aswan

Concurrent data collection, compression and storage

https://github.com/endremborza/aswan

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

data python
Last synced: 6 months ago · JSON representation ·

Repository

Concurrent data collection, compression and storage

Basic Info
  • Host: GitHub
  • Owner: endremborza
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 316 KB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 5
  • Open Issues: 0
  • Releases: 29
Topics
data python
Created almost 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

aswan

Documentation Status codeclimate codecov pypi DOI

collect and organize data into a T1 data depot named after the Aswan Dam

Collect and compress data from the internet for later parsing

  • quick, parallel, customizable to collect
  • compressed to store
  • quick to sync with a remote store
    • sync to continue collecting
    • sync to parse
  • immutable collection

To Setup a Remote

set the environment variables ASWAN_AUTH_HEX and ASWAN_AUTH_PASS according to the zimmauth package, and ASWAN_REMOTE with the name of the default remote.

Concepts

  • objects
    • saved by collection events
  • events
    • collection
    • registration (v2: registration for parsing)
    • (v2) parsing
  • runs
    • manual run vs automated run
    • makes manual adding of urls easy but revertible
    • has unique id
    • generates events
    • linked to a specific version of the code
    • ideally commit hash + pip freeze
  • statuses
    • determined by base status + runs integrated
    • contains
    • what urls need to be collected
    • (v2) what collected objects need to be parsed
    • sqlite file, constantly trimmed

Structure

  • objects
    • 00, 01, ...
  • runs
    • run-hash
      • context.yaml
      • commit-hash, pip-freeze, ...
      • events.zip
  • statuses
    • status-hash
    • context.yaml
      • parent-status, integrated
    • db.sqlite.zip
  • current-run

    • context.yaml
    • events
    • these to be compressed into ../runs
    • status.sqlite
  • there is a 'TEST' status

    • cannot be integrated whatever is based on it
    • a test run can be made on it...

when starting a run: - check if current-run is empty - if not, fail with - find latest status - if it has not integrated all past runs, create a new status that has - start collection (+ registration) - either stops or breaks, all events and objects are saved to disk - if properly stops, move and compress stuff - based on one that was the starter, and current run id

Pre v1.0 laundry list

  • parallelize push / pull
  • parsing/connection/broken session error docs
  • transferring / ignoring cookies

  • template projects

    • oddsportal
    • updating thingy, based on latest match in season
    • footy
    • rotten
    • boxoffice

Owner

  • Name: Endre Mark Borza
  • Login: endremborza
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
url: https://github.com/endremborza/aswan
authors:
- family-names: Borza
  given-names: Endre Márk
  orcid: https://orcid.org/0000-0002-8804-4520
title: endremborza/aswan
version: 0.5.15
date-released: 2024-06-07

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 228
  • Total Committers: 2
  • Avg Commits per committer: 114.0
  • Development Distribution Score (DDS): 0.039
Top Committers
Name Email Commits
Endre Márk Borza e****a@g****m 219
papsebestyen p****n@g****m 9

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 7
  • Average time to close issues: N/A
  • Average time to close pull requests: 20 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.57
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • papsebestyen (4)
  • endremborza (3)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,715 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 3
  • Total versions: 27
  • Total maintainers: 1
pypi.org: aswan

Data collection manager

  • Versions: 27
  • Dependent Packages: 0
  • Dependent Repositories: 3
  • Downloads: 1,715 Last month
Rankings
Downloads: 5.5%
Dependent repos count: 9.0%
Dependent packages count: 10.0%
Average: 13.3%
Forks count: 14.2%
Stargazers count: 27.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/test.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • codecov/codecov-action v3 composite
  • nanasess/setup-chromedriver v1 composite
.github/workflows/twine_release.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
pyproject.toml pypi
  • atqo >=0.3.0
  • beautifulsoup4 *
  • brotli *
  • flask *
  • flask-cors *
  • html5lib *
  • pyyaml *
  • requests *
  • selenium *
  • sqlalchemy *
  • typer *