pycurator

Simplifying large-scale metadata curation

https://github.com/michaelbaluja/pycurator

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (17.1%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Simplifying large-scale metadata curation

Basic Info

Host: GitHub
Owner: michaelbaluja
License: other
Language: Python
Default Branch: main
Homepage: https://pycurator.readthedocs.io/en/latest/
Size: 4.26 MB

Statistics

Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Created about 5 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

PyCurator

Making data extraction and curation as easy as py.

PyCurator allows users to easily query research repositories without the trouble of reading through API documentation. Whether you want the ease of clicking some buttons and getting the data or the flexibility of modifying query format, PyCurator provides a simple UI for quickly retrieving metadata that is built on top of an extensible collection of Web and API scraper classes.

Supported Repositories

PyCurator currently supports the following repositories in the capacities listed. Authentication is only required for Kaggle, though may provide runtime benefits for Dryad as rate-limiting is relaxed.

| Repository | Authentication | |----------------------|----------------------------------------------------------------------------------------------| | Dataverse | | | Dryad | Dryad | | Figshare | | | Kaggle | Kaggle | | OpenML | | | Papers With Code | |
| Zenodo | |

If there's a repository that you would like to see added to the list, check out the Contributions section.

Installation and use

Installation

Pip

The latest stable release of PyCurator is available for download on PyPi

bash $ pip install pycurator

Install from source

Dependencies are provided in the requirements.txt file. It is recommended to create a virtual environment to ensure there is no conflict with the packages in your current work space.

PyCurator requires a Python version >= 3.10.

To run, simply paste the following commands into your terminal

bash $ git clone https://github.com/michaelbaluja/PyCurator.git $ cd PyCurator $ python -m pip install -e . $ pycurator

Use

Repository Selection

After following the commands above, you will be met with the landing page, containing licensing, funding, and copyright information. Clicking Continue will bring you to the following page

Repository Selection Page

Parameter Selection

Clicking on one of the repositories will bring up the respective parameters used for querying the API and saving your results. Parameters will vary depending on repository selected.

Parameter Selection

These parameters are outlined as | Parameter | Description | |----------------|----------------------------------------------------------------------------------------------------------------------------------| | Save Directory | Location to save results. Defaults to "/data/{reponame}/{searchterm}{searchtype}.json" within PyCurator /data sub-directory. | | Search Terms | Search term(s) to query. Terms should be separated with a comma, and multi-word terms should be wrapped in quotes. | | Search Types | Type of objects to query. | After all required parameters are provided, the Run button is activated.

Run Page

The run page provides high level status updates in the main window. These include the beginning and end of processes, rate-limiting issues, runtime completion, and saving confirmation. Below are real-time status updates for the specific query being completed as well as a progress bar for the high level task. During tasks that have a fixed duration, such as metadata querying or some web scraping, a fixed-length progress bar will show the progression of output. During tasks that have an indeterminate duration, a cycling task bar will be present to represent continued progress.

At the bottom are the navigation buttons. To avoid unnecessary queries, the Back button is unresponsive during runtime, but is activated after completion. The Stop button is used to interrupt runtime and stop querying. After runtime completion or interruption, the Stop button is replaced by the Exit button, allowing you to safely terminate the program.

Contributions

Bugs

For any bugs or problems that you come across, open an issue that details the problem that you're experiencing.

Extension

Know of an API that you think should be included in PyCurator? Create a Pull Request outlining the API and why you think it would be beneficial, and make sure to follow the format set out through the existing Scraper classes.

Funding and acknowledgements

The initial development of this program was funded by the Librarians Association of The University of California (LAUC) and UC San Diego Library Research Data Curation Program (RDCP).

Thank you to Matt Peters, Dan LaSusa, John Chen, Joshua Weimer, and Amy Ly for their feedback during testing of early iterations of PyCurator.

Owner

Name: Michael Baluja
Login: michaelbaluja
Kind: user
Location: San Diego, CA

Website: michaelbaluja.com
Twitter: mlbaluja
Repositories: 4
Profile: https://github.com/michaelbaluja

Machine Learning and Data Science MS student, enjoyer, and user.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Baluja"
  given-names: "Michael"
  orcid: "https://orcid.org/0000-0003-1155-8571"
- family-names: "Labou"
  given-names: "Stephanie"
  orcid: "https://orcid.org/0000-0001-5633-5983"
title: "PyCurator"
version: 0.1
doi: [tbd]
date-released: 2022-05-16
repository-code: "https://pypi.org/project/pycurator/"
url: "https://github.com/michaelbaluja/PyCurator"

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 5
Total pull requests: 2
Average time to close issues: 7 days
Average time to close pull requests: 4 days
Total issue authors: 3
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 1.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

michaelbaluja (2)
stephlabou (2)
jggautier (1)

Pull Request Authors

stephlabou (2)

Top Labels

Issue Labels

bug (2) enhancement (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 12 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 5
Total maintainers: 1

pypi.org: pycurator

Collect data from popular data repositories with ease.

Homepage: https://github.com/michaelbaluja/PyCurator
Documentation: https://pycurator.readthedocs.io/
License: BSD License
Latest release: 0.1.2
published almost 4 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 12 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 20.3%

Dependent repos count: 21.7%

Average: 21.8%

Forks count: 22.6%

Downloads: 34.2%

Maintainers (1)

michaelbaluja

Last synced: 10 months ago

Dependencies

environment.yml conda

kaggle >=1.5.12
openml >=0.12.2
openpyxl
pandas >=1.4.1
pytest
python >=3.10
requests >=2.27.1
setuptools >=57.4.0
sphinx

docs/doc_requires.txt pypi

sphinx-book-theme *

requirements.txt pypi

appdirs *
kaggle *
pandas *
pytest *
requests *

pycurator

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

PyCurator

Supported Repositories

Installation and use

Installation

Pip

Install from source

Use

Repository Selection

Parameter Selection

Run Page

Contributions

Bugs

Extension

Funding and acknowledgements

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pycurator

Rankings

Maintainers (1)

Dependencies