Retriever
Retriever: Data Retrieval Tool - Published in JOSS (2017)
Science Score: 100.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 7 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org, zenodo.org -
✓Committers with academic emails
5 of 73 committers (6.8%) from academic institutions -
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Keywords from Contributors
Repository
Quickly download, clean up, and install public datasets into a database management system
Basic Info
- Host: GitHub
- Owner: weecology
- License: other
- Language: Python
- Default Branch: main
- Homepage: http://data-retriever.org
- Size: 77.4 MB
Statistics
- Stars: 318
- Watchers: 31
- Forks: 141
- Open Issues: 54
- Releases: 12
Topics
Metadata Files
README.md

Finding data is one thing. Getting it ready for analysis is another. Acquiring, cleaning, standardizing and importing publicly available data is time consuming because many datasets lack machine readable metadata and do not conform to established data structures and formats. The Data Retriever automates the first steps in the data analysis pipeline by downloading, cleaning, and standardizing datasets, and importing them into relational databases, flat files, or programming languages. The automation of this process reduces the time for a user to get most large datasets up and running by hours, and in some cases days.
Installing the Current Release
If you have Python installed you can install the current release using either pip:
bash
pip install retriever
or conda after adding the conda-forge channel (conda config --add channels conda-forge):
bash
conda install retriever
Depending on your system configuration this may require sudo for pip:
bash
sudo pip install retriever
Precompiled binary installers are also available for Windows, OS X, and Ubuntu/Debian on the releases page. These do not require a Python installation.
List of Available Datasets
Installing From Source
To install the Data Retriever from source, you'll need Python 3.6.8+ with the following packages installed:
- xlrd
The following packages are optionally needed to interact with associated database management systems:
- PyMySQL (for MySQL)
- sqlite3 (for SQLite)
- psycopg2-binary (for PostgreSQL), previously psycopg2.
- pyodbc (for MS Access - this option is only available on Windows)
- Microsoft Access Driver (ODBC for windows)
To install from source
Either use pip to install directly from GitHub:
shell
pip install git+https://git@github.com/weecology/retriever.git
or:
- Clone the repository
- From the directory containing setup.py, run the following command:
pip install .. You may need to includesudoat the beginning of the command depending on your system (i.e.,sudo pip install .).
More extensive documentation for those that are interested in developing can be found here
Using the Command Line
After installing, run retriever update to download all of the available dataset scripts.
To see the full list of command line options and datasets run retriever --help.
The output will look like this:
```shell usage: retriever [-h] [-v] [-q] {download,install,defaults,update,new,newjson,editjson,delete_json,ls,citation,reset,help} ...
positional arguments: {download,install,defaults,update,new,newjson,editjson,deletejson,ls,citation,reset,help} sub-command help download download raw data files for a dataset install download and install dataset defaults displays default options update download updated versions of scripts new create a new sample retriever script newjson CLI to create retriever datapackage.json script editjson CLI to edit retriever datapackage.json script deletejson CLI to remove retriever datapackage.json script ls display a list all available dataset scripts citation view citation reset reset retriever: removes configuration settings, scripts, and cached data help
optional arguments: -h, --help show this help message and exit -v, --version show program's version number and exit -q, --quiet suppress command-line output ```
To install datasets, use retriever install:
```shell usage: retriever install [-h] [--compile] [--debug] {mysql,postgres,sqlite,msaccess,csv,json,xml} ...
positional arguments: {mysql,postgres,sqlite,msaccess,csv,json,xml} engine-specific help mysql MySQL postgres PostgreSQL sqlite SQLite msaccess Microsoft Access csv CSV json JSON xml XML
optional arguments: -h, --help show this help message and exit --compile force re-compile of script before downloading --debug run in debug mode ```
Examples
These examples are using the Iris flower dataset. More examples can be found in the Data Retriever documentation.
Using Install
shell
retriever install -h (gives install options)
Using specific database engine, retriever install {Engine}
shell
retriever install mysql -h (gives install mysql options)
retriever install mysql --user myuser --password ******** --host localhost --port 8888 --database_name testdbase iris
install data into an sqlite database named iris.db you would use:
shell
retriever install sqlite iris -f iris.db
Using download
shell
retriever download -h (gives you help options)
retriever download iris
retriever download iris --path C:\Users\Documents
Using citation
shell
retriever citation (citation of the retriever engine)
retriever citation iris (citation for the iris data)
Spatial Dataset Installation
Set up Spatial support
To set up spatial support for Postgres using Postgis please refer to the spatial set-up docs.
```shell retriever install postgres harvard-forest # Vector data retriever install postgres bioclim # Raster data
Install only the data of USGS elevation in the given extent
retriever install postgres usgs-elevation -b -94.98704597353938 39.027001800158615 -94.3599408119917 40.69577051867074
```
Website
For more information see the Data Retriever website.
Acknowledgments
Development of this software was funded by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through Grant GBMF4563 to Ethan White and the National Science Foundation as part of a CAREER award to Ethan White.
Owner
- Name: Weecology
- Login: weecology
- Kind: organization
- Website: http://weecology.org
- Repositories: 93
- Profile: https://github.com/weecology
JOSS Publication
Retriever: Data Retrieval Tool
Authors
Tags
data retrieval data processing python data data science datasetsCitation (CITATION)
Morris, B.D. and E.P. White. 2013. The EcoData Retriever: improving access to
existing ecological data. PLOS ONE 8:e65848.
http://doi.org/doi:10.1371/journal.pone.0065848
@article{morris2013ecodata,
title={The EcoData Retriever: Improving Access to Existing Ecological Data},
author={Morris, Benjamin D and White, Ethan P},
journal={PLOS One},
volume={8},
number={6},
pages={e65848},
year={2013},
publisher={Public Library of Science}
doi={10.1371/journal.pone.0065848}
}
Papers & Mentions
Total mentions: 3
<i>De novo</i> genome assembly of two tomato ancestors, <i>Solanum pimpinellifolium</i> and <i>Solanum</i> <i> lycopersicum</i> var. <i>cerasiforme</i>, by long-read sequencing
- DOI: 10.1093/dnares/dsaa029
- OpenAlex ID: https://openalex.org/W3125995565
- Published: January 2021
Environmental Enrichment Induces Epigenomic and Genome Organization Changes Relevant for Cognition
- DOI: 10.3389/fnmol.2021.664912
- OpenAlex ID: https://openalex.org/W3157674268
- Published: May 2021
Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline
- DOI: 10.1186/s13059-019-1905-y
- OpenAlex ID: https://openalex.org/W2994632914
- Published: December 2019
GitHub Events
Total
- Issues event: 1
- Watch event: 11
- Issue comment event: 8
- Push event: 1
- Pull request event: 5
- Gollum event: 13
- Fork event: 8
Last Year
- Issues event: 1
- Watch event: 11
- Issue comment event: 8
- Push event: 1
- Pull request event: 5
- Gollum event: 13
- Fork event: 8
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Ben Morris | b****n@b****m | 743 |
| Ethan White | e****n@w****g | 403 |
| henrykironde | h****e@g****m | 393 |
| goelakash | g****3@g****m | 66 |
| zhangcandrew | z****w@g****m | 43 |
| Apoorva Pandey | a****5@g****m | 40 |
| Shreyash sharma | s****l@y****n | 35 |
| Harshit Bansal | h****c@g****m | 34 |
| ShivamNegi | s****9@g****m | 28 |
| Elita Baldridge | e****e@w****g | 26 |
| Aakash Chaudhary | a****0@g****m | 19 |
| Ansh Dassani | a****4@g****m | 17 |
| Ashish | a****1@g****m | 16 |
| Daniel McGlinn | d****n@g****m | 10 |
| Sumit Saha | s****6@g****m | 9 |
| akshayah3 | a****5@g****m | 8 |
| David J. Harris | h****1@g****m | 7 |
| Katherine Thibault | k****t@n****g | 7 |
| Pankaj Kumar | me@p****e | 6 |
| Nageshbansal | n****9@g****m | 6 |
| ddigges | d****s@i****m | 5 |
| Kunal Pal | m****l@g****m | 5 |
| Shawn Taylor | s****r@w****g | 5 |
| kapil kumar | k****3@g****m | 5 |
| paul wolf | p****f@u****u | 5 |
| ashu | j****u@g****m | 4 |
| sarsees | r****e@g****m | 4 |
| pranita-s | p****a@g****m | 4 |
| Kristina Riemer | k****r@w****g | 4 |
| unknown | B****n@.****) | 4 |
| and 43 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 30
- Total pull requests: 84
- Average time to close issues: 8 months
- Average time to close pull requests: about 1 month
- Total issue authors: 8
- Total pull request authors: 17
- Average comments per issue: 2.63
- Average comments per pull request: 2.63
- Merged pull requests: 48
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 3
- Average comments per issue: 0
- Average comments per pull request: 2.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- henrykironde (17)
- Aakash3101 (3)
- kkothari2001 (3)
- ethanwhite (2)
- Nageshbansal (2)
- dikwickley (1)
- bw4sz (1)
- ha0ye (1)
Pull Request Authors
- henrykironde (42)
- dassaniansh (13)
- Aakash3101 (9)
- Nageshbansal (6)
- dikwickley (4)
- kkothari2001 (2)
- dependabot[bot] (2)
- Khush2040 (2)
- Luckysteve007 (2)
- apeksha235 (1)
- jainamritanshu (1)
- bloemenk (1)
- pri1311 (1)
- PatriceJada (1)
- pyther-hub (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 818 last-month
- Total docker downloads: 2,029
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 4
(may contain duplicates) - Total versions: 20
- Total maintainers: 2
pypi.org: retriever
Data Retriever
- Homepage: https://github.com/weecology/retriever
- Documentation: https://retriever.readthedocs.io/
- License: MIT License
-
Latest release: 3.1.0
published over 3 years ago
Rankings
Maintainers (2)
conda-forge.org: retriever
This module analyzes jpeg/jpeg2000/png/gif image header and return image size.
- Homepage: https://github.com/weecology/retriever
- License: MIT
-
Latest release: 3.1.0
published over 3 years ago
Rankings
Dependencies
- Pillow *
- PyMySQL >=0.4
- argcomplete *
- coverage *
- future *
- h5py *
- inquirer *
- kaggle *
- numpydoc *
- openpyxl *
- pandas *
- psycopg2-binary *
- requests *
- setuptools *
- sphinx_py3doc_enhanced_theme *
- sphinx_rtd_theme *
- sphinxcontrib-napoleon *
- tables *
- tqdm ==4.30.0
- xlrd >=0.7
- Pillow *
- argcomplete *
- future *
- h5py *
- kaggle *
- pandas *
- requests *
- tqdm *
- xlrd *
- actions/checkout v2 composite
- actions/setup-python v2 composite
- crazy-max/ghaction-docker-meta v1 composite
- docker/build-push-action ad44023a93711e3deb337508980b4b5e9bcdc5dc composite
- docker/login-action 28218f9b04b4f3f62068d7b6ce6ca5b26e35336c composite
- actions/checkout master composite
- actions/setup-python v1 composite
- pypa/gh-action-pypi-publish master composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- codecov/codecov-action v1 composite
- huaxk/postgis-action v1 composite
- mysql 5.7 docker
- osgeo/gdal latest build
- kartoza/postgis latest
- mysql 5.7
- ret_image latest
