ena-upload-cli
ENA upload cli - script your Open Data upload to the European Nucleotide Archive
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.4%) to scientific vocabulary
Keywords from Contributors
Repository
ENA upload cli - script your Open Data upload to the European Nucleotide Archive
Basic Info
Statistics
- Stars: 28
- Watchers: 5
- Forks: 17
- Open Issues: 11
- Releases: 42
Metadata Files
README.md
ENA upload cli
This command line tool (CLI) allows easy submission of experimental data and respective metadata to the European Nucleotide Archive (ENA) using tabular files or one of the excel spreadsheets that can be found on this template repo. The supported metadata that can be submitted includes study, sample, run and experiment info so you can use the tool for programmatic submission of everything ENA needs without the need of logging in to the Webin interface. This also includes client side validation using ENA checklists and releasing the ENA objects. This command line tool is also available as a Galaxy tool and can be added to you own Galaxy instance or you can make use of one of the existing Galaxy instances, like usegalaxy.eu or usegalaxy.be.
Overview
The metadata should be provided in separate tables or files carrying similar information corresponding to the following ENA objects:
- STUDY
- SAMPLE
- EXPERIMENT
- RUN
You can set the tool to perform the following actions:
- add: add an object to the archive
- modify: modify an object in the archive
- cancel: cancel a private object and its dependent objects
- release: release a private object immediately to the public
After a successful submission, new tsv tables will be generated with the ENA accession numbers filled in along with a submission receipt.
Tool dependencies
- python 3.8+ including following packages:
- Genshi
- lxml
- pandas
- requests
- pyyaml
- openpyxl
- jsonschema
Installation
pip install ena-upload-cli
Usage
``` Minimal: ena-upoad-cli --action {add,modify,cancel,release} --center CENTER_NAME --secret SECRET
All supported arguments:
-h, --help show this help message and exit --version show program's version number and exit --action {add,modify,cancel,release} add: add an object to the archive modify: modify an object in the archive cancel: cancel a private object and its dependent objects release: release a private object immediately to public --study STUDY table of STUDY object --sample SAMPLE table of SAMPLE object --experiment EXPERIMENT table of EXPERIMENT object --run RUN table of RUN object --data [FILE ...] data for submission --center CENTERNAME specific to your Webin account --checklist CHECKLIST specify the sample checklist with following pattern: ERC0000XX, Default: ERC000011 --xlsx XLSX filled in excel template with metadata --isajson ISAJSON BETA: ISA json describing describing the ENA objects --isaassaystream ISAASSAYSTREAM BETA: specify the assay stream(s) that holds the ENA information, this can be a list of assay streams --autoaction BETA: detect automatically which action (add or modify) to apply when the action column is not given --tool TOOLNAME specify the name of the tool this submission is done with. Default: ena-upload-cli --toolversion TOOLVERSION specify the version of the tool this submission is done with --nodata_upload indicate if no upload should be performed and you like to submit a RUN object (e.g. if uploaded was done separately). --draft indicate if no submission should be performed --secret SECRET .secret.yml file containing the password and Webin ID of your ENA account -d, --dev flag to use the dev/sandbox endpoint of ENA ```
Mandatory arguments: --action, --center and --secret.
ENA Webin
A Webin can be made here if you don't have one already. The Webin ID makes use of the full username looking like: Webin-XXXXX. Visit Webin online to check on your submissions or dev Webin to check on test submissions.
The .secret.yml file
To avoid exposing your credentials through the terminal history, it is recommended to make use of a .secret.yml file, containing your password and username keywords. An example is given in the root of this directory.
ENA sample checklists
You can specify ENA sample checklist using the --checklist parameter. By default the ENA default sample checklist is used supporting the minimum information required for the sample (ERC000011). The supported checklists are listed on our template repo.
Fixed sample columns
The command line tool will automatically fetch the correct scientific name based on the taxon ID or fetch the taxon ID based on the scientific name. Both can be given and no overwrite will be done.
- Mandatory: alias, title, sample_description, collection date, geographic location (country and/or sea) and either scientific_name or taxon_id (preferred)
- Optional: common_name, sample_description
| alias | title | taxonid | scientificname | commonname | sampledescription | collection date | geographic location (country and/or sea) | |----------------|----------------|----------|-------------------------------------------------|-------------|----------------------|-----------------|------------------------------------------| | samplealias4 | sampletitle2 | 2697049 | Severe acute respiratory syndrome coronavirus 2 | covid-19 | sampledescription1 | 2020-10-11 | Argentina | | samplealias5 | sampletitle3 | 2697049 | Severe acute respiratory syndrome coronavirus 2 | covid-19 | sampledescription2 | 2008-01-24 | Belgium |
Custom attributes
Additional custom attributes (i.e. attributes not specified in the ERC checklist) can be added to the sample table by adding columns which headers are named like sample_attribute[attribute_name]; for example sample_attribute[treatment], sample_attribute[age]... An example tsv file using custom attributes can be found in exampletables/ENAtemplatesamplesxtra_attrs.tsv. The same syntax is also applicable for xlsx input files.
| alias | ... | sampleattribute[treatment] | sampleattribute[age] |----------------|----------------|---------------------|------------------------| | samplealias4 | ... | treated | 2 days | samplealias5 | ... | untreated | 2 days
Viral submissions
If you want to submit viral samples you can use the ENA virus pathogen checklist by adding ERC000033 to the checklist parameter. Check out our viral example command as demonstration. Please use the ENA virus pathogen checklist in our template repo to know what is allowed/possible in the Controlled vocabularyfields.
ENA study, experiment and run tables
Please check out the template of your checklist to discover which attributes are mandatory for the study, experiment and run ENA object.
Read info run attributes
Using read_type and read_label as header in the columns of ENA run objects will allow you to set information about reads. Values are listed in a comma separated way, without spaces. read_type has a controlled vocabulary, which can be found in the ENA Documentation. An example tsv file using these attributes can be found in exampletables/ENAtemplaterunsread_info.tsv. The same syntax is also applicable for xlsx input files.
This feature is currently limited to FastQ files.
Encrypted files
When transferring encrypted files, an additional unencrypted_checksum column can be added in the run table. This column should contain the md5 checksum of the unencrypted file, and note that no check is performed on this value.
This feature is currently limited to FastQ files.
Study and experiment custom attributes
Similarly to samples, additional custom attributes can be added to the experiment and study tables by adding columns which headers are named like experiment_attribute[attribute_name] and study_attribute[attribute_name] in the experiment and study tables, respectively.
Dev instance
By default the submission will be done using following url to ENA: https://www.ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA
Use the --dev flag if you want to do a test submission using the tool by the sandbox dev instance of ENA: https://wwwdev.ebi.ac.uk/ena/submit/drop-box/submit/?auth=ENA. A TEST submission will be discarded within 24 hours.
Submitting a selection of rows to ENA
There are two ways of submitting only a selection of objects to ENA. This is handy for reoccurring submissions, especially when they belong to the same study.
- Manual: you can add an optional
statuscolumn to every table/sheet that contains the action you want to apply during this submission. If you chose to add only the first 2 samples to ENA, you specify--action addas parameter in the command and you add theaddvalue to the status column of the rows you want to submit as demonstrated below. Same holds for the actionmodify,releaseandcancel. - Automatic (BETA): using the
--auto_actionit is possible to auto detect wether an object (using the alias) is already present on ENA and will fill in the specified action (--actionparameter) accordingly. In practice, this means that if a user chooses to add objects and we already find this object already exists using its alias, this objects will not be added. On the other hand, if the command is used to modify objects, we want to apply this solely on objects that already exist on ENA. The detection only works with ENA objects that are published and findable on the website trough the search function (both the dev and live website). If the tool does not correctly detect the presence of your ENA object, we suggest to use the more robust manual approach as described above.
Example with modify as seen in the example sample modify table
| alias | status | title | taxonid | sampledescription | |----------------|--------|----------------|----------|----------------------| | samplealias4 | modify | sampletitle1 | 2697049 | sampledescription1 | | samplealias5 | | sampletitle2 | 2697049 | sampledescription2 |
IMPORTANT: if the status column is given but not filled in, or filled in with a different action from the one in the
--actionparameter, no rows will be submitted! Either leave out the column or add to every row you want to submit the correct action.
Using Excel templates
We also support the use of specific excel templates, designed for each sample checklist. Use the --xlsx command to add the path to an excel template file filled in from this template repo.
The data files
Supported data
- [x] Read data
- [ ] Genome Assembly
- [ ] Transcriptome Assembly
- [x] Template Sequence
- [x] Other Analyses
Most files uploaded to the ENA FTP server need to be compressed.
More information on how ENA wants to receive the files can be found here.
Note for data upload:
Uploaded files are persistently stored on the ENA server after the upload for some time.
Thus, if multiple test submission are performed, it is possible to skip the data upload with --no_data_upload in
subsequent submissions.
This also allows uploading (large) datasets separately e.g. with aspera.
For the --no_data_upload argument, data file(s) still need to be provided with --data if a RUN object is submitted without its MD5 sums in the file_checksum column.
Releasing and canceling a submission
If you want to release or cancel data, you can do so by using cancel or release in the --action parameter in the command line. Tables that have to be released or cancelled need an accession column with corresponding accession ids. This means that you first have to use add to submit your data, and use afterwords the updated table with accession ids, if you did not yet submit your data.
By default the updated tables after submission will have the action added in their status column. Don't forget to change the values to release or cancel if you want to use one of these actions (or delete the status column if your action applies for the whole table).
NOTE: Releasing a study will make all child elements like runs and experiments public.
Tool overview
inputs:
* metadata tables/excelsheet/isajson
* examples in `exampletableand on this [template repo](https://github.com/ELIXIR-Belgium/ENA-metadata-templates) for excel sheets
* (optional) define actions in **status** column e.g.add,modify,cancel,release(when not given the whole table is submitted)
* to perform bulk submission of all objects, thealiases idsin different ENA objects should be in the association where alias ids in experiment object link all objects together
* experimental data
* examples inexample_data`
outputs:
* a receipt.xml file in the working directory with the receipt from the ENA submission
* metadata tables with updated info in the same directory of inputs:
* updated status: added, modified, canceled, released
* accession ids
* submission date
* file checksums in runs table if not given
* taxon id or scientific name in sample table if not given
Test the tool
Add metadata and sequence data
ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --secret .secret.ymlAdd metadata only
ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs_md5sums.tsv --dev --secret .secret.ymlAdd studies
ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --dev --secret .secret.ymlModify sample metadata
ena-upload-cli --action modify --center 'your_center_name' --sample example_tables/ENA_template_samples_modify.tsv --dev --secret .secret.ymlViral data
ena-upload-cli --action add --center 'your_center_name' --study example_tables/ENA_template_studies.tsv --sample example_tables/ENA_template_samples_vir.tsv --experiment example_tables/ENA_template_experiments.tsv --run example_tables/ENA_template_runs.tsv --data example_data/*gz --dev --checklist ERC000033 --secret .secret.ymlUsing an Excel template
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --checklist ERC000033 --secret .secret.yml --xlsx example_tables/ENA_excel_example_ERC000033.xlsxUsing an ISA JSON
ena-upload-cli --action add --center 'your_center_name' --data example_data/*gz --dev --secret .secret.yml --isa_json tests/test_data/simple_test_case_v2.json --isa_assay_stream "Ena stream 1"Release submission
ena-upload-cli --action release --center 'your_center_name' --study example_tables/ENA_template_studies_release.tsv --dev --secret .secret.yml
Note for Windows users: Windows, by default, does not support wildcard expansion in command-line arguments. Because of this the
--data example_data/*gzargument should be substituted with one containing a list of the data files. For this example, use:
--data example_data/ENA_TEST1.R1.fastq.gz example_data/ENA_TEST2.R1.fastq.gz example_data/ENA_TEST2.R2.fastq.gz
Owner
- Name: useGalaxy.eu
- Login: usegalaxy-eu
- Kind: organization
- Email: contact@galaxyproject.eu
- Location: Europe
- Website: https://usegalaxy.eu
- Repositories: 47
- Profile: https://github.com/usegalaxy-eu
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Droesbeke"
given-names: "Bert"
orcid: "https://orcid.org/0000-0002-3079-6586"
- family-names: "Yusuf"
given-names: "Dilmurat"
- family-names: "Grüning"
given-names: "Björn"
title: "ena-upload-cli"
doi: 10.5281/zenodo.5599603
date-released: 2021-02-12
url: "https://github.com/usegalaxy-eu/ena-upload-cli"
preferred-citation:
type: article
authors:
- family-names: "Roncoroni"
given-names: "Miguel"
orcid: "https://orcid.org/0000-0001-7461-1427"
- family-names: "Droesbeke"
given-names: "Bert"
orcid: "https://orcid.org/0000-0002-3079-6586"
- family-names: "Eguinoa"
given-names: "Ignacio"
orcid: "https://orcid.org/0000-0001-8231-3323"
- family-names: "De Ruyck"
given-names: "Kim"
- family-names: "D’Anna"
given-names: "Flora"
- family-names: "Yusuf"
given-names: "Dilmurat"
- family-names: "Grüning"
given-names: "Björn"
- family-names: "Backofen"
given-names: "Rolf"
- family-names: "Coppens"
given-names: "Frederik"
doi: "https://doi.org/10.1093/bioinformatics/btab421"
journal: "Bioinformatics"
month: 11
start: 3983
end: 3985
title: "A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive"
issue: 21
volume: 37
year: 2021
GitHub Events
Total
- Create event: 3
- Release event: 2
- Issues event: 10
- Delete event: 1
- Issue comment event: 44
- Push event: 47
- Pull request review event: 19
- Pull request review comment event: 11
- Pull request event: 17
- Fork event: 1
Last Year
- Create event: 3
- Release event: 2
- Issues event: 10
- Delete event: 1
- Issue comment event: 44
- Push event: 47
- Pull request review event: 19
- Pull request review comment event: 11
- Pull request event: 17
- Fork event: 1
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 248
- Total Committers: 10
- Avg Commits per committer: 24.8
- Development Distribution Score (DDS): 0.323
Top Committers
| Name | Commits | |
|---|---|---|
| bedroesb | b****o@p****e | 168 |
| Bert Droesbeke | 4****b@u****m | 29 |
| Fritjof Lammers | f****s@d****e | 20 |
| dyusuf | d****f@g****m | 10 |
| bedroesb | b****b@u****m | 7 |
| Björn Grüning | b****n@g****u | 6 |
| Taavi Päll | t****1@g****m | 4 |
| rafael buono | 7****o@u****m | 2 |
| ieguinoa | i****a@g****m | 1 |
| bgruening | b****g@u****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 42
- Total pull requests: 79
- Average time to close issues: 3 months
- Average time to close pull requests: 11 days
- Total issue authors: 11
- Total pull request authors: 9
- Average comments per issue: 1.29
- Average comments per pull request: 1.51
- Merged pull requests: 75
- Bot issues: 0
- Bot pull requests: 25
Past Year
- Issues: 9
- Pull requests: 17
- Average time to close issues: 13 days
- Average time to close pull requests: 9 days
- Issue authors: 3
- Pull request authors: 3
- Average comments per issue: 0.11
- Average comments per pull request: 3.06
- Merged pull requests: 15
- Bot issues: 0
- Bot pull requests: 8
Top Authors
Issue Authors
- bedroesb (25)
- Cecilia-Sensalari (3)
- roncoronimiguel (2)
- mobilegenome (2)
- wna-se (2)
- cgirardot (2)
- Najatamk (1)
- mthang (1)
- tpall (1)
- FernandoDuarteF (1)
- rabuono (1)
Pull Request Authors
- bedroesb (42)
- github-actions[bot] (28)
- cgirardot (6)
- mobilegenome (3)
- nuwang (2)
- tpall (1)
- ieguinoa (1)
- wna-se (1)
- rabuono (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 12,950 last-month
- Total dependent packages: 1
- Total dependent repositories: 3
- Total versions: 42
- Total maintainers: 2
pypi.org: ena-upload-cli
Command Line Interface to upload data to the European Nucleotide Archive
- Homepage: https://github.com/usegalaxy-eu/ena-upload-cli
- Documentation: https://ena-upload-cli.readthedocs.io/
- License: MIT
-
Latest release: 0.9.0
published 10 months ago
Rankings
Maintainers (2)
Dependencies
- genshi *
- lxml *
- openpyxl *
- pandas >=1.2
- pyyaml *
- requests *
- required *
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- peter-evans/create-pull-request v3 composite