https://github.com/asadprodhan/downloading_genomes_from_refseq
Downloading Genomes from RefSeq
https://github.com/asadprodhan/downloading_genomes_from_refseq
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Keywords
download
genomes
refseq
Last synced: 5 months ago
·
JSON representation
Repository
Downloading Genomes from RefSeq
Basic Info
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
download
genomes
refseq
Created almost 4 years ago
· Last pushed almost 3 years ago
https://github.com/asadprodhan/Downloading_genomes_from_RefSeq/blob/main/
# **Downloading Genomes from RefSeq**
## **Step 1: Collect the assembly summary report for your organism of interest from the NCBI RefSeq Index** **For example, the assembly summary report for Bacteria can be obtained as follows:** ``` wget ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/assembly_summary_refseq.txt ``` For other organisms, navigate to the assembly summary report starting from the Index of /genomes/refseq as shown below:
![]()
Figure showing organism directory in RefSeq
## **Step 2: Filter out your targeted genomes from the assembly report** **For example, all species of Pseudomonas can be extracted from the bacterial assembly report as follows:** ``` #!/bin/bash awk -F '\t' '{if($8 ~ /Pseudomonas/) print $1","$2","$3","$5","$8","$11","$12","$14","$15","$16","$20}' assembly_summary.txt > assembly_summary_complete_genomes_Pseudomonas.txt ``` **What the script does:** - Column 8 ($8) in the assembly report contains the name of the species. ~ /Pseudomonas/ will extract only the Pseudomonas species Here, we are extracting Pseudomonas species along with other metadata in different columns of the assembly report. - Column 1 ($1): # assembly_accession - Column 2 ($2): bioproject ID - Column 3 ($3): biosample ID - Column 5 ($5): refseq_category, is it a representative genome? representative genome are quality-checked by RefSeq team - Column 8 ($8): organism_name - Column 11 ($11): version_status, is it latest? - Column 12 ($12): assembly_level, complete genome, scaffold or contig - Column 14 ($14): genome_rep, full? or partial? - Column 15 ($15): seq_rel_date, release date - Column 16 ($16): asm_name, assembly name - Column 20 ($20): ftp_path, the download link (however, the links, as they appear here, do not download the files, the links need to be amended in the following step to get them download-ready)
## **Step 3: Amend the above links to get them download-ready** In column 20, the links appear as follows: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/763/245/GCF_000763245.3_ASM76324v3 To get it download-ready, two amendments are required: The last part i.e. GCF_000763245.3_ASM76324v3 needs to be repeated. So, it will look like this: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/763/245/GCF_000763245.3_ASM76324v3/GCF_000763245.3_ASM76324v3 A file extension (_genomic.fna.gz) need to be added So, the download-ready version of the links in column 20 will look like this: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/763/245/GCF_000763245.3_ASM76324v3/GCF_000763245.3_ASM76324v3_genomic.fna.gz **This amendment can be done in excel as follows:** - Convert the filter assembly report from text to xlsx format - Select Column 20 and split it using the Text to Columns function in the Data Tab and / as text separator - Then build the link using concatenation function in excel - Save the names of the genomes and their newly built download-ready link in csv format. This file will serve as a temple or metadata for the next step
## **Step 4: Download the genomes** The following script will download the genomes using the download-ready links and rename the files ``` #!/bin/bash # #textFormating Red="$(tput setaf 1)" Green="$(tput setaf 2)" reset=`tput sgr0` # turns off all atribute Bold=$(tput bold) # #FTP-links SAMPLES=*.csv # while IFS=, read -r field1 field2 do echo "${Red}${Bold} Downloading...${reset}: "${field1}"" echo "Name : $field1" echo "FTP-link : $field2" wget "${field2}" -O ${field1}.fna.gz gzip -d ${field1}.fna.gz mv ${field1}.fna ${field1}.fasta echo "${Green}${Bold} Download completed${reset}:"${field1}"" echo " " done < ${SAMPLES} ``` **What the script does:** - 'SAMPLES=*.csv' takes a csv file that has the genome names in Column 1 (Field 1) and the download-ready links in Column 2 (Field 2). ***Make sure that the genome names (Field1) DO NOT have any space*** - 'wget' downloads and renames the files - 'gzip' decompress the file - 'mv' changes the file extension from 'fna' to 'fasta' - 'echo' will show the progress on the screen - 'tput' commands are for color formating of the screen displays (optional)
**The End**
Owner
- Name: Asad Prodhan
- Login: asadprodhan
- Kind: user
- Location: Perth, Australia
- Company: Department of Primary Industries and Regional Development
- Website: www.linkedin.com/in/asadprodhan
- Twitter: Asad_Prodhan
- Repositories: 2
- Profile: https://github.com/asadprodhan
Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.