https://github.com/asadprodhan/downloading_genomes_from_refseq

Downloading Genomes from RefSeq

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary

Keywords

download genomes refseq

Last synced: 5 months ago · JSON representation

Repository

Downloading Genomes from RefSeq

Basic Info

Host: GitHub
Owner: asadprodhan
Language: Shell
Default Branch: main
Homepage:
Size: 48.8 KB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

download genomes refseq

Created almost 4 years ago · Last pushed almost 3 years ago

https://github.com/asadprodhan/Downloading_genomes_from_RefSeq/blob/main/

# **Downloading Genomes from RefSeq**

## **Step 1: Collect the assembly summary report for your organism of interest from the NCBI RefSeq Index** **For example, the assembly summary report for Bacteria can be obtained as follows:** ``` wget ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/assembly_summary_refseq.txt ``` For other organisms, navigate to the assembly summary report starting from the Index of /genomes/refseq as shown below:

Figure showing organism directory in RefSeq

## **Step 2: Filter out your targeted genomes from the assembly report** **For example, all species of Pseudomonas can be extracted from the bacterial assembly report as follows:** ``` #!/bin/bash awk -F '\t' '{if($8 ~ /Pseudomonas/) print $1","$2","$3","$5","$8","$11","$12","$14","$15","$16","$20}' assembly_summary.txt > assembly_summary_complete_genomes_Pseudomonas.txt ``` **What the script does:** - Column 8 ($8) in the assembly report contains the name of the species. ~ /Pseudomonas/ will extract only the Pseudomonas species Here, we are extracting Pseudomonas species along with other metadata in different columns of the assembly report. - Column 1 ($1): # assembly_accession - Column 2 ($2): bioproject ID - Column 3 ($3): biosample ID - Column 5 ($5): refseq_category, is it a representative genome? representative genome are quality-checked by RefSeq team - Column 8 ($8): organism_name - Column 11 ($11): version_status, is it latest? - Column 12 ($12): assembly_level, complete genome, scaffold or contig - Column 14 ($14): genome_rep, full? or partial? - Column 15 ($15): seq_rel_date, release date - Column 16 ($16): asm_name, assembly name - Column 20 ($20): ftp_path, the download link (however, the links, as they appear here, do not download the files, the links need to be amended in the following step to get them download-ready)
## **Step 3: Amend the above links to get them download-ready** In column 20, the links appear as follows: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/763/245/GCF_000763245.3_ASM76324v3 To get it download-ready, two amendments are required: The last part i.e. GCF_000763245.3_ASM76324v3 needs to be repeated. So, it will look like this: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/763/245/GCF_000763245.3_ASM76324v3/GCF_000763245.3_ASM76324v3 A file extension (_genomic.fna.gz) need to be added So, the download-ready version of the links in column 20 will look like this: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/763/245/GCF_000763245.3_ASM76324v3/GCF_000763245.3_ASM76324v3_genomic.fna.gz **This amendment can be done in excel as follows:** - Convert the filter assembly report from text to xlsx format - Select Column 20 and split it using the Text to Columns function in the Data Tab and / as text separator - Then build the link using concatenation function in excel - Save the names of the genomes and their newly built download-ready link in csv format. This file will serve as a temple or metadata for the next step
## **Step 4: Download the genomes** The following script will download the genomes using the download-ready links and rename the files ``` #!/bin/bash # #textFormating Red="$(tput setaf 1)" Green="$(tput setaf 2)" reset=`tput sgr0` # turns off all atribute Bold=$(tput bold) # #FTP-links SAMPLES=*.csv # while IFS=, read -r field1 field2 do echo "${Red}${Bold} Downloading...${reset}: "${field1}"" echo "Name : $field1" echo "FTP-link : $field2" wget "${field2}" -O ${field1}.fna.gz gzip -d ${field1}.fna.gz mv ${field1}.fna ${field1}.fasta echo "${Green}${Bold} Download completed${reset}:"${field1}"" echo " " done < ${SAMPLES} ``` **What the script does:** - 'SAMPLES=*.csv' takes a csv file that has the genome names in Column 1 (Field 1) and the download-ready links in Column 2 (Field 2). ***Make sure that the genome names (Field1) DO NOT have any space*** - 'wget' downloads and renames the files - 'gzip' decompress the file - 'mv' changes the file extension from 'fna' to 'fasta' - 'echo' will show the progress on the screen - 'tput' commands are for color formating of the screen displays (optional)

**The End**

Owner

Name: Asad Prodhan
Login: asadprodhan
Kind: user
Location: Perth, Australia
Company: Department of Primary Industries and Regional Development

Website: www.linkedin.com/in/asadprodhan
Twitter: Asad_Prodhan
Repositories: 2
Profile: https://github.com/asadprodhan

Laboratory Scientist at DPIRD. My work involves Oxford Nanopore Sequencing and Bioinformatics for pest and pathogen diagnosis.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/asadprodhan/downloading_genomes_from_refseq

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

https://github.com/asadprodhan/Downloading_genomes_from_RefSeq/blob/main/

Owner

GitHub Events

Total

Last Year