https://github.com/bge-barcoding/sample-processing

Collection of scripts to download NCBI taxonomy, process BOLD-downloaded sample metadata, and generate necessary files for downstream processes.

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Last synced: 6 months ago · JSON representation

Repository

Collection of scripts to download NCBI taxonomy, process BOLD-downloaded sample metadata, and generate necessary files for downstream processes.

Basic Info

Host: GitHub
Owner: bge-barcoding
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 169 KB

Statistics

Stars: 0
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Fork of SchistoDan/sample-processing

Created over 1 year ago · Last pushed about 1 year ago

https://github.com/bge-barcoding/sample-processing/blob/main/

Collection of scripts for processing and parsing necessary BOLD-downloaded sample metadata, and generating input files for downstream processes (e.g. ([uploading trimmed reads to ENA](https://github.com/bge-barcoding/ena-read-upload), input into [skim2mito](https://github.com/o-william-white/skim2mito) and [MGE](https://github.com/bge-barcoding/MitoGeneExtractor-BGE), and [requesting taxid creation from ENA](https://github.com/bge-barcoding/ena-taxid-creation))). 
- Written by Ben Price and Dan Parsons @ NHMUK.
- For help, see usage information and docstrings within each script.
- Requires sample_processing conda environment to be activated. Environment with all necessary dependencies can be created from sample_processing.yaml file in this repo.

## 1_combine_tsv.py ##
- Merges TSV files from multiple zipped folders downloaded from BOLD.
  - **`usage: python combine_tsv.py  `**
  - `output_dir: Directory where combined .tsv files will be stored.`
  - `zip_files_dir: Directory containing the zipped files to be merged.
    - Recommended directory structure:
```
parent_directory/
 collection_data.tsv
 lab.tsv
 merged_custom_fields.tsv
 specimen_details.tsv
 tags.tsv
 taxonomy.tsv
 voucher.tsv

 subdirectory.zip
    collection_data.tsv
    lab.tsv
    merged_custom_fields.tsv
    specimen_details.tsv
    tags.tsv
    taxonomy.tsv
    voucher.tsv

 subdirectory2.zip
      collection_data.tsv
     lab.tsv
     merged_custom_fields.tsv
     specimen_details.tsv
     tags.tsv
     taxonomy.tsv
     voucher.tsv
```
 
## 2_download_taxonomy.sh ##
- Downloads the [newst NCBI taxonomy (i.e. new_taxdump)](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/).
- If the taxdump was previously downloaded to the same directory, the script will replace that taxdump with the newst version. The tar.gz download will also be removed post-extraction.
  - **`usage: srun download_taxonomy.sh  `**
  - `output_file = Name of NCBI taxdump (e.g. _taxdump.tar.gz)`
  - `extract_dir = Directory to output taxdump files to (e.g. _taxdump). If given a relative or absolute path, new directories will be created.`
 


## 3_local_bold_processing.py ##
- Merges relevant sample metadata from BOLD .tsv files, and resolves the taxonomic ID using the hierarchical structure and checks for mismatches between BOLD and NCBI taxonomy. Outputs sample_metadata.csv containing fields below:
  - Sample ID
  - Process ID
  - BOLD taxonomic ranks (phylum->species)
  - taxid
  - matched_rank (taxonimic rank the taxid corresponds to)
  - lineage (full NCBI lineage for taxid)
  - lineage_mismatch (did the major BOLD taxonomic ranks match the fetched NCBI lineage)
  - BOLD sample metadata (Identifier (identified_by), Collection Date (collection_date), Geographic Location (Country/Ocean) (geographic_location), Exact Site (geographic_location_locality), Latitude (latitude), Longitude (longitude), Collected By (collected_by), Habitat (habitat), Tissue Descriptor (organism_part), Sex (sex), Life stage (lifestage), Museum ID (specimen_voucher), Institution Storing (collecting_institution), Type Status (type_status).
- **`usage: python local_bold_processing.py   `**
  - `input_dir = Directory containing BOLD-downloaded sample metadata (.tsv files).`
  - `rankedlineage_path = Path to NCBI taxonomic hierarchy/lineage (_taxdump/rankedlineage.dmp file).`
  - `output = Directory to output sample_metadata.csv to. Provide name of output .csv file`
 
If metadata was not collected for a particular sample, 'not collected' output to field as required by [ToL ENA sample registration checklist](https://www.ebi.ac.uk/ena/browser/view/ERC000053).

**Example samples_metadata.csv**
| Sample ID | Process ID  | Phylum | Class | Order | Family | Subfamily | Genus | Species  | taxid | matched_rank | specimen_voucher | lifestage | collection_date | geographic_location | geographic_location_locality | latitude | longitude | collected_by | habitat | identified_by | collecting_institution | organism_part | sex | type_status |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | ---| --- | --- | --- |--- | --- | --- | --- | --- | --- | --- | ---| --- | --- | --- | --- |
| BGE_0001_A01  | BSNHM001-24 | Arthropoda | Insecta | Trichoptera | Apataniidae | Apataniinae | Apatania | Apatania stylata | 177658 | genus | 'Museum ID' | adult | YYYY-MM-DD | France | not collected | lat (DD) | lon (DD) | 'Collectors' | not collected | not collected | NHMMUK | Whole | M | type |
| BGE_0001_A02 | BSNHM002-24 | Arthropoda | Insecta | Trichoptera | Glossosomatidae | Agapetinae | Agapetus | Agapetus iridipennis | 177627 | genus | 'Museum ID' | adult | YYYY-MM-DD | Switzerland | not collected | lat (DD) | lon (DD) | 'Collectors' | not collected | not collected | Naturalis | not collected | F | Paratype |
| BGE_0001_A03 | BSNHM003-24 | Arthropoda | Insecta | Trichoptera | Glossosomatidae | Hydropsychidae | Diplectrona | Diplectrona meridionalis | 177860 | genus | 'Museum ID' | adult | YYYY-MM-DD |  France | not collected | lat (DD) | lon (DD) | 'Collectors' | not collected | not collected | NHMMUK | Leg | M | no |

- This can also be run using the 3_local_bold_proccessing.sh script in this repo to run it on a slurm cluster. This will greatly speed up creation of the sample_metadata.csv output file.

## 4_samples_spreadsheet.py ##
Script to generate samples.csv requried to run MGE and skim2mito piplines. Script also screen BGE project samples using the 27 BGE project codes.
- **`usage: python 2_samples_spreadsheet.py [path/to/raw/read/dir] [path/to/samples_metadata.csv]`**
- `path/to/raw/read/dir: Path to parent directory with 'flat' or 'nested' structure (see below/docstring for more example) containing raw PE read files.`
- `path/to/output/dir/samples_metadata.csv`: Directory containing sample_metadata.csv file.`
- outputs three CSV files containing ID (Process ID), forward (absolute path to R1 read (fastq.gz), reverse (absolute path to R2 read (fastq.gz) and taxid to current directory. samples_[parent_dir_name]_types.csv also contains type_status field, for reference.
  - samples_[parent_dir_name].csv = contains all samples containing BGE project codes in their Process ID's.
  - samples_[parent_dir_name]_nonproject.csv = contains all samples not containing BGE project codes.
  - samples_[parent_dir_name]_types.csv = contains all samples (project or non-project) that have 'type' in Type Status field.
```
Example Directory Structures Supported (where BSNHM593-24 is the process ID):
1. Flat structure:
   /parent_dir/
    BSNHM593-24_R1.fq
    BSNHM593-24_R2.fq
    ...

2. Nested structure:
   /parent_dir/XE-4013/
    20240906_LH00179_0123_A22CKGHLT4/
        Sample_XE-4013-BSNHM593-24/
           BSNHM593-24_R1.fastq.gz
           BSNHM593-24_R2.fastq.gz
        ...
```

## Miro workflow ##
![image](![image](https://github.com/user-attachments/assets/dd960f04-2957-4ef1-bd58-d5457ecba777)

Owner

Name: BGE barcoding
Login: bge-barcoding
Kind: organization

Website: https://biodiversitygenomics.eu/
Twitter: BioGenEurope
Repositories: 1
Profile: https://github.com/bge-barcoding

Biodiversity Genomics Europe (BGE) - (meta)barcoding software artifacts

GitHub Events

Total

Push event: 5
Fork event: 1

Last Year

Push event: 5
Fork event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/bge-barcoding/sample-processing

Science Score: 10.0%

Repository

Basic Info

Statistics

https://github.com/bge-barcoding/sample-processing/blob/main/

Owner

GitHub Events

Total

Last Year