https://github.com/armbrustlab/npac_euk_gene_catalog
North Pacific eukaryotic metatranscriptome assemblies and annotations
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.4%) to scientific vocabulary
Repository
North Pacific eukaryotic metatranscriptome assemblies and annotations
Basic Info
- Host: GitHub
- Owner: armbrustlab
- License: mit
- Language: Shell
- Default Branch: main
- Homepage: https://www.nature.com/articles/s41597-024-04005-5
- Size: 292 KB
Statistics
- Stars: 2
- Watchers: 5
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
North Pacific Eukaryotic Gene Catalog (NPEGC)
Overview
The North Pacific Eukaryotic Gene Catalog (NPEGC) is a compilation of metatranscriptome sequence data and annotations derived from 261 samples collected from four oceanographic research cruises in the North Pacific Ocean.
Key Features
- 261 metatranscriptomes from five cruise studies
- 182 million transcript contigs (clustered at 99% protein identity)
- Taxonomic and functional annotations
- Read abundance data

Sample sites for metatranscriptomes in the North Pacific Eukaryotic Gene Catalog
Data Sources
- Diel1: 48 samples from SCOPE HOE-Legacy 2 (July 2015) Diel 1 project page
- Gradients1: 47 samples from KOK1606 (April-May 2016) Gradients 1 project page
- Gradients2: 59 samples from MGL1704 (May-June 2017) Gradients 2 project page
- Gradients3: 63 samples from KM1906 (April 2019) Gradients 3 project page
- G3 diel: 44 samples from KM1906 (April 2019) G3 diel study project page
Data Products
1. Raw Metatranscriptome Assemblies
2. Processed Protein Contigs and Annotations
3. Processed Nucleotide Metatranscripts and Read Counts
Script Index
Universal Scripts
These scripts are used across all studies in the North Pacific Eukaryotic Gene Catalog:
IlluminaQCAWS.sh: Description: Performs quality control and trimming of raw Illumina sequencing data using Trimmomatic.
NPEGC.6trframeselection_clustering.sh: Translates nucleotide sequences, selects the longest coding frame(s), and clusters protein sequences at 99% identity.
NPEGC.diamond_taxonomy.log.sh: Assigns taxonomic identifiers to protein sequences using DIAMOND alignment against the MarFERReT + MARMICRODB database.
NPEGC.hmmer_function.sh: Annotates protein sequences with protein families using HMMER against the Pfam database.
NPEGC.ntkallistocounts.sh: Quantifies transcript abundances by aligning short reads to assembled transcripts using kallisto.
aggregatekallistocounts.R: Consolidates kallisto output files, joining sequence length and estimated count values for each project's metatranscriptome.
Study-Specific Scripts
Each study (G1PA, G2PA, G3PA, G3PA_diel, D1PA) has two specific scripts:
{STUDY_ID}.process_short_reads.sh: Performs quality control and preprocessing of raw sequencing data for the specific study.{STUDY_ID}.trinity_assemblies.sh: Uses Trinity to perform de novo assembly of metatranscriptomes for the specific study.
Links to study-specific scripts:
Gradients 1 (G1PA):
Gradients 2 (G2PA):
Gradients 3 (G3PA):
G3 Diel (G3PA_diel):
Diel1 (D1PA):
Associated Data
Additional metadata and associated datasets are available on the Simons CMAP ocean data portal.
- SCOPE Diel1 associated data: https://simonscmap.com/catalog/cruises/KM1513
- Gradients 1 associated data: https://simonscmap.com/catalog/cruises/KOK1606
- Gradients 2 associated data: https://simonscmap.com/catalog/cruises/MGL1704
- Gradients 3 associated data: https://simonscmap.com/catalog/cruises/KM1906
Additional metadata for the Gradients cruises can be found here: http://scope.soest.hawaii.edu/data/gradients/gradients.html
Citation
If you use this data in your research, please cite:
Groussman, R. D., Coesel, S. N., Durham, B. P., Schatz, M. J. & Armbrust, E. V. The North Pacific Eukaryotic Gene Catalog of metatranscriptome assemblies and annotations. Sci. Data 11, 1161 (2024). doi:10.1038/s41597-024-04005-5
Owner
- Name: Armbrust Lab
- Login: armbrustlab
- Kind: organization
- Location: Seattle, WA
- Website: http://armbrustlab.ocean.washington.edu
- Repositories: 23
- Profile: https://github.com/armbrustlab
Biological Oceanography Lab at the University of Washington
GitHub Events
Total
- Watch event: 2
- Push event: 7
Last Year
- Watch event: 2
- Push event: 7
Dependencies
- python 3 build
- biopython ==1.79
- numpy ==1.23.4
- pandas ==1.5.1
- python-dateutil ==2.8.2
- pytz ==2022.6
- six ==1.16.0