unassigner
Type strain identification for 16S reads
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Repository
Type strain identification for 16S reads
Basic Info
- Host: GitHub
- Owner: PennChopMicrobiomeProgram
- Language: Python
- Default Branch: master
- Size: 342 KB
Statistics
- Stars: 5
- Watchers: 2
- Forks: 5
- Open Issues: 8
- Releases: 11
Metadata Files
README.md
Unassigner
Evaluate consistency with named bacterial species for short 16S rRNA marker gene sequences.
Summary
The 16S rRNA gene is found in all bacteria, and its gene sequence is highly conserved. Amplification and sequencing of bacterial 16S rRNA genes is a common method used to survey bacterial communities in microbiome research. However, high throughput instruments are unable to sequence the entire gene. Therefore, a short region of the gene is selected for amplification and sequencing.
The resultant sequences, spanning part of the 16S gene, can be used to identify the types of bacteria present in a specimen. For example, one sequence might be assigned to the Streptococcus genus based on sequence similarity. Many programs are available to carry out such taxonomic assignment.
It is generally thought that the 16S rRNA gene is not suitable for assignment of bacterial species. We agree, but with a catch: the gene sequence is suitable for ruling out assignment to many bacterial species. This software is designed to rule out all the species designations that are inconsistent with a partial 16S rRNA gene sequence. For those species that are not definitively ruled out, we assign a probability that the sequence is inconsistent with the species.
Because the software is geared towards ruling out species rather than deciding on the best assignment, we call it the unassigner. It's a cheesy joke, but we've decided to roll with it.
The unassigner library provides a command-line program, unassign,
that takes a FASTA file of DNA sequences in a 16S gene region, and
gives the probability that the sequence is inconsistent with nearby
bacterial species.
Installation
Install with conda using:
bash
conda create --name unassigner -c conda-forge -c bioconda unassigner
Or run with Docker using:
bash
docker run --rm -it ctbushman/unassigner:latest unassign --help
Alternative Installation
Unassigner can be installed using pip:
bash
pip install unassigner
But will require vsearch to be installed separately. It can also be installed from GitHub:
bash
git clone https://github.com/PennChopMicrobiomeProgram/unassigner.git
pip install unassigner/
Usage
The unassign program requires one argument, a FASTA-formatted file
of short 16S sequences:
bash
unassign my_sequences.fasta
If the program has not been run before, it will automatically download
the bacterial species data it needs, format its reference files,
create an output directory named my_sequences_unassigned, and write
a table of results there, along with some auxiliary output files. Note
that the output directory will be in the same directory as my_sequences.fasta.
Please see the output of unassign --help for a list of the available
options.
Trim ragged
The trimragged program takes in a query sequence to search and trim and an input fasta file (or it can read from stdin):
bash
trimragged AGAGTTTGATCCTGGCTCAG --input_file my_sequences.fasta
Trimragged is included to extract different regions from the full length 16S rRNA gene. The purpose of this auxiliary software is to account for the full length 16S rRNA sequences where only a part of the primer is present in the sequence. This can be due to low quality at the beginning or at the end of a sequence due to limitations of sequencing platforms.
The software operates in three steps: 1) Matching the full length of the primer, 2) Matching the partial primer, 3) Aligning reads to other sequences with a known primer location. The sequence of the primer to search and trim is required for the software. Only one primer is accepted at a time, so the user needs to run the software twice with each primer sequence.
Step 1: The software first searches for the full length of the primer sequence. If mismatches are allowed, then the software expands all possibilities of the primer sequence mutations in a list and searches for each. Once a hit is found, the start and end index is stored as a PrimerMatch object.
Step 2: If the minpartial argument is greater than 0, the software then searches for partial matches of the primer in the remaining sequences. The software makes a list of all the possibilities of primers, removing nucleotides from the beginning of the sequence till the minimum length specified by minpartial is reached. Then the software searches for each of the possible primer sequences. Once a hit is found, the start and end index is stored as a Primer Match object.
Step 3: The last part of the software relies on building a database of the sequences with already identified primer sequences from the previous two steps. Then the rest of the reads are aligned against the database of sequences with known primer locations using vsearch. Once a hit is found, and the positions of the primers are estimated by extending the aligned region.
Please see the output of trimragged --help for a list of the available
options.
Contributing
We welcome ideas from our users about how to improve this software. Please open an issue if you have a question or would like to suggest a feature.
Owner
- Name: PennChopMicrobiomeProgram
- Login: PennChopMicrobiomeProgram
- Kind: organization
- Location: United States of America
- Repositories: 15
- Profile: https://github.com/PennChopMicrobiomeProgram
GitHub Events
Total
- Create event: 10
- Release event: 1
- Issues event: 8
- Watch event: 1
- Delete event: 6
- Issue comment event: 3
- Push event: 20
- Pull request review event: 3
- Pull request review comment event: 5
- Pull request event: 10
- Fork event: 1
Last Year
- Create event: 10
- Release event: 1
- Issues event: 8
- Watch event: 1
- Delete event: 6
- Issue comment event: 3
- Push event: 20
- Pull request review event: 3
- Pull request review comment event: 5
- Pull request event: 10
- Fork event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 11
- Total pull requests: 24
- Average time to close issues: almost 2 years
- Average time to close pull requests: 14 days
- Total issue authors: 3
- Total pull request authors: 2
- Average comments per issue: 0.27
- Average comments per pull request: 0.21
- Merged pull requests: 20
- Bot issues: 0
- Bot pull requests: 8
Past Year
- Issues: 6
- Pull requests: 21
- Average time to close issues: 23 days
- Average time to close pull requests: 9 days
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.19
- Merged pull requests: 17
- Bot issues: 0
- Bot pull requests: 8
Top Authors
Issue Authors
- Ulthran (9)
- cchehoud (1)
- eclarke (1)
Pull Request Authors
- Ulthran (16)
- dependabot[bot] (8)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- biopython *
- scipy *
- biopython *
- scipy *
- actions/checkout v2 composite
- actions/setup-python v2 composite
- codecov/codecov-action v3 composite
- s-weigand/setup-conda v1.1.0 composite
- actions/checkout v3 composite
- github/super-linter v4 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
- actions/checkout v3 composite
- actions/setup-python v3 composite
- s-weigand/setup-conda v1.1.0 composite