https://github.com/bricoletc/backtranslate
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (6.4%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
Basic Info
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Created over 6 years ago
· Last pushed over 5 years ago
Metadata Files
Readme
README.rst
Protein back translation with pairwise distance constraints
============================================================
Usage
``````
Run using `python3 -m src.backTrans`
::
usage: backTrans.py [-h] (-i INPUT_FILE | -s SEQUENCE) -m {dna,protein}
[-d MIN_DIST] [-n NUM_SAMPLES] [-o OUTPUT]
[--output_prefix OUT_PREFIX] [--stats_header]
[--forbidden FORBIDDEN]
Backtranslate amino acid sequences with pairwise distance constraints
optional arguments:
-h, --help show this help message and exit
-i INPUT_FILE, --input_file INPUT_FILE
Path to fasta file containing amino acid or DNA
sequences
-s SEQUENCE, --sequence SEQUENCE
amino acid sequence passed on command-line
-m {dna,protein}, --mode {dna,protein}
mode to run the tool on: in protein mode, samples
backtranslations, in dna mode, uses the dna sequences
as samples directly.
-d MIN_DIST, --min_distance MIN_DIST
Minimum distance (Hamming, as fraction of distinct
nucleotides) between all returned sequences
-n NUM_SAMPLES, --num_samples NUM_SAMPLES
Maximum number of backtranslated DNA samples to
produce
-o OUTPUT, --output-dir OUTPUT
An existing directory where the output will go. This
is a fasta of the sampled sequence(s) and a stats file
--output_prefix OUT_PREFIX
prefix for output files
--stats_header Prints the header for the stats file
--forbidden FORBIDDEN
File path to DNA sequences that cannot appear in the
sample; one sequence per line.
Protein(s)
-----------
Produces DNA sequences compatible with each protein, where no two DNA sequences have less than `--min_distance` Hamming distance.
* Provide a single amino acid sequence on command-line: `-s`
* Or an fasta file with one or more protein sequence: `-i` and `-m protein`
DNA(s)
-------
Finds a set of DNA sequences where no two have less than ``min--distance`` Hamming distance
Provide a fasta file: `-i` and `-m dna`
TODOs
``````
* Try CD-Hit on DNA sequences as opposed to graph approach. Note:
* CD-Hit produces one representative per cluster, and the pairwise distance is only guaranteed lower than threshold on those
* cd-hit-est seems to be capped at 75% ID minimum
* Faster pairwise distance computation:
* cd-hit similar approach: avoid alignment for sequences where given number of shared small word frequency counts is exceeded.
* sketching-based approach, eg Mash. Con: inexact
Owner
- Name: Brice Letcher
- Login: bricoletc
- Kind: user
- Company: EMBL-EBI
- Twitter: bricoletc
- Repositories: 2
- Profile: https://github.com/bricoletc
Bioinformatician and early-career researcher - EMBL-EBI and CNRS ~~~~~~ Parsing my way through DNA sequence data
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0