https://github.com/aliyoussef96/sequence-database-curator
This program dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match).
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: springer.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.8%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
Repository
This program dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match).
Basic Info
- Host: GitHub
- Owner: AliYoussef96
- License: gpl-3.0
- Default Branch: master
- Homepage: https://sites.google.com/pharma.cu.edu.eg/eslam-ibrahim/github-and-softwares/sddc-program
- Size: 2.09 MB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of Eslam-Samir-Ragab/Sequence-database-curator
Created over 5 years ago
· Last pushed about 7 years ago
https://github.com/AliYoussef96/Sequence-database-curator/blob/master/
# Sequence Dereplicator and Database Curator (SDDC) program This program dereplicates and/or filter nucleotide and/or protein database from a list of names or sequences (by exact match). ## This software is under GNU General Public License v3.0 ## Please, cite: DOI: [10.1007/s00284-017-1327-6](https://link.springer.com/article/10.1007/s00284-017-1327-6)## How to use: 1. you need to install [python 2.7](https://www.python.org/downloads/) or [python 3](https://www.python.org/downloads/) on your machine. 2. you need to install [Numpy](https://pypi.python.org/pypi/numpy) and [Biopython](http://biopython.org/wiki/Download) 3. you need to install future module by [pip command](https://docs.python.org/3/installing/) 4. Click Clone or download > Download ZIP > extract the downloaded file. 5. Open the file **sddc.py** with (python.exe). * [Windows](http://stackoverflow.com/a/1527012/7414020) * U/Linux : use the command `chmod u+x sddc.py` * Mac : use the command `python sddc.py` 6. State your variables and press Enter. ### **The full SDDC commands, Cheat sheet and notes are [here](https://github.com/Eslam-Samir-Ragab/Sequence-database-curator/blob/master/additionals/SDDC%20Cheat%20sheet.pdf)** ## *Updates in SDDC v3.0:* 1. Bugs fixes. 2. Usage of -org_order with -kw is updated 3. Exchange FASTA headers mode is now available. ## *Updates in SDDC v2.0:* 1. You can filter the sequences using only keywords (separated by a comma) inclusively or exclusively by adding (-kw) argument to your normal command line. 2. You can get your sequences in their original order after dereplication and/or sequence filtration by adding (-org_order) to your normal command line. ## *Notes:* * The rate of SDDC as determined using Intel(R) Pentium(R) CPU G630 @ 2.70GHz 2.70 GHz Processor, 4.00 GB RAM, 32-bit Operating System
* List of options and commands in the program you can download it from [here](https://github.com/Eslam-Samir-Ragab/Sequence-database-curator/blob/master/additionals/SDDC%20Commands.pdf):
## Examples if you want to dereplicate protein sequences use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode derep` if you want to dereplicate protein sequences and preserve the original order of the sequences in the new file use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode derep -org_order` if you want to dereplicate protein sequences with a minimum length = 30 and sequences are in multiple files use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode derep -min_length 30 -multi` if you want to dereplicate nucleotide sequences with optimum approach and normal protein length = 300 use the following command `python sddc.py -in (input_file) -n -out (output_file) -mode derep -optimum -prot_length 300` if you want to filter a protein sequences inclusively by name (i.e. you want to retrieve only seqeunces that you've specified their names) use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach inclusive` if you want to filter a protein sequences inclusively by keyword(s) (i.e. you want to retrieve only seqeunces that you've specified the keywords (separated by a comma) in their names) use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach inclusive -kw` if you want to filter a protein sequences exclusively by name (i.e. you want to retrieve the seqeunces that aren't present in your filter file) use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file) -approach exclusive` if you want to filter a protein sequences exclusively by keyword(s) in their names (i.e. you want to retrieve the seqeunces that certain keywords (separated be a comma) aren't present in your filter file) use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode filter -flt_by name -flt_file (filter_file in csv) -approach exclusive -kw` if you want to filter a nucleotide sequences by sequence (only exclusive) use the following command `python sddc.py -in (input_file) -n -out (output_file) -mode filter -flt_by seq -flt_file (filter_file)` if you want to exchange words in FASTA headers of your protein sequences use the following command `python sddc.py -in (input_file) -p -out (output_file) -mode exchange_headers -ex_file (exchange_file in csv)` if you want to exchange words in FASTA headers of your nucleotide sequences use the following command `python sddc.py -in (input_file) -n -out (output_file) -mode exchange_headers -ex_file (exchange_file in csv)` Example (1)
Example (2)
### Any errors please send me an email to
## Visit [my website](https://sites.google.com/pharma.cu.edu.eg/eslam-ibrahim/) for more details, other publications, and contact
Owner
- Name: Ali Youssef
- Login: AliYoussef96
- Kind: user
- Location: Egypt
- Company: Cairo university
- Website: https://www.linkedin.com/in/ali-youssef-455a92130/
- Repositories: 3
- Profile: https://github.com/AliYoussef96
Trying to be a Bioinformatician




