https://github.com/a-slide/refmasker
Hard mask homologies between fasta reference sequences identified by Blastn
Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.5%) to scientific vocabulary
Repository
Hard mask homologies between fasta reference sequences identified by Blastn
Basic Info
- Host: GitHub
- Owner: a-slide
- License: gpl-2.0
- Language: Python
- Default Branch: master
- Homepage: http://a-slide.github.io/RefMasker
- Size: 511 KB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
RefMasker
Hard mask homologies between fasta reference sequences identified by Blastn
Creation : 2015/06/08
Last update : 2015/06/24
Motivation
RefMasker is a python2.7 object oriented script that was developed in order to attribute more correctly short sequencing reads obtained from a mix of reference sequences whose abundance is highly unbalanced. Indeed, a rare reference with sequence homologies with a much frequent reference can result in possible misattributions of reads to the rarest sequence and thus, a large overestimation of this sequence.
Principle
- Users can generate a template configuration file and fill it according to their requirements. The order of references indicated in the configuration file is CRITICAL since it will determine the order in which sequences will be masker thereafter.
- The configuration file containing all program parameters (including reference fasta location) is parsed and verified for validity.
- Following the order indicated in the configuration file, reference fasta files are uncompressed (if needed), parsed, and indexed using a memory mapping.
- An iterative masking is performed, starting from the last reference (subject) against all the references listed before (queries). For each new iteration the penultimate reference from the previous iteration becomes the subject and is removed from the queries (see figure below).
- When the list of queries is empty the iteration stops.
- Depending of the user requirements, blast and masking reports are generated.

Details of iterations
- Imperfect matches between the subject and the queries are found using a wrapper of NCBI Blast+ (pyBlast submodule)
- If matches were found, the program writes a masked version of the subject reference where each positions of the subject overlapping hits is replaced by a 'N' base (hard masking).
- The subject reference is removed from the reference list.
Dependencies
The program was developed under Linux Mint 17 and was not tested with other OS.
In addition to python2.7 the following dependencies are required for proper program execution:
- ncbi blast+ 2.2.28+
Install blast with your favorite package manager (ex: sudo apt-get install ncbi-blast+)
- python package pyfasta 0.5.2 +
Install pip with your favorite package manager and enter the following line to install pyfasta: sudo pip install pyfasta
Get and install
Clone the repository in recursive mode to download the main repo and its submodules
git clone --recursive https://github.com/a-slide/RefMasker.gitEnter the src folder of the program folder and make the main script executable
sudo chmod u+x RefMasker.pyFinally, add RefMasker.py to your PATH
Usage
In the folder where files will be created
``` Usage: RefMasker.py -c Conf.txt [-i -h]
Options: --version show program's version number and exit -h, --help show this help message and exit -c CONF_FILE Path to the configuration file [Mandatory] -i Generate an example configuration file and exit [Facultative] ```
An example configuration file can be generated by running the program with the option -i
The possible options are extensively described in the configuration file.
The program can be tested from the test folder with the dataset provided and the default configuration file.
cd ./test/result
RefMasker.py -i
RefMasker.py -c Quade_conf_file.txt
Testings
The module can be easily tested thanks to pytest. It will also test the pyBlast submodule.
- Install pytest with pip
pip install pytest - Run test with py.test-2.7 -v
Example of output if successful. Please note than some tests might fail due to the random sampling of DNA sequences, and uncertainties of Blastn algorithm.
``` ========================================================================= test session starts ========================================================================= platform linux2 -- Python 2.7.5 -- py-1.4.27 -- pytest-2.7.0 -- /usr/bin/python rootdir: /home/adrien/Programming/Python/Refeed/src, inifile: collected 39 items
testRefMasker.py::testSequencecreate PASSED testRefMasker.py::testSequenceaddhit[100-seq0-90-110-0-0--0-0-0-0-] xfail testRefMasker.py::testSequenceaddhit[100-seq1-80-100-0-0--0-0-0-0-] xfail testRefMasker.py::testSequenceaddhit[100-seq0-80-90-20-30-ATCG-79-90-19-30-ATCG] PASSED testRefMasker.py::testSequenceaddhit[100-seq0-90-80-20-30-ATCG-79-90-19-30-CGAT] PASSED testRefMasker.py::testSequenceaddhit[100-seq0-80-90-30-20-ATCG-79-90-19-30-CGAT] PASSED testRefMasker.py::testSequenceaddhit[100-seq0-90-80-30-20-ATCG-79-90-19-30-ATCG] PASSED testRefMasker.py::testSequenceoutputsequence1[100-1] PASSED testRefMasker.py::testSequenceoutputsequence1[100-5] PASSED testRefMasker.py::testSequenceoutputsequence1[200-10] PASSED testRefMasker.py::testSequenceoutputsequence2 PASSED testRefMasker.py::testReferencecreate[1-1000-1-False] PASSED testRefMasker.py::testReferencecreate[1-1000-1-True] PASSED testRefMasker.py::testReferencecreate[2-10000-2-False] PASSED testRefMasker.py::testReferencecreate[2-10000-2-True] PASSED testRefMasker.py::testReferenceaddhitlist[1-1000-1] PASSED testRefMasker.py::testReferenceaddhitlist[2-10000-2] PASSED testRefMasker.py::testReferenceoutputmaskedreference PASSED pyBlast/testpyBlast.py::testBlastHit[36.9133828132-88-75-85-47-98-88-14-8.78046725086-92.5815421121] PASSED pyBlast/testpyBlast.py::testBlastHit[-1-19-100-17-17-54-53-33-79.1465130808-41.6977101708] xfail pyBlast/testpyBlast.py::testBlastHit[65.8976266941--1-46-9-74-59-97-56-59.2270229149-93.0689987714] xfail pyBlast/testpyBlast.py::testBlastHit[75.9701897823-71--1-26-16-91-16-82-5.78377016797-79.1291574854] xfail pyBlast/testpyBlast.py::testBlastHit[80.9394959784-54-85--1-5-78-33-35-8.3011500976-53.4993883036] xfail pyBlast/testpyBlast.py::testBlastHit[35.5821954158-26-23-29--1-69-35-57-47.706286329-4.1842760318] xfail pyBlast/testpyBlast.py::testBlastHit[52.9290346724-31-3-44-74--1-30-76-36.6917151434-43.8870409292] xfail pyBlast/testpyBlast.py::testBlastHit[16.7597390274-26-0-37-100-15--1-91-89.8637578655-63.9053323995] xfail pyBlast/testpyBlast.py::testBlastHit[94.5094431806-49-70-48-9-39-80--1-72.722423521-98.7208732416] xfail pyBlast/testpyBlast.py::testBlastHit[44.4349347822-84-83-96-49-59-16-9--1-91.9302274501] xfail pyBlast/testpyBlast.py::testBlastHit[77.9794166482-19-89-79-33-46-9-26-21.2569521087--1] xfail pyBlast/testpyBlast.py::testBlastn[blastn-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[blastn-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[blastn-short-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[blastn-short-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[dc-megablast-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[dc-megablast-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[megablast-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[megablast-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[rmblastn-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[rmblastn-Random queries] xfail
================================================================ 22 passed, 17 xfailed in 7.02 seconds ================================================================ ```
Authors and Contact
- Adrien Leger aleg@ebi.ac.uk @a-slide
- Emilie Lecomte emilie.lecomte@univ-nantes.fr @emlec
Owner
- Name: Adrien Leger
- Login: a-slide
- Kind: user
- Location: Oxford, UK
- Company: @nanoporetech
- Website: https://adrienleger.com/
- Twitter: AdrienLeger2
- Repositories: 50
- Profile: https://github.com/a-slide
Research scientist at Oxford Nanopore Technologies