https://github.com/a-slide/refmasker

Hard mask homologies between fasta reference sequences identified by Blastn

https://github.com/a-slide/refmasker

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.5%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Hard mask homologies between fasta reference sequences identified by Blastn

Basic Info
Statistics
  • Stars: 0
  • Watchers: 3
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Archived
Created about 11 years ago · Last pushed over 4 years ago
Metadata Files
Readme License

README.md

RefMasker

Hard mask homologies between fasta reference sequences identified by Blastn


Creation : 2015/06/08

Last update : 2015/06/24


Motivation

RefMasker is a python2.7 object oriented script that was developed in order to attribute more correctly short sequencing reads obtained from a mix of reference sequences whose abundance is highly unbalanced. Indeed, a rare reference with sequence homologies with a much frequent reference can result in possible misattributions of reads to the rarest sequence and thus, a large overestimation of this sequence.

Principle

  1. Users can generate a template configuration file and fill it according to their requirements. The order of references indicated in the configuration file is CRITICAL since it will determine the order in which sequences will be masker thereafter.
  2. The configuration file containing all program parameters (including reference fasta location) is parsed and verified for validity.
  3. Following the order indicated in the configuration file, reference fasta files are uncompressed (if needed), parsed, and indexed using a memory mapping.
  4. An iterative masking is performed, starting from the last reference (subject) against all the references listed before (queries). For each new iteration the penultimate reference from the previous iteration becomes the subject and is removed from the queries (see figure below).
  5. When the list of queries is empty the iteration stops.
  6. Depending of the user requirements, blast and masking reports are generated.

RefMasker_iteration

Details of iterations

  • Imperfect matches between the subject and the queries are found using a wrapper of NCBI Blast+ (pyBlast submodule)
  • If matches were found, the program writes a masked version of the subject reference where each positions of the subject overlapping hits is replaced by a 'N' base (hard masking).
  • The subject reference is removed from the reference list.

Dependencies

The program was developed under Linux Mint 17 and was not tested with other OS.

In addition to python2.7 the following dependencies are required for proper program execution:

Install blast with your favorite package manager (ex: sudo apt-get install ncbi-blast+)

Install pip with your favorite package manager and enter the following line to install pyfasta: sudo pip install pyfasta

Get and install

  • Clone the repository in recursive mode to download the main repo and its submodules git clone --recursive https://github.com/a-slide/RefMasker.git

  • Enter the src folder of the program folder and make the main script executable sudo chmod u+x RefMasker.py

  • Finally, add RefMasker.py to your PATH

Usage

In the folder where files will be created

``` Usage: RefMasker.py -c Conf.txt [-i -h]

Options: --version show program's version number and exit -h, --help show this help message and exit -c CONF_FILE Path to the configuration file [Mandatory] -i Generate an example configuration file and exit [Facultative] ```

An example configuration file can be generated by running the program with the option -i

The possible options are extensively described in the configuration file.

The program can be tested from the test folder with the dataset provided and the default configuration file.

cd ./test/result RefMasker.py -i RefMasker.py -c Quade_conf_file.txt

Testings

The module can be easily tested thanks to pytest. It will also test the pyBlast submodule.

  • Install pytest with pip pip install pytest
  • Run test with py.test-2.7 -v

Example of output if successful. Please note than some tests might fail due to the random sampling of DNA sequences, and uncertainties of Blastn algorithm.

``` ========================================================================= test session starts ========================================================================= platform linux2 -- Python 2.7.5 -- py-1.4.27 -- pytest-2.7.0 -- /usr/bin/python rootdir: /home/adrien/Programming/Python/Refeed/src, inifile: collected 39 items

testRefMasker.py::testSequencecreate PASSED testRefMasker.py::testSequenceaddhit[100-seq0-90-110-0-0--0-0-0-0-] xfail testRefMasker.py::testSequenceaddhit[100-seq1-80-100-0-0--0-0-0-0-] xfail testRefMasker.py::testSequenceaddhit[100-seq0-80-90-20-30-ATCG-79-90-19-30-ATCG] PASSED testRefMasker.py::testSequenceaddhit[100-seq0-90-80-20-30-ATCG-79-90-19-30-CGAT] PASSED testRefMasker.py::testSequenceaddhit[100-seq0-80-90-30-20-ATCG-79-90-19-30-CGAT] PASSED testRefMasker.py::testSequenceaddhit[100-seq0-90-80-30-20-ATCG-79-90-19-30-ATCG] PASSED testRefMasker.py::testSequenceoutputsequence1[100-1] PASSED testRefMasker.py::testSequenceoutputsequence1[100-5] PASSED testRefMasker.py::testSequenceoutputsequence1[200-10] PASSED testRefMasker.py::testSequenceoutputsequence2 PASSED testRefMasker.py::testReferencecreate[1-1000-1-False] PASSED testRefMasker.py::testReferencecreate[1-1000-1-True] PASSED testRefMasker.py::testReferencecreate[2-10000-2-False] PASSED testRefMasker.py::testReferencecreate[2-10000-2-True] PASSED testRefMasker.py::testReferenceaddhitlist[1-1000-1] PASSED testRefMasker.py::testReferenceaddhitlist[2-10000-2] PASSED testRefMasker.py::testReferenceoutputmaskedreference PASSED pyBlast/testpyBlast.py::testBlastHit[36.9133828132-88-75-85-47-98-88-14-8.78046725086-92.5815421121] PASSED pyBlast/testpyBlast.py::testBlastHit[-1-19-100-17-17-54-53-33-79.1465130808-41.6977101708] xfail pyBlast/testpyBlast.py::testBlastHit[65.8976266941--1-46-9-74-59-97-56-59.2270229149-93.0689987714] xfail pyBlast/testpyBlast.py::testBlastHit[75.9701897823-71--1-26-16-91-16-82-5.78377016797-79.1291574854] xfail pyBlast/testpyBlast.py::testBlastHit[80.9394959784-54-85--1-5-78-33-35-8.3011500976-53.4993883036] xfail pyBlast/testpyBlast.py::testBlastHit[35.5821954158-26-23-29--1-69-35-57-47.706286329-4.1842760318] xfail pyBlast/testpyBlast.py::testBlastHit[52.9290346724-31-3-44-74--1-30-76-36.6917151434-43.8870409292] xfail pyBlast/testpyBlast.py::testBlastHit[16.7597390274-26-0-37-100-15--1-91-89.8637578655-63.9053323995] xfail pyBlast/testpyBlast.py::testBlastHit[94.5094431806-49-70-48-9-39-80--1-72.722423521-98.7208732416] xfail pyBlast/testpyBlast.py::testBlastHit[44.4349347822-84-83-96-49-59-16-9--1-91.9302274501] xfail pyBlast/testpyBlast.py::testBlastHit[77.9794166482-19-89-79-33-46-9-26-21.2569521087--1] xfail pyBlast/testpyBlast.py::testBlastn[blastn-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[blastn-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[blastn-short-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[blastn-short-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[dc-megablast-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[dc-megablast-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[megablast-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[megablast-Random queries] xfail pyBlast/testpyBlast.py::testBlastn[rmblastn-Queries from Subject] PASSED pyBlast/testpyBlast.py::testBlastn[rmblastn-Random queries] xfail

================================================================ 22 passed, 17 xfailed in 7.02 seconds ================================================================ ```

Authors and Contact

Owner

  • Name: Adrien Leger
  • Login: a-slide
  • Kind: user
  • Location: Oxford, UK
  • Company: @nanoporetech

Research scientist at Oxford Nanopore Technologies

GitHub Events

Total
Last Year