https://github.com/broadinstitute/rec-seq

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.7%) to scientific vocabulary

Last synced: 7 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: broadinstitute
License: mit
Language: Python
Default Branch: master
Size: 11.7 KB

Statistics

Stars: 1
Watchers: 7
Forks: 0
Open Issues: 1
Releases: 0

Created about 7 years ago · Last pushed about 7 years ago

Metadata Files

Readme License

Rec-seq

Rec-seq is a method for determining the DNA specificity and potential off-target substrates of site-specific recombinases via high-throughput sequencing of recombined substrates from a pool of partially randomized DNA sequences. The Rec-seq data analysis script, rec-seq.py, is used to quantify the enzyme’s substrate specificity. Post-recombination sequencing reads that contain the matched target core sequence are aligned to the native target sequence, with no gaps allowed. After alignment, reads with excessive numbers of mismatches are considered to be the result of sequencing errors and are filtered out of subsequent analysis (determined by maxmismatchcount parameter). For the remaining sequences, at each position in the recombinase target, the abundance of the canonical base (A_i) and the sum of the non-canonical bases (B_i) are calculated. The same analysis is performed for the sequencing reads of the input library, except the abundance of the canonical base and the sum of the non-canonical bases are expressed as fractions α_i and β_i. The enrichment score for each position is then calculated as the ratio r_i = (A_i/B_i)/(α_i/β_i). Analysis is performed separately for the left and right half-sites, using as input the sequencing reads from experiments with either left- or right-randomized half-sites.

Input

index file - tab-delimited text file defining analysis of individual experiments. Each row in the index file specifies an enzyme variant used in the experiment, a date of the experiment, and fastq files containing pre- and post-selection sequencing reads for both left and right half sites. Index file contains the following columns:
- Enzyme_variant - enzyme variant used in the experiment
- Date - date of the experiment
- Left_library_file - fastq file containing post-selection sequencing reads for the left half site
- Right_library_file - fastq file containing post-selection sequencing reads for the right half site
- Left_control_file - fastq file containing pre-selection sequencing reads for the left half site
- Right_control_file - fastq file containing pre-selection sequencing reads for the left half site
substrate file - tab-delimited text file defining canonical substrate sequences and their formats. Each file listed in the index file must have an entry in the substrates file. Library files and their corresponding control files must have the same site layout. Substrate file contains the following columns:
- Library_file - fastq file containing sequencing reads
- Substrate – nucleotide sequence of the canonical substrate
- Site_layout - layout of the substrate in the format left half-site length;core length;right half-site length
fastq files (listed in index/substrate files) containing sequencing reads.
maxmismatchcount - parameter that controls which sequencing reads that are included in the analysis, only reads that have at most maxmismatchcount mismatches when compared to left and right half sites of the canonical substrate are considered.

Output

Analysis output, printed to stdout, is a tab-delimited text with the following columns: * Enzyme_variant - enzyme variant used in analyzed experiment * Date - date of analyzed experiment * Left_library_file - fastq file containing post-selection sequencing reads for the left half site * Right_library_file - fastq file containing post-selection sequencing reads for the right half site * Total_read_count - total number of reads present in post-selection library file * Control_read_count - total number of reads present in pre-selection control file * Core_count - number of reads containing core sequence in post-selection library file * Control_core_count - number of reads containing core sequence in pre-selection control file * Position - position for which enrichment was computed, negative for the left half-site and positive for the right half-site. * Nucleotide - nucleotide for which enrichment was computed * Match - true when the nucleotide for which enrichment was computed matches a canonical substrate base at a given position * Library_count - number of reads (containing core sequence and having at most maxmismatchcount mismatches) matching a given nucleotide at a given position in post-selection library file * Control_count - number of reads (containing core sequence and having at most maxmismatchcount mismatches) matching a given nucleotide at a given position in pre-selection control file * Enrichment - computed value of enrichment of a given nucleotide at a given position

Usage

rec-seq.py [-h] [-mmc MAX_MISMATCH_COUNT] index_file substrate_file

positional arguments:
  index_file            input index file
  substrate_file        input substrates file

optional arguments:
  -h, --help            show help message and exit
  -mmc MAX_MISMATCH_COUNT, --max_mismatch_count MAX_MISMATCH_COUNT
                    default max mismatch count value 5

Requirements

Python 3.2 or later

pandas 0.16.1 or later

Example

Example index and substrate files are included. The corresponding fastq files are available from https://www.ncbi.nlm.nih.gov/bioproject/517947

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Issues event: 1

Last Year

Issues event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science