minseq-find

MinSeq Find Algorithm

https://github.com/dev11ume/minseq-find

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

MinSeq Find Algorithm

Basic Info

Host: GitHub
Owner: dev11ume
License: gpl-3.0
Language: MATLAB
Default Branch: main
Size: 2.18 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Created about 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

5 different sequencing datasets used in this study:
1. From this study – Sequencing data can be downloaded from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA729962
2. Zhang et al. 2018 Genome Research (SelexGLM study) – Sequencing data can be downloaded from https://www.ncbi.nlm.nih.gov/bioproject/PRJNA379022
3. Jolma et al. 2013 Cell (HT-SELEX study) – Sequencing data can be downloaded from ENA (European Nucleotide Archive) under accession numbers ERP001824 and ERP001826
4. Yin et al. 2017 Science (HT-SELEX study) – Sequencing data can be downloaded from ENA (European Nucleotide Archive) under accession number PRJEB9797.
5. Isakova et al. 2017 Nature Methods (SMiLE-seq study) – Sequencing data can be downloaded from https://www.ncbi.nlm.nih.gov/sra/?term=SRP073361

Ran on system with 16GB RAM
Getting MinSeqs for Nuclear Receptor (NR) proteins:
1. Download NR data from above links.
2. Downloaded data in fastq format for this study & other studies are first need to be converted into sequence format. Use provided trim_fastq_1.pl to convert fastq files to sequence format. This script will also rename files. “perl trim_fastq_1.pl Sra_mapping1.txt 20”
3. Start with a list of NRs – like “list_NRs_complete_1.txt” for all NRs – a list of all NRs from this study or previous high-throughput in vitro sequencing study that gave a valid motif. OR “list_NRs_bhimsaria_1.txt” for those from this study only. Columns are -
1. Sample file name with an ID.
2. Library or previous round file name with an ID. Starting random library was used for this study, SMiLE-seq and SelexGLM, whereas previous round data was used for Jolma et al. 2013 and Yin et al. 2017 as random library wasn’t always available for those datasets.
3. HT-SELEX round number.
4. Name 1 provided to the sample or run.
5. Constant region to the left of random DNA library (if available otherwise x).
6. Constant region to the right of random DNA library (if available otherwise x).
7. Name 2 provided to the sample or run.
8. Partner protein (if available otherwise x).
9. x.
10. Monomer used for each sample for landscape and repeat heat maps.
11. Name 3 provided to the sample or run.
12. Numbering 1 (ignore).
13. Numbering 2 (ignore).
14. Name 4 provided to the sample or run (this is used to name the sample).
4. Run MinSeqFind algorithm using run_minseqfind_1.m (in MinSeqFind directory) program on Octave. Typically for each NR sample it’ll take around an hour to run. User can set list of samples using follwing on line#5 list_file='list_NRs_bhimsaria_1_demo.txt'; Rest parameters can be set line#10 to line#38 of run_minseqfind_1.m
5. Program “run_minseqfind_1.m” generates PWM matrix in text format and chen format, which can then be converted to logo format using seqLogo package in R or chen2meme+ceqlogo command from MEME suite (https://meme-suite.org/meme/). These PWM logos as generated by ceqlogo were used in this manuscript.
6. Program “run_minseqfind_1.m” also generates MinSeqs which are printed in a text file with columns a) sequence b) reverse complement c) Weighted MinSeq score d) length of the sequence (without counting Ns) e) Enrichment/MinSeq score

Demo-
After downloading and converting data user can execute run_minseqfind_1.m for list_NRs_bhimsaria_1_demo.txt to get files corresponding to GR round 3 binding data named as S07_166_S07_998 in MinSeqFindFunct-op directory. Resultant data is stored as resf.mat file in S07_166_S07_998 directory. Corresponding MinSeqs are there in MinSeqs directory S07_166_S07_998_1_100000.txt file. 20 numbered PWMs will appear in TXT and CHEN directories eg. S07_166_S07_998_1_1.txt and S07_166_S07_998_1_1.chen with their sequence reverse complement S07_166_S07_998_1_1r.txt and S07_166_S07_998_1_1r.chen.

MinSeqScoring_1.m
Program to score sequences from MinSeqs stored in a resf.mat file generated by run_minseqfind_1.m
Sequences of length longer than 1000 will be ignored. As that takes much longer time for scoring

Define following variables in the file
1) 'seq_file' as FILE CONTAINING SEQUENCES
2) 'load_resf' as FOR THE MinSeq resf.mat file
3) 'op_file' as NAME OF THE OUTPUT FILE where output scores will be stored

Owner

Name: Devesh Bhimsaria
Login: dev11ume
Kind: user
Location: India
Company: IIT Roorkee

Repositories: 1
Profile: https://github.com/dev11ume

Professor @ IIT Roorkee

Citation (CITATION.cff)

cff-version: 1.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Bhimsaria
    given-names: Devesh
    orcid: https://orcid.org/0000-0001-8413-3801
title: "Hidden Modes of DNA Binding by Human Nuclear Receptors"
version: 1.0
doi: 10.5281/zenodo.7844417
date-released: 2023-04-19

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

minseq-find

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.txt

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year