manorm2-utils
To pre-process a set of ChIP-seq samples and coordinate with MAnorm2 for differential analysis
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Keywords
chip-seq
normalization
python-library
Last synced: 6 months ago
·
JSON representation
Repository
To pre-process a set of ChIP-seq samples and coordinate with MAnorm2 for differential analysis
Basic Info
- Host: GitHub
- Owner: tushiqi
- License: gpl-3.0
- Language: Python
- Default Branch: master
- Size: 4.4 MB
Statistics
- Stars: 8
- Watchers: 1
- Forks: 4
- Open Issues: 2
- Releases: 0
Topics
chip-seq
normalization
python-library
Created over 7 years ago
· Last pushed almost 7 years ago
Metadata Files
Readme
License
README.rst
=============================
Introduction to MAnorm2_utils
=============================
:Author: Shiqi Tu
:Contact: tushiqi@picb.ac.cn
:Version: 1.0.0
:Date: 2018-08-24
:code:`MAnorm2_utils` is designed to coordinate with MAnorm2_, an R package for
differential analysis with ChIP-seq_ signals between two or more groups of
replicate samples. :code:`MAnorm2_utils` is primarily used for processing a set
of ChIP-seq samples into a regular table recording the read abundances and
enrichment states of a list of genomic bins in each of these samples.
.. _MAnorm2: https://github.com/tushiqi/MAnorm2
.. _ChIP-seq: https://en.wikipedia.org/wiki/ChIP-sequencing
Usage
------------------------------
The primary utility of :code:`MAnorm2_utils` comes from the two scripts bound
with it, named :code:`profile_bins` and :code:`sam2bed`, respectively.
Profiling ChIP-seq signals in reference genomic regions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Given the peak regions and mapping positions of reads of each of a set of
ChIP-seq_ samples, :code:`profile_bins` comes up with a list of reference
genomic bins (each being enriched for ChIP-seq signals in at least one of the
samples), and deduces the read count as well as enrichment status of each of
the bins in each sample. Refer to MACS_ for more information about the
technical terms mentioned above.
.. _MACS: https://genomebiology.biomedcentral.com/
articles/10.1186/gb-2008-9-9-r137
We recommend `MACS 1.4`_ for identifying peaks for ChIP-seq samples associated
with narrow genomic regions of reads enrichment (e.g., samples for most
transcription factors and histone modifications like H3K4me3 and H3K27ac). In
fact, although having a general applicability, :code:`profile_bins` is
specifically suited to processing the output files generated by MACS 1.4. For
histone modifications constituting broad enriched domains (e.g., H3K9me3 and
H3K27me3), we recommend SICER_ as the peak caller.
.. _MACS 1.4: https://github.com/taoliu/MACS/downloads
.. _SICER: https://academic.oup.com/bioinformatics/article/25/15/1952/212783
The following is a sample usage of :code:`profile_bins` of the simplest form:
.. code:: bash
profile_bins --peaks=peak1.bed,peak2.bed \
--reads=read1.bed,read2.bed \
--labs=s1,s2 -n example
.. Note::
:code:`profile_bins` only recognizes BED-formatted_ input files. For read
alignment results stored in SAM_ files, use first :code:`sam2bed` to
transform them into BED files before calling :code:`profile_bins` (BED files
created by :code:`sam2bed` have been specifically designed to suit
:code:`profile_bins`; see also the `section below`__). For BAM-formatted_
files, refer to Samtools_ for converting them into SAM files.
.. _BED-formatted: BED_
.. _BED: http://genome.ucsc.edu/FAQ/FAQformat.html#format1
.. _BAM-formatted: SAM_
.. _SAM: https://samtools.github.io/hts-specs/SAMv1.pdf
.. _Samtools: https://www.htslib.org/
__ `Transforming SAM into BED files`_
If everything goes smoothly, the command above will generate two files, named
``example_profile_bins_log.txt`` and ``example_profile_bins.xls``,
respectively. The former records the full list of parameter settings for
calling :code:`profile_bins`, as well as some summary statistics regarding each
of the supplied ChIP-seq samples. The latter gives the read count and
enrichment status for each deduced reference genomic bin in each sample, and
has a format like the following (data shown here is only for illustration):
.. table:: Example output of :code:`profile_bins`
:align: right
====== ======= ======= ============ ============ ============= =============
chrom start end s1.read_cnt s2.read_cnt s1.occupancy s2.occupancy
====== ======= ======= ============ ============ ============= =============
chr1 28112 29788 115 4 1 0
chr1 164156 166417 233 194 1 1
chr1 166417 168417 465 577 1 1
chr1 168417 169906 15 34 0 1
====== ======= ======= ============ ============ ============= =============
To clarify, a genomic bin is "occupied" by a ChIP-seq sample if and only if its
middle point is covered by some peak region of the sample.
:code:`profile_bins` supports a number of parameters for a customized
configuration for deducing reference genomic bins as well as counting the reads
falling in them. Type :code:`profile_bins --help` in the command line for a
complete list of these parameters and a brief description of each of them.
Among others, several parameters deserve specific attention:
- By default, :code:`profile_bins` merges peaks from all the provided ChIP-seq
samples into a consensus set of peak regions, and divides up each *broad*
merged peak into consecutive genomic bins. Specify :code:`--typical-bin-size`
to control the size of such genomic bins. Note that the merged peaks having a
size comparable to this parameter are left untouched.
The default value of :code:`--typical-bin-size`, which is 2000, suits well
the ChIP-seq samples of histone modifications. For ChIP-seq samples of
transcription factors, setting the parameter to 1000 is recommended.
- In cases where summit positions of the supplied peaks are available (e.g.,
when the peaks are called by using `MACS 1.4`_), you may provide
:code:`profile_bins` with this information via specifying :code:`--summits`.
Summit positions will be used to determine an appropriate start point for
dividing up a broad merged peak.
- Alternatively, you can directly specify a set of genomic regions as the
reference bins to profile, by setting :code:`--bins` to a BED_ file. In this
case, :code:`profile_bins` focuses on these provided bins and suppresses the
peak merging procedure.
:code:`--typical-bin-size` and :code:`--summits` are ignored when
:code:`--bins` is specified.
- Before being assigned to reference bins, each read (or read pair) is
converted into a genomic locus representing the middle point of the
underlying DNA fragment. By default, :code:`profile_bins` treats the supplied
reads as single-end, and shifts downstream the 5' end of each of them by
:code:`--shiftsize` to reach the putative middle point. :code:`--shiftsize`
defaults to 100, and may be set to half of the practical DNA fragment size
selected in the library preparation process.
- Set :code:`--paired` to indicate the reads are paired-end. In this case,
middle point of the underlying DNA fragment associated with each read pair
could be accurately inferred. Note that two reads from the same ChIP-seq
sample are considered as a read pair only if they have *exactly the same*
name (i.e., the 4th column in a BED_ file).
:code:`--shiftsize` is ignored when :code:`--paired` is set.
- :code:`--keep-dup` controls the program's behavior regarding duplicate reads
(or read pairs) potentially resulting from PCR amplification. For single-end
reads, two reads are considered as duplicates if their 5' ends are mapped to
the same genomic locus; for paired-end reads, two read pairs are considered
as duplicates if their implied DNA fragments occupy the same genomic
interval.
By default, :code:`profile_bins` preserves all the reads (or read pairs) for
the counting procedure. For both paired-end reads and deep-sequencing
single-end reads, we strongly recommend setting :code:`--keep-dup` to 1 to
enhance the specificity of downstream analyses. In that case, for each
ChIP-seq sample only one read (or read pair) of a set of duplicates is
retained for counting. Note also that the output log file records, for each
sample, the ratio of reads (or read pairs) that are removed due to
:code:`--keep-dup`.
- :code:`profile_bins` supports the idea of using a configuration file to
deliver parameters, to avoid repeated typing in the command line. To do that,
write a configuration file following the format as demonstrated below, and
pass it to :code:`--parameters`::
peaks=peak1.bed,peak2.bed
reads=read1.bed,read2.bed
labs=s1,s2
n=example
summits=summit1.bed,summit2.bed
paired
keep-dup=1
Note that :code:`--parameters` could be used in mixture with the other
command-line arguments.
Refer to the `Manual of MAnorm2_utils`_ for a full specification of the
parameters supported by :code:`profile_bins`.
.. _Manual of MAnorm2_utils: https://github.com/tushiqi/MAnorm2_utils/
tree/master/docs
Transforming SAM into BED files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
:code:`sam2bed` is designed to coordinate with :code:`profile_bins`, since the
latter only accepts BED-formatted_ files. The simplest form of calling
:code:`sam2bed` is as follows:
.. code:: bash
sam2bed -i File.sam -o File.bed
The program will read from the standard input stream if :code:`-i` is not
specified.
In the vast majority of cases, the default setting of most of the parameters
supported by :code:`sam2bed` should be used.
The only parameter that may be customized in
practice is :code:`--min-qual`, which controls the program's behavior
regarding filtering out the SAM_ alignment records with a low mapping quality.
Type :code:`sam2bed --help` in the command line for a brief description of each
parameter supported by :code:`sam2bed`.
GitHub Events
Total
- Issue comment event: 1
- Pull request review comment event: 6
- Pull request event: 1
- Fork event: 1
Last Year
- Issue comment event: 1
- Pull request review comment event: 6
- Pull request event: 1
- Fork event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Shiqi Tu | t****i@p****n | 28 |
Committer Domains (Top 20 + Academic)
picb.ac.cn: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 3
- Total pull requests: 0
- Average time to close issues: 24 days
- Average time to close pull requests: N/A
- Total issue authors: 3
- Total pull request authors: 0
- Average comments per issue: 1.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- maisarahabs (1)
- sanchari24 (1)
- Edert (1)
Pull Request Authors
- aafolaya (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 39 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 1
- Total maintainers: 1
pypi.org: manorm2-utils
To pre-process a set of ChIP-seq samples
- Homepage: https://github.com/tushiqi/MAnorm2_utils
- Documentation: https://manorm2-utils.readthedocs.io/
- License: GNU General Public License v3 (GPLv3)
-
Latest release: 1.0.0
published over 7 years ago
Rankings
Dependent packages count: 10.0%
Forks count: 16.9%
Stargazers count: 21.5%
Dependent repos count: 21.7%
Average: 23.6%
Downloads: 47.8%
Maintainers (1)
Last synced:
6 months ago