motifpeeker

Benchmark Epigenomic Profiling Methods with Motif Enrichment as Key Metric

https://github.com/neurogenomics/motifpeeker

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

bioconductor bioconductor-package chip-seq epigenomics interactive-report motif-enrichment-analysis r-package

Last synced: 9 months ago · JSON representation

Repository

Benchmark Epigenomic Profiling Methods with Motif Enrichment as Key Metric

Basic Info

Host: GitHub
Owner: neurogenomics
License: gpl-3.0
Language: R
Default Branch: master
Homepage: https://neurogenomics.github.io/MotifPeeker/
Size: 7.32 MB

Statistics

Stars: 2
Watchers: 3
Forks: 0
Open Issues: 1
Releases: 0

Topics

bioconductor bioconductor-package chip-seq epigenomics interactive-report motif-enrichment-analysis r-package

Created almost 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output:
  github_document
---

```{r, echo = FALSE, include = FALSE}
pkg <- read.dcf("DESCRIPTION", fields = "Package")[1]
title <- gsub("\n"," ",read.dcf("DESCRIPTION", fields = "Title")[1])
description <- gsub("\n"," ",read.dcf("DESCRIPTION", fields = "Description")[1])
URL <- read.dcf('DESCRIPTION', fields = 'URL')[1]
owner <- strsplit(URL,"/")[[1]][4]
repo <- strsplit(URL,"/")[[1]][5]
```

# MotifPeeker
Benchmarking Epigenomic Profiling Methods Using Motif Enrichment  
![](https://github.com/neurogenomics/MotifPeeker/raw/master/inst/hex/hex.png){style='height: 300px !important;'}

[![Bioc history](https://bioconductor.org/shields/years-in-bioc/MotifPeeker.svg)](https://bioconductor.org/packages/devel/bioc/html/MotifPeeker.html#since)
`r rworkflows::use_badges(add_hex = FALSE, add_authors = FALSE)`

**Authors:** ***Hiranyamaya (Hiru) Dash, Thomas Roberts, Maria Weinert, Nathan Skene***  
**Updated:** ***`r format(Sys.Date(), '%b-%d-%Y')`***  


## Introduction

`MotifPeeker` is used to compare and analyse datasets from epigenomic
profiling methods with motif enrichment as the key benchmark. The package
outputs an HTML report consisting of three sections:  

1. **General Metrics**: Provides an overview of metrics related to dataset
peaks, including FRiP scores, peak widths, and motif-to-summit distances.  

2. **Known Motif Enrichment Analysis**: Presents statistics on the frequency of
enriched user-supplied motifs in the datasets and compares them between the
common and unique peaks from comparison and reference datasets.  

3. **Discovered Motif Enrichment Analysis**: Details the statistics of
motifs discovered in common and unique peaks from comparison and reference
datasets. Examines motif similarities and identifies the closest known motifs in
the JASPAR or the provided database.
 




## Installation 

`MotifPeeker` uses
[`memes`](https://www.bioconductor.org/packages/release/bioc/html/memes.html)
which relies on a local install of the
[MEME suite](https://meme-suite.org/meme/), which can be installed as follows:
```{bash, eval = FALSE}
MEME_VERSION=5.5.5  # or the latest version

wget https://meme-suite.org/meme/meme-software/$MEME_VERSION/meme-$MEME_VERSION.tar.gz
tar zxf meme-$MEME_VERSION.tar.gz
cd meme-$MEME_VERSION
./configure --prefix=$HOME/meme --with-url=http://meme-suite.org/ \
--enable-build-libxml2 --enable-build-libxslt
make
make install

# Add to PATH
echo 'export PATH=$HOME/meme/bin:$HOME/meme/libexec/meme-$MEME_VERSION:$PATH' >> ~/.bashrc
echo 'export MEME_BIN=$HOME/meme/bin' >> ~/.bashrc
source ~/.bashrc
```

**NOTE:** It is important that Perl dependencies associated with MEME suite are
also installed, particularly `XML::Parser`, which can be installed using the
following command in the terminal:
```{bash, eval = FALSE}
cpan install XML::Parser
```
For more information, refer to the [Perl dependency section of the MEME suite](https://meme-suite.org/meme/doc/install.html#prereq_perl).

Once the MEME suite and its associated Perl dependencies are installed, the
development version of `MotifPeeker` can be installed using the following code:
```{r, eval = FALSE}
# Install latest version of MotifPeeker
BiocManager::install("MotifPeeker", version = "devel", dependencies = TRUE)

# Load the package
library(MotifPeeker)
```

Alternatively, you can use the [Docker/Singularity container](https://neurogenomics.github.io/MotifPeeker/articles/docker.html) to
run the package out-of-the-box.

## Documentation 

#### [MotifPeeker Website](https://`r owner`.github.io/`r repo`) 
#### [Get Started](https://neurogenomics.github.io/MotifPeeker/articles/MotifPeeker.html) 
#### [Docker/Singularity Container](https://neurogenomics.github.io/MotifPeeker/articles/docker.html) 
#### [Example Reports](https://neurogenomics.github.io/MotifPeeker/articles/examples.html) 
#### [Troubleshooting](https://neurogenomics.github.io/MotifPeeker/articles/troubleshooting.html) 

## Usage

Load the package and example datasets.  
```{r, eval = FALSE}
library(MotifPeeker)
data("CTCF_ChIP_peaks", package = "MotifPeeker")
data("CTCF_TIP_peaks", package = "MotifPeeker")
data("motif_MA1102.2", package = "MotifPeeker")
data("motif_MA1930.2", package = "MotifPeeker")
```

Prepare input files.  
```{r, eval = FALSE}
peak_files <- list(CTCF_ChIP_peaks, CTCF_TIP_peaks)
alignment_files <- list(
    system.file("extdata", "CTCF_ChIP_alignment.bam", package = "MotifPeeker"),
    system.file("extdata", "CTCF_TIP_alignment.bam", package = "MotifPeeker")
)
motif_files <- list(motif_MA1102.2, motif_MA1930.2)
```

Run `MotifPeeker()`:  
```{r, eval = FALSE}
MotifPeeker(
    peak_files = peak_files,
    reference_index = 2,  # Set TIP-seq experiment as reference
    alignment_files = alignment_files,
    exp_labels = c("ChIP", "TIP"),
    exp_type = c("chipseq", "tipseq"),
    genome_build = "hg38",
    motif_files = motif_files,
    cell_counts = NULL,  # No cell-count information
    motif_discovery = TRUE,
    motif_discovery_count = 3,
    motif_db = NULL,
    download_buttons = TRUE,
    out_dir = tempdir(),
    workers = 2,
    debug = FALSE,
    quiet = FALSE,
    verbose = TRUE
)
```

### Required Inputs

These input parameters must be provided:  

Details

- `peak_files`: A list of path to peak files or `GRanges` objects with the
  peaks to analyse. Currently, only peak files from `MACS2/3` (`.narrowPeak`)
  and `SEACR` (`.bed`) are supported. ENCODE file IDs can also be provided to
  automatically fetch peak file(s) from the ENCODE database.  
- `reference_index`: An integer specifying the index of the reference dataset
  in the `peak_files` list to use as reference for various comparisons.
  (default = 1)  
- `genome_build`: A character string or a `BSgenome` object specifying the
  genome build of the datasets. At the moment, only hg38 and hg19 are supported
  as abbreviated input.  
- `out_dir`: A character string specifying the output directory to save the
  HTML report and other files.  



### Optional Inputs

These input parameters optional, but *recommended* to add more analyses, or
enhance them:  

Details

- `alignment_files`: A list of path to alignment files or `Rsamtools::BamFile`
  objects with the alignment sequences to analyse. Alignment files are used to
  calculate read-related metrics like FRiP score. ENCODE file IDs can also be
  provided to automatically fetch alignment file(s) from the ENCODE database.  
- `exp_labels`: A character vector of labels for each peak file. If not
  provided, capital letters will be used as labels in the report.
- `exp_type`: A character vector of experimental types for each peak file.  
  Useful for comparison of different methods. If not provided, all datasets will
  be classified as "unknown" experiment types in the report. `exp_type` is used
  only for labelling. It does not affect the analyses. You can also input custom
  strings. Datasets will be grouped as long as they match their respective
  `exp_type`. Supported experimental types are:  
    - `chipseq`: ChIP-seq data  
    - `tipseq`: TIP-seq data  
    - `cuttag`: CUT&Tag data  
    - `cutrun`: CUT&Run data  
- `motif_files`: A character vector of path to motif files, or a vector of
  `universalmotif-class` objects. Required to run Known Motif Enrichment
  Analysis. JASPAR matrix IDs can also be provided to automatically fetch motifs
  from the JASPAR.  
- `motif_labels`: A character vector of labels for each motif file. Only used if
  path to file names are passed in motif_files. If not provided, the motif file
  names will be used as labels.  
- `cell_counts`: An integer vector of experiment cell counts for each peak file
  (if available). Creates additional comparisons based on cell counts.  
- `motif_db`: Path to `.meme` format file to use as reference database, or a
  list of `universalmotif-class` objects. Results from motif discovery
  are searched against this database to find similar motifs. If not provided,
  JASPAR CORE database will be used, making this parameter **truly optional**.
  **NOTE**: p-value estimates are inaccurate when the database has fewer than
  50 entries.  



### Other Options

For more information on additional parameters, please refer to the documentation
for [`MotifPeeker()`](https://neurogenomics.github.io/MotifPeeker/reference/MotifPeeker.html).  

### Runtime Guidance

For 4 datasets, the runtime is approximately 3 minutes with
motif_discovery disabled. However, motif discovery can take
hours to complete.  

To make computation faster, we highly recommend tuning the following arguments:  

Details

- `workers`: Running motif discovery in parallel can significantly reduce
  runtime, but it is very memory-intensive, consuming upwards of 10GB of RAM per
  thread. Memory starvation can greatly slow the process, so set `workers` with
  caution.  
- `motif_discovery_count`: The number of motifs to discover per sequence group
  exponentially increases runtime. We recommend no more than 5 motifs to make a
  meaningful inference.  
- `trim_seq_width`: Trimming sequences before running motif discovery
  can significantly reduce the search space. Sequence length can exponentially
  increase runtime. We recommend running the script with
  `motif_discovery = FALSE` and studying the motif-summit distance
  distribution under general metrics to find the sequence length that captures
  most motifs. A good starting point is 150 but it can be reduced further if
  appropriate.



### Outputs
  
`MotifPeeker` generates its output in a new folder within he `out_dir`
directory. The folder is named `MotifPeeker_YYYYMMDD_HHMMSS` and contains the
following files:  

- `MotifPeeker.html`: The main HTML report, including all analyses and plots.  
- Output from various MEME suite tools in their respecive sub-directories, if
  `save_runfiles` is set to `TRUE`.  

## Datasets

`MotifPeeker` comes with several datasets bundled:  

Details

- `CTCF_TIP_peaks`: Human CTCF peak file generated with TIP-seq using
  HCT116 cell-line. No control files were used to generate the peak file. The
  peaks were called using `MACS3` with `CTCF_TIP_alignment.bam` as input.  
- `CTCF_ChIP_peaks`: Human CTCF peak file generated with ChIP-seq using
  HCT116 cell-line. No control files were used to generate the peak file. The
  peaks were called using `MACS3` with `CTCF_ChIP_alignment.bam` as input.  
- `motif_MA1102.3`: The JASPAR motif for CTCFL (MA1102.3) for *Homo Sapiens*.
  Sourced from [JASPAR](https://jaspar.elixir.no/matrix/MA1102.3/)
- `motif_MA1930.2`: The JASPAR motif for CTCFL (MA1930.2) for *Homo Sapiens*.
  Sourced from [JASPAR](https://jaspar.elixir.no/matrix/MA1930.2/)
- `CTCF_TIP_alignment.bam`: Human CTCF alignment file generated with TIP-seq
  using HCT116 cell-line. The alignment file was generated using the
  [`nf-core/cutandrun`](https://nf-co.re/cutandrun/3.2.2) pipeline.
  Raw read files were sourced from *NIH Sequence Read Archives*
  [*ID: SRR16963166*](https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR16963166). 
  Only available as *extdata*.  
- `CTCF_ChIP_alignment.bam`: Human CTCF alignment file generated with ChIP-seq
  using HCT116 cell-line. Sourced from
  [ENCODE (Accession: ENCFF091ODJ)](https://www.encodeproject.org/files/ENCFF091ODJ/).
  Only available as *extdata*.  



Please note that the peaks and alignments included are a very small subset
(*chr10:65,654,529-74,841,155*) of the actual data. It only serves as an example
to demonstrate the package and run tests to maintain the integrity of the
package.

## Citation
 
If you use ``r pkg``, please cite: 


> `r citation(pkg)$textVersion`

## Licensing Restrictions

MotifPeeker incorporates the MEME Suite, which is available free of charge for
educational, research, and non-profit purposes. Users intending to use
MotifPeeker for commercial purposes are required to purchase a license for the
MEME Suite.

For more details, please refer to the [MEME Suite Copyright Page](https://meme-suite.org/meme/doc/copyright.html).

## Contact

### [Neurogenomics Lab](https://www.neurogenomics.co.uk)  
UK Dementia Research Institute  
Department of Brain Sciences  
Faculty of Medicine  
Imperial College London  
[GitHub](https://github.com/neurogenomics)  



### Session Info


```{r}
utils::sessionInfo()
```

Owner

Name: neurogenomics
Login: neurogenomics
Kind: organization
Location: United Kingdom

Website: https://www.neurogenomics.co.uk
Repositories: 56
Profile: https://github.com/neurogenomics

Neurogenomics Lab, UK Dementia Research Institute at Imperial College London

GitHub Events

Total

Issues event: 1
Watch event: 1
Delete event: 4
Issue comment event: 4
Push event: 58
Pull request event: 9
Create event: 7

Last Year

Issues event: 1
Watch event: 1
Delete event: 4
Issue comment event: 4
Push event: 58
Pull request event: 9
Create event: 7

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1
Total pull requests: 6
Average time to close issues: 3 days
Average time to close pull requests: less than a minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.67
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 6
Average time to close issues: 3 days
Average time to close pull requests: less than a minute
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.67
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

HDash (1)

Pull Request Authors

HDash (11)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

bug (2) enhancement (2)

Dependencies

.github/workflows/rworkflows.yml actions

neurogenomics/rworkflows master composite

DESCRIPTION cran

R >= 4.4.0 depends
knitr * suggests
markdown * suggests
methods * suggests
remotes * suggests
rmarkdown * suggests
rworkflows * suggests
testthat >= 3.0.0 suggests
utils * suggests

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science