cactus

Chromatin ACcessibility and Transcriptomics Unifying Software

https://github.com/jsalignon/cactus

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.4%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Chromatin ACcessibility and Transcriptomics Unifying Software

Basic Info
  • Host: GitHub
  • Owner: jsalignon
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Homepage:
  • Size: 70.9 MB
Statistics
  • Stars: 13
  • Watchers: 2
  • Forks: 3
  • Open Issues: 0
  • Releases: 9
Created over 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Changelog Citation

README.md

Overview of Cactus.
(a) Key features. Icons were adapted from Servier Medical Art and the Database Center for the life sciences/TogoTV. (b) Simplified workflow. (c) Example of enrichment analysis performed for a gene showing an increase in both chromatin accessibility and gene expression upon treatment. Enrichment of internal GRs and GSs indicates enrichment of GRs and GSs in other GRs and GSs generated by the pipeline. Black lines and blue circles represent DNA and nucleosomes, respectively. Orange lines represent mRNA molecules. (d) Sub-workflow showing the creation of DASs. Dotted arrows indicate optional additional filters. Abbreviations: DAR, differentially accessible region; DEG, differentially expressed gene; ChIP, ChIP-Seq binding sites; motifs, DNA binding motifs; FDR, false discovery rate; prom, promoter; distNC, distal non-coding region.

CACTUS (Chromatin ACcessibility and Transcriptomics Unification Software) is an mRNA-Seq and ATAC-Seq analysis pipeline that aims to assist researchers in formulating hypotheses about the molecular mechanisms regulating their conditions of interest. The pipeline does standard preprocessing and differential abundance analysis, followed by enrichment analysis using various large-scale external datasets, such as databases of gene ontologies, pathways, DNA binding motifs, CHIP-Seq binding sites, and chromatin states. Currently, Cactus can analyze data from any of the four ENCODE/modENCODE species: H. sapiens, M. musculus, D. melanogaster and C. elegans. The pipeline is designed to be easy to use for people without bioinformatics skills, efficient and reproducible through the use of the workflow language Nextflow, and various tools managers (Singularity, Docker, Conda, Mamba), and flexible with many parameters available to customize the analysis. Output files are easy to view (e.g., multiQC, merged and individual pdfs and tables, formatted Excel tables) and interpret (e.g., standardized downstream analysis figures, customizable heatmaps).

This introductory section provides a quick overview of how Cactus works, with: - A quick start guide to get started rapidly. - A tutorial to get details on usage, options and interpretation. - A flowchart to visualize the key steps of the analysis. - An overview of the output files to know what to expect.

Reference: Salignon, J., Millan-Ariño, L., Garcia, M. U. & Riedel, C. G. (2024). Cactus: a user-friendly and reproducible ATAC-Seq and mRNA-Seq analysis pipeline for data preprocessing, differential analysis, and enrichment analysis. Lincs: Genomics, bioRxiv..

Licence: This source code is released under the MIT license, included here.

Owner

  • Name: Jérôme Salignon
  • Login: jsalignon
  • Kind: user
  • Location: Stockholm
  • Company: Karolinska Institute

Citation (CITATIONS.md)


# Table of contents

  - [Cactus](#Cactus)
    - [jsalignon/cactus](#jsalignon/cactus)
    - [Languages](#Languages)
    - [Dependencies](#Dependencies)
    - [Tools](#cactus_tools)
  - [Scripts](#Scripts)
    - [Create references](#Create-references)
      - [Tools](#create_refs_tools)
      - [Databases](#create_refs_databases)
    - [Create test datasets](#Create-test-datasets)
      - [Tools](#create_test_ds_tools)
      - [Databases](#create_test_ds_databases)
  - [Research papers](#Research-papers)
    - [ATAC-Seq](#ATAC-Seq)
    - [Statistics](#Statistics)



# Cactus


## jsalignon/cactus

- [Cactus](https://doi.org/10.1101/2023.05.11.540110)

> Salignon J, Millan-Ariño L, Garcia M, Riedel C G. Cactus: a user-friendly and reproducible ATAC-Seq and mRNA-Seq analysis pipeline for data preprocessing, differential analysis, and enrichment analysis. doi: 10.1101/2023.05.11.540110.


## Languages

- [R](https://www.R-project.org/)

> R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

- [Nextflow](https://doi.org/10.1038/nbt.3820)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. PubMed PMID: 28398311.


## Dependencies

- [Anaconda](https://anaconda.com)

> Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

> Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

> da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

- [Mamba](https://medium.com/@QuantStack/open-software-packaging-for-science-61cecee7fc23)

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

> Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.


<h2 name="cactus_tools">Tools</h2>


- [BBMap](https://sourceforge.net/projects/bbmap/)

- [BEDTools](https://doi.org/10.1093/bioinformatics/btq033)

> Aaron R. Quinlan, Ira M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features, Bioinformatics, Volume 26, Issue 6, 15 March 2010, Pages 841–842.

- [bioconductor-ChIPseeker](http://dx.doi.org/10.1093/bioinformatics/btv145)

> G Yu, LG Wang, QY He. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 2015, 31(14):2382-2383.

- [bioconductor-GenomicFeatures](https://doi.org/10.1371/journal.pcbi.1003118)

> Lawrence M, Huber W, Pag\`es H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118.

- [bioconductor-clusterProfiler](https://doi.org/10.1089/omi.2011.0118)

> Yu G, Wang L, Han Y and He Q*. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

- [bioconductor-AnnotationDbi](https://bioconductor.org/packages/AnnotationDbi)

> Pagès H, Carlson M, Falcon S, Li N (2022). AnnotationDbi: Manipulation of SQLite-based annotations in Bioconductor. R package version 1.58.0.

- [bioconductor-DiffBind](https://doi.org/10.1038/nature10730)

> Ross-Innes CS, Stark R, Teschendorff AE, Holmes KA, Ali HR, Dunning MJ, Brown GD, Gojis O, Ellis IO, Green AR, Ali S, Chin S, Palmieri C, Caldas C, Carroll JS (2012). “Differential oestrogen receptor binding is associated with clinical outcome in breast cancer.” Nature, 481, -4.

> Stark R, Brown G (2011). DiffBind: differential binding analysis of ChIP-Seq peak data.

- [Bowtie2](https://doi.org/10.1038/nmeth.1923)

> Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359.

- [DeepTools](https://doi.org/10.1093/nar/gkw257)

> Ramírez, Fidel, Devon P. Ryan, Björn Grüning, Vivek Bhardwaj, Fabian Kilpert, Andreas S. Richter, Steffen Heyne, Friederike Dündar, and Thomas Manke. deepTools2: A next Generation Web Server for Deep-Sequencing Data Analysis. Nucleic Acids Research (2016).

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

- [HOMER](https://doi.org/10.1016/j.molcel.2010.05.004)

> Heinz S, Benner C, Spann N, Bertolino E et al. Simple Combinations of Lineage-Determining Transcription Factors Prime cis-Regulatory Elements Required for Macrophage and B Cell Identities. Mol Cell 2010 May 28;38(4):576-589. PMID: 20513432

- [MACS2](https://doi.org/10.1186/gb-2008-9-9-r137)

> Zhang, Y., Liu, T., Meyer, C.A. et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 9, R137 (2008).

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411. PubMed Central PMCID: PMC5039924.

- [Picard](https://broadinstitute.github.io/picard/)

> “Picard Toolkit.” 2019. Broad Institute, GitHub Repository.

- [PIGZ](https://zlib.net/pigz/)

- [pdftk](https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/)

- [kallisto](https://doi.org/10.1038/nbt.3519)

> Nicolas L Bray, Harold Pimentel, Páll Melsted and Lior Pachter, Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology 34, 525–527 (2016).

- [r-sleuth](http://dx.doi.org/10.1038/nmeth.4324)

> Harold J. Pimentel, Nicolas Bray, Suzette Puente, Páll Melsted and Lior Pachter, Differential analysis of RNA-Seq incorporating quantification uncertainty, Nature Methods (2017), advanced access.

- [r-openxlsx](https://ycphs.github.io/openxlsx/index.html)

> Schauberger P, Walker A (2022). openxlsx: Read, Write and Edit xlsx Files.

- [r-magrittr](https://magrittr.tidyverse.org)

> Bache S, Wickham H (2022). magrittr: A Forward-Pipe Operator for R.

- [r-dplyr](https://dplyr.tidyverse.org)

> Wickham H, François R, Henry L, Müller K (2022). dplyr: A Grammar of Data Manipulation.

- [r-purrr](http://purrr.tidyverse.org)

> Henry L, Wickham H (2022). purrr: Functional Programming Tools.

- [r-ggplot2](https://ggplot2.tidyverse.org/)

> H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.

- [r-data.table](https://CRAN.R-project.org/package=data.table)

> Matt Dowle and Arun Srinivasan (2021). data.table: Extension of `data.frame`. R package version 1.14.2.
  
- [r-gridExtra](https://CRAN.R-project.org/package=gridExtra)

> Baptiste Auguie (2017). gridExtra: Miscellaneous Functions for "Grid" Graphics. R package version 2.3.
  
- [r-ColorBrewer](https://CRAN.R-project.org/package=RColorBrewer)

> Erich Neuwirth (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-3. 

- [r-VennDiagram](https://doi.org/10.1186/1471-2105-12-35)

> Chen, H., Boutros, P.C. VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics 12, 35 (2011). 

- [SAMtools](https://pubmed.ncbi.nlm.nih.gov/19505943/)

> Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9. doi: 10.1093/bioinformatics/btp352. Epub 2009 Jun 8. PubMed PMID: 19505943; PubMed Central PMCID: PMC2723002.

- [Skewer](https://doi.org/10.1186/1471-2105-15-182)

> Jiang, H., Lei, R., Ding, S.W. and Zhu, S. (2014) Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinformatics, 15, 182.



# Scripts

## Create references

<h3 name="create_refs_tools">Tools</h3>

- [GffRead](https://doi.org/10.12688/f1000research.23297.2)

> How to cite: Pertea G and Pertea M. GFF Utilities: GffRead and GffCompare [version 2; peer review: 3 approved]. F1000Research 2020, 9:304.

- [Annotationhub](10.18129/B9.bioc.AnnotationHub)

> Morgan M, Shepherd L (2022). AnnotationHub: Client to access AnnotationHub resources. R package version 3.2.0.

- [cvbio](https://github.com/clintval/cvbio#cvbio)

- [LiftOver](https://doi.org/10.1093/nar/gkj144)

> A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, J. Hillman-Jackson, R. M. Kuhn, J. S. Pedersen, A. Pohl, B. J. Raney, K. R. Rosenbloom, A. Siepel, K. E. Smith, C. W. Sugnet, A. Sultan-Qurraie, D. J. Thomas, H. Trumbower, R. J. Weber, M. Weirauch, A. S. Zweig, D. Haussler, W. J. Kent, The UCSC Genome Browser Database: update 2006, Nucleic Acids Research, Volume 34, Issue suppl_1, 1 January 2006, Pages D590–D598.

- [BEDOPS](https://doi.org/10.1093/bioinformatics/bts277)

> Shane Neph, M. Scott Kuehn, Alex P. Reynolds, Eric Haugen, Robert E. Thurman, Audra K. Johnson, Eric Rynes, Matthew T. Maurano, Jeff Vierstra, Sean Thomas, Richard Sandstrom, Richard Humbert, John A. Stamatoyannopoulos, BEDOPS: high-performance genomic feature operations, Bioinformatics, Volume 28, Issue 14, 15 July 2012, Pages 1919–1920.

<h3 name="create_refs_databases">Databases</h3>

- [ChrommHmm](https://doi.org/10.1038/nmeth.1906)

> Ernst, J., Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 9, 215–216 (2012). 

- [ChrommHmm human chromatin states](https://doi.org/10.1038/s41586-020-03145-z)

> Boix, C.A., James, B.T., Park, Y.P. et al. Regulatory genomic circuitry of human disease loci by integrative epigenomics. Nature 590, 300–307 (2021). 

- [ChrommHmm mouse chromatin states](https://doi.org/10.1038/s42003-021-01756-4)
> van der Velde, A., Fan, K., Tsuji, J. et al. Annotation of chromatin states in 66 complete mouse epigenomes during development. Commun Biol 4, 239 (2021). 

- [CIS-BP motifs](https://doi.org/10.1016/j.cell.2014.08.009)

> Weirauch, M. T. et al. Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity. Cell 158, 1431–1443 (2014).

- [ENCODE blacklisted regions](https://doi.org/10.1038/s41598-019-45839-z)

> Amemiya, H.M., Kundaje, A. & Boyle, A.P. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 9, 9354 (2019). 

- [ENCODE integrative analysis](https://doi.org/10.1038/nature11247)

> The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). 

- [ENCODE portal](https://doi.org/10.1093/nar/gkz1062)

> Luo, Y. et al. New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Research 48, D882–D889 (2020).

- [Ensembl genomes](https://doi.org/10.1093/nar/gkab1049)

> Fiona Cunningham *et al.*, Ensembl 2022, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D988–D995.

- [HiHMM chromatin states](https://doi.org/10.1038/nature13415)

> Ho, J., Jung, Y., Liu, T. et al. Comparative analysis of metazoan chromatin organization. Nature 512, 449–452 (2014). 


## Create test datasets

<h3 name="create_test_ds_tools">Tools</h3>

- [nf-core/fetchngs](https://doi.org/10.1038/s41587-020-0439-x)

> Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). 

- [seqtk](https://github.com/lh3/seqtk)


<h3 name="create_test_ds_databases">Databases</h3>

- [NCBI GEO](https://doi.org/10.1093/nar/gks1193)

> Tanya Barrett, Stephen E. Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F. Kim, Maxim Tomashevsky, Kimberly A. Marshall, Katherine H. Phillippy, Patti M. Sherman, Michelle Holko, Andrey Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L. Robertson, Nadezhda Serova, Sean Davis, Alexandra Soboleva, NCBI GEO: archive for functional genomics data sets—update, Nucleic Acids Research, Volume 41, Issue D1, 1 January 2013, Pages D991–D995.


### Studies

 - [Worm and human (GSE98758)](https://doi.org/10.1016/j.devcel.2018.07.006)
> Kolundzic E, Ofenbauer A, Bulut SI, Uyar B et al. FACT Sets a Barrier for Cell Fate Reprogramming in Caenorhabditis elegans and Human Cells. Dev Cell 2018 Sep 10;46(5):611-626.e12. PMID: 30078731

 - [Fly (GSE149339)](https://doi.org/10.1101/gad.341768.120)
> Judd, J., Duarte, F. M. & Lis, J. T. Pioneer-like factor GAF cooperates with PBAP (SWI/SNF) and NURF (ISWI) to regulate transcription. Genes Dev. 35, 147–156 (2021).

- [Mouse (GSE193393)](https://doi.org/10.1093/nar/gkac584)
> Park SW, Kim J, Oh S, Lee J et al. PHF20 is crucial for epigenetic control of starvation-induced autophagy through enhancer activation. Nucleic Acids Res 2022 Aug 12;50(14):7856-7872. PMID: 35821310


# Research papers

## ATAC-Seq

- [First ATAC-Seq paper in *C. elegans*](https://doi.org/10.1101/gr.226233.117)

> Daugherty, A. C. et al. Chromatin accessibility dynamics reveal novel functional enhancers in C. elegans. Genome Research (2017).

- [Initial ATAC-Seq paper](https://doi.org/10.1038/NMETH.2688)

> Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods 10, 1213–1218 (2013).

## Statistics

- [Tests for enrichment or depletion of go categories](https://doi.org/10.1093/bioinformatics/btl633)

> Isabelle Rivals, Léon Personnaz, Lieng Taing, Marie-Claude Potier, Enrichment or depletion of a GO category within a class of genes: which test?, Bioinformatics, Volume 23, Issue 4, 15 February 2007, Pages 401–407.

- [Multiple testing adjustment with the False Discovery Rate](https://doi.org/10.1093/bioinformatics/btl633)

> Benjamini, Y., and Hochberg, Y. (1995).  Controlling the false discovery rate: a practical and powerful approach to multiple testing.  _Journal of the Royal Statistical Society Series B_,*57*, 289-300.

GitHub Events

Total
  • Watch event: 4
  • Push event: 3
Last Year
  • Watch event: 4
  • Push event: 3