lagoon-mcl

LArGe cOmparative Omics Networks - Markov CLustering (LAGOON-MCL)

https://github.com/jroussea/lagoon-mcl

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

LArGe cOmparative Omics Networks - Markov CLustering (LAGOON-MCL)

Basic Info
  • Host: GitHub
  • Owner: jroussea
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 226 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 8
Created about 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

LArGe cOmparative Omics Network - Markov CLustering

LAGOON-MCL Nextflow Singularity

Introduction

LAGOON-MCL is a FAIR pipeline using Nextflow as workflow manager. The main objective of the pipeline is to build putative protein families using sequence similarity networks and graph clustering. To explore the resulting clusters, LAGOON-MCL uses annotations (functional, taxonomic, ...) provided by the user or obtained with the pipeline using Pfam. To take sequence exploration a step further, ESM Metagenomic Atlas clustered at 30% identity can be scanned for information on the protein's three-dimensional structure.

  • The first step is to build a Sequence Similarity Network (SSNs), aligning all the sequences against itself with Diamond BLASTp. Network clustering with Markov CLustering algorithm (MCL).
  • The second [optional] step is to obtain information about the sequences (function, taxonomy, etc.). LAGOON-MCL can scan Pfam using MMseqs2.
  • The third stage of the pipeline calculates a homogeneity score for each cluster based on sequence information (the homogeneity score is calculated for each annotation).

Start with LAGOON-MCL

  1. Install Nextflow

  2. Install Singularity

  3. Download the pipeline

bash git clone https://github.com/jroussea/lagoon-mcl.git

  1. Build Singularity images

The tool-specific containers (SeqKit2, MCL, Diamond and MMseqs2) are built from BioContainers. The LAGOON-MCL container (with R, Python, packages and modules) is built from a container available on Docker Hub, the Dockerfile is available here.

```bash

SeqKit2 v2.9.0

wget -O containers/seqkit/2.9.0/seqkit.sif https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0

Diamond v2.1.10

wget -O containers/diamond/2.1.10/diamond.sif https://depot.galaxyproject.org/singularity/diamond:2.1.10--h43eeafb_2

MCL v22.282

wget -O containers/mcl/22.282/mcl.sif https://depot.galaxyproject.org/singularity/mcl:22.282--pl5321h031d066_2

MMseqs2 v15.6f452

wget -O containers/mmseqs2/15.6f452/mmseqs.sif https://depot.galaxyproject.org/singularity/mmseqs2:15.6f452--pl5321h6a68c12_3

LAGOON-MCL v1.1.0

singularity build --fakeroot containers/lagoon-mcl/1.1.0/lagoon-mcl.sif docker://jroussea/lagoon-mcl:latest ```

  1. Download and build database

```Bash cd tool-kit chmod +x buildalpahfolddb.sh buildpfamdb.sh

Download and build Pfam

./buildpfamdb.sh

Download and build AlphaFoldDB

./buildalpahfolddb.sh ```

Default path for Pfam database: lagoon-mcl/database/pfamDB \ Default path for AlphaFold database: lagoon-mcl/database/alaphafoldDB

  1. Test the pipeline

bash chmod +x bin/* nextflow run main.nf -profile test,singularity

  1. Run your analysis

bash nextflow run main.nf -profile custom,singularity [-c <institute_config_file>]

Documentation

For more information about LAGOON-MCL, please read the documentation.

Contributions and Support

LAGOON-MCL is actively supported and developed pipeline. Please use the issue tracker for malfunctions and the GitHub discussions for questions, comments, feature requests, etc.

Acknowledgments

LArGe cOmparative Omics Networks (LAGOON) Markov CLustering algorithm (MCL) is developed by the Atelier de BioInformatique team of the Institut de Systématique, Évolution, Biodiversité - UMR 7205 (Muséum National d'Histoire Naturelle, Paris, France).\ LAGOON-MCL is a new version of LAGOON developed by Dylan Klein.

Citations

If you use LAGOON-MCL, references can be found in CITATION.md

Owner

  • Login: jroussea
  • Kind: user

Citation (CITATION.md)

# LAGOON-MCL: Citations

## [Nextflow](https://www.nextflow.io/)

> Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316‑319. [https://doi.org/10.1038/nbt.3820](https://doi.org/10.1038/nbt.3820)

## Pipeline tools

* [**Diamond**](https://github.com/bbuchfink/diamond)

> Buchfink, B., Reuter, K., & Drost, H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18(4), Article 4. [https://doi.org/10.1038/s41592-021-01101-x](https://doi.org/10.1038/s41592-021-01101-x)

* [**Markov CLustering algorithm**](https://micans.org/mcl/)

> Enright, A. J., Van Dongen, S., & Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7), 1575‑1584.

> Van Dongen, S. (2008). Graph Clustering Via a Discrete Uncoupling Process. SIAM Journal on Matrix Analysis and Applications, 30(1), 121‑141. [https://doi.org/10.1137/040608635](https://doi.org/10.1137/040608635)

> van Dongen, S., & Abreu-Goodger, C. (2012). Using MCL to Extract Clusters from Networks. In J. van Helden, A. Toussaint, & D. Thieffry (Éds.), Bacterial Molecular Networks : Methods and Protocols (p. 281‑295). Springer. [https://doi.org/10.1007/978-1-61779-361-5_15](https://doi.org/10.1007/978-1-61779-361-5_15)

* [**SeqKit2**](https://bioinf.shenwei.me/seqkit/)

> Shen, W., Sipos, B., & Zhao, L. (2024). SeqKit2 : A Swiss army knife for sequence and alignment processing. iMeta, 3(3), e191. [https://doi.org/10.1002/imt2.191](https://doi.org/10.1002/imt2.191)

* [**MMseqs2**](https://github.com/soedinglab/MMseqs2)

> Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), Article 11. [https://doi.org/10.1038/nbt.3988](https://doi.org/10.1038/nbt.3988)

## Pipeline databases

* [**Pfam database**](http://pfam.xfam.org/)

> Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G. A., Sonnhammer, E. L. L., Tosatto, S. C. E., Paladin, L., Raj, S., Richardson, L. J., Finn, R. D., & Bateman, A. (2021). Pfam : The protein families database in 2021. Nucleic Acids Research, 49(D1), D412‑D419. [https://doi.org/10.1093/nar/gkaa913](https://doi.org/10.1093/nar/gkaa913)

* [**AlphaFold clusters database**]

> Barrio-Hernandez, I., Yeo, J., Jänes, J., Mirdita, M., Gilchrist, C. L. M., Wein, T., Varadi, M., Velankar, S., Beltrao, P., & Steinegger, M. (2023). Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983), 637‑645. [https://doi.org/10.1038/s41586-023-06510-w](https://doi.org/10.1038/s41586-023-06510-w)

> Varadi, M., Bertoni, D., Magana, P., Paramval, U., Pidruchna, I., Radhakrishnan, M., Tsenkov, M., Nair, S., Mirdita, M., Yeo, J., Kovalevskiy, O., Tunyasuvunakool, K., Laydon, A., Žídek, A., Tomlinson, H., Hariharan, D., Abrahamson, J., Green, T., Jumper, J., … Velankar, S. (2024). AlphaFold Protein Structure Database in 2024 : Providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52(D1), D368‑D375. [https://doi.org/10.1093/nar/gkad1011](https://doi.org/10.1093/nar/gkad1011)

> Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), Article 7873. [https://doi.org/10.1038/s41586-021-03819-2](https://doi.org/10.1038/s41586-021-03819-2)

## Software packaging / containerisation tools

* [**Singularity**](https://sylabs.io/singularity/)

> Kurtzer, G. M., Sochat, V., & Bauer, M. W. (2017). Singularity : Scientific containers for mobility of compute. PLOS ONE, 12(5), e0177459. [https://doi.org/10.1371/journal.pone.0177459](https://doi.org/10.1371/journal.pone.0177459)

* [**BioContainers**](https://biocontainers.pro/)

> da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers : An open-source and community-driven framework for software standardization. Bioinformatics (Oxford, England), 33(16), 2580‑2582. [https://doi.org/10.1093/bioinformatics/btx192](https://doi.org/10.1093/bioinformatics/btx192)

* [**Docker**](https://www.docker.com/)

> Merkel, D. (2014). Docker : Lightweight Linux containers for consistent development and deployment. Linux J., 2014(239), 2:2.

GitHub Events

Total
  • Watch event: 1
  • Delete event: 11
  • Public event: 1
  • Push event: 72
  • Gollum event: 39
  • Pull request review event: 4
  • Pull request event: 8
  • Create event: 7
Last Year
  • Watch event: 1
  • Delete event: 11
  • Public event: 1
  • Push event: 72
  • Gollum event: 39
  • Pull request review event: 4
  • Pull request event: 8
  • Create event: 7

Dependencies

containers/lagoon-mcl/1.1.0/Dockerfile docker
  • condaforge/miniforge3 latest build