lagoon-mcl
LArGe cOmparative Omics Networks - Markov CLustering (LAGOON-MCL)
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
LArGe cOmparative Omics Networks - Markov CLustering (LAGOON-MCL)
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 8
Metadata Files
README.md
LArGe cOmparative Omics Network - Markov CLustering
Introduction
LAGOON-MCL is a FAIR pipeline using Nextflow as workflow manager. The main objective of the pipeline is to build putative protein families using sequence similarity networks and graph clustering. To explore the resulting clusters, LAGOON-MCL uses annotations (functional, taxonomic, ...) provided by the user or obtained with the pipeline using Pfam. To take sequence exploration a step further, ESM Metagenomic Atlas clustered at 30% identity can be scanned for information on the protein's three-dimensional structure.
- The first step is to build a Sequence Similarity Network (SSNs), aligning all the sequences against itself with Diamond BLASTp. Network clustering with Markov CLustering algorithm (MCL).
- The second [optional] step is to obtain information about the sequences (function, taxonomy, etc.). LAGOON-MCL can scan Pfam using MMseqs2.
- The third stage of the pipeline calculates a homogeneity score for each cluster based on sequence information (the homogeneity score is calculated for each annotation).
Start with LAGOON-MCL
Install Nextflow
Install Singularity
Download the pipeline
bash
git clone https://github.com/jroussea/lagoon-mcl.git
- Build Singularity images
The tool-specific containers (SeqKit2, MCL, Diamond and MMseqs2) are built from BioContainers. The LAGOON-MCL container (with R, Python, packages and modules) is built from a container available on Docker Hub, the Dockerfile is available here.
```bash
SeqKit2 v2.9.0
wget -O containers/seqkit/2.9.0/seqkit.sif https://depot.galaxyproject.org/singularity/seqkit:2.9.0--h9ee0642_0
Diamond v2.1.10
wget -O containers/diamond/2.1.10/diamond.sif https://depot.galaxyproject.org/singularity/diamond:2.1.10--h43eeafb_2
MCL v22.282
wget -O containers/mcl/22.282/mcl.sif https://depot.galaxyproject.org/singularity/mcl:22.282--pl5321h031d066_2
MMseqs2 v15.6f452
wget -O containers/mmseqs2/15.6f452/mmseqs.sif https://depot.galaxyproject.org/singularity/mmseqs2:15.6f452--pl5321h6a68c12_3
LAGOON-MCL v1.1.0
singularity build --fakeroot containers/lagoon-mcl/1.1.0/lagoon-mcl.sif docker://jroussea/lagoon-mcl:latest ```
- Download and build database
```Bash cd tool-kit chmod +x buildalpahfolddb.sh buildpfamdb.sh
Download and build Pfam
./buildpfamdb.sh
Download and build AlphaFoldDB
./buildalpahfolddb.sh ```
Default path for Pfam database: lagoon-mcl/database/pfamDB \
Default path for AlphaFold database: lagoon-mcl/database/alaphafoldDB
- Test the pipeline
bash
chmod +x bin/*
nextflow run main.nf -profile test,singularity
- Run your analysis
bash
nextflow run main.nf -profile custom,singularity [-c <institute_config_file>]
Documentation
For more information about LAGOON-MCL, please read the documentation.
Contributions and Support
LAGOON-MCL is actively supported and developed pipeline. Please use the issue tracker for malfunctions and the GitHub discussions for questions, comments, feature requests, etc.
Acknowledgments
LArGe cOmparative Omics Networks (LAGOON) Markov CLustering algorithm (MCL) is developed by the Atelier de BioInformatique team of the Institut de Systématique, Évolution, Biodiversité - UMR 7205 (Muséum National d'Histoire Naturelle, Paris, France).\ LAGOON-MCL is a new version of LAGOON developed by Dylan Klein.
Citations
If you use LAGOON-MCL, references can be found in CITATION.md
Owner
- Login: jroussea
- Kind: user
- Repositories: 1
- Profile: https://github.com/jroussea
Citation (CITATION.md)
# LAGOON-MCL: Citations ## [Nextflow](https://www.nextflow.io/) > Di Tommaso, P., Chatzou, M., Floden, E. W., Barja, P. P., Palumbo, E., & Notredame, C. (2017). Nextflow enables reproducible computational workflows. Nature Biotechnology, 35(4), 316‑319. [https://doi.org/10.1038/nbt.3820](https://doi.org/10.1038/nbt.3820) ## Pipeline tools * [**Diamond**](https://github.com/bbuchfink/diamond) > Buchfink, B., Reuter, K., & Drost, H.-G. (2021). Sensitive protein alignments at tree-of-life scale using DIAMOND. Nature Methods, 18(4), Article 4. [https://doi.org/10.1038/s41592-021-01101-x](https://doi.org/10.1038/s41592-021-01101-x) * [**Markov CLustering algorithm**](https://micans.org/mcl/) > Enright, A. J., Van Dongen, S., & Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7), 1575‑1584. > Van Dongen, S. (2008). Graph Clustering Via a Discrete Uncoupling Process. SIAM Journal on Matrix Analysis and Applications, 30(1), 121‑141. [https://doi.org/10.1137/040608635](https://doi.org/10.1137/040608635) > van Dongen, S., & Abreu-Goodger, C. (2012). Using MCL to Extract Clusters from Networks. In J. van Helden, A. Toussaint, & D. Thieffry (Éds.), Bacterial Molecular Networks : Methods and Protocols (p. 281‑295). Springer. [https://doi.org/10.1007/978-1-61779-361-5_15](https://doi.org/10.1007/978-1-61779-361-5_15) * [**SeqKit2**](https://bioinf.shenwei.me/seqkit/) > Shen, W., Sipos, B., & Zhao, L. (2024). SeqKit2 : A Swiss army knife for sequence and alignment processing. iMeta, 3(3), e191. [https://doi.org/10.1002/imt2.191](https://doi.org/10.1002/imt2.191) * [**MMseqs2**](https://github.com/soedinglab/MMseqs2) > Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), Article 11. [https://doi.org/10.1038/nbt.3988](https://doi.org/10.1038/nbt.3988) ## Pipeline databases * [**Pfam database**](http://pfam.xfam.org/) > Mistry, J., Chuguransky, S., Williams, L., Qureshi, M., Salazar, G. A., Sonnhammer, E. L. L., Tosatto, S. C. E., Paladin, L., Raj, S., Richardson, L. J., Finn, R. D., & Bateman, A. (2021). Pfam : The protein families database in 2021. Nucleic Acids Research, 49(D1), D412‑D419. [https://doi.org/10.1093/nar/gkaa913](https://doi.org/10.1093/nar/gkaa913) * [**AlphaFold clusters database**] > Barrio-Hernandez, I., Yeo, J., Jänes, J., Mirdita, M., Gilchrist, C. L. M., Wein, T., Varadi, M., Velankar, S., Beltrao, P., & Steinegger, M. (2023). Clustering predicted structures at the scale of the known protein universe. Nature, 622(7983), 637‑645. [https://doi.org/10.1038/s41586-023-06510-w](https://doi.org/10.1038/s41586-023-06510-w) > Varadi, M., Bertoni, D., Magana, P., Paramval, U., Pidruchna, I., Radhakrishnan, M., Tsenkov, M., Nair, S., Mirdita, M., Yeo, J., Kovalevskiy, O., Tunyasuvunakool, K., Laydon, A., Žídek, A., Tomlinson, H., Hariharan, D., Abrahamson, J., Green, T., Jumper, J., … Velankar, S. (2024). AlphaFold Protein Structure Database in 2024 : Providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52(D1), D368‑D375. [https://doi.org/10.1093/nar/gkad1011](https://doi.org/10.1093/nar/gkad1011) > Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., … Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), Article 7873. [https://doi.org/10.1038/s41586-021-03819-2](https://doi.org/10.1038/s41586-021-03819-2) ## Software packaging / containerisation tools * [**Singularity**](https://sylabs.io/singularity/) > Kurtzer, G. M., Sochat, V., & Bauer, M. W. (2017). Singularity : Scientific containers for mobility of compute. PLOS ONE, 12(5), e0177459. [https://doi.org/10.1371/journal.pone.0177459](https://doi.org/10.1371/journal.pone.0177459) * [**BioContainers**](https://biocontainers.pro/) > da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Vera Alvarez, R., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers : An open-source and community-driven framework for software standardization. Bioinformatics (Oxford, England), 33(16), 2580‑2582. [https://doi.org/10.1093/bioinformatics/btx192](https://doi.org/10.1093/bioinformatics/btx192) * [**Docker**](https://www.docker.com/) > Merkel, D. (2014). Docker : Lightweight Linux containers for consistent development and deployment. Linux J., 2014(239), 2:2.
GitHub Events
Total
- Watch event: 1
- Delete event: 11
- Public event: 1
- Push event: 72
- Gollum event: 39
- Pull request review event: 4
- Pull request event: 8
- Create event: 7
Last Year
- Watch event: 1
- Delete event: 11
- Public event: 1
- Push event: 72
- Gollum event: 39
- Pull request review event: 4
- Pull request event: 8
- Create event: 7
Dependencies
- condaforge/miniforge3 latest build