multiplesequencealign
A pipeline to run and systematically evaluate Multiple Sequence Alignment (MSA) methods.
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 10 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Keywords
Repository
A pipeline to run and systematically evaluate Multiple Sequence Alignment (MSA) methods.
Basic Info
- Host: GitHub
- Owner: nf-core
- License: mit
- Language: Nextflow
- Default Branch: master
- Homepage: https://nf-co.re/multiplesequencealign
- Size: 19.6 MB
Statistics
- Stars: 35
- Watchers: 163
- Forks: 14
- Open Issues: 12
- Releases: 2
Topics
Metadata Files
README.md
Introduction
Use nf-core/multiplesequencealign to:
- Deploy one (or many) of the most popular Multiple Sequence Alignment (MSA) tools.
- Benchmark MSA tools (and their inputs) using various metrics.
Main steps:
Computation of summary statistics on the input files (e.g., average sequence similarity across the input sequences, their length, pLDDT extraction if available).Inputs summary (Optional)
Renders a guide tree with a chosen tool (list available in usage). Some aligners use guide trees to define the order in which the sequences are aligned.Guide Tree (Optional)
Aligns the sequences with a chosen tool (list available in usage).Align (Required)
Evaluates the generated alignments with different metrics: Sum Of Pairs (SoP), Total Column score (TC), iRMSD, Total Consistency Score (TCS), etc.Evaluate (Optional)
Reports the collected information of the runs in a Shiny app and a summary table in MultiQC. Optionally, it can also render the Foldmason MSA visualization in HTML format.Report(Optional)
More introductory material: bytesize talk, nextflow summit talk, poster.

Usage
[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with
-profile testbefore running the workflow on actual data.
Quick start - test run
To get a feeling of what the pipeline does, run:
(You don't need to download or provide any file, try it!)
nextflow run nf-core/multiplesequencealign \
-profile test_tiny,docker \
--outdir results
and if you want to see how a more complete run looks like, you can try:
nextflow run nf-core/multiplesequencealign \
-profile test,docker \
--outdir results
How to set up an easy run:
[!NOTE] We have a lot more of use cases examples under FAQs
Input data
You can provide either (or both) a fasta file or a set of protein structures.
Alternatively, you can provide a samplesheet and a toolsheet.
See below how to provide them.
Find some example input data here
CASE 1: One input dataset, one tool.
If you only have one dataset and want to align it using one specific MSA tool (e.g. FAMSA or FOLDMASON), you can run the pipeline with one single command.
Is your input a fasta file (example)? Then:
bash
nextflow run nf-core/multiplesequencealign \
-profile easy_deploy,docker \
--seqs <YOUR_FASTA.fa> \
--aligner FAMSA \
--outdir outdir
Is your input a directory where your PDB files are stored (example)? Then:
bash
nextflow run nf-core/multiplesequencealign \
-profile easy_deploy,docker \
--pdbs_dir <PATH_TO_YOUR_PDB_DIR> \
--aligner FOLDMASON \
--outdir outdir
FAQ: Which are the available tools I can use?
Check the list here: available tools.FAQ: Can I use both --seqs and --pdbs_dir?
Yes, go for it! This might be useful if you want a structural evaluation of a sequence-based aligner for instance.FAQ: Can I specify also which guidetree to use?
Yes, use the--tree flag. More info: usage and parameters.
FAQ: Can I specify the arguments of the tools (tree and aligner)?
Yes, use the--args_tree and --args_aligner flags. More info: usage and parameters.
CASE 2: Multiple datasets, multiple tools.
bash
nextflow run nf-core/multiplesequencealign \
-profile test,docker \
--input <samplesheet.csv> \
--tools <toolsheet.csv> \
--outdir outdir
You need 2 input files:
- samplesheet (your datasets)
- toolsheet (which tools you want to use).
What is a samplesheet?
The sample sheet defines the input datasets (sequences, structures, etc.) that the pipeline will process. A minimal version: ```csv id,fasta seatoxin,seatoxin.fa toxin,toxin.fa ``` A more complete one: ```csv id,fasta,reference,optional_data seatoxin,seatoxin.fa,seatoxin-ref.fa,seatoxin_structures toxin,toxin.fa,toxin-ref.fa,toxin_structures ``` Each row represents a set of sequences (in this case the seatoxin and toxin protein families) to be aligned and the associated (if available) reference alignments and dependency files (this can be anything from protein structure or any other information you would want to use in your favourite MSA tool). Please check: usage. > [!NOTE] > The only required input is the id column and either fasta or optional_data.What is a toolsheet?
The toolsheet specifies which combination of tools will be deployed and benchmarked in the pipeline. Each line defines a combination of guide tree and multiple sequence aligner to run with the respective arguments to be used. The only required field is `aligner`. The fields `tree`, `args_tree` and `args_aligner` are optional and can be left empty. A minimal version: ```csv tree,args_tree,aligner,args_aligner ,,FAMSA, ``` This will run the FAMSA aligner. A more complex one: ```csv tree,args_tree,aligner,args_aligner FAMSA, -gt upgma -medoidtree, FAMSA, , ,TCOFFEE, FAMSA,,REGRESSIVE, ``` This will run, in parallel: - the FAMSA guidetree with the arguments -gt upgma -medoidtree. This guidetree is then used as input for the FAMSA aligner. - the TCOFFEE aligner - the FAMSA guidetree with default arguments. This guidetree is then used as input for the REGRESSIVE aligner. Please check: usage. > [!NOTE] > The only required input is `aligner`.For more details on more advanced runs: usage documentation and the parameter documentation.
[!WARNING] Please provide pipeline parameters via the CLI or Nextflow
-params-fileoption. Custom config files including those provided by the-cNextflow option can be used to provide any configuration except for parameters; see docs.
Pipeline resources
Which resources is the pipeline using? You can find the default resources used in base.config.
If you are using specific profiles, e.g. test, these will overwrite the defaults.
If you want to modify the needed resources, please refer usage.
Pipeline output
Example results: results tab on the nf-core website pipeline page. For more details: output documentation.
Extending the pipeline
For details on how to add your favourite guide tree, MSA or evaluation step in nf-core/multiplesequencealign please refer to the extending documentation.
Credits
nf-core/multiplesequencealign was originally written by Luisa Santus (@luisas) and Jose Espinosa-Carrasco (@JoseEspinosa) from The Comparative Bioinformatics Group at The Centre for Genomic Regulation, Spain.
The following people have significantly contributed to the development of the pipeline and its modules: Leon Rauschning (@lrauschning), Alessio Vignoli (@alessiovignoli), Igor Trujnara (@itrujnara) and Leila Mansouri (@l-mansouri).
Contributions and Support
If you would like to contribute to this pipeline, please see the contributing guidelines.
For further information or help, don't hesitate to get in touch on the Slack #multiplesequencealign channel (you can join with this invite).
Citations
If you use nf-core/multiplesequencealign for your analysis, please cite it using the following doi: 10.5281/zenodo.13889386
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
You can cite the nf-core publication as follows:
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
Owner
- Name: nf-core
- Login: nf-core
- Kind: organization
- Email: core@nf-co.re
- Website: http://nf-co.re
- Twitter: nf_core
- Repositories: 84
- Profile: https://github.com/nf-core
A community effort to collect a curated set of analysis pipelines built using Nextflow.
Citation (CITATIONS.md)
# nf-core/multiplesequencealign: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools - [3DCoffee](https://pubmed.ncbi.nlm.nih.gov/15201059/) > O'Sullivan O, Suhre K, Abergel C, Higgins DG, Notredame C. 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J Mol Biol. 2004 Jul 2;340(2):385-95. doi: 10.1016/j.jmb.2004.04.058. PMID: 15201059. - [ClustalO](https://pubmed.ncbi.nlm.nih.gov/21988835/) > Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75. PMID: 21988835; PMCID: PMC3261699. - [csvtk](https://github.com/shenwei356/csvtk) - [FAMSA](https://pubmed.ncbi.nlm.nih.gov/27670777/) > Cameron Laurence Mathison Gilchrist, Milot Mirdita, Martin Steinegger: Multiple Protein Structure Alignment at Scale with FoldMason. bioRxiv 2024.08.01.606130; doi: https://doi.org/10.1101/2024.08.01.606130 - [FoldMason](https://www.biorxiv.org/content/10.1101/2024.08.01.606130v3) > Deorowicz S, Debudaj-Grabysz A, Gudyś A. FAMSA: Fast and accurate multiple sequence alignment of huge protein families. Sci Rep. 2016 Sep 27;6:33964. doi: 10.1038/srep33964. PMID: 27670777; PMCID: PMC5037421. - [Kalign3](https://pubmed.ncbi.nlm.nih.gov/31665271/) > Lassmann T. Kalign 3: multiple sequence alignment of large data sets. Bioinformatics. 2019 Oct 26;36(6):1928–9. doi: 10.1093/bioinformatics/btz795. Epub ahead of print. PMID: 31665271; PMCID: PMC7703769. - [learnMSA](https://pubmed.ncbi.nlm.nih.gov/36399060/) > Becker F, Stanke M. learnMSA: learning and aligning large protein families. Gigascience. 2022 Nov 18;11:giac104. doi: 10.1093/gigascience/giac104. PMID: 36399060; PMCID: PMC9673500. - [MAFFT](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC135756/) > Katoh K, Misawa K, Kuma K, Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002 Jul 15;30(14):3059-66. doi: 10.1093/nar/gkf436. PMID: 12136088; PMCID: PMC135756. - [MAGUS](https://pubmed.ncbi.nlm.nih.gov/33252662/) > Smirnov V, Warnow T. MAGUS: Multiple sequence Alignment using Graph clUStering. Bioinformatics. 2021 Jul 19;37(12):1666-1672. doi: 10.1093/bioinformatics/btaa992. PMID: 33252662; PMCID: PMC8289385. > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/) > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. - [mTM-align](https://pubmed.ncbi.nlm.nih.gov/29281009/) > Dong R, Peng Z, Zhang Y, Yang J. mTM-align: an algorithm for fast and accurate multiple protein structure alignment. Bioinformatics. 2018 May 15;34(10):1719-1725. doi: 10.1093/bioinformatics/btx828. PMID: 29281009; PMCID: PMC5946935. - [Muscle5](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9664440/) > Edgar RC. Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nat Commun. 2022 Nov 15;13(1):6968. doi: 10.1038/s41467-022-34630-w. PMID: 36379955; PMCID: PMC9664440. - [T-Coffee](https://pubmed.ncbi.nlm.nih.gov/10964570/) > Notredame C, Higgins DG, Heringa J. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000 Sep 8;302(1):205-17. doi: 10.1006/jmbi.2000.4042. PMID: 10964570. - [UPP](https://academic.oup.com/bioinformatics/article/39/1/btad007/6982552) > Park M, Ivanovic S, Chu G, Shen C, Warnow T. UPP2: fast and accurate alignment of datasets with fragmentary sequences. Bioinformatics. 2023 Jan 1;39(1):btad007. doi: 10.1093/bioinformatics/btad007. PMID: 36625535; PMCID: PMC9846425. ## Python packages - [Biopython](https://pubmed.ncbi.nlm.nih.gov/19304878/) > Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, Friedberg I, Hamelryck T, Kauff F, Wilczynski B, de Hoon MJ. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 1;25(11):1422-3. doi: 10.1093/bioinformatics/btp163. Epub 2009 Mar 20. PMID: 19304878; PMCID: PMC2682512. ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Create event: 8
- Release event: 2
- Issues event: 35
- Watch event: 17
- Delete event: 1
- Issue comment event: 60
- Push event: 70
- Pull request event: 69
- Pull request review event: 90
- Pull request review comment event: 115
- Fork event: 3
Last Year
- Create event: 8
- Release event: 2
- Issues event: 35
- Watch event: 17
- Delete event: 1
- Issue comment event: 60
- Push event: 70
- Pull request event: 69
- Pull request review event: 90
- Pull request review comment event: 115
- Fork event: 3
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 19
- Total pull requests: 34
- Average time to close issues: 23 days
- Average time to close pull requests: 3 days
- Total issue authors: 5
- Total pull request authors: 4
- Average comments per issue: 0.53
- Average comments per pull request: 0.79
- Merged pull requests: 24
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 19
- Pull requests: 34
- Average time to close issues: 23 days
- Average time to close pull requests: 3 days
- Issue authors: 5
- Pull request authors: 4
- Average comments per issue: 0.53
- Average comments per pull request: 0.79
- Merged pull requests: 24
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- luisas (45)
- lrauschning (3)
- JoseEspinosa (2)
- jen-reeve (1)
- random-annoym (1)
- bxskdh (1)
Pull Request Authors
- luisas (51)
- nf-core-bot (13)
- lrauschning (9)
- JoseEspinosa (7)
- itrujnara (3)
- mirpedrol (1)
- alessiovignoli (1)
- nvnieuwk (1)
- maxulysse (1)
- Joon-Klaps (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/upload-artifact v3 composite
- seqeralabs/action-tower-launch v2 composite
- actions/upload-artifact v3 composite
- seqeralabs/action-tower-launch v2 composite
- mshick/add-pr-comment v1 composite
- actions/checkout v3 composite
- nf-core/setup-nextflow v1 composite
- actions/stale v7 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- mshick/add-pr-comment v1 composite
- nf-core/setup-nextflow v1 composite
- psf/black stable composite
- dawidd6/action-download-artifact v2 composite
- marocchino/sticky-pull-request-comment v2 composite
- actions/setup-python v5 composite
- eWaterCycle/setup-singularity v7 composite
- nf-core/setup-nextflow v1 composite
- actions/setup-python v5 composite
- rzr/fediverse-action master composite
- zentered/bluesky-post-action v0.1.0 composite
- py_fasta_validator 0.6.*
- foldmason 2.7bd21ed.*
- foldmason 2.7bd21ed.*
- pigz 2.8.*
- foldmason 2.7bd21ed.*
- mafft 7.520.*
- pigz 2.8.*
- mafft 7.525.*
- magus-msa 0.2.0.*
- pigz 2.8.*
- mtm-align 20220104.*
- pigz 2.8.*
- pigz 2.8.*
- pigz 2.8.*
- t-coffee 13.46.0.919e8c6b.*
- pigz 2.8.*
- t-coffee 13.46.0.919e8c6b.*
- t-coffee 13.46.0.919e8c6b.*
- pigz 2.8.*
- t-coffee 13.46.0.919e8c6b.*
- pigz 2.8.*
- t-coffee 13.46.0.919e8c6b.*
- t-coffee 13.46.0.919e8c6b.*
- actions/checkout 0ad4b8fadaa221de15dcec353f45205ec38ea70b composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- conda-incubator/setup-miniconda a4260408e20b96e80095f42ff7f1a15b27dd94ca composite
- eWaterCycle/setup-apptainer main composite
- jlumbroso/free-disk-space 54081f138730dfa15788a46383842cd2f914a1be composite
- nf-core/setup-nextflow v2 composite
- actions/checkout 11bd71901bbe5b1630ceea73d27597364c9af683 composite
- mshick/add-pr-comment b8f338c590a895d50bcbfa6c5859251edc8952fc composite
- nichmor/minimal-read-yaml v0.0.2 composite