https://github.com/arklumpus/alifilter

A machine learning approach to alignment filtering

https://github.com/arklumpus/alifilter

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 29 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.1%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

A machine learning approach to alignment filtering

Basic Info
  • Host: GitHub
  • Owner: arklumpus
  • License: gpl-3.0
  • Language: C#
  • Default Branch: main
  • Size: 339 MB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 1 year ago · Last pushed 11 months ago
Metadata Files
Readme License

README.md

AliFilter: a machine learning approach to alignment filtering

DOI

Sequence alignment filtering (or "trimming") consists in removing parts of a DNA or protein alignment to improve the performance of a downstream analysis (such as a phylogenetic reconstruction). Alignment columns are removed because they are deemed to be unsuitable for the analysis, e.g. because they are likely to be the result of mistakes introduced by the sequence alignment software, because they contain no information, or because they contain a high amount of noise.

Alignment filtering can be performed by manually inspecting alignments and identifying problematic alignment columns, or by using a variety of software tools (e.g., BMGE [1], ClipKIT [2], Gblocks [3], Noisy [4], trimAL [5]; see Tan et al. 2015 [6] for a review of some of these). Compared to manual filtering, automated filtering tools have the advantage of being easily applicable to large datasets and producing consistent results; on the other hand, apart from some customisation settings, they are often a "black box" offering little control over which parts of the alignments are preserved or deleted. Manual filtering, on the other hand, is more time consuming and less reproducible, but allows for a more fine-tuned filtering approach.

AliFilter is a tool to automate a manual filtering approach. Using a machine learning algorithm, AliFilter can create a model from a small set of manually filtered alignments; the model can then be used to reproducibly filter many aligments, simulating the manual filtering approach. The program also comes with a pre-trained model that can be used to filter alignments out of the box.

AliFilter is a command-line tool available for Windows, macOS and Linux; it is distributed under a GPLv3 license. An API is also available, which allows programs written in C#, C/C++, Python, R, and JavaScript to use AliFilter models for alignment filtering.

Quick usage guide

AliFilter does not require any installation; you just need to download the latest program release for your operating system and you are good to go.

To filter an alignment with AliFilter, run the following command:

AliFilter -i alignment.fas -o output.fas

Here, alignment.fas is the input (unfiltered) alignment, while output.fas is the name of the file where the output (filtered) alignment will be saved. Alignments can be in FASTA or relaxed PHYLIP format. This command will use the default model implemented in AliFilter to filter the alignment.

If you wish to use a specific model, you can use the -m argument:

AliFilter -i alignment.fas -o output.fas -m <model>

Where <model> is either a standard model specification, or the path to a model.json file containing a custom trained model. A list of the standard model specifications is available in the Wiki.

If you do not provide the -i or -o arguments, the program will read from the standard input or write to the standard output. This makes it possible to concatenate sequence alignment and filtering in a single line; for example, if you are using mafft to align the sequences:

mafft --auto unaligned.fas | AliFilter > filtered.fas

This command will directly create a file called filtered.fas containing the filtered sequence alignment.

AliFilter can also perform additional tasks, including training new models, comparing two alignments or masks, and combining multiple masks. See the Wiki for more details on all the features of the program.

Citation

If you use AliFilter in your research, please cite it as:

Bianchini, G., Zhu, R., Cicconardi, F., & Moody, E. R. R. (2025). \ AliFilter: a machine learning approach to alignment filtering. \ Zenodo. https://doi.org/10.5281/zenodo.14861812

Building from source

Note that if you just wish to use the program, you can simply download the precompiled executables from the release page, rather than compiling the program.

If you wish to build AliFilter from source, you will need to install the .NET 8.0 SDK. Afterwards, clone this repository (the source code is in the src folder) and execute the build script for your platform.

Windows

To build signed executables on Windows, you will need to install a code signing certificate on your system; in the following commands, <subject> should be the code signing certificate subject, while <pin> should be your smart card pin. If you do not have a code signing certificate, you can still build unsigned executables by using random strings for these parameters: the signing step will fail, but the executables will still be produced.

You will need to execute a different script depending on your architecture.

Windows-x64

cmd BuildRelease-win-x64.cmd <subject> <pin>

Windows-arm64

cmd BuildRelease-win-arm64.cmd <subject> <pin>

Linux

The command you will need to execute on Linux systems depends on your architecture.

Linux-x64

bash chmod +x BuildRelease-linux-x64.sh ./BuildRelease-linux-x64.sh

Linux-arm64

bash chmod +x BuildRelease-linux-arm64.sh ./BuildRelease-linux-arm64.sh

macOS

The command you need to execute on macOS also depends on your architecture. If you do not have a paid Apple Developer account, you can enter random strings for the various required arguments of the script; the code signing and notarization steps will fail, but the executable will still produced.

macOS-x64

bash chmod +x BuildRelease-mac-x64.sh ./BuildRelease-mac-x64.sh <Developer ID Application> <Apple ID> <App-specific password> <Developer team ID>

macOS-arm64

bash chmod +x BuildRelease-mac-arm64.sh ./BuildRelease-mac-arm64.sh <Developer ID Application> <Apple ID> <App-specific password> <Developer team ID>

References

[1] Criscuolo, A., Gribaldo, S. BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evol Biol 10, 210 (2010). https://doi.org/10.1186/1471-2148-10-210

[2] Steenwyk JL, Buida TJ III, Li Y, Shen X-X, Rokas A (2020) ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. PLoS Biol 18(12): e3001007. https://doi.org/10.1371/journal.pbio.3001007

[3] Castresana, J. (2000). Selection of Conserved Blocks from Multiple Alignments for Their Use in Phylogenetic Analysis. Molecular Biology and Evolution, 17(4), 540–552. https://doi.org/10.1093/OXFORDJOURNALS.MOLBEV.A026334

[4] Dress, A.W., Flamm, C., Fritzsch, G. et al. Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol 3, 7 (2008). https://doi.org/10.1186/1748-7188-3-7

[5] Salvador Capella-Gutiérrez, José M. Silla-Martínez, Toni Gabaldón, trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses, Bioinformatics, Volume 25, Issue 15, August 2009, Pages 1972–1973, https://doi.org/10.1093/bioinformatics/btp348

[6] Ge Tan, Matthieu Muffato, Christian Ledergerber, Javier Herrero, Nick Goldman, Manuel Gil, Christophe Dessimoz, Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference, Systematic Biology, Volume 64, Issue 5, September 2015, Pages 778–791, https://doi.org/10.1093/sysbio/syv033

Owner

  • Name: Giorgio Bianchini
  • Login: arklumpus
  • Kind: user
  • Company: University of Bristol

GitHub Events

Total
  • Release event: 1
  • Watch event: 2
  • Member event: 1
  • Public event: 1
  • Push event: 4
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 2
  • Member event: 1
  • Public event: 1
  • Push event: 4
  • Create event: 1

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 11
  • Total Committers: 1
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 11
  • Committers: 1
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Giorgio Bianchini g****i@b****k 11
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

API/R/alifilter/DESCRIPTION cran
  • R >= 2.10.0 depends
  • Rcpp >= 1.0.12 imports
  • ape >= 5.7.1 imports
  • jsonlite >= 1.8.8 imports
API/CSharp/AliFilterExample.csproj nuget
  • AliFilter 0.2.12
src/AliFilter/AliFilter.csproj nuget
  • Accord 3.8.0
  • Accord.MachineLearning 3.8.0
  • Accord.Statistics 3.8.0
  • Mono.Options 6.12.0.148
  • VectSharp.Markdown 1.7.0
  • VectSharp.PDF 3.1.0
  • VectSharp.Plots 1.1.0
  • VectSharp.SVG 1.10.1