phylter
Detection of outlier genes and species in phylogenomics
Science Score: 36.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 15 DOI reference(s) in README -
○Academic publication links
-
✓Committers with academic emails
2 of 4 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Keywords
Repository
Detection of outlier genes and species in phylogenomics
Basic Info
- Host: GitHub
- Owner: damiendevienne
- Language: R
- Default Branch: master
- Homepage: https://damiendevienne.github.io/phylter
- Size: 181 MB
Statistics
- Stars: 12
- Watchers: 3
- Forks: 5
- Open Issues: 2
- Releases: 4
Topics
Metadata Files
README.md
PhylteR, a tool for analyzing, visualizing and filtering phylogenomics datasets 
phylter is a tool that allows detecting, removing and visualizing outliers in phylogenomics dataset by iteratively removing taxa from gene families (gene trees) and optimizing a score of concordance between individual matrices.
phylter relies on DISTATIS (Abdi et al, 2005), an extension of multidimensional scaling to 3 dimensions to compare multiple distance matrices at once.
phylter builds on Phylo-MCOA (de Vienne et al. 2012) but is much faster and accurate.
phylter takes as input either a collection of phylogenetic trees (that are converted to distance matrices by phylter), or a collection of pairwise distance matrices (obtained from multiple sequence alignements, for instance).
phylter accepts data with missing values (missing taxa in some genes).
phylter detects outliers with a method proposed by Hubert & Vandervieren (2008) for skewed data.
phylter does not accept that the same taxa is present multiple times in the same gene.
phylter is written in R language.
For details about the functions, their usage, and a in-depth description of the use of phylter on a biological dataset, step-by-step, please vist the phylter web page : https://damiendevienne.github.io/phylter.
Note: if you don't use R or don't want to use R, containerized versions of phylter are also available (Docker and Singularity): https://damiendevienne.github.io/phylter/articles/phyltercontainer.html
if you use phylter, please cite: Comte, A., Tricou, T., Tannier, E., Joseph, J., Siberchicot, A., Penel, S., Allio, R., Delsuc, F., Dray, S., de Vienne, D.M. (2023). PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets, Molecular Biology and Evolution, 40(11) msad234, https://doi.org/10.1093/molbev/msad234
Installation
phylter is now on CRAN.
Installation is as easy as typing what follows at the R command prompt:
R
install.packages("phylter")
If you want the latest version, you can also install the development version of phylter:
Install the release version of
remotesfrom CRAN:R install.packages("remotes")Install the development version of
phylterfrom GitHub: ```R remotes::install_github("damiendevienne/phylter")
3. Once installed, the package can be loaded:
R
library("phylter")
```
Note: phylter requires R version > 4.0, otherwise it cannot be installed. Also, R uses the GNU Scientific Library. On Ubuntu, this can be installed prior to the installation of the phylter package by typing
sudo apt install libgsl-devin a terminal.
Usage
Here is a brief introduction to the use phylter on a collection of gene trees. For more detailed explanations and a use case example, please visit https://damiendevienne.github.io/phylter/.
<!-- For more more detailed examples, please go to ADD LINK TO THE AUTOMATICALLY GENERATED WEBSITEWEB. -->
1. With the read.tree function from the ape package, read trees from external file and save as a list called trees.
R
if (!requireNamespace("ape", quietly = TRUE))
install.packages("ape")
trees <- ape::read.tree("treefile.tre")
2. (optional) Read or get gene names somewhere (same order as the trees) and save it as a vector called names.
3. Run phylter on your trees (see details below for possible options).
```R
results <- phylter(trees, gene.names = names)
```
Options
The
phylterfunction is called as follows by default:R phylter(X, bvalue = 0, distance = "patristic", k = 3, k2 = k, Norm = "median", Norm.cutoff = 0.001, gene.names = NULL, test.island = TRUE, verbose = TRUE, stop.criteria = 1e-5, InitialOnly = FALSE, normalizeby = "row", parallel = TRUE)Arguments are as follows:
X: A list of phylogenetic trees (phylo object) or a list of distance matrices. Trees can have different number of leaves and matrices can have different dimensions. If this is the case, missing values are imputed.bvalue: IfXis a list of trees, nodes with a support belowbvaluewill be collapsed prior to the outlier detection.distance: IfXis a list of trees, type of distance used to compute the pairwise matrices for each tree. Can be "patristic" (sum of branch lengths separating tips, the default) or "nodal" (number of nodes separating tips).k: Strength of outlier detection. The higher this value the less outliers detected.k2: Same askfor complete gene outlier detection. To preserve complete genes from being discarded,k2can be increased. By default,k2 = k.Norm: Should the matrices be normalized prior to the complete analysis and how. If "median", matrices are divided by their median; if "mean", they are divided by their mean; if "none", no normalization if performed. Normalizing ensures that fast-evolving (and slow-evolving) genes are not treated as outliers. Normalization by median is a better choice as it is less sensitive to outlier values.Norm.cutoff: Value of the median (ifNorm = "median") or the mean (ifNorm = "mean") below which matrices are simply discarded from the analysis. This prevents dividing by 0, and allows getting rid of genes that contain mostly branches of length 0 and are therefore uninformative anyway. Discarded genes, if any, are listed in the output (out$DiscardedGenes).gene.names: List of gene names used to rename elements inX. If NULL (the default), elements are named 1,2,...,length(X).test.island: IfTRUE(the default), only the highest value in an island of outliers is considered an outlier. This prevents non-outliers hitchhiked by outliers to be considered outliers themselves.verbose: IfTRUE(the default), messages are written during the filtering process to get information on what is happening.stop.criteria: The optimization stops when the gain (quality of compromise) between round n and round n+1 is smaller than this value. Default to 1e-5.InitialOnly: Logical. IfTRUE, only the Initial state of the data is computed.normalizeby: Should the gene x species matrix be normalized prior to outlier detection, and how.parallel: Logical. Should the computations be parallelized when possible? Default toTRUE. Note that the number of threads cannot be set by the user whenparallel = TRUE. It uses all available cores on the machine.
4. Analyze the results
To get the list of outliers detected by phylter, simply type:
R
results$Final$Outliers
In addition, many functions allow looking at the outliers detected and comparing before and after phyltering.
```R
Get a summary: nb of outliers, gain in concordance, etc.
summary(results)
Show the number of species in each gene, and how many per gene are outliers
plot(results, "genes")
Show the number of genes where each species is found, and how many are outliers
plot(results, "species")
Compare before and after genes x species matrices, highlighting missing data and outliers
identified (not efficient for large datasets)
plot2WR(results)
Plot the dispersion of data before and after outlier removal. One dot represents one
gene x species association
plotDispersion(results)
Plot the genes x genes matrix showing pairwise correlation between genes
plotRV(results)
Plot optimization scores during optimization
plotopti(results) ```
5. Save the results of the analysis to an external file, for example to perform cleaning on raw alignments or pruning gene trees based on the results from phylter.
R
write.phylter(results, file = "phylter.out")
References
Abdi, H., O’Toole, A.J., Valentin, D. & Edelman, B. (2005). DISTATIS: The analysis of multiple distance matrices. Proceedings of the IEEE Computer Society: International Conference on Computer Vision and Pattern Recognition (San Diego, CA, USA). doi: 10.1109/CVPR.2005.445. https://www.utdallas.edu/~herve/abdi-distatis2005.pdf
Allio, R., Tilak, M. K., Scornavacca, C., Avenant, N. L., Kitchener, A. C., Corre, E., ... & Delsuc, F. (2021). High-quality carnivoran genomes from roadkill samples enable comparative species delineation in aardwolf and bat-eared fox. Elife, 10, e63167. https://doi.org/10.7554/eLife.63167
Comte, A., Tricou, T., Tannier, E., Joseph, J., Siberchicot, A., Penel, S., Allio, R., Delsuc, F., Dray, S., de Vienne, D.M. (2023). PhylteR: Efficient Identification of Outlier Sequences in Phylogenomic Datasets, Molecular Biology and Evolution, 40(11), msad234, https://doi.org/10.1093/molbev/msad234
Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions. Computational Statistics and Data Analysis. https://doi.org/10.1016/j.csda.2007.11.008
de Vienne D.M., Ollier S. et Aguileta G. (2012). Phylo-MCOA: A Fast and Efficient Method to Detect Outlier Genes and Species in Phylogenomics Using Multiple Co-inertia Analysis. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msr317 (This is the ancestor of phylter).
For comments, suggestions and bug reports, please open an issue on this GitHub repository.
Owner
- Login: damiendevienne
- Kind: user
- Repositories: 25
- Profile: https://github.com/damiendevienne
GitHub Events
Total
- Create event: 1
- Release event: 1
- Issues event: 2
- Watch event: 2
- Issue comment event: 4
- Push event: 1
Last Year
- Create event: 1
- Release event: 1
- Issues event: 2
- Watch event: 2
- Issue comment event: 4
- Push event: 1
Committers
Last synced: about 3 years ago
All Time
- Total Commits: 409
- Total Committers: 4
- Avg Commits per committer: 102.25
- Development Distribution Score (DDS): 0.357
Top Committers
| Name | Commits | |
|---|---|---|
| Damien de Vienne | d****e@u****r | 263 |
| Aurore | a****e@g****m | 72 |
| Aurélie Siberchicot | a****t@u****r | 68 |
| theotricou | t****u@g****m | 6 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 10
- Total pull requests: 0
- Average time to close issues: 21 days
- Average time to close pull requests: N/A
- Total issue authors: 9
- Total pull request authors: 0
- Average comments per issue: 2.7
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: 6 days
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 5.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Boussau (2)
- MaelDore (1)
- MareikeJaniak (1)
- francicco (1)
- 000generic (1)
- Ofsm (1)
- bvalot (1)
- yangliu-szu (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 320 last-month
- Total docker downloads: 28
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
cran.r-project.org: phylter
Detect and Remove Outliers in Phylogenomics Datasets
- Homepage: https://github.com/damiendevienne/phylter
- Documentation: http://cran.r-project.org/web/packages/phylter/phylter.pdf
- License: GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]
-
Latest release: 0.9.12
published 10 months ago
Rankings
Maintainers (1)
Dependencies
- R >= 4.0 depends
- RSpectra * imports
- Rfast * imports
- ape * imports
- ggplot2 * imports
- mrfDepth * imports
- reshape2 * imports
- stats * imports
- utils * imports
- actions/checkout v3 composite
- r-lib/actions/check-r-package v2 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite
- JamesIves/github-pages-deploy-action 4.1.4 composite
- actions/checkout v3 composite
- r-lib/actions/setup-pandoc v2 composite
- r-lib/actions/setup-r v2 composite
- r-lib/actions/setup-r-dependencies v2 composite