https://github.com/cbg-ethz/quasifit
QuasiFit is a Bayesian MCMC sampler for inferring fitness landscapes in the quasispecies model subject to mutation-selection equilibrium.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Repository
QuasiFit is a Bayesian MCMC sampler for inferring fitness landscapes in the quasispecies model subject to mutation-selection equilibrium.
Basic Info
Statistics
- Stars: 2
- Watchers: 6
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
QuasiFit 0.3
David Seifert (david.seifert@bsse.ethz.ch) Niko Beerenwinkel (niko.beerenwinkel@bsse.ethz.ch)
Citation
If you find QuasiFit useful, please cite our paper in Genetics
David Seifert, Francesca Di Giallonardo, Karin J. Metzner, Huldrych F. Günthard, and Niko Beerenwinkel. A Framework for Inferring Fitness Landscapes of Patient-Derived Viruses Using Quasispecies Theory. Genetics 2015, 199(1). DOI: 10.1534/genetics.114.172312
Introduction
QuasiFit is an MCMC sampler that implements (relative) fitness inference for NGS data assuming a mutation-selection equilibrium. From the posterior, neutral networks and epistasis can be determined.
Binaries
We have pre-compiled 64-bit binaries for Linux and Mac users:
- Linux:
quasifit-linux-static-amd64
You will require a 64-bit distribution having at least glibc 2.3.2. Any distribution from the past 10 years should work.
The linux binary was built on Debian Etch 4.0r9 64-bit.
- Mac:
quasifit-mac-static-amd64
You will require at least Mac OS X 10.4.11 running on a 64-bit Mac.
QuasiFit was built on both platforms with GSL 1.16 and Boost 1.55 with GCC 4.8.2 on -O2 optimizations. The main code (including Eigen) was compiled with -O3 optimizations. All libraries, including the C++ runtime libraries, have been linked statically to produce a binary that has no external dependencies, in other words, they are directly useable.
All static binaries can be downloaded from the main git tree for the most recent release or from the releases page.
Prerequisites
If you wish to compile QuasiFit from source, you will require the following components (these are not necessary for running the statically linked binary):
- GSL; somewhat recent release (http://www.gnu.org/software/gsl/)
The GNU Scientific Library is required for random number generating functions.
- Eigen; at least 3.2 (http://eigen.tuxfamily.org/)
Eigen forms the core mathematics library of QuasiFit, with all its linear algebra routines.
- Boost; at least 1.50 (http://www.boost.org/)
Boost provides the necessary abstraction for time routines and thread handling. Also abstracts the different precision types.
- GMP (optional); somewhat recent release (http://www.gmplib.org/)
The GNU Multiple Precision Arithmetic Library (GMP) provides the basis (mpf_t) for arbitrary precision floating-point calculations. It is only required if you wish to build an arbitrary precision sampler.
- libquadmath (optional); at least GCC 4.6 (http://gcc.gnu.org/onlinedocs/libquadmath/)
GCC's libquadmath provides the __float128 quad-precision floating point type and associated operations. This is an internal GCC library that is included with GCC since 4.6. It is only required if you wish to use a quad-precision sampler. Quad precision represents a trade-off between performance and precision.
Furthermore, you will require a compiler that can handle C++0x (which includes all C++11 compilers). QuasiFit has been successfully compiled with GCC 4.4 on RHEL 6, GCC 4.8 on Gentoo/Debian Etch and icc 12.0 on RHEL 6. Please note that building an arbitrary precision sampler requires either GCC or Clang, as the Intel C++ Compiler has a known bug in handling Boost's Multiprecision library.
If you wish to do development, you will require parts of the extended GNU toolchain (the infamous Autotools):
- Autoconf; latest 2.69 release (http://www.gnu.org/software/autoconf/)
GNU Autoconf produces the ./configure script from configure.ac.
- Automake; latest 1.14 release (http://www.gnu.org/software/automake/)
GNU Automake produces the Makefile.in precursor, that is processed with ./configure to yield the final Makefile.
- Libtool; latest 2.4.2 release (http://www.gnu.org/software/libtool/)
GNU Libtool is required as a dependency of boost.m4.
QuasiFit is strongly intertwined with libraries and programs that heavily rely on features of UNIX-like systems, hence supporting Microsoft Windows is not a goal (in particular, building the GNU Scientific Library and using the GNU build system on Windows is a nightmare).
Preparing
To install the aforementioned dependencies, follow the guides here.
Linux
Due to the large inherent heterogeneity of the Linux landscape, we will only detail the procedure of installing dependencies for Ubuntu 14.04 LTS here. The procedure should be very similiar for Debian. ```
install basic compiler toolchain (you will be prompted to enter your password)
sudo apt-get install build-essential
install GSL
sudo apt-get install libgsl0-dev
install Eigen
sudo apt-get install libeigen3-dev
install Boost
sudo apt-get install libboost-all-dev
(optional) install GMP for arbitrary precision arithmetic
sudo apt-get install libgmp-dev ```
If you wish to work with the bleeding-edge release of QuasiFit, you will need the complete GNU Autotools toolchain. It should be reiterated here that the recommended way of building QuasiFit is by downloading the provided tarball and using either the included static binaries or compiling from source. The Git tree needs to be bootstrapped to produce the various scripts. To install the Autotools: ```
install the GNU toolchain
sudo apt-get install autoconf automake libtool pkg-config git ```
Mac OS X
QuasiFit should be buildable without complications on all Mac OS X versions above and including 10.6. For older versions of Mac OS X, the build process is significantly more involved due to the C++11 requirement. In this case we recommend using the provided precompiled binaries.
In any case, you will need to install the latest version of
Xcode for your platform (4.2 for 10.6; 4.6.3 for 10.7; 5.1.x for 10.8 & 10.9) either via the Mac App Store or by downloading the disk image from the Apple Developer Connection (http://developer.apple.com/).
Command Line Tools for Xcode. Since Xcode 4.3 Apple has stopped shipping command line tools with the standard Xcode package. You will need to install these via "Downloads" in the "Xcode" -> "Preferences" menu, or (preferably) by downloading the latest appropriate "Command Line Tools" package from the Apple Developer Connection.
MacPorts (http://www.macports.org/install.php).
Henceforth, we assume Xcode, the Command Line Tools and Macports to be installed.
Install the remaining libraries from MacPorts by performing ```
install general prerequisites
sudo port install wget pkgconfig
GCC (optional; if you wish to use quad-precision on Mac OS X,
you will require GCC as Clang/LLVM cannot handle libquadmath)
sudo port install gcc48
install GSL; choose one of the following
standard variant:
sudo port install gsl
optimized variant, compiled with -O3 and -march=native:
sudo port install gsl +optimize
install Eigen
sudo port install eigen3
install Boost; choose one of the following
standard variant, pulls in a load of dependencies:
sudo port install boost
minimalist variant,
disables python, avoids multiple dependencies:
sudo port install boost -python27
minimalist variant,
also builds static libraries that can be linked into the
final executable to reduce dynamic linking,
making the executable more portable:
sudo port install boost -python27 -no_static
(optional) install GMP for arbitrary precision arithmetic
sudo port install gmp ```
If you wish to work with the bleeding-edge release of QuasiFit, you will need the complete GNU Autotools toolchain. It should be reiterated here that the recommended way of building QuasiFit is by downloading the provided tarball and using either the included static binaries or compiling from source. The Git tree needs to be bootstrapped to produce the various scripts. To install the Autotools: ```
install the GNU toolchain
sudo port install autoconf automake libtool ```
Building
After having installed all of the required dependencies, you can build QuasiFit. For this, run
wget --no-check-certificate http://github.com/SoapZA/QuasiFit/releases/download/v0.3/quasifit-0.3.tar.bz2 -O - | tar xj
cd quasifit-0.3/
./configure
make -j3
For users wishing to do development or want to stay up-to-date with the latest development, you will need to clone the git tree. This method is not recommended for users just wishing to use QuasiFit, as it requires the complete Autotools toolchain.
git clone https://github.com/SoapZA/QuasiFit.git
cd QuasiFit/
./autogen.sh
./configure
make -j3
The resulting executable quasifit located in src/ can then be run. You can also install QuasiFit into a directory of your choosing if you specify the directory to the configure script with
./configure --prefix=<DIR>
and then install after the initial make with make install.
Options
The QuasiFit sampler has a number of options for controlling the Metropolis-Hastings algorithm and I/O. See quasifit -h for more information.
Usage
Input file
QuasiFit is versatile in what sequence file formats it can take as input:
- Generic FASTA file. One drawback of the Generic FASTA input file is that it cannot include unobserved haplotypes, as every sequence in a FASTA file represents one observation. To include unobserved haplotypes into the inference procedure, use one of the two other input formats.
- QuasiRecomb FASTA file. QuasiFit was designed with QuasiRecomb's output file structure as input. QuasiRecomb writes the statistics of its output into the sequence identifier field of a FASTA file:
```
read0_0.5026 TAGAAGATATGGAGTTGCCAGGGAGGTGGA ```
QuasiFit parses this expression and extracts the counts. While QuasiRecomb by itself does not include any unobserved haplotypes, you can insert these manually into the output FASTA file. 3. QuasiFit input file. QuasiFit can load its own kind of input file, which consists simply of comma-separated values. Every line includes one haplotype, separated by its observed count with a comma. For instance
AAA,1
AAT,1
would be a two-haplotype input file for QuasiFit. This is the same input file as used in the supplemental information of the main paper in the section "Unobserved haplotypes simulations".
Obviously, all haplotypes have to be of the same length for the quasispecies model (in its simplest form) to be applicable.
Output files
QuasiFit produces multiple output files:
<FILE>-f.csv: these contain the actual fitness samples from the fitness manifold. Be aware that QuasiFit will write out the full number of decimal digits for each floating-point value, hence this file can become somewhat large.<FILE>-p.csv: these contain the estimated population distribution samples. Every row should theoretically sum to 1 (within numerical truncation errors), as every component represents the probability of a haplotype in an asymptotically infinite population.<FILE>-r.csv: these contain the samples from the subset of the Euclidean space, which in fact is the true sampling space. Every row will include at least one 0, as the dimensionality of the euclidean space is the same as the degrees of freedom of the quasispecies distribution simplex, namely #Haplotypes - 1.<FILE>-diag.csv: these contain 3 columns of diagnostic data. The first column represents the logarithm of the posterior density function (up to a constant shift), the second column represents the logarithm of the absolute value of the determinant of the Jacobian of h(p), and finally, the third column represents the logarithm of the multinomial likelihood (excluding the constant prefactor).
All of these files can be analysed with standard tools. We recommend using R for its sophisticated plotting capabilities. To load one such file, fire up R and use for instance
diagnostic_data = read.table("<FILE>-diag.csv", header=TRUE, sep=",", colClasses="numeric")
to have a look at the diagnostic data. QuasiFit does not automatically detect or remove the burn-in phase, as this is generally tricky and should be left for the user to determine. To determine the burn-in phase, just plot the beginning of the log Posterior and notice when the values flatten out and converge to their supposed stationary distribution. To plot the first 25'000 values of the log Posterior, proceed with
plot(diagnostic_data$LogPost[1:25000], xlab="MCMC iteration", ylab="Log Posterior", type="l")
to get something like this
Notice how the MCMC chains converge to the stationary distribution at around 10'000 iterations - this would be considered the burn-in phase. The drop in the log Posterior from the initial value is a result of starting at the MLE of the problem and the general curse of dimensionality. For more complicated post-MCMC diagnostics, try the coda package from CRAN (http://cran.r-project.org/web/packages/coda/index.html). For instructions on verifying convergence, see http://www.people.fas.harvard.edu/~plam/teaching/methods/convergence/convergence_print.pdf.
Unconnected haplotypes
QuasiFit makes strong assumptions on the connectedness of haplotypes. For instance, the haplotypes AAA and TTT are separated by a mutational step that requires 3 simultaneous mutations in one replication cycle. Mainly for numerical reasons, this causes the mutation matrix Q to become numerically reducible and a global equilibrium distribution of the quasispecies equation is not numerically guaranteed anymore.
To circumvent this issue, we have detailed a procedure in the main paper in the section "Haplotype space and mutation probabilities" that inserts a minimal number of unobserved haplotypes such that we arrive at a network of haplotypes, where every haplotype can mutate into every other haplotype by taking only simultaneous k mutations per replication cycle (in practice we require k = 1). To do this, we have included our MATLAB script curateSample.m in the scripts/ folder. Fire up a MATLAB session and run for instance
curateSample('quasispecies.fasta')
where quasispecies.fasta is the output file of QuasiRecomb. The curateSample script converts QuasiRecomb's output to a QuasiFit input file and includes the minimal number of unobserved haplotypes to make the haplotype graph connected with one component for given k.
Owner
- Name: Computational Biology Group (CBG)
- Login: cbg-ethz
- Kind: organization
- Location: Basel, Switzerland
- Website: https://www.bsse.ethz.ch/cbg
- Twitter: cbg_ethz
- Repositories: 91
- Profile: https://github.com/cbg-ethz
Beerenwinkel Lab at ETH Zurich
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0