https://github.com/biocore/micov

Aggregate genome coverage

https://github.com/biocore/micov

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.1%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Aggregate genome coverage

Basic Info
  • Host: GitHub
  • Owner: biocore
  • License: other
  • Language: Python
  • Default Branch: main
  • Size: 27.5 MB
Statistics
  • Stars: 4
  • Watchers: 5
  • Forks: 5
  • Open Issues: 8
  • Releases: 2
Created about 2 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License

README.md

micov: aggregate MIcrobiome COVerage

We introduce aggregate MIcrobiome COVerage (micov), a bioinformatic tool that efficiently computes precise, optionally-aggregated, genomic coverage positions across numerous metagenomes and arbitrary sample types. Micov offers three key advantages over conventional tools: rapid sample type-specific cumulative coverage calculations, identification of mobile or polymorphic genetic elements, and detection of strain heterogeneity through coverage variations.

Design

The primary input mapping structure for micov is per-sample SAM/BAM or BED (3-column). These data are then consolidated into Parquet files to utilize pushdown filters.

Installation

We recommend creating a separate conda environment, and installing into that.

bash $ pip install micov

Installation From Source

To install the most up-to-date version of micov

bash $ git clone https://github.com/biocore/micov.git $ cd micov $ conda create -n micov python=3.12 $ conda install -q --yes -n micov -c conda-forge --file ci/conda_requirements.txt $ conda activate micov $ pip install -e .

Example Usages

See below for examples of running micov on SAM files.

1. Set Up Environment

First, activate the Conda environment where micov is installed:

bash conda activate micov

2. Process SAM Files to Extract Covered Positions

Next, we will process SAM files to extract covered positions. Note: If you have coverages.tgz coverage files from Qitta, please go to step 4. micov accepts headerless SAM/BAM files, and writes out BED-like files which describe the observed start and stop positions on the references in the SAM data.

If your input files contain headers, remove them using samtools before running micov:

bash samtools view -S input.sam > output.sam

Similarly, if your input files are in BAM format, convert them to SAM format using samtools:

bash samtools view input.bam > output.sam

Next, compress the SAM data into BED coverge files. The samtools command above can be piped into micov to compress the SAM data into BED-like files if desired, but for simplicity, we will demonstrate use from SAM. In writing, we asssume the name of the SAM file corresponds to a sample name. The subsequent code expects the BED files to have either a .cov or .cov.gz extension.

```bash mkdir -p "./example/coverages"

for file in ./example/samfiles/*.sam.xz; do sample_id=$(basename "$file" .sam.xz)

echo "Processing $file..."

# Run micov compress
xzcat $file | micov compress | gzip > "./example/coverages/${sample_id}.cov.gz"

done ```

3. Consolidate Coverage Files

After extracting coverage data, consolidate the .cov files into Parquet representations. This requires a length mapping file (length.tsv), which maps genome IDs to their corresponding genome lengths. An example length file can be found in ./example/metadata/length.tsv. If this file is not available, it can for example be generated using seqkit:

bash seqkit fx2tab --length --name --header-line foo.fasta > length.tsv

Now, consolidate the coverage files. On read, micov will interpret the non-extension portion of a filename as the sample ID. For example, given foo/bar/baz.cov.gz, the sample ID will be baz.

```bash

micov nonqiita-to-parquet \ --pattern "example/coverages/*.cov.gz" \ --output example/parquet/example \ --lengths example/metadata/length.tsv ```

4. Convert Coverage Data to Parquet Format

micov provides functionality to convert Qiita-formatted coverage data into Parquet format as well.

```bash mkdir -p "./example/parquet"

note: multiple coverage files can be specified by repeating the --qiita-coverages argument

micov qiita-to-parquet \ --qiita-coverages "./example/consolidate/consolidated.tgz" \ --output "./example/parquet/example" \ --lengths "./example/metadata/length.tsv" ```

5. Generate Per-Sample-Group Plots

A series of plots can be constructed guided by metadata. Specifically, micov produces the following:

  • Non-cumulative coverage curves for each genome in the feature metadata.
  • Cumulative coverage curves for each genome in the feature metadata. These accumulation data are supported by K-S tests written to the output directory.
  • Scaled and unscaled position plots for each genome in the feature metadata.

Categorical metadata can be used to group samples; sample-metadata is required. The genomes to examine can optionally be constrained using features-to-keep. Specific start and stop regions of genomes can also be specified within the features-to-keep but limited to a single region per genome currently.

micov expects the first column of a sample metadata file to be the sample ID under the header sample_id. Similarly, the first column of a feature metadata file should be the feature ID under the header genome_id.

The --output parameter specified a prefix for the output files.

Optionally, Monte Carlo curves can be produced for the cumulative plots by specifying --monte. There are two Monte Carlo options: unfocused and focused. The unfocused option will select samples at random with any coverage data, while the focused option will randomly select samples with nonzero coverage of the current genome. Both options select independent of sample metadata, and will select the max number of samples observed in a sample group.

Additionally, users can specify --percentile to display plots with the x-axis representing percentile of samples instead of absolute sample counts.

Pairwise Kolmogorov-Smirnov (KS) tests between all sample groups' cumulative coverage curves are automatically conducted and results saved in cumulative.ks.tsv. The KS test quantifies whether two sample groups differ in the distribution of their cumulative genome coverages, with the KS statistic measuring the maximal difference between the two cumulative distributions, and the KS p-value assessing the statistical significance of the difference.

```bash mkdir -p "./example/plots/persamplegroups"

micov per-sample-group \ --parquet-coverage "./example/parquet/example" \ --sample-metadata "./example/metadata/samplemetadata.txt" \ --sample-metadata-column "dog" \ --features-to-keep "./example/metadata/featuremetadata.txt" \ --output "./example/plots/persamplegroups/example" \ --plot ```

6. Binning and Ranking

The binning command allows you to divide genome positions into fix-sized bins and compute summary statistics across samples, based on sample metadata. This is useful for identifying regions of interest (e.g. high variability across samples).

```bash mkdir -p "./example/binning"

micov binning \ --parquet-coverage ./example/parquet/example \ --sample-metadata ./example/metadata/samplemetadata.txt \ --features-to-keep ./example/metadata/featuremetadata.txt \ --metadata-variable "dog" \ --outdir ./example/binning \ --rank ```

Each bin is ranked based on the standard deviation of sample hits across groups assoicated with the chosen metadata category, with bins exhibiting higher variability ranked at the top.

The rankings are saved in the output stats_by_variance_of_sample_hits.tsv whereas binning statistics (start and end positions of each bin, number of sample hits per bin, number of read hits per bin.etc) are saved in stats_bins.tsv.

7. Additional Usage (optional)

Existing .SAM/.BAM can be converted into coverage percentages by specifying length data at compression:

bash $ xzcat some_data.sam.xz | micov compress --length length.tsv > coverages.tsv

Multiple coverage files for the same sample can be aggregated into a single file:

bash $ zcat run1/sample1.cov.gz run2/sample1.cov.gz | micov compress | gzip > combined/sample1.cov.gz

Owner

  • Name: biocore
  • Login: biocore
  • Kind: organization
  • Location: Cyberspace

Collaboratively developed bioinformatics software.

GitHub Events

Total
  • Create event: 20
  • Release event: 6
  • Watch event: 2
  • Delete event: 4
  • Member event: 2
  • Issue comment event: 9
  • Push event: 62
  • Pull request event: 37
  • Pull request review event: 31
  • Pull request review comment event: 34
  • Fork event: 4
Last Year
  • Create event: 20
  • Release event: 6
  • Watch event: 2
  • Delete event: 4
  • Member event: 2
  • Issue comment event: 9
  • Push event: 62
  • Pull request event: 37
  • Pull request review event: 31
  • Pull request review comment event: 34
  • Fork event: 4

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 8
  • Total pull requests: 46
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 days
  • Total issue authors: 1
  • Total pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.41
  • Merged pull requests: 37
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 42
  • Average time to close issues: N/A
  • Average time to close pull requests: 6 days
  • Issue authors: 0
  • Pull request authors: 4
  • Average comments per issue: 0
  • Average comments per pull request: 0.36
  • Merged pull requests: 34
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • wasade (8)
Pull Request Authors
  • wasade (31)
  • sherlyn99 (20)
  • lucaspatel (2)
  • cguccione (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 21 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 6
  • Total maintainers: 1
pypi.org: micov
  • Documentation: https://micov.readthedocs.io/
  • License: ========================== The micov licensing terms ========================== The micov project is licensed under the terms of the Modified BSD License (also known as New or Revised BSD), as follows: Copyright (c) 2024-, The micov Development Team <damcdonald@ucsd.edu> All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the micov development team nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE MICOV DEVELOPMENT TEAM BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  • Latest release: 0.0.0
    published over 1 year ago
  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 21 Last month
Rankings
Dependent packages count: 9.5%
Average: 36.2%
Dependent repos count: 62.8%
Maintainers (1)
Last synced: 10 months ago

Dependencies

setup.py pypi
.github/workflows/ci.yml actions
  • actions/checkout v4 composite
  • conda-incubator/setup-miniconda v2 composite
ci/conda_requirements.txt pypi
  • click *
  • matplotlib *
  • numba *
  • polars *
  • scipy *
  • versioneer *
ci/requirements.lint.txt pypi
  • check-manifest *
  • ruff *
  • versioneer *
ci/requirements.test.txt pypi
  • pytest * test
  • versioneer * test
pyproject.toml pypi