semibin

SemiBin: metagenomics binning with self-supervised deep learning

https://github.com/bigdatabiology/semibin

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 12 DOI reference(s) in README
✓
Academic publication links
Links to: scholar.google
✓
Committers with academic emails
1 of 6 committers (16.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary

Keywords

assembly-binning bioinformatics deep-learning metagenomics self-supervised-learning

Last synced: 6 months ago · JSON representation ·

Repository

SemiBin: metagenomics binning with self-supervised deep learning

Basic Info

Host: GitHub
Owner: BigDataBiology
Language: Python
Default Branch: main
Homepage: https://semibin.rtfd.io/
Size: 106 MB

Statistics

Stars: 134
Watchers: 4
Forks: 12
Open Issues: 20
Releases: 23

Topics

assembly-binning bioinformatics deep-learning metagenomics self-supervised-learning

Created over 5 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Citation

SemiBin: Metagenomic Binning Using Siamese Neural Networks for short and long reads

SemiBin is a command tool for metagenomic binning with deep learning, handles both short and long reads.

CONTACT US: Please use GitHub issues for bug reports and the SemiBin users mailing-list for more open-ended discussions or questions.

If you use this software in a publication please cite:

Pan, S.; Zhu, C.; Zhao, XM.; Coelho, LP. A deep siamese neural network improves metagenome-assembled genomes in microbiome datasets across different environments. Nat Commun 13, 2326 (2022). https://doi.org/10.1038/s41467-022-29843-y

The self-supervised approach and the algorithms used for long-read datasets (as well as their benchmarking) are described in

Pan, S.; Zhao, XM; Coelho, LP. SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing. Bioinformatics Volume 39, Issue Supplement_1, June 2023, Pages i21–i29; https://doi.org/10.1093/bioinformatics/btad209

Basic usage of SemiBin

A tutorial of running SemiBin from scrath can be found here SemiBin tutorial.

Installation with conda:

bash conda create -n SemiBin conda activate SemiBin conda install -c conda-forge -c bioconda semibin

This will install both the SemiBin2 command as well (for backwards compatibility), the old SemiBin command. For new projects, it is recommended that you exclusively use SemiBin2: both commands do the same thing, but SemiBin2 has a slightly nicer interface.

The inputs to the SemiBin are contigs (assembled from the reads) and BAM files (reads mapping to the contigs). In the docs you can see how to generate the inputs starting with a metagenome.

Running with single-sample binning (for example: human gut samples):

bash SemiBin2 single_easy_bin -i contig.fa -b S1.sorted.bam -o output --environment human_gut

(if you are using contigs from long-reads, add the --sequencing-type=long_read argument).

Running with multi-sample binning:

bash SemiBin2 multi_easy_bin -i contig_whole.fa -b *.sorted.bam -o output

The output includes the bins in the output_bins directory (including the bin.*.fa and recluster.*.fa).

Please find more options and details below and read the docs.

Advanced Installation

SemiBin runs (and is continuously tested) on Python 3.7-3.13

pixi

The current recommended way to install SemiBin with GPU-support is to use pixi. Pixi will use the packages from conda-forge and bioconda to install SemiBin and its dependencies. See the docs for more details, but the basic idea is to create a pixi.toml file with the following content:

```toml [project] authors = ["Luis Pedro Coelho luis@luispedro.org"] channels = ["conda-forge", "bioconda"] name = "semibin_install" platforms = ["linux-64"] version = "0.1.0"

[tasks]

[dependencies] semibin = ">=2.2.0,<3" pytorch-gpu = "*"

[system-requirements] cuda = "12.0" ```

This will install SemiBin with GPU support, but it does require a CUDA-compatible GPU. Alternatively, you can install SemiBin in CPU-only mode by removing the pytorch-gpu and cuda lines.

Source

You will need the following dependencies:

The easiest way to install the dependencies is with conda:

bash conda install -c bioconda bedtools hmmer samtools

Once the dependencies are installed, you can install SemiBin by running:

bash pip install .

Optional extra dependencies:

Examples of binning

SemiBin runs on single-sample, co-assembly and multi-sample binning. Here we show the simple modes as an example. For the details and examples of every SemiBin subcommand, please read the docs.

Binning assemblies from long reads

Since version 1.4, SemiBin proposes new algorithm (ensemble based DBSCAN algorithm) for binning assemblies from long reads. To use it, you can used the subcommands bin_long or pass the option --sequencing-type=long_read to the single_easy_bin or multi_easy_bin subcommands.

Easy single/co-assembly binning mode

Single sample and co-assembly are handled the same way by SemiBin.

You will need the following inputs:

A contig file (contig.fa in the example below)
BAM file(s) from mapping short reads to the contigs, sorted (mapped_reads.sorted.bam in the example below)

The single_easy_bin command can be used to produce results in a single step.

For example:

bash SemiBin2 \ single_easy_bin \ --input-fasta contig.fa \ --input-bam mapped_reads.sorted.bam \ --environment human_gut \ --output output

Alternatively, you can train a new model for that sample, by not passing in the --environment flag:

bash SemiBin2 \ single_easy_bin \ --input-fasta contig.fa \ --input-bam mapped_reads.sorted.bam \ --output output

The following environments are supported:

human_gut
dog_gut
ocean
soil
cat_gut
human_oral
mouse_gut
pig_gut
built_environment
wastewater
chicken_caecum (Contributed by Florian Plaza Oñate)
global

The global environment can be used if none of the others is appropriate. Note that training a new model can take a lot of time and disk space. Some patience will be required. If you have a lot of samples from the same environment, you can also train a new model from them and reuse it.

Easy multi-samples binning mode

The multi_easy_bin command can be used in multi-samples binning mode:

You will need the following inputs:

A combined contig file
BAM files from mapping

For every contig, format of the name is <sample_name>:<contig_name>, where : is the default separator (it can be changed with the --separator argument). NOTE: Make sure the sample names are unique and the separator does not introduce confusion when splitting. For example:

```

S1:Contig1 AGATAATAAAGATAATAATA S1:Contig2 CGAATTTATCTCAAGAACAAGAAAA S1:Contig3 AAAAAGAGAAAATTCAGAATTAGCCAATAAAATA S2:Contig1 AATGATATAATACTTAATA S2:Contig2 AAAATATTAAAGAAATAATGAAAGAAA S3:Contig1 ATAAAGACGATAAAATAATAAAAGCCAAATCCGACAAAGAAAGAACGG S3:Contig2 AATATTTTAGAGAAAGACATAAACAATAAGAAAAGTATT S3:Contig3 CAAATACGAATGATTCTTTATTAGATTATCTTAATAAGAATATC ```

You can use this to get the combined contig:

bash SemiBin2 concatenate_fasta -i contig*.fa -o output

If either the sample or the contig names use the default separator (:), you will need to change it with the --separator,-s argument.

After mapping samples (individually) to the combined FASTA file, you can get the results with one line of code:

bash SemiBin2 multi_easy_bin -i concatenated.fa -b *.sorted.bam -o output

Running with abundance information from strobealign-aemb

Strobealign-aemb is a fast abundance estimation method for metagenomic binning. As strobealign-aemb can not provide the mapping information for every position of the contig, so we can not run SemiBin2 with strobealign-aemb in binning modes where samples used smaller 5 and need to split the contigs to generate the must-link constratints.

split the FASTA files to generate the must-link constraints bash python script/generate_split.py -c contig.fa -o output
map reads using strobealign-aemb to generate the abundance information bash strobealign --aemb output/split.fa read1_1.fq read1_2.fq -R 6 > sample1.txt strobealign --aemb output/split.fa read2_1.fq read2_2.fq -R 6 > sample2.txt strobealign --aemb output/split.fa read3_1.fq read3_2.fq -R 6 > sample3.txt strobealign --aemb output/split.fa read4_1.fq read4_2.fq -R 6 > sample4.txt strobealign --aemb output/split.fa read5_1.fq read5_2.fq -R 6 > sample5.txt
Running SemiBin2 (like running SemiBin with BAM files) bash SemiBin2 generate_sequence_features_single -i contig.fa -a *.txt -o output SemiBin2 generate_sequence_features_multi -i contig.fa -a *.txt -s : -o output SemiBin2 single_easy_bin -i contig.fa -a *.txt -o output SemiBin2 multi_easy_bin i contig.fa -a *.txt -s : -o output

Output

The output folder will contain:

Features computed from the data and used for training and clustering
Saved semi-supervised deep learning model
Output bins
Table with basic information about each bin
Some intermediate files

By default, bins are in output_bins directory.

For more details about the output, read the docs.

Owner

Name: Big Data Biology Lab
Login: BigDataBiology
Kind: organization
Email: luis@luispedro.org

Repositories: 15
Profile: https://github.com/BigDataBiology

Citation (CITATION.md)

If you use this software in a publication please cite:

>  Pan, S.; Zhu, C.; Zhao, XM.; Coelho, LP. A deep siamese neural network
>  improves metagenome-assembled genomes in microbiome datasets across
>  different environments. *Nat Commun* **13,** 2326 (2022).
>  https://doi.org/10.1038/s41467-022-29843-y

And

> Pan, S., Zhao, XM; Coelho, LP. SemiBin2: Self-Supervised Contrastive Learning
> Leads to Better MAGs for Short- and Long-Read Sequencing. Bioinformatics  39
> (39 Suppl 1): i21–29. https://doi.org/10.1038/s41467-022-29843-y

GitHub Events

Total

Create event: 2
Release event: 1
Issues event: 42
Watch event: 22
Issue comment event: 65
Push event: 24
Pull request event: 2
Fork event: 1

Last Year

Create event: 2
Release event: 1
Issues event: 42
Watch event: 22
Issue comment event: 65
Push event: 24
Pull request event: 2
Fork event: 1

Committers

Last synced: over 1 year ago

All Time

Total Commits: 574
Total Committers: 6
Avg Commits per committer: 95.667
Development Distribution Score (DDS): 0.491

Past Year

Commits: 80
Committers: 2
Avg Commits per committer: 40.0
Development Distribution Score (DDS): 0.188

Top Committers

Name	Email	Commits
Luis Pedro Coelho	l**s@l**g	292
psj1997	1**9@q**m	268
SvetlanaUP	6****P	10
Yang Yunyi	y**g@l**k	2
Florian Plaza Oñate	f**e@i**r	1
Sebastian Jaenicke	s**k@C**E	1

Committer Domains (Top 20 + Academic)

cebitec.uni-bielefeld.de: 1 inra.fr: 1 ln.hk: 1 qq.com: 1 luispedro.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 126
Total pull requests: 39
Average time to close issues: 2 months
Average time to close pull requests: 3 days
Total issue authors: 78
Total pull request authors: 7
Average comments per issue: 3.19
Average comments per pull request: 0.46
Merged pull requests: 31
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 24
Pull requests: 2
Average time to close issues: about 1 month
Average time to close pull requests: less than a minute
Issue authors: 17
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

luispedro (12)
fplaza (7)
SilasK (5)
Louis-MG (3)
guangingmai (3)
FelipeMSD (3)
adityabandla (3)
LanSabb (2)
SvetlanaUP (2)
yazhinia (2)
ZarulHanifah (2)
PeterCx (2)
B-1991-ing (2)
nashanghenzan (2)
rhysnewell (2)

Pull Request Authors

SvetlanaUP (14)
luispedro (14)
psj1997 (6)
SebastianDall (2)
alienzj (2)
fplaza (1)
sjaenick (1)

Top Labels

Issue Labels

enhancement (4) good first issue (3) bad-error-message (2) wontfix (2) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 124 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 23
Total maintainers: 1

pypi.org: semibin

Metagenomic binning with siamese neural networks

Documentation: https://semibin.readthedocs.io/
License: MIT
Latest release: 2.2.0
published 11 months ago

Versions: 23
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 124 Last month

Rankings

Stargazers count: 7.9%

Dependent packages count: 10.1%

Forks count: 11.4%

Downloads: 12.5%

Average: 12.7%

Dependent repos count: 21.6%

Maintainers (1)

luispedro

Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi

mkdocs >=1.3.0

setup.py pypi

atomicwrites *
numexpr *
numpy *
pandas *
python-igraph *
pyyaml *
requests *
scikit-learn *
torch *
tqdm *

.github/workflows/semibin_test.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
conda-incubator/setup-miniconda v2 composite

semibin

Science Score: 64.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

SemiBin: Metagenomic Binning Using Siamese Neural Networks for short and long reads

Basic usage of SemiBin

Advanced Installation

pixi

Source

Examples of binning

Binning assemblies from long reads

Easy single/co-assembly binning mode

Easy multi-samples binning mode

Running with abundance information from strobealign-aemb

Output

Owner

Citation (CITATION.md)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: semibin

Rankings

Maintainers (1)

Dependencies