https://github.com/cfsan-biostatistics/rabbittclust
RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Last synced: 10 months ago
·
JSON representation
Repository
RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches
Basic Info
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of RabbitBio/RabbitTClust
Created over 3 years ago
· Last pushed over 3 years ago
https://github.com/CFSAN-Biostatistics/RabbitTClust/blob/main/
 RabbitTClust is a fast and memory-efficient genome clustering tool based on sketch-based distance estimations. It enables processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. RabbitTClust supports classical single-linkage hierarchical (clust-mst) and greedy incremental clustering (clust-greedy) algorithms for different scenarios. ## Installation RabbitTClust version 2.0 can only support 64-bit Linux Systems. ### Dependancy * cmake v.3.0 or later * c++14 * [zlib](https://zlib.net/) ### Compile and install automatically ```bash git clone --recursive https://github.com/RabbitBio/RabbitTClust.git cd RabbitTClust ./install.sh ``` ### Compile and install manually ```bash git clone --recursive https://github.com/RabbitBio/RabbitTClust.git cd RabbitTClust #make rabbitSketch library cd RabbitSketch && mkdir -p build && cd build && cmake -DCXXAPI=ON -DCMAKE_INSTALL_PREFIX=. .. && make -j8 && make install && cd ../../ && #make rabbitFX library cd RabbitFX && mkdir -p build && cd build && cmake -DCMAKE_INSTALL_PREFIX=. .. && make -j8 && make install && cd ../../ && #compile the clust-greedy mkdir -p build && cd build && cmake -DUSE_RABBITFX=ON -DUSE_GREEDY=ON .. && make -j8 && make install && cd ../ && #compile the clust-mst cd build && cmake -DUSE_RABBITFX=ON -DUSE_GREEDY=OFF .. && make -j8 && make install && cd ../ ``` ## Usage ```bash usage: clust-mst [-h] [-l] [-t][-d] [-F] [-i] [-o] usage: clust-mst [-h] [-f] [-E] [-d] [-i] [-o] usage: clust-greedy [-h] [-l] [-t] [-d] [-F] [-i] [-o] usage: clust-greedy [-h] [-f] [-d] [-i] [-o] -h : this help message -k : set kmer size, default 21, for both clust-mst and clust-greedy -s : set sketch size, default 1000, for both clust-mst and clust-greedy -c : set sampling ratio to compute variable sketchSize, sketchSize = genomeSize/samplingRatio, only support with MinHash sketch function, for clust-greedy -d : set the distance threshold, default 0.05 for both clust-mst and clust-greedy -t : set the thread number, default take full usage of platform cores number, for both clust-mst and clust-greedy -l : input is a file list, not a single genome file. Lines in the input file list specify paths to genome files, one per line, for both clust-mst and clust-greedy -i : path of input file. One file list or single genome file. Two input file with -f and -E option -f : two input files, genomeInfo and MSTInfo files for clust-mst; genomeInfo and sketchInfo files for clust-greedy -E : two input files, genomeInfo and sketchInfo for clust-mst -F : set the sketch function, including MinHash and KSSD, default MinHash, for both clust-mst and clust-greedy -o : path of output file, for both clust-mst and clust-greedy -e : not save the intermediate file generated from the origin genome file, such as the GenomeInfo, MSTInfo, and SketchInfo files, for both clust-mst and clust-greedy ``` ## Example: ```bash #input is a file list, one genome path per line: ./clust-mst -l -i bact_refseq.list -o bact_refseq.mst.clust ./clust-greedy -l -i bact_genbank.list -o bact_genbank.greedy.clust #input is a single genome file in FASTA format, one genome as a sequence: ./clust-mst -i bacteria.fna -o bacteria.mst.clust ./clust-greedy -i bacteria.fna -o bacteria.greedy.clust #the sketch size (reciprocal of sampling proportion), kmer size, and distance threshold can be specified by -s (-c), -k, and -d options. ./clust-mst -l -k 21 -s 1000 -d 0.05 -i bact_refseq.list -o bact_refseq.mst.clust ./clust-greedy -l -k 21 -c 1000 -d 0.05 -i bact_genbank.list -o bact_genbank.greedy.clust #for redundancy detection with clust-greedy, input is a genome file list: #use -d to specify the distance threshold corresponding to various degrees of redundancy. ./clust-greedy -d 0.001 -l -i bacteriaList -o bacteria.out #for generator cluster from exist MST with a distance threshold of 0.045: #ATTENTION: the -f must in front of the -i option ./clust-mst -d 0.05 -f -i bact_refseq.list.MinHashGenomeInfo bact_refseq.list.MinHashMSTInfo -o bact_refseq.mst.d.045.clust #for generator cluster from exist sketches of clust-greedy with a distance threshold of 0.001: #ATTENTION: the -f must in front of the -i option ./clust-greedy -d 0.001 -f -i bact_genbank.list.MinHashGenomeInfo bact_genbank.list.MinHashSketchInfo -o bact_genbank.greedy.d.001.clust ``` ## Output The output file is in a CD-HIT output format and is slightly different when running with varying input options (*-l* and *-i*). Option *-l* means input as a FASTA file list, one file per genome, and *-i* means input as a single FASTA file, one sequence per genome. #### Output format for a FASTA file list input With *-l* option, the tab-delimited values in the lines beginning with tab delimiters are: * local index in a cluster * global index of the genome * genome length * genome file name (including genome assembly accession number) * sequence name (first sequence in the genome file) * sequence comment (remaining part of the line) **Example:** ```txt the cluster 0 is: 0 0 14782125nt bacteria/GCF_000418325.1_ASM41832v1_genomic.fna NC_021658.1 Sorangium cellulosum So0157-2, complete sequence 1 1 14598830nt bacteria/GCF_004135755.1_ASM413575v1_genomic.fna NZ_CP012672.1 Sorangium cellulosum strain So ce836 chromosome, complete genome the cluster 1 is: 0 2 14557589nt bacteria/GCF_002950945.1_ASM295094v1_genomic.fna NZ_CP012673.1 Sorangium cellulosum strain So ce26 chromosome, complete genome the cluster 2 is: 0 3 13673866nt bacteria/GCF_019396345.1_ASM1939634v1_genomic.fna NZ_JAHKRM010000001.1 Nonomuraea guangzhouensis strain CGMCC 4.7101 NODE_1, whole genome shotgun sequence ...... ``` #### Output format for a single FASTA file input With *-i* option, the tab-delimited values in the lines beginning with tab delimiters are: * local index in a cluster * global index of the genome * genome length * sequence name * sequence comment (remaining part of this line) **Example:** ```txt the cluster 0 is: 0 0 11030030nt NZ_GG657755.1 Streptomyces himastatinicus ATCC 53653 supercont1.2, whole genome shotgun sequence 1 1 11008137nt NZ_RIBZ01000339.1 Streptomyces sp. NEAU-LD23 C2041, whole genome shotgun sequence the cluster 1 is: 0 2 11006208nt NZ_KL647031.1 Nonomuraea candida strain NRRL B-24552 Doro1_scaffold1, whole genome shotgun sequence the cluster 2 is: 0 3 10940472nt NZ_VTHK01000001.1 Amycolatopsis anabasis strain EGI 650086 RDPYD18112716_A.Scaf1, whole genome shotgun sequence ...... ``` # Bug Report All bug reports, comments and suggestions are welcome. ## Cite [Xu, X. et al. (2022). RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with minhash sketches. bioRxiv.](https://doi.org/10.1101/2022.10.13.512052)
Owner
- Name: CFSAN (Center for Food Safety and Applied Nutrition)
- Login: CFSAN-Biostatistics
- Kind: organization
- Website: https://www.fda.gov/about-fda/fda-organization/center-food-safety-and-applied-nutrition-cfsan
- Repositories: 18
- Profile: https://github.com/CFSAN-Biostatistics
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1