https://github.com/broadinstitute/tandem-repeat-catalog
Genome-wide TR catalog and variation clusters described in [Weisburd, Dolzhenko, et al. 2024]
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
○Academic email domains
-
✓Institutional organization owner
Organization broadinstitute has institutional domain (www.broadinstitute.org) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.5%) to scientific vocabulary
Repository
Genome-wide TR catalog and variation clusters described in [Weisburd, Dolzhenko, et al. 2024]
Basic Info
Statistics
- Stars: 15
- Watchers: 7
- Forks: 0
- Open Issues: 0
- Releases: 4
Metadata Files
README.md
Tandem Repeat Catalog & Variation Clusters
This repo contains a genome-wide TR catalog with 4.9 million loci.
It stratifies TRs into 2 groups: 1) isolated TRs suitable for traditional repeat copy number analysis using short-read or long-read data 2) TRs embedded within wider polymorphic regions (ie. variation clusters) that are best studied through sequence-level analysis
Release v1.0 is available for download, and is described in:
Defining a tandem repeat catalog and variation clusters for genome-wide analyses and population databases
Ben Weisburd, Egor Dolzhenko, Mark F. Bennett, Matt C. Danzi, Adam English, Laurel Hiatt, Hope Tanudisastro, Nehir Edibe Kurtas, Helyaneh Ziaei Jam, Harrison Brand, Fritz J. Sedlazeck, Melissa Gymrek, Harriet Dashnow, Michael A. Eberle, Heidi L. Rehm
bioRxiv 2024.10.04.615514; doi: https://doi.org/10.1101/2024.10.04.615514
Background
Tandem repeats (TRs) are regions of the genome that consist of consecutive copies of some motif sequence. For example, CAGCAGCAG is a tandem repeat of the CAG motif. Many types of genomic studies require annotations of tandem repeats in the reference genome, called repeat catalogs, which specify the genomic start and end coordinates of each tandem repeat region, as well as the one or more motifs that repeat there.
For example, if a hypothetical region at the beginning of chrX consisted of the following nucleotide sequence:
ATCAGTAGA ATATATATAT CAGACAGCAGCAG TGAGTGCGTAC...
it could be represented in a repeat catalog as two entries:
chrX:10-19 (AT)*
chrX:20-32 (CAG)*
indicating that a repeat of the AT motif occurs between positions 10 and 19 (inclusive), and of the CAG motif between positions 20 and 32.
A genome-wide catalog would contain such entries for all repeat regions of interest found anywhere in the genome.
Catalog Construction
The genome-wide TR catalog was created by combining 4 source catalogs in order: 1) Known disease-associated loci 2) Illumina catalog of 174k polymorphic repeats 3) All perfect repeats in hg38 that span ≥ 9bp and consist of at least 3 repeats of any motif between 1 and 1000 bp in size. These were identified using ColabRepeatFinder. 4) Catalog of polymorphic loci computed by applying methods described in [Weisburd et al. 2023] to 78 haplotype-resolved T2T assemblies from the HPRC and HGSVC
The numbers (and %) of loci in the combined catalog that were added from each of the source catalogs were as follows:
83 out of 83 (100.0%) TRs from source 1: known disease-associated loci as well as 20 adjacent or historical candidate loci
174,244 out of 174,286 (100.0%) TRs from source 2: Illumina catalog of 174k polymorphic loci
4,391,197 out of 4,558,281 ( 96.3%) TRs from source 3: perfect repeats in hg38
297,517 out of 1,937,805 ( 15.4%) TRs from source 4: polymorphic loci in 78 haplotype-resolved T2T assemblies
The merging procedure involved taking all loci from the 1st catalog, then all non-duplicate loci from the next catalog, then from the third catalog and so on, in the order listed above. A locus was considered a duplicate if it overlapped a previously-added locus by at least 66% and the two loci had the same motif after cyclic shift and/or reverse complement (ie. CAG, AGC, GCA, CTG, TGC, GCT were considered to be the same motif).
Catalog Stats
The following catalog stats for v1.0 were computed using stranalysis/computecatalog_stats.py:
``` Stats for repeatcatalogv1.hg38.1to1000bpmotifs.EH.withannotations.json.gz: 4,863,041 total loci 65,678,112 base pairs spanned by all loci (2.127% of the genome) 3,210,115 out of 4,863,041 ( 66.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 1,567,337 out of 4,863,041 ( 32.2%) repeat intervals are homopolymers 18,340 out of 4,863,041 ( 0.4%) repeat intervals overlap each other by at least two motif lengths 11 out of 4,863,041 ( 0.0%) repeat intervals have non-ACGT motifs Examples of overlapping repeats: chr1:82008141-82008152, chr3:78937990-78938032, chr4:1046750-1046794, chr5:52437153-52437201, chr6:34683425-34683453, chr4:107646310-107646327, chr18:52413438-52413450, chr6:150149299-150149323, chr7:40295582-40295597, chr9:35561915-35561931
Ranges: Motif size range: 1-833bp Locus size range: 1-2523bp Num repeats range: 1-300x repeats
Max locus size = 2,523bp @ chrX:71520430-71522953 (CCAGCACTTTGGGAGGCCGAGGCAGGCTGATCACTAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCCGTCTCTACTAAAAATACAAAAATTACCTGGGTGTGGGGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGCTGGGGAGGCAGGAGAATTGCCTGAACCTGAGAGGCAGAGGCTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGCGACAGAGTGAGACTCAGTCTCAAAACAAAAAAAAAAAAAGATTTTAGTAACTTTTATCCTGTTTTAATAATACTGACTCAGAAACTATAATGTGTACTTTATAATTTACTTCCTAGATGACACTTGATTTTCTTCAAGAGCAAGATAGCTGCCCTGTGCAGTTGGTCTCCTTGAAAACTATTTTAGTTCTATCATAATTTCCTGTGATAAATATTTTGACCTTCTAAAATTTCAGAATATTGCACCAAGTAGAAAGAAAATAGGTTTTTTCTCTTTTCTTCTTCTTCCTTTTTTTTTTCTGAGAAAGAGGGAATGAGAACTTTAGTGTTCTTTCAATAGCGTTCTTATTTGTAGAAATGCATAATAGTGTCCTAGTAAGGCTTGACAATAACTCTGGTCTTCATCATATTTTGTGATAAAACTTTTGATTTAAAAAAACCTCTGATCTATTTATCATGGCAAATGGATAGAGCTTTCCTGCCTGTTTTCTTTCTTTTCTTTTTTCTTTCTTTCCTTTTTTTTCCTTTGAGCTTAGATTTTTAGAAGCACATATTTAAAAATCAGGTATAAGACTGGATGCAGTGGCTCACGCCTGTAATC) Min reference repeat purity = 0.43 @ chr3:112804380-112804514 (TCT) Min overall mappability = 0.00 @ chrY:56887882-56887891 (TGA) Base-level purity median: 1.000, mean: 0.999
chrX: 244,191 out of 4,863,041 ( 5.0%) repeat intervals
chrY: 39,257 out of 4,863,041 ( 0.8%) repeat intervals
chrM: 14 out of 4,863,041 ( 0.0%) repeat intervals
alt contigs: 0 out of 4,863,041 ( 0.0%) repeat intervals
Motif size distribution: 1bp: 1,567,337 out of 4,863,041 ( 32.2%) repeat intervals 2bp: 978,972 out of 4,863,041 ( 20.1%) repeat intervals 3bp: 1,432,117 out of 4,863,041 ( 29.4%) repeat intervals 4bp: 590,787 out of 4,863,041 ( 12.1%) repeat intervals 5bp: 177,422 out of 4,863,041 ( 3.6%) repeat intervals 6bp: 56,731 out of 4,863,041 ( 1.2%) repeat intervals 7-24bp: 43,996 out of 4,863,041 ( 0.9%) repeat intervals 25+bp: 15,679 out of 4,863,041 ( 0.3%) repeat intervals
Num repeats in reference: 1x: 10,443 out of 4,863,041 ( 0.2%) repeat intervals 2x: 38,922 out of 4,863,041 ( 0.8%) repeat intervals 3x: 1,799,189 out of 4,863,041 ( 37.0%) repeat intervals 4x: 650,397 out of 4,863,041 ( 13.4%) repeat intervals 5x: 356,525 out of 4,863,041 ( 7.3%) repeat intervals 6x: 151,893 out of 4,863,041 ( 3.1%) repeat intervals 7x: 85,760 out of 4,863,041 ( 1.8%) repeat intervals 8x: 257,475 out of 4,863,041 ( 5.3%) repeat intervals 9x: 352,993 out of 4,863,041 ( 7.3%) repeat intervals 10-15x: 759,188 out of 4,863,041 ( 15.6%) repeat intervals 16-25x: 348,837 out of 4,863,041 ( 7.2%) repeat intervals 26-35x: 45,478 out of 4,863,041 ( 0.9%) repeat intervals 36-50x: 5,610 out of 4,863,041 ( 0.1%) repeat intervals 51+x: 331 out of 4,863,041 ( 0.0%) repeat intervals
Reference repeat purity distribution: 0.0: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.1: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.2: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.3: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.4: 3 out of 4,863,041 ( 0.0%) repeat intervals 0.5: 14 out of 4,863,041 ( 0.0%) repeat intervals 0.6: 44 out of 4,863,041 ( 0.0%) repeat intervals 0.7: 1,570 out of 4,863,041 ( 0.0%) repeat intervals 0.8: 12,336 out of 4,863,041 ( 0.3%) repeat intervals 0.9: 21,126 out of 4,863,041 ( 0.4%) repeat intervals 1.0: 4,827,948 out of 4,863,041 ( 99.3%) repeat intervals
Mappability distribution: 0.0: 154,279 out of 4,863,041 ( 3.2%) loci 0.1: 214,471 out of 4,863,041 ( 4.4%) loci 0.2: 246,877 out of 4,863,041 ( 5.1%) loci 0.3: 236,856 out of 4,863,041 ( 4.9%) loci 0.4: 391,388 out of 4,863,041 ( 8.0%) loci 0.5: 561,639 out of 4,863,041 ( 11.5%) loci 0.6: 352,273 out of 4,863,041 ( 7.2%) loci 0.7: 306,208 out of 4,863,041 ( 6.3%) loci 0.8: 337,715 out of 4,863,041 ( 6.9%) loci 0.9: 626,047 out of 4,863,041 ( 12.9%) loci 1.0: 1,435,288 out of 4,863,041 ( 29.5%) loci
Locus sizes at each motif size: 1bp motifs: locus size range: 1 bp to 90 bp (median: 11 bp) based on 1,567,337 loci. Mean base purity: 1.00. Mean mappability: 0.66 2bp motifs: locus size range: 2 bp to 600 bp (median: 10 bp) based on 978,972 loci. Mean base purity: 1.00. Mean mappability: 0.76 3bp motifs: locus size range: 3 bp to 632 bp (median: 9 bp) based on 1,432,117 loci. Mean base purity: 1.00. Mean mappability: 0.75 4bp motifs: locus size range: 4 bp to 533 bp (median: 14 bp) based on 590,787 loci. Mean base purity: 1.00. Mean mappability: 0.68 5bp motifs: locus size range: 5 bp to 400 bp (median: 18 bp) based on 177,422 loci. Mean base purity: 1.00. Mean mappability: 0.61 6bp motifs: locus size range: 6 bp to 1,103 bp (median: 20 bp) based on 56,731 loci. Mean base purity: 1.00. Mean mappability: 0.62 7bp motifs: locus size range: 7 bp to 151 bp (median: 22 bp) based on 15,083 loci. Mean base purity: 1.00. Mean mappability: 0.58 8bp motifs: locus size range: 8 bp to 312 bp (median: 25 bp) based on 7,107 loci. Mean base purity: 1.00. Mean mappability: 0.57 9bp motifs: locus size range: 9 bp to 153 bp (median: 28 bp) based on 3,231 loci. Mean base purity: 1.00. Mean mappability: 0.51 10bp motifs: locus size range: 10 bp to 150 bp (median: 31 bp) based on 2,713 loci. Mean base purity: 1.00. Mean mappability: 0.50 ```
Additional stats can be found in the [run log]
Owner
- Name: Broad Institute
- Login: broadinstitute
- Kind: organization
- Location: Cambridge, MA
- Website: http://www.broadinstitute.org/
- Twitter: broadinstitute
- Repositories: 1,083
- Profile: https://github.com/broadinstitute
Broad Institute of MIT and Harvard
GitHub Events
Total
- Create event: 2
- Release event: 2
- Issues event: 2
- Watch event: 6
- Issue comment event: 2
- Push event: 1
Last Year
- Create event: 2
- Release event: 2
- Issues event: 2
- Watch event: 6
- Issue comment event: 2
- Push event: 1
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 1
- Total pull requests: 1
- Average time to close issues: about 1 month
- Average time to close pull requests: about 4 hours
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 8.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: about 1 month
- Average time to close pull requests: about 4 hours
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 8.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- hdashnow (1)
Pull Request Authors
- bw2 (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- matplotlib *
- pandas *
- seaborn *
- ijson *
- matplotlib *
- pandas *
- seaborn *
- weisburd/str-analysis@sha256 69f989a7e6e34a34c7aee126f9501caaa49ba879464577ca07dae6ac73b213b9 build