https://github.com/broadinstitute/sma-finder

A tool for diagnosing SMA using exome, genome or targeted sequencing data

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov
○
Academic email domains
✓
Institutional organization owner
Organization broadinstitute has institutional domain (www.broadinstitute.org)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

A tool for diagnosing SMA using exome, genome or targeted sequencing data

Basic Info

Host: GitHub
Owner: broadinstitute
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 3.6 MB

Statistics

Stars: 11
Watchers: 4
Forks: 1
Open Issues: 0
Releases: 3

Created almost 4 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

SMA Finder

SMA Finder is a tool for diagnosing spinal muscular atrophy (SMA) from short read exome, genome, or panel sequencing data. It takes a reference sequence (FASTA) and 1 or more alignment files (CRAM or BAM) as input, evaluates reads at the c.840 position of SMN1 and SMN2 to detect the most common molecular causes of SMA, and then reports whether the data indicates a complete loss of functional SMN1, and therefore a diagnosis of SMA.

SMA Finder has been tested on over 30,000 exome, genome, and panel sequencing samples from CMG/GREGoR rare disease cohorts as well as 200,000 exomes and genomes from the UK Biobank (UKBB). In these tests, SMA Finder's false positive rate was less than 1 in 200,000, its true positive rate was 28/28 (100%), and its positive predictive value (PPV) was 97%.

Limitations:
- does not report SMA carrier status or SMN1/SMN2 copy numbers
- does not detect the ~5% of cases caused by SMN1 loss-of-function mutations that do not involve the c.840 position
- requires at least 14 reads to overlap the c.840 position in SMN1 plus SMN2 in order to make a call
- was developed and tested on Illumina short read sequencing data generated using DNA extrated from whole blood and aligned using the BWA aligner. Performance on data from other sequencing technologies, sample types, and alignment pipelines is unknown.

Publication

For more information about SMA Finder, see:

Weisburd B, Sharma R, Pata V, et al. Detecting missed diagnoses of spinal muscular atrophy in genome, exome, and panel sequencing datasets. Preprint. medRxiv. 2024;2024.02.11.24302646. Published 2024 Feb 27. doi:10.1101/2024.02.11.24302646

Install

To install the latest version of SMA Finder, run: python3 -m pip install -U sma-finder

Example

Example command: sma_finder --verbose --hg38-reference-fasta /ref/hg38.fa sample1.cram Command output: ``` Input args: --hg38-reference-fasta: /ref/hg38.fa --output-tsv: sample1.smafinderresults.tsv

CRAMS or BAMS: sample1.cram

Output row #1: filenameprefix sample1 filetype cram genomeversion hg38 sampleid s1 smastatus has SMA confidencescore 168 c840readswithsmn1baseC 0 c840totalreads 174 Wrote 1 rows to sample1.smafinder_results.tsv
```

Usage

``` sma_finder --help

usage: smafinder.py [-h] [--hg37-reference-fasta HG37REFERENCEFASTA] [--hg38-reference-fasta HG38REFERENCEFASTA] [--t2t-reference-fasta T2TREFERENCEFASTA] [-o OUTPUT_TSV] [-v] cramorbampath [cramorbam_path ...]

positional arguments: cramorbam_path One or more CRAM or BAM file paths

optional arguments: -h, --help show this help message and exit --hg37-reference-fasta HG37REFERENCEFASTA HG37 reference genome FASTA path. This should be specified if the input bam or cram is aligned to HG37. --hg38-reference-fasta HG38REFERENCEFASTA HG38 reference genome FASTA path. This should be specified if the input bam or cram is aligned to HG38. --t2t-reference-fasta T2TREFERENCEFASTA T2T reference genome FASTA path. This should be specified if the input bam or cram is aligned to the CHM13 telomere-to-telomere benchmark. -o OUTPUTTSV, --output-tsv OUTPUTTSV Optional output tsv file path -v, --verbose Whether to print extra details during the run ```

Output

The output .tsv contains one row per input CRAM or BAM file and has the following columns:

filename_prefix	CRAM or BAM filename prefix. If the input file is /path/sample1.cram this would be "sample1".
file_type	"cram" or "bam"
genome_version	"hg37", "hg38", or "t2t"
sample_id	sample id from the CRAM or BAM file header (parsed from the read group metadata)
sma_status	possible values are: "has SMA" "does not have SMA" "not enough coverage at SMN c.840 position"
confidence_score	PHRED-scaled integer score measuring the level of confidence that the sma_status is correct. The bigger the score, the higher the confidence. It is calculated in a similar way to the PL field in GATK HaplotypeCaller genotypes.
c840_reads_with_smn1_base_C	number of reads that have a 'C' nucleotide at the c.840 position in SMN1 plus SMN2
c840_total_reads	total number of reads overlapping the c.840 position in SMN1 plus SMN2

Combining results from multiple samples

After running SMA Finder on many samples, it's often useful to combine the per-sample output tables into a single table. One way to do this is with the following shell command:

cd <directory with multiple SMA Finder output tsvs> combined_table_filename=combined_results.tsv head -n 1 $(ls *.sma_finder_results.tsv | head -n 1) > ${combined_table_filename} # get table header from the 1st table for i in *.sma_finder_results.tsv; do tail -n +2 $i >> ${combined_table_filename} # concatenate all tables done

Plotting combined results

A scatter plot summarizing read counts from many samples can be generated using the plot_SMN1_SMN2_scatter command:

python3 plot_SMN1_SMN2_scatter.py --format svg --format png ${combined_table_filename}

It generates plots like this one which is based on 16,626 exomes that include neuromuscular disease cohorts:

Poster from SVAR22

This poster on SMA Finder was presented at the SVAR22 conference:

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Watch event: 1
Push event: 1

Last Year

Watch event: 1
Push event: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

pysam >=0.18.0
scipy >=1.7.3

docker/Dockerfile docker

python 3.9-slim-bullseye build

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/broadinstitute/sma-finder

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SMA Finder

Publication

Install

Example

CRAMS or BAMS: sample1.cram

Usage

Output

Combining results from multiple samples

Plotting combined results

Poster from SVAR22

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies