Recent Releases of somalier
somalier -
More lenient sex inference
Installation
grab the static binary , or use docker via brentp/somalier:v0.3.0
sites files
a T2T sites by @kpalin sites.chm13v2.T2T.vcf.gz )
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp 11 months ago
somalier -
v0.2.19
- [relate/infer] fix check that would prevent some inference (#123 thanks @equinne5 for reporting and providing test-case)
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.19
sites files
a T2T sites by @kpalin sites.chm13v2.T2T.vcf.gz )
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 2 years ago
somalier - find_sites fixes
v0.2.18
- [find_sites] handle empty alternate alleles (#121 thanks @johanneskoester for reporting)
- [find_sites] add --output-vcf option
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.18
sites files
a T2T sites by @kpalin sites.chm13v2.T2T.vcf.gz )
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp almost 3 years ago
somalier - missing RG
- allow setting bam sample name via ENV when RG is missing (#115)
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.17
sites files
a T2T sites by @kpalin sites.chm13v2.T2T.vcf.gz )
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp almost 3 years ago
somalier -
This makes find-sites faster and less buggy (only needed if you wish to create your own sites files).
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.16
sites files (unchanged from previous releases except for T2T)
a new T2T sites file matching those for hg38 and GRCh37 was created by @kpalin sites.chm13v2.T2T.vcf.gz )
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 3 years ago
somalier - minor find-sites fix
this is a minor release that fixes problems with previous binary.
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.15
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
sites.hg38.rna.vcf.gz only includes sites likely to be expressed in GTeX
- Nim
Published by brentp over 4 years ago
somalier -
this is a minor release. with small usability improvements. see below for details:
changes
- minor fixes to find-sites
- allow setting env variable
SOMALIER_REPORT_ALL_PAIRSto force reporting of all sample-pairs (#76) - improve readme (via @zztin)
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.14
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
sites.hg38.rna.vcf.gz only includes sites likely to be expressed in GTeX
- Nim
Published by brentp over 4 years ago
somalier - trio inference fix
v0.2.13
- add "Heterozygosity rate" as a per-sample metric to the html output. (Thanks Irenaeus and Kelly for the suggestion)
- fix inference for some cases. obvious parent-child pairs were sometimes missed.
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.13
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
sites.hg38.rna.vcf.gz only includes sites likely to be expressed in GTeX
- Nim
Published by brentp about 5 years ago
somalier - scaling IBS
v0.2.12
- add checkbox to HTML to scale IBS0, IBS2, etc by number of sites shared by the samples. this almost always results in a scaling that is better across (pairs of) samples.
- ancestry: allow globs for ancestry files (#59)
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.12
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
sites.hg38.rna.vcf.gz only includes sites likely to be expressed in GTeX
- Nim
Published by brentp over 5 years ago
somalier - v0.2.11
This release adds small changes, except that the relatedness calculation has been corrected. It worked for most cases except when the number of sites was extremely low. It is now more accurate in more cases. This was reported by @fgvieira . See below for full release details.
v0.2.11
- more informative error message on bad sample name (#53)
- allow setting SOMALIERABHOM_CUTOFF to change which calls are considered hom-ref (#56)
- adjust (fix) relatedness calculation which was off when the number of shared sites was low (#55).
many thanks to @fgvieira who found and diagnosed this problem. This change adds a
hets_abcolumn which is the count of times sampleawas het andbwas not unknown + the times samplebwas het andawas not unknown; this is mostly not needed except to (re) calculate relatedness but is reported in the text output for completeness.
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.11
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
sites.hg38.rna.vcf.gz only includes sites likely to be expressed in GTeX
- Nim
Published by brentp almost 6 years ago
somalier - sites file optimized for RNA-Seq, bug-fixes and ancestry improvements
v0.2.10
- added a new sites file that includes sites likely to be expressed in GTeX to improve kinship estimation in RNA-Seq data (see below for link to new sites file).
- fix extra output column in pairs.tsv (#47)
- change output file name of ancestry to include "somalier"
- fix for gvcf with empty alts (#46)
- add include regions and exclude sites to find-sites
- add --min-ab option to somalier relate to limit het sites to
min_ab..(1-min_ab). default is 0.3 - html output for sample plot defaults to number of het sites on X (was hom-alt)
- better estimates in
somalier ancestrywhen incoming samples are different ancestry from training (thousand genomes samples)
example output is now at: https://brentp.github.io/somalier/ex.html and: https://brentp.github.io/somalier/ex.somalier-ancestry.html
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.10
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
sites.hg38.rna.vcf.gz only includes sites likely to be expressed in GTeX
- Nim
Published by brentp about 6 years ago
somalier - multi-sample GVCFs
v0.2.9
- support multi-sample GVCF and fix some GVCF cases (thanks @ameynert for implementing)
- also fixes some edge-cases with GVCFs
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.9
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - pedigree inference and better handling of identical samples
v0.2.8
- html output has a list of pre-sets to auto-select informative X, Y axes for the sample plot
- add --infer flag to somalier relate to allow inferring relatedness. this accompanies a change in the .samples.tsv output so that it can be used as a pedigree file
- add --sample-prefix option to extract and corresponding (multi-)option to relate. So, given a cohort with DNA and RNA where samples have identical IDs (SM tags) in the DNA and RNA, can use somalier as: ``` somalier extract -d DNA --sample-prefix DNA- ... somalier extract -d RNA --sample-prefix RNA- ...
somalier relate --sample-prefix DNA- --sample-prefix RNA- DNA/.somalier RNA/.somalier ... ```
and it will show the samples that have matching IDs after stripping the prefixes as "identical".
Installation
grab the static binary , or use docker via brentp/somalier:v0.2.8
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - ancestry
this release adds an initial implementation of an ancestry sub-command that can use a set of labelled samples (with extracted somalier files) to train a small neural network which is then used to predict the ancestry of incoming samples.
the implementation is incomplete, but works for well-behaved data. Here is an example:
http://home.chpc.utah.edu/~u6000771/somalier/somalier-ancestry.n.html
This is possible thanks to a very fast randmized PCA implementation (along with a neural network framework) from @mratsim in Arraymancer.
There are also improvements for huge cohorts. See below for full change-set.
Installation
grab the static binary below, or use docker via brentp/somalier:v0.2.7
v0.2.7
- new subcommand
ancestryto predict ancestry using a simple neural network on the somalier sketches. creates an interactive html output and a text file - fix for "Argument list too long" on huge cohorts (#37)
- sub-sample .pairs.tsv output for huge cohorts -- only for unrelated samples.
- better sub-sampling of html output
sites files (unchanged from previous releases)
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - html for huge cohorts
somalier was fast enough to use on large >5,000 sample cohorts, but the html output was not useful. this fixes that by sub-sampling pairs of samples that are expected to be unrelated and also appear to be unrelated by the genotype information.
v0.2.6
- for large cohorts (>1K samples) the html output is now usable. it randomly subsets samples that should be and are unrelated.
- better error messages for bad input
- inspect environment variable:
SOMALIER_ALLOWED_FILTERSso that users can give a comma-delimited list of FILTERs that should be allowed (by default only PASS and RefCall variants are considered. This is useful for some GVCF formats.
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - VCF+GVCF edge-cases
get started with a binary below, or with docker:brentp/somalier:v0.2.5
v0.2.5
- handle more types of GVCF (#27, thanks @holtjma)
- handle VCFs without depth (AD) information. this enables extracting VCFs with only genotypes such as files converted from array information (#31, thanks @asazonov)
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - GVCF support and parameter changes
get started with a binary below, or with docker:brentp/somalier:v0.2.4
v0.2.4
- unify genotyping between all code-paths (thanks Filipe)
- if both groups and pedigree information are specified, they correctly share information (#26)
- relax allele balance to hom-ref is < 0.04 and hom-alt > 0.96 (was 0.02 and 0.98 respectively).
- support for GVCF (#27)
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - major performance improvements, bugfixes
The main change in this release is the use of bitvectors to calculate all-vs-all relatedness. This speeds up the relatedness step by about 100X such that we can calculate relatedness of all 4,825,171 possible pairwise combinations of the 2,504 thousand genomes samples in about 20 seconds. It also fixes a bug in the a-allele/b-allele designation for VCF that caused problems when comparing samples extracted from VCF/BCF to those from CRAM/BAM.
The readme now includes instructions on how to estimate ancestry from somalier sketches.
v0.2.3
- calculate relatedness correctly for samples with parent-ids specified when the parents are not actually in the pedigree file.
- use bit-vectors to calculate relatedness. this gives up to a 250X speedup. with this code, I can now evaluate relatedness for 3756 in under 30 seconds on my laptop.
- better scaling of X and Y depth
- use final RG as the sample id in relate
- output expected relatedness in .pairs.tsv file
- fix ref/alt (a/b-allele ordering for VCF) this was a bug that caused problems when comparing samples extracted from VCF files to other samples extracted from BAM/CRAM files. Thanks very much to Filipe and Sergio for finding this issue and providing several test-cases. (if you have previously downloaded the thousand genomes files from zenodo, please update to the latest).
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp over 6 years ago
somalier - static build with curl. default output dir
v0.2.2
- add a default output directory. previously if not outputdir was specified, it would try to write to / and give a non-informative error.
- static build with libcurl. the static binary now supports bams/vcfs/crams over https/s3 etc.
Install
- somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go.
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
- Nim
Published by brentp almost 7 years ago
somalier -
v0.2.1
- fix hover in html
- add --unknown flag for
somalier relateto set unknown genotypes to hom-ref (useful when merging single-sample VCFs). - change sites to be alphabetical by allele so that they are the same between genome builds
- add version to .somalier files created with extract -- these will not be compatible with those made with v0.2.0. I don't forsee a backwards incompatible change like this one in the near future.
- sites files for hg38 and GRCh37 are compatible. That is, we can extract sites from bams or vcfs from samples aligned to GRCh37 reference and accurately calculate relatedness on files extracted from samples aligned to hg38.
- better HTML performance for large numbers of samples by sub-sampling individiuals that are expected to be unrelated and that have a calculated relatedness < 0.09.
- add a
depthviewsub-command to plot the depth of each sample along each chromosome. - much nicer html and several fixes thanks to Joe Brown
Install
This release comes with 2 linux binaries: + somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go. + somalier_shared requires htslib (and libhts.so). use this binary if you need to access S3 or https files.
sites files
These sites files are build-specific, but as of this release, once the sites are extracted, the resulting somalier files can be used to compare samples even across genome builds.
sites.hg19.vcf.gz sites.hg38.nochr.vcf.gz sites.GRCh37.vcf.gz sites.hg38.vcf.gz
)
- Nim
Published by brentp almost 7 years ago
somalier - major refactor for scalability
v0.2.0
This was a large re-write of somalier. The command-line usage is backwards incompatible (but
should not change moving forward). There is now a per-sample extract step:
somalier extract -d extracted/ -s $sites_vcf -f $fasta $sample.cram
followed by a relate step:
somalier relate --ped $ped extracted/*.somalier
This enables parallelization by sample across nodes and the resulting, extracted, binary "somalier"
files are only ~220KB per sample so reading them is nearly instant and the relate step
runs in 10 seconds for my 603-sample test-case which makes adjusting pedigree files or removing samples
and re-running a much faster process.
This means we can add a single (n+1) sample and once it's extracted, we can compare it to an entire cohort in a few seconds.
somalier extract can also take a (multi-sample) VCF and create an idential "somalier" file for cases when a VCF is available.
The sites files (linked below) are also greatly improved (with fewer sites, better accuracy) in this release)
For example, here is the output from previous version:
compared to this version:

Note how on the bottom figure for this version, like colors (relationships indicated from a pedigree file) cluster more tightly than in the previous version.
This release also reports values for X and Y chromosomes which help to evaluate observed vs expected sex, which can help resolve sample swaps.
Install
This release comes with 2 linux binaries: + somalier_static is a completely static binary and the recommended way to run somalier; just wget, chmod+x (get a sites file) and go. + somalier_shared requires htslib (and libhts.so). use this binary if you need to access S3 or https files.
sites files
sites.hg38.vcf.gz sites.GRCh37.vcf.gz
- Nim
Published by brentp about 7 years ago
somalier -
v0.1.5
- add experimental contamination estimate. this simply prints to stderr the sample and inferred source (another sample) of contamination along with the estimated level of contamination and the number of sites used to estimate it.
- fix threading bug with large numbers of samples.
- more lenient ped file parsing ("Female" will be recognized in sex column and "Affected" in phenotype column).
- the html output now allows selecting a single sample to be highlighted in the plot this allows finding a sample of interest in a large cohort.
- the output now includes a new metric for proportion of sites with an allele balance > 0.02 and < 0.2 or > 0.8 and < 0.98. this turns out to be a nice QC (high is bad)
- for low coverage or targetted sites, sometimes
nanvalues would stop the entire html page from working; this has been fixed. - make sure all reported relationships are plotted in correct colors (#14)
- plotting fixes (#15)
Install
This release comes with 2 linux binaries: + somalier_static is a completely static binary; just wget, chmod+x (get a sites file) and go. + somalier_shared requires htslib (and libhts.so). use this binary if you need to access S3 or https files.
Sites files
hg38:
sites.hg38.vcf.gz sites.chr.hg38.vcf.gz (for hg38 VCFs with "chr" prefix)
GRCh37
- Nim
Published by brentp about 7 years ago
somalier - .list to specify bam/cram and index paths.
v0.1.4
- if a file ending with ".list" is given as an argument (instead of .bam, .cram), it can contain
paths to the alignment files and optionally the indexes. e.g.
https://abc/path/to/aaa.bam https://abc/indexes/path/aaa.bam.bai https://abc/path/to/bbb.bam https://abc/indexes/path/bbb.bam.baiThese can be space, comma, or tab-delimited.
here are the current best sites files for hg38:
sites.hg38.vcf.gz sites.chr.hg38.vcf.gz (for hg38 VCFs with "chr" prefix)
- Nim
Published by brentp over 7 years ago
somalier - fixes for deep coverage and better QC metrics.
v0.1.3
- if a sample had > 1 allele that was neither REF nor ALT at a given site, it was assigned
an
unknowngenotype. This was too stringent for deep sequencing so it was changed to a proportion (> 0.04 [or 1 in 25 alleles]) #7 - for samples with sparse coverage, e.g. from targeted sequencing projects, mean depth is
not very informative because it gets washed out by all the zero-depth sites. The new columns:
gt_depth_mean,gt_depth_std, gtdepthskew` report the values for the depth at genotyped sites--those meeting the depth requirement (default of 7).
- Nim
Published by brentp over 7 years ago
somalier - plot-aesthetics, fixes for RNA-Seq, more depth diagnostics
v0.1.2
- allow lower-case reference alleles in case of masked genomes (see #5)
- set relatedness values < -1.5 to -1.5 in the plot
- fix bug that affected relatedness calcs especially in RNA-Seq
- add more diagnostic values (allele-balance and number of non ref/alt bases)
see previous release for sites file. For hg38, it's possible to use: https://github.com/brentp/peddy/blob/master/peddy/GRCH38.sites (if your chromosomes have the 'chr' prefix, that must be added to this sites file).
- Nim
Published by brentp over 7 years ago
somalier - bug-fixes and interaction in html output
v0.1.1
- fix bug in plot labels
- better inter-plot interaction in html
continue to use sites.vcf.gz from previous release (GRCh37 only)
- Nim
Published by brentp over 7 years ago
somalier - v0.1.0
this release improves the parallelization by sample and provides a better (GRCh37) sites file. It is recommended to use this file. It has fewer sites (23K) but they will work for BS-Seq data and should provide a slightly better relatedness estimate than the 37K from the previous release.
It also removed the heatmap plot in favor of a depth (diagnostic plot).
- Nim
Published by brentp over 7 years ago
somalier - first release
see binary attached (built on oldish system so should avoid libc problems on most systems).
the sites.vcf.gz will work for hg37. the next release will provide one that works for hg38, but any set of common variants will work. sites.vcf.gz
- Nim
Published by brentp almost 8 years ago