Recent Releases of pgsc_calc
pgsc_calc - v2.0.1
Docs improvements
- Update explanation to reflect that the reports are shareable and don't contain individual level scores
- Add example reports to download to the explanation
- Simplify offline usage section, which is useful for generic Trusted Research Environments (TREs) and sensitive compute on HPCs [@nebfield]
- Add information about running on the AllOfUs TRE [@joeltg10 @HasangaDM @smlmbrt]
Bug fixes
- Fix singularity tests on GitHub actions
Full Changelog: https://github.com/PGScatalog/pgsc_calc/compare/v2.0.0...v2.0.1
- Nextflow
Published by nebfield about 1 year ago
pgsc_calc - v2.0.0
We've marked this release as the first full release of v2, linked with our recent publication describing the calculator in full (Lambert, Wingfield, et al. Nature Genetics. 2024.).
Changelog
Improvements
- Make report shareable by default
- Remove individual level data
- Don't show density plots with small sample sizes
- Add warnings about complex alleles (e.g. HLA/APOE) and dosage specific effect weights to the report
- The variant verification step added in
2.0.0-beta.3has been integrated intopgscatalog-aggregate- The symmetric difference of scoring file variant IDs (
.scorefile.gz) and variants that contributed to the final calculated score (.varsplink file) must be an empty set
- The symmetric difference of scoring file variant IDs (
Bug fixes
- Stop crashing when encountering a scoring file with dosage specific effect weights (skip instead)
- Fix report logo
- Fix VCF input with JSON samplesheets
- Add tar to zstd conda environment to prevent very old tar installs failing to extract the database
- Reduce download max thread workers to prevent throttling by the EBI
- Fix variant verification step failing in some conda deployments
Full Changelog: https://github.com/PGScatalog/pgsc_calc/compare/v2.0.0-beta.3...v2.0.0
[!NOTE]
* TheCOMBINE_SCOREFILESprocess may take longer to finish and use more memory than in previous versions * New internal variant data models were added in this release to improve handling complex alleles and dosage specific effect weights * Also, every variant (scoring file row) has many more validation steps now to ensure data quality and consistency * Speed and memory usage will be improved in the next release
- Nextflow
Published by smlmbrt over 1 year ago
pgsc_calc - v2.0.0-beta.3
Changelog
Important fix: Fix splitting duplicated variant IDs across multiple scoring files
Background
- The
MATCH_COMBINEstep writes new scoring files for input toplink2 --score - When plink2 encounters a variant with the same ID across multiple rows in a scoring file it will ignore duplicates and warn about them
- This only happens when the same variant ID has different effect alleles across different rows
- A variant ID with the same effect allele and scores across multiple columns is OK, this causes scores to be calculated in parallel
Example
When using PGS000039, PGS000040, and PGS000041 in parallel some variants have different effect alleles at the same coordinates, for example:
22:40682469:T:C with effect allele T (PGS000041hmPOSGRCh38)
22:40682469:T:Cwith effect allele C (PGS000039hmPOSGRCh38)
Impact
In versions v2.0.0-beta, beta.1, and beta.2 the duplicated variant is written to the same scoring file and ignored by plink2. The duplicated variant doesn't contribute to the final calculated PGS.
In all v2.0.0-alpha versions and beta.3 a second scoring file is correctly written containing the other allele (additional alleles create extra scoring files automatically within the updated MATCH_COMBINE process). We have also updated the software tests to ensure this error doesn't occur in future releases.
This problem is more likely to happen when larger scores are calculated in parallel. As more scores are calculated in parallel, it's more likely that variant IDs with different effect alleles will duplicate and be ignored during the score calculation stage.
While the overall impact on the final score is likely to be small we encourage users to upgrade to beta.3, especially if they calculate larger scores in parallel.
How do I know if my data are affected?
$ cd work/71/35fa3c977993b71d5a85fb6721e8c3 # cd to a scoring process directory
$ comm -3 <(sort hgdp_22_additive_0.sscore.vars) <(zcat hgdp_22_additive_0.scorefile.gz | tail -n +2 | cut -f 1 | sort)
22:40682469:T:C
One missing variant appears in the output. This check is now included in the scoring module.
Other fixes
- Fix
--keep_ambiguousparameter #346 (@nebfield) - Fix variant matching information getting dropped from log when scores didn't pass the match rate threshold (@nebfield)
- Fix fraposa-pgsc handling exclusively numeric IIDs https://github.com/PGScatalog/fraposa_pgsc/pull/18 (@smlmbrt)
- Nextflow
Published by nebfield almost 2 years ago
pgsc_calc - v2.0.0-beta.2
Changelog
Features
- Add FID support internally (FID + IID must be unique for all samples) [@nebfield, thanks to @jasamack for initial draft fix]
- Add parameters to tune target variant missingness (
--pca_geno_miss_target, default maximum 10%) and/or MAF (--pca_maf_target, default no filtering) during intersection with the reference panel. [@smlmbrt]- The new defaults will help incorrect ancestry assignments when running the calculator on low sample sizes (revert to pre-beta version behaviour), as this behaviour was caused by the MAF filter before.
- Add
--efo_idparameter, deprecating--trait_efowhich will be removed in a future release
Misc
- Remove default anaconda channels because of license changes https://github.com/PGScatalog/pgsc_calc/pull/342
- Nextflow
Published by nebfield almost 2 years ago
pgsc_calc - v2.0.0-beta.1
Changelog
Bug fixes
- Fix samplesheet parsing error warnings by @smlmbrt in https://github.com/PGScatalog/pgsc_calc/pull/322
- Write consistent column sets to variant information files by @nebfield in https://github.com/PGScatalog/pgsc_calc/pull/330
Full Changelog: https://github.com/PGScatalog/pgsc_calc/compare/v2.0.0-beta...v2.0.0-beta.1
- Nextflow
Published by nebfield almost 2 years ago
pgsc_calc - v2.0.0-beta
Changelog
Graduating to beta with the release of our preprint 🎉
Improvements
- Improve aggregation https://github.com/PGScatalog/pygscatalog/pull/23
- Improve matching performance https://github.com/PGScatalog/pygscatalog/pull/22
- Improve match error docs https://github.com/PGScatalog/pgsc_calc/pull/311
- Publish dependencies to Bioconda to improve conda profile UX
- https://anaconda.org/bioconda/fraposa-pgsc
- https://anaconda.org/bioconda/pgscatalog.core
- https://anaconda.org/bioconda/pgscatalog.match
- https://anaconda.org/bioconda/pgscatalog.calc
Bug fixes
- Fix for https://github.com/PGScatalog/pygscatalog/issues/21
- Closes #301
- Specify modules explicitly to fix #312
- Fix bim input to
pgscatalog-aggregate#319
- Nextflow
Published by nebfield about 2 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha.6
Changelog
2024-05-28 update: We're investigating unexpected pgscatalog.core.lib.pgsexceptions.MatchRateError in some environments (e.g. UK Biobank on a HPC). This release has been downgraded to a pre-release
Please note the minimum required nextflow version has been updated to v23.10.0, released in October 2023. Run nextflow self-update to upgrade your nextflow version.
Improvements
- Migrate our custom python tools to new
pygscatalogpackages- Reference / target intersection now considers allelic frequency and variant missingness to determine PCA eligibility
- Downloads from PGS Catalog should be faster (async)
- Packages are now documented
- Update plink version to alpha 5.10 final #179
- Add docs describing cloud execution
- Add correlation test comparing calculated scores against known good scores
- When matching variants, matching logs are now written before scorefiles to improve debugging UX
- Improvements to PCA quality (ensuring low missingness and suitable MAF for PCA-eligble variants in target samples).
- This could allow us to implement MAF/missingness filters for scoring file variants in the future.
Bug fixes
- Fix ancestry adjustment with VCFs #252
- Fix support for scoring files that only have one effect type column #280
- Fix adjusting PGS with zero variance (skip them) #283
- Check for reserved characters in sampleset names
Known bug
- Incorrectly adjusting the
AVGin--run_ancestrymode #301 - unexpected
pgscatalog.core.lib.pgsexceptions.MatchRateErrorin some environments (e.g. UK Biobank on a HPC)
- Nextflow
Published by nebfield about 2 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha.5
Changelog
Improvements
- Automatically mount directories inside singularity containers without setting any configuration
- Improve permanent caching of ancestry processes with
--genotypes_cacheparameter - resync with nf-core framework
- Refactor combine_scorefiles to improve speed and quality control processes
Bug fixes
- Fix semantic storeDir definitions causing problems cloud execution (google batch)
- Fix missing DENOM values with multiple custom scoring files (score calculation not affected)
- Fix liftover failing silently with custom scoring files (thanks Brooke!)
Misc:
- Move aggregation step out of report
- Improve speed of
ANCESTRY_ANALYSIS
- Nextflow
Published by nebfield over 2 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha.4
Changelog
Improvements
- Give a more helpful error message when there's no valid matches in
match_combine
Bug fixes
- Fix retrying downloads when the EBI servers are sleepy on a Monday morning
- Fix numeric sample identifiers breaking ancestry analysis
- Check chr prefix in samplesheets
- Nextflow
Published by nebfield over 2 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha.3
Improvements:
- Automatically retry scoring with more RAM on larger datasets
- Describe scoring precision in docs
- Change handling of VCFs to reduce errors when recoding
- Internal changes to improve support for custom reference panels
Bug fixes:
- Fix VCF input to ancestry projection subworkflow (thanks
frahimovandAWS-crafterfor patiently debugging) - Fix scoring options when reading allelic frequencies from a reference panel (thanks
raimondsrefor reporting the changes from v1.3.2 -> 2.0.0-alpha) - Fix conda profile action
- Nextflow
Published by nebfield over 2 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha.2
Changelog
- Bump
pgscatalog_utilsv0.4.0 -> v0.4.1- Closes #165
- Nextflow
Published by nebfield almost 3 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha.1
This patch fixes a bug when running the workflow directly from github with the test profile (i.e. without cloning first). Thanks to @staedlern for reporting the problem.
- Nextflow
Published by nebfield almost 3 years ago
pgsc_calc - pgsc_calc v2.0.0-alpha
This is the alpha release of the pgsc_calc pipeline's major new feature: to compare samples to a reference population in order to adjust PGS with genetic ancestry data (see documentation for details). The normal calculation of PGS is largely unaffected and directly comparable with previous versions of the calculator and PGS calculated with other tools.
Features
Major
- Breaking changes to samplesheet structure to provide more flexible support for extra genomic file types in the future.
- Genetic ancestry group similarity is calculated to a population reference panel (default: 1000 Genomes) when the
--run_ancestryflag is supplied. This runs using PCA and projection implemented in thefraposa_pgsc (v0.1.0)package. - Calculated PGS can be adjusted for genetic ancestry using empirical PGS distributions from the most similar reference panel population or continuous PCA-based regressions.
These new features are optional and don't run in the default workflow.
Minor
- Speed optimizations for PGS scoring (skipping allele frequency calculation). Thanks to @mglev1n for the suggestion!
Credits
Contributions from: @nebfield @smlmbrt @ens-lgil
- Nextflow
Published by smlmbrt almost 3 years ago
pgsc_calc - pgsc_calc v1.3.2
This patch fixes a bug that caused the effect weight column in some PGS Catalog scoring files to be read as strings instead of floats, which triggered an assertion error. Thanks to @j0n-a for reporting the problem.
- Nextflow
Published by nebfield over 3 years ago
pgsc_calc - pgsc_calc v1.3.1
This patch fixes a bug that breaks the workflow if all variants in one or more PGS scoring files match perfectly with the target genomes. Thanks to @lemieuxl for reporting the problem.
- Nextflow
Published by nebfield over 3 years ago
pgsc_calc - pgsc_calc v1.3.0
This release is focused on improving scalability.
Features
- Variant matching is made more efficient using a split - apply - combine approach when the data is split across chromosomes. This supports parallel PGS calculation for the largest traits in the PGS Catalog (e.g. cancer, 418 PGS [avg 261,000 variants/score]) on big datasets such as UK Biobank.
- Better support for running in offline environments:
- Internet access is only required to download scores by ID. Scores can be pre-downloaded using the utils package (https://pypi.org/project/pgscatalog-utils/)
- Scoring file metadata is read from headers and displayed in the report (removed API calls during report generation)
- Implemented flag (
-–efo_direct) to return only PGS tagged with exact EFO term (e.g. no PGS for child/descendant terms in the ontology)
- Nextflow
Published by nebfield over 3 years ago
pgsc_calc - pgsc_calc v1.2.0
This release is focused on improving memory and storage usage.
Features
- Allow genotype dosages to be imported from VCF to be specified in vcfgenotypefield of samplesheet (default: GT / hard calls)
- Makes use of durable caching when relabelling and recoding target genomes (
--genotypes_cache) - Improvements to use less storage space:
- All intermediate files are now compressed by default
- Add parameter to support zstd compressed input files
- Improved memory usage when matching variants
(updated tagged release to fix docs)
- Nextflow
Published by nebfield over 3 years ago
pgsc_calc - pgsc_calc v1.1.0
The first public release of the pgsc_calc pipeline. This release adds compatibility for every score published in the PGS Catalog. Each scoring file in the PGS Catalog has been processed to provide consistent genomic coordinates in builds GRCh37 and GRCh38. The pipeline has been updated to take advantage of the harmonised scoring files (see PGS Catalog downloads for additional details).
Features
Many of the underlying software tools are now implemented within a
pgscatalog_utilspackage (v0.1.2, https://github.com/PGScatalog/pgscatalog_utils and https://pypi.org/project/pgscatalog-utils/). The packaging allows for independent testing and development of tools for downloading and working with the scoring files.The output report has been improved to have more detailed metadata describing the scoring files and how well the variants match the target sampleset(s).
Improvements to variant matching:
- More precise control of variant matching parameters is now possible, like ignoring strand flips
match_variantsshould now use less RAM by default:- A laptop with 16GB of RAM should be able to comfortably calculate scores on the 1000 genomes dataset
- Fast matching mode (
--fast_match) is available if ~32GB of RAM is available and you'd like to calculate scores for larger datasets
Groups of scores from the PGS Catalog can be calculated by specifying a specific
--trait(EFO ID) or--publication(PGP ID), in addition to using individual scoring files--pgs_id(PGS ID).Score validation has been integrated with the test suite
Support for M1 Macs with
--platformparameter (docker executor only)
Bug fixes
Implemented a more robust prioritisation procedure if a variant has multiple candidate matches or duplicated IDs
Fixed processing multiple samplesets in parallel (e.g. 1000 Genomes + UK Biobank)
When combining multiple scoring files, all variants are now kept to reflect the correct denominator for % matching statistics.
When trying to correct for strand flips the matched effect allele wasn't being correctly complemented
- Nextflow
Published by nebfield almost 4 years ago
pgsc_calc - v1.0.0
This release reliably calculates scores that contain chromosomal positions (scores with only rsID information will fail). Significant effort has been made to validate scores on different reference datasets. In the next release we'll add score validation to our test suite to make sure calculated scores are consistent between releases.
Changelog
- Add support for PLINK2 format (samplesheet structure changed)
- Add support for allosomes (e.g. X, Y)
- Improve PGS Catalog compatibility (e.g. missing other allele)
- Add automatic liftover of scoring files to match target genome build
- Performance improvements to support UK BioBank scale data (500,000 genomes)
- Support calculation of multiple scores in parallel
- Significantly improved test coverage (> 80%)
- Lots of other small changes to improve correctness and handling edge cases
In Development
This is marked as a pre-release because it will will fail for PGS Catalog scores that only have an rsID. Mapped positions will eventually be provided for existing scores via the PGS Catalog API and these will be integrated into the calculator pipeline.
- Nextflow
Published by nebfield about 4 years ago
pgsc_calc - 0.1.3dev
[0.1.3dev] - 2022-02-04
pgsccalc should run on GrCh37 scoring files from the PGS Catalog & GrCh37 target genomic data but :rotatinglight: don't trust the output :rotating_light:
This release is the final implementation of the MVP.
Changelog
- Better support for calling pipeline via an API
- Documentation(!)
- Better schemas for validation
- Nextflow
Published by nebfield over 4 years ago
pgsc_calc - 0.1.2dev
[0.1.2dev] - 2022-01-17
pgsccalc should run on GrCh37 scoring files from the PGS Catalog & GrCh37 target genomic data but :rotatinglight: don't trust the output :rotating_light:
Enhancements & fixes
- #2: Set up github action CI and linting
- A lot of work to integrate with IGS4EU (e.g. JSON input)
- Nextflow
Published by nebfield over 4 years ago
pgsc_calc - 0.1.1dev
Small fix to reflect the new scoring file format used by the PGS Catalog (version 2.0)
pgsc_calc should run on GrCh37 scoring files from the PGS Catalog & GrCh37 target genomic data but :rotatinglight: don't trust the output :rotatinglight:
- Nextflow
Published by nebfield over 4 years ago