https://github.com/broadinstitute/variants-cosmos-spikes

DSP Variants team spikes to explore Azure Cosmos DB

https://github.com/broadinstitute/variants-cosmos-spikes

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

DSP Variants team spikes to explore Azure Cosmos DB

Basic Info
  • Host: GitHub
  • Owner: broadinstitute
  • Language: Java
  • Default Branch: main
  • Size: 105 KB
Statistics
  • Stars: 0
  • Watchers: 7
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme

README.md

A collection of DSP Variants Team spikes for Azure Cosmos DB.

VS-890 Translation of the original Cosmos ingest spike from Ammonite to Java.

Includes initial Gradle and GitHub Actions setup to run unit tests.

VS-893 Group records for improved ingest performance.

Groups variant and reference rows into fewer higher-level documents. Groupings by 6000 for vets and 40000 for reference ranges (larger numbers produce HTTP 413 "Request size is too large" errors from Cosmos). This version of the code requires slightly modified EXPORT DATA statements to order data by sample_id and location:

ref_ranges:

EXPORT DATA OPTIONS( uri='gs://bucket/path/to/avros/ref_ranges/ref_ranges_001/ref_ranges_001_*.avro', format='AVRO', compression='SNAPPY') AS SELECT r.sample_id AS sample_id, location, length, state FROM `gvs-internal.quickstart_dataset.ref_ranges_001` r INNER JOIN `gvs-internal.quickstart_dataset.sample_info` s ON s.sample_id = r.sample_id WHERE withdrawn IS NULL AND is_control = FALSE ORDER BY sample_id, location

vets:

EXPORT DATA OPTIONS( uri='gs://bucket/path/to/avros/vets/vet_001/vet_001_*.avro', format='AVRO', compression='SNAPPY') AS SELECT v.sample_id AS sample_id, location, ref, alt, AS_RAW_MQ, AS_RAW_MQRankSum, QUALapprox, AS_QUALapprox, AS_RAW_ReadPosRankSum, AS_SB_TABLE, AS_VarDP, call_GT, call_AD, call_GQ, call_PGT, call_PID, call_PL FROM `gvs-internal.quickstart_dataset.vet_001` v INNER JOIN `gvs-internal.quickstart_dataset.sample_info` s ON s.sample_id = v.sample_id WHERE withdrawn IS NULL AND is_control = FALSE ORDER BY sample_id, location

To realize ~10x performance gains relative to the VS-890 code, the indexing policy on the target Cosmos container needs to be changed from automatic to none (this code will work without updating the Cosmos indexing policy but it won't be much faster than the VS-890 baseline). The updated indexing policy should look like:

{ "indexingMode": "none", "automatic": false, "includedPaths": [ ], "excludedPaths": [ ] }

VS-906 Cosmos DB Serverless Exploration

Cosmos DB serverless is restricted to 50 GB per container and 5000 RU/s throughput. There is a preview 1 TB container offering in which I have enrolled the Variants subscription which will offer higher throughput as storage grows.

An ingest run with Quickstart data using the code from this spike consumed 6.65 M RU for reference data and 10.36 M RU for variant data. Run times were ~32 minutes and ~37 minutes, respectively, for a total of ~69 minutes. RU consumption cost was (6.65 + 10.36 =~ 17M RU * $0.25 / RU) = $4.25

or $0.425 per sample. Storage cost was

10.03 + 13.13 = 23.16 * $0.25 GB / month = $5.79 GB / month

or $0.58 / sample * month.

After creating indexes with this specification the UI did not report any change in storage amount (or cost):

{ "indexingMode": "consistent", "automatic": true, "includedPaths": [ { "path": "/sample_id/*" }, { "path": "/location/*" } ], "excludedPaths": [ { "path": "/*" } ] }

Invocations on a Standard_E4-2ads_v5 VM looked like:

java -Xms2g -Xmx26g -jar build/libs/variantstore-*.jar --database cosmos-gvs-serverless --container ref_ranges \ --avro-dir /mnt/data/avros-sample-location/ref_ranges/ref_ranges_001/ --max-records-per-document 40000 --drop-state 4

java -Xms2g -Xmx26g -jar build/libs/variantstore-*.jar --database cosmos-gvs-serverless --container vets \ --avro-dir /mnt/data/avros-sample-location/vets/vet_001/ --max-records-per-document 6000

Improvements in VS-906

  • Support for references drop states with --drop-state parameter
  • Always split Cosmos documents on chromosome boundaries
  • Parameter --submission-batch-size to control the maximum size of document batches sent to Cosmos at one time.
  • Parameter --continuous-flux to turn on more efficient (but more crashy on low throughput) continuous-Flux data loading.
  • Support for various Cosmos parameters. I experimented with all of these but none seemed to help with the high numbers of 429 statuses seen in Cosmos Insights:
    • --target-throughput: Value to specify for Cosmos container local target throughput
    • --max-micro-batch-size: CosmosBulkExecutionOptions micro batch size
    • --micro-batch-concurrency: CosmosBulkExecutionOptions micro batch concurrency
    • --max-micro-batch-retry-rate: CosmosBulkExecutionOptions max micro batch retry rate
    • --min-micro-batch-retry-rate: CosmosBulkExecutionOptions min micro batch retry rate
    • --min-micro-batch-interval-millis: CosmosBulkExecutionOptions retry rate in milliseconds
  • Lots of general cleanup

Owner

  • Name: Broad Institute
  • Login: broadinstitute
  • Kind: organization
  • Location: Cambridge, MA

Broad Institute of MIT and Harvard

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 day
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • mcovarr (4)
Top Labels
Issue Labels
Pull Request Labels