https://github.com/broadinstitute/dig-bioindex

BIO index for genetic records stored in S3 and FastAPI server.

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Keywords

ot2od036440

Last synced: 10 months ago · JSON representation

Repository

BIO index for genetic records stored in S3 and FastAPI server.

Basic Info

Host: GitHub
Owner: broadinstitute
License: bsd-3-clause
Language: Python
Default Branch: master
Homepage:
Size: 9.1 MB

Statistics

Stars: 6
Watchers: 7
Forks: 6
Open Issues: 3
Releases: 1

Topics

ot2od036440

Created over 6 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

Bio-Index

Bio-Index is a tool that indexes genomic data stored in AWS S3 "tables" (typically generated by Spark) so that it can be rapidly queried and loaded. It uses a MySQL database to store the indexes and to look up where in S3 each "record" is located.

The Bio-Index has two entry points: a CLI used for basic CRUD operations and a simple HTTP server and REST API for pure querying.

Prerequisites

Python 3.8+

bash $ sudo amazon-linux-extras install python3.8 Make sure python3 and pip3 are on the path. You may need to do something like: bash $ sudo ln -s `which python3.8` /usr/bin/python3 $ sudo ln -s /usr/local/bin/pip3 /usr/bin/pip3

Upgrade pip, otherwise installing deps later won't work: bash sudo python3 -m pip install --upgrade pip

Setup

First clone the git repository:

bash $ git clone https://github.com/broadinstitute/dig-bioindex.git

Then, cd into the directory created and install needed requirements

bash $ sudo pip3 install -r requirements.txt

At this point, the BioIndex is installed on your system and you can run it with python3 -m bioindex.main:

bash $ python3 -m bioindex.main [--env-file <environment overrides>] <command> [args]

Configuring the BioIndex

The bio-index uses python-dotenv (environment variables) for configuration. There are two environment files of importance: .bioindex and .env. The .bioindex file contains environment variables for connecting to AWS if they need to differ from those in the AWS credentials file. If you pass --env-file before the command you can override which environment file is used instead of .bioindex.

The following are the environment variables that can be set in the .bioindex file:

```ini BIOINDEXS3BUCKET # S3 bucket to index/read from BIOINDEXRDSSECRET # AWS SecretID used to connect to the RDS instance () BIOINDEXRDSINSTANCE # RDS instance name; used if no secret specified () BIOINDEXRDSUSERNAME # RDS instance login; used if no secret specified () BIOINDEXRDSPASSWORD # RDS instance credentials; used if no secret specified () BIOINDEXBIOSCHEMA # RDS MySQL schema for the bio index (default=bio) BIOINDEXPORTALSCHEMA # RDS MySQL schema for the portal (optional) BIOINDEXLAMBDAFUNCTION # Lambda function that can be used for indexing remotely (optional) BIOINDEXGRAPHQLSCHEMA # File the GraphQL schema is written to and read from (optional) BIOINDEXGENESURI # Location of a GFF gene source (default=genes/genes.gff.gz) BIOINDEXRESPONSELIMIT # Number of bytes to read from S3 per request (default=2 MB) BIOINDEXMATCHLIMIT # Number of matches to return per request (default=100)

() - Either BIOINDEXRDSSECRET or BIOINDEXRDSINSTANCE is required (*) - If BIOINDEXRDSINSTANCE is used, then username and password are required ```

Additionally, one can set a single environment variable (BIOINDEX_ENVIRONMENT), which should be the name of an AWS secret. If set, the BioIndex will read that secret as JSON and expects it to contain the rest of the environment setup.

Likewise, the environment can be overridden. The priority of values is as such:

secret < .bioindex < envrionment

For example, consider the following setup:

BIOINDEX_ENVIRONMENT contains "bio-index-secret", which sets BIOINDEX_S3_BUCKET to "bio-index"
BIOINDEX_S3_BUCKET is set in .bioindex to "bio-index-dev"

When run, the S3 bucket will be set to "bio-index-dev". Likewise, if the command line is run like so:

bash $ BIOINDEX_S3_BUCKET=bio-test python3 -m bioindex.main query gene SLC30A8

The S3 bucket used will be "bio-test".

NOTE: The only environment variable that must be set are `BIOINDEXS3BUCKETand either theBIOINDEXRDSSECRETor otherBIOINDEXRDS*` variables. These will tell the BioIndex both where the data is located and where to write/read the index data.

Creating Indexes

To create a new index, use the create command. Example:

bash $ python3 -m bioindex.main create my-index prefix/key/to/files/ phenotype,chrom:pos

The above would create a new (or overwrite the existing) index named my-index. It indicates that all the files to index in the S3 bucket are located in prefix/key/to/files/ recursively, and that the schema used to index the files should be done by phenotype first and then by locus: chrom:pos.

The "prefix" to the files should always be a directory name and end with /. Every object in S3 under it will be indexed, no matter how deeply nested.

The "schema" parameter for the index controls how each record is indexed. Every schema follows the same the same general format: keys,...,locus. Consider the following JSON record:

json { "varId": "8:117962623:C:T", "dbSNP": "rs769898168", "chromosome": "8", "position": 117962623, "phenotype": "T2D", "pValue": 0.39, "beta": 0.3, "consequence": "splice_region_variant", "gene": "SLC30A8", "impact": "LOW" }

This record, may be indexed many different ways. For example:

By variant ID: varId
By dbSNP: dbSNP
By variant ID or dbSNP: varId|dbSNP
By position: chromosome:position
By phenotype, then position: phenotype,chromosome:position
By gene, then phenotype: gene,phenotype
...

The rules of indexing are as follows:

Key columns can only be cardinal values and are matched exactly.
Interchangeable keys may be separated with |.
Locus must be last.
Locus must be a position (chr:pos), region (chr:start-stop), or field template (varId=$chr:$pos) where the field can be parsed as a position/region, but is matched exactly by the field value as if it were a key column.

Preparing S3 Objects

Once everything is setup, you can begin creating or preparing the objects in S3 to be indexed. Each objects is expected to be in JSON-lines format, and must be sorted in order they are to be indexed! The only exception to this would be if the index is always a 1:1 mapping with a single record (e.g. indexing by ID).

For example, if the the schema phenotype,chromosome:position is used, then the objects in S3 are expected to be written (using Spark) like so:

python df.orderBy(['phenotype','chromosome','position']) \ .write \ .json('s3://my-bucket/folder')

The above code would write out many part files to the bucket/path, each perfectly sorted and ready to be indexed using the index CLI command.

Indexing

Once an index has been created, simply use the index command and pass long a comma-separated list of indexes to build.

bash $ python3 -m bioindex.main index my-index,another-index

NOTE: You can also pass * as to build all indexes!

You can also build indexes "remotely" using an AWS Lambda Function. To do this, see the [DIG Indexer][indexer] project, which is a [Serverless][serverless] project that can be used to deploy a Lambda Function to AWS. Once deployed, set the BIOINDEX_LAMBDA_FUNCTION environment variable and pass --use-lambda on the CLI for the index command. You can also adjust the number of workers (--workers) to use, which is the number of Lambda functions that will execute in parallel.

Block Gzip Compression

Once you've indexed plain json files, you can compress them using bgzip and mark the index as compressed in the db.
To see about the steps for compressing the plain json see README.md. Setting the compressed field in __Indexes will cause the bio index server to retrieve the relevant data with the bgzip command. To install bgzip on AWS Linux follow these steps (it is necessary to build from source in order to enable seamless s3 support as of 06/2023): 1. sudo yum install -y gcc make zlib-devel bzip2-devel xz-devel curl-devel ncurses-devel openssl-devel 2. wget https://github.com/samtools/htslib/releases/download/1.17/htslib-1.17.tar.bz2 (replace with latest version from https://github.com/samtools/htslib/releases if you want) 3. tar -xvf htslib-1.17.tar.bz2 4. cd htslib-1.17 5. ./configure --enable-s3 --enable-libcurl 6. make 7. sudo make install

Querying Indexes

Once you've built an index, you can then query it and retrieve all the records that match various input keys and/or overlap the given region. For example, to query all records in the genes key space that overlap a given region:

bash $ python3 -m bioindex.main query genes chr3:983248-1180000 {'chromosome': '3', 'end': 1445901, 'name': 'CNTN6', 'source': 'symbol', 'start': 1134260, 'type': 'protein_coding'}

NOTE: If you'd like to limit the output, just pipe it to head -n.

In addition to querying, there are also commands to count records, fetch all records, and match keys. Examples:

```bash $ python3 -m bioindex.main count genes 8:100000000-200000000 1587

$ python3 -m bioindex.main match gene SLC30A SLC30A1 SLC30A10 SLC30A2 SLC30A3 SLC30A4 SLC30A5 SLC30A6 SLC30A7 SLC30A8 SLC30A9 ```

NOTE: The count command is an approximation. It reads the first 500 records and divides the total number of bytes to read from S3 by the average byte size per record.

RsId => VariantId Index

This is an index that will look up variant id that most frequently corresponds to a specified rsid. We use a dynamo db table for this index. You can access this index via /api/bio/varIdLookup/<rsid> e.g. curl http://localhost:5000/api/bio/varIdLookup/rs1294894678 You can find more info about this index and how it's created here.

The GraphQL REST Server

In addition to a CLI, Bio-Index is also a FastAPI server that allows you to query records using GraphQL via REST calls.

Building the GraphQL Schema

GraphQL requires a schema to process queries. The schema is inferred from the data, and build with the build-schema CLI option:

$ python3 -m bioindex.main build-schema --save

If you don't pass --save, then the schema is simply printed out. By default it is written to the filename specified by the BIOINDEX_GRAPHQL_SCHEMA environment variable (defaulted to schema.graphql), but you can change the destination by either providing --out <filename> or simply redirecting the output somewhere else.

Once the schema has been saved, you can then start the server.

Starting the Server

The server is started using the serve command:

bash $ bioindex serve --port 5000

REST Queries

The entire REST API can be explored both via the demo page and via the REST API documentation page.

Each request results in a JSON response that looks like so:

json { "continuation": null, "nonce": "Ox4YfcJapxGYST_siDYjFtp150BZEMqC5JdyTuyTMUQ", "count": 1, "page": 1, "data": [ { "chromosome": "8", "end": 100728, "ensemblId": "ENSG00000254193", "name": "AC131281.2", "start": 100584, "type": "processed_pseudogene" } ], "index": "genes", "limit": null, "profile": { "query": 0.138009, "fetch": 0.417972 }, "progress": { "bytes_read": 368, "bytes_total": 368 }, "q": [ "chr8:100000-101000" ] }

The count is the total number of records returned by this request.

The data is the array of records (if format=row) or a dictionary of columns (if format=column).

The profile shows how long the index query took vs. how much time was spent fetching the records from S3.

The progress shows how many bytes were read from S3 this request and what the total number of bytes that need to be read are.

If the continutation value is non-null, then it is a string, which is a token indicating there are more bytes left to be read and records left to be returned. They can be retrieved using the /api/bio/cont?token=<token> end-point.

If the continuation is followed to download more records, then the page count is increased each subsequent call.

Using Docker

In the image/ subfolder is a Dockerfile that can be used to build a Docker image. Or a pre-built image can be pulled from DockerHub.

Building the Image

To build the image from scratch, run the following:

bash $ docker build -t broadinstitute/bioindex:latest image

Once built, running docker images should show it ready for use.

Executing Using Docker

When running the BioIndex from the docker image, it's best to pass the environment data through with --env-file and if you want to make use of the GraphQL API, then a volume needs to be mounted that will point to where the BIOINDEX_GRAPHQL_SCHEMA file is located.

```bash $ # list all indexes $ docker run --env-file ./my-bioindex.env -rm broadinstitute/bioindex bioindex list

$ # build the schema and output it to stdout $ docker run --env-file ./my-bioindex.env -v .:. -rm broadinstitute/bioindex bioindex build-schema

$ # start the server $ docker run --env-file ./my-bioindex.env -v .:. -rm broadinstitute/bioindex bioindex serve ```

Genes URI

When executing queries, it's often more convenient to use a gene name instead of trying to pass a specific region. Since gene names are specific to species and assemblies, the gene names are configurable using a GFF3 file. This can be a local file (and by default is the one located in this repository), but can also be a remote, hosted file. It is expected that the attributes column contains either the ID or Name field set to the gene name to use. If the Alias attribute is also present, it is assumed to be a comma-separated list of alternate names for the gene, and those will also be included in the map.

Here is an example of the first few lines of the default GTF in this repository:

19 . protein_coding 58856544 58864865 . + . Name=A1BG;Alias=ENSG00000121410,HGNC:5,uc002qsd.5,MGI:2152878 10 . protein_coding 52559169 52645435 . + . Name=A1CF;Alias=ENSG00000148584,HGNC:24086,uc057tgv.1,MGI:1917115 12 . protein_coding 9220260 9268825 . + . Name=A2M;Alias=ENSG00000175899,HGNC:7,uc001qvk.2,MGI:2449119 12 . protein_coding 8975068 9039597 . + . Name=A2ML1;Alias=ENSG00000166535,HGNC:23336,uc001quz.6 1 . protein_coding 33772367 33786699 . + . Name=A3GALT2;Alias=ENSG00000184389,HGNC:30005,uc031plq.1,MGI:2685279

NOTE: GFF files are tab-delimited! The spacing shown above is only for readability.

The GFF file is only downloaded/parsed if needed. It is loaded on-demand (only once per execution) if a query requiring a locus is provided something other than a known, region format (e.g. chromosome:start-end) and then assumes what was provided should be interpreted as a gene name.

fin.

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Push event: 16
Pull request event: 7
Fork event: 4
Create event: 2

Last Year

Push event: 16
Pull request event: 7
Fork event: 4
Create event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 79
Average time to close issues: N/A
Average time to close pull requests: 10 days
Total issue authors: 0
Total pull request authors: 7
Average comments per issue: 0
Average comments per pull request: 0.08
Merged pull requests: 66
Bot issues: 0
Bot pull requests: 9

Past Year

Issues: 0
Pull requests: 27
Average time to close issues: N/A
Average time to close pull requests: 7 days
Issue authors: 0
Pull request authors: 4
Average comments per issue: 0
Average comments per pull request: 0.07
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 6

View more stats

Top Authors

Issue Authors

dependabot[bot] (1)

Pull Request Authors

psmadbec (24)
sagehen03 (18)
massung (8)
dependabot[bot] (8)
wnojopra (2)
qu-y (1)
ClintAtTheBroad (1)

Top Labels

Issue Labels

dependencies (1)

Pull Request Labels

dependencies (8)

Dependencies

requirements.txt pypi

aiofiles ==0.6
boto3 ==1.17
botocore ==1.20
click ==7.0
fastapi ==0.65.2
graphene ==3.0
graphql-core ==3.1.2
orjson ==3.5
pydantic ==1.6.2
pymysql ==0.10
python-dotenv ==0.15
requests ==2.25
rich ==12.0
smart_open ==5.0
sqlalchemy ==1.4
typing-extensions ==4.1.1
uvicorn ==0.13

setup.py pypi

aiofiles >=0.6
boto3 >=1.17
botocore >=1.20
click >=7.0
fastapi >=0.60
graphql-core >=3.0
orjson >=3.5
pydantic >=1.4
pymysql >=0.10
python-dotenv >=0.15
requests >=2.25
rich >=10.0
smart_open >=5.0
sqlalchemy >=1.4
uvicorn >=0.13

.github/workflows/codeql.yml actions

actions/checkout v3 composite
github/codeql-action/analyze v2 composite
github/codeql-action/autobuild v2 composite
github/codeql-action/init v2 composite

image/Dockerfile docker

python 3 build

batch-index-files/Dockerfile docker

ubuntu 20.04 build

batch-index-files/requirements.txt pypi

bioindex master
boto3 ==1.26.116
click ==8.1.3