https://github.com/biocommons/uta

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.9%) to scientific vocabulary

Keywords

bioinformatics sequence-alignment sequences

Last synced: 5 months ago · JSON representation

Repository

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image

Basic Info

Host: GitHub
Owner: biocommons
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 14.5 MB

Statistics

Stars: 65
Watchers: 10
Forks: 24
Open Issues: 24
Releases: 0

Topics

bioinformatics sequence-alignment sequences

Created over 9 years ago · Last pushed 6 months ago

Metadata Files

Readme License Codeowners

uta -- Universal Transcript Archive

bringing smiles to transcript users since 2013

The UTA (Universal Transcript Archive) stores transcripts aligned to sequence references (typically genome reference assemblies). It supports aligning the same transcript to multiple references using multiple alignment methods. Specifically, it facilitates the following:

Querying for multiple transcript sources through a single interface
Interpreting variants reported in literature against obsolete transcript records
Identifying regions where transcript and reference genome sequence assemblies disagree
Comparing transcripts across from distinct sources
Comparing transcript alignments generated by multiple methods
Identifying ambiguities in transcript alignments

UTA is used by the hgvs package to map variants between genomic, transcript, and protein coordinates.

This code repository is primarily used for generating the UTA database. The primary interface for the database itself is via direct PostgreSQL access. (A REST interface is planned, but not yet available.)

Users can access a public instance of UTA or build their own instance of the database.

Accessing the Public UTA Instance

Invitae provides a public instance of UTA. The connection parameters are:

param | value ------------ | -------------------- host | uta.biocommons.org port | 5432 (default) database | uta login | anonymous password | anonymous

For example:

$ PGPASSWORD=anonymous psql -h uta.biocommons.org -U anonymous -d uta

Or, in Python (requires psycopg2):

> import psycopg2, psycopg2.extras
> conn = psycopg2.connect("host=uta.biocommons.org dbname=uta user=anonymous password=anonymous")
> cur = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
> cur.execute("select * from uta_20140210.tx_def_summary_v where hgnc='BRCA1'")
> row = cur.fetchone()
> dict(row)
{'hgnc': 'BRCA1',
'cds_md5': 'b3d16af258a759d0321d4f83b55dd51b',
'es_fingerprint': 'f91ab768a35c8db477fbf04dde6955e2',
'tx_ac': 'ENST00000357654',
'alt_ac': 'ENST00000357654',
'alt_aln_method': 'transcript',
'alt_strand': 1,
'exon_set_id': 7027,
'n_exons': 23,
'se_i': '0,100;100,199;199,253;253,331;331,420;420,560;560,666;666,712;712,789;789,4215;4215,4304;4304,4476;4476,4603;4603,4794;4794,5105;5105,5193;5193,5271;5271,5312;5312,5396;5396,5451;5451,5525;5525,5586;5586,7094',
'starts_i': [0,
100,
199,
253,
331,
420,
560,
666,
712,
789,
4215,
4304,
4476,
4603,
4794,
5105,
5193,
5271,
5312,
5396,
5451,
5525,
5586],
'ends_i': [100,
199,
253,
331,
420,
560,
666,
712,
789,
4215,
4304,
4476,
4603,
4794,
5105,
5193,
5271,
5312,
5396,
5451,
5525,
5586,
7094],
'lengths': [100,
99,
54,
78,
89,
140,
106,
46,
77,
3426,
89,
172,
127,
191,
311,
88,
78,
41,
84,
55,
74,
61,
1508],
'cds_start_i': 119,
'cds_end_i': 5711}

Installing UTA Locally

Installing with Docker (preferred)

Docker enables the distribution of lightweight, isolated packages that run on essentially any platform. When you use this approach, you will end up with a local UTA installation that runs as a local PostgreSQL process. The only requirement is Docker itself - you will not need to install PostgreSQL or any of its dependencies.

Install Docker.
Define the UTA version to download. A list of available versions can be found here:
```
$ uta_v=uta_20210129b
```
This variable is used only for consistency in the examples that follow. Defining this variable is not required for any other reason.

The UTA version string indicates the data release date. The tag is made at the time of loading and is used to derive the filename for the database dumps and docker images. Therefore, the public c instances, database dumps, and docker images will always contain exactly the same content.
Fetch the UTA Docker image from Docker Hub:
```
$ docker pull biocommons/uta:$uta_v
```
This process will likely take 1-3 minutes.
Run the image:
```
$ docker run \
   -d \
   -e POSTGRES_PASSWORD=some-password-that-you-make-up \
   -v uta-dl-cache:/tmp \
   -v uta_vol:/var/lib/postgresql/data \
   --name $uta_v \
   --network=host \
   biocommons/uta:$uta_v
```
The first time you run this image, it will initialize a PostgreSQL database cluster, then download a database dump from biocommons and install it.

The uta container stages the postgres dump archive into the /tmp directory. Putting that in another volume called uta-dl-cache is helpful because it lets you re-build the database without having to re-download the postgres dump archive.

-d starts the container in daemon (background) mode. To see progress:
```
$ docker logs -f $uta_v
```
You will see messages from several processes running in parallel. Near the end, you'll see:
```
== You may now connect to uta.  No password is required.
...
2020-05-28 22:08:45.654 UTC [1] LOG:  database system is ready to accept connections
```
Hit Ctrl-C to stop watching logs. The container will still be running.

Test your installation

With the test commands below, you should see a table dump with at least 4 lines showing schema_version, create date, license, and uta (code) version used to build the instance.

$ psql -h localhost -U anonymous -d uta -c "select * from $uta_v.meta"

      key       |                               value
----------------+--------------------------------------------------------------------
 schema_version | 1.1
 created on     | 2015-08-21T10:53:50.666152
 license        | CC-BY-SA (http://creativecommons.org/licenses/by-sa/4.0/deed.en_US
 uta version    | 0.2.0a2.dev11+n52ed6e969cfc
 (4 rows)

(Optional) To configure hgvs to use this local installation, consult the hgvs documentation

Installing from database dumps

Users should prefer the public UTA instance (uta.biocommons.org) or the Docker installation wherever possible. When those options are not available, users may wish to create a local PostgreSQL database from database dumps. Users choosing this method of installation should be experienced with PostgreSQL administration.

The public site and Docker images are built from exactly the same dumps as provided below. Building a database from these should result in a local database that is essentially identical to those options.

Due to the heterogeneity of operating systems and PostgreSQL installations, installing from database dumps is unsupported.

The following commands will likely need modification appropriate for the installation environment.

Download an appropriate database dump from dl.biocommons.org.
Create a user and database.

You may choose any username and database name you like. uta and uta_admin are likely to ease installation.
```
$ createuser -U postgres uta_admin
$ createuser -U postgres anonymous
$ createdb -U postgres -O uta_admin uta
```

Restore the database.

$ uta_v=uta_20210129b
$ gzip -cdq $uta_v.pgd.gz | psql -U uta_admin -1 -v ON_ERROR_STOP=1 -d uta -Eae

Developer Setup

Virtual Environment

To develop UTA, follow these steps.

Set up a virtual environment using your preferred method. For example:
```
$ python3 -m venv uta-venv
$ source uta-venv/bin/activate
```

Clone UTA and install:

$ git clone git@github.com:biocommons/uta.git
$ cd uta
$ pip install -e .[test]

Restore a database or load a new one using the instructions above.
To run the tests:
```
$ python3 -m unittest
```

Docker

Clone UTA and build docker image:

$ git clone git@github.com:biocommons/uta.git
$ cd uta
$ docker build -t uta .

Restore a database or load a new one using the instructions above.
Run container and tests
```
$ docker run -it --rm uta bash
```

Testing

$ docker build --target uta-test -t uta-test .
$ docker run --rm uta-test python -m unittest

UTA update procedure

Requires docker.

0. Setup

Make directories: mkdir -p $(pwd)/ncbi-data mkdir -p $(pwd)/output/artifacts mkdir -p $(pwd)/output/logs

Set variables: export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b export UTA_ETL_OLD_UTA_VERSION=UTA_ETL_OLD_UTA_IMAGE_TAG export UTA_ETL_NEW_UTA_VERSION=uta_20240512 export UTA_ETL_NCBI_DIR=./ncbi-data export UTA_ETL_WORK_DIR=./output/artifacts export UTA_ETL_LOG_DIR=./output/logs

Build the UTA image: docker build --target uta -t uta-update .

1. Download SeqRepo data

docker compose run seqrepo-pull

Note: pulling data takes ~30 minutes and requires ~13 GB. Note: a container called seqrepo will be left behind.

2. Extract and transform data from NCBI

Download files from NCBI, extract into intermediate files, and load into UTA and SeqRepo.

See 2A for nuclear transcripts and 2B for mitochondrial transcripts.

2A. Nuclear transcripts

docker compose run ncbi-download docker compose run uta-extract docker compose run seqrepo-load docker compose run uta-load

2B. Mitochondrial transcripts

docker compose -f docker-compose.yml -f misc/mito-transcripts/docker-compose-mito-extract.yml run mito-extract docker compose run seqrepo-load docker compose run uta-load

2C. Manual splign transcripts

To load splign-manual transcripts, the workflow expects an input txdata.yaml file and splign alignments. Define this path using the environment variable $UTASPLIGNMANUALDIR. These file paths should exist: - `$UTASPLIGNMANUALDIR/splign-manual/txdata.yaml-$UTASPLIGNMANUAL_DIR/splign-manual/alignments/*.splign`

txdata.yaml defines the transcripts and their metadata. The alignments dir contains the splign alignments. To run the workflow: export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/ docker compose run splign-manual

UTA has updated and the database has been dumped into a pgd file in UTA_ETL_WORK_DIR. SeqRepo has been updated in place.

Migrations

UTA uses alembic to manage database migrations. To auto-generate a migration: alembic -c etc/alembic.ini revision --autogenerate -m "description of the migration" This will create a migration script in the alembic/versions directory. Adjust the upgrade and downgrade function definitions. To apply the migration: alembic -c etc/alembic.ini upgrade head To reverse a migration, use downgrade with the number of steps to reverse. For example, to reverse the last: alembic -c etc/alembic.ini downgrade -1

Owner

Name: biocommons
Login: biocommons
Kind: organization

Website: https://github.com/biocommons/biocommons/wiki/Welcome
Repositories: 19
Profile: https://github.com/biocommons

a collection of open source bioinformatics tools

GitHub Events

Total

Issues event: 5
Watch event: 6
Delete event: 1
Issue comment event: 2
Push event: 5
Pull request event: 3
Pull request review event: 2
Create event: 3

Last Year

Issues event: 5
Watch event: 6
Delete event: 1
Issue comment event: 2
Push event: 5
Pull request event: 3
Pull request review event: 2
Create event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 23
Total pull requests: 17
Average time to close issues: over 4 years
Average time to close pull requests: about 1 month
Total issue authors: 14
Total pull request authors: 9
Average comments per issue: 4.13
Average comments per pull request: 1.41
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 2
Average time to close issues: 24 days
Average time to close pull requests: 12 minutes
Issue authors: 5
Pull request authors: 2
Average comments per issue: 0.2
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

reece (9)
korikuzma (2)
theferrit32 (1)
anderspitman (1)
DGMichael (1)
deannachurch (1)
budsonjelmont (1)
nordhuang (1)
pablosolar (1)
ypchan (1)
sstadick (1)
soensz01 (1)
ahwagner (1)
koopmann (1)

Pull Request Authors

bsgiles73 (4)
sptaylor (3)
korikuzma (2)
andreasprlic (2)
b0d0nne11 (2)
naomifox-invitae (1)
gomoto (1)
theferrit32 (1)
ahwagner (1)

Top Labels

Issue Labels

stale (17) closed-by-stale (17) enhancement (8) resurrected (4) bug (1) keep alive (1)

Pull Request Labels

resurrected (2) stale (1) closed-by-stale (1) documentation (1)

Dependencies

.github/workflows/uta_update.yml actions

actions/github-script v6 composite

pyproject.toml pypi

attrs *
biocommons.seqrepo *
biopython >=1.69
bioutils *
colorlog *
configparser *
docopt *
eutils >=0.3.2
nose *
prettytable *
psycopg2-binary *
pytz *
recordtype *
sqlalchemy *
uta-align *

.github/workflows/labels.yml actions

actions/checkout v3 composite
crazy-max/ghaction-github-labeler v4 composite

.github/workflows/stale.yml actions

actions/stale v8 composite

https://github.com/biocommons/uta

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

uta -- Universal Transcript Archive

Accessing the Public UTA Instance

Installing UTA Locally

Installing with Docker (preferred)

Installing from database dumps

Developer Setup

Virtual Environment

Docker

UTA update procedure

0. Setup

1. Download SeqRepo data

2. Extract and transform data from NCBI

2A. Nuclear transcripts

2B. Mitochondrial transcripts

2C. Manual splign transcripts

Migrations

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies