dbgap

dbGaP to biocaddie conversion utilities

https://github.com/crddi/dbgap

Last synced: 6 months ago · JSON representation

Repository

dbGaP to biocaddie conversion utilities

Basic Info

Host: GitHub
Owner: crDDI
License: bsd-3-clause
Language: Python
Default Branch: master
Size: 150 KB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 0

Created about 10 years ago · Last pushed about 10 years ago

Metadata Files

Readme License

dbgap

dbGaP to bioCADDIE metadata conversion utilities

Introduction

This package contains a general utility that allows you to:

Download study metadata from the dbGaP ftp site by study id.
Convert the study metadata from XML into JSON
Transform the dbGaP JSON into a structure that is compatible with the bioCADDIE study schema, dataset schema and dimension schema
Transform the bioCADDIE compatible JSON into RDF for use in mapping functions.
Transform RDF into bioCADDIE compatible JSON

Installation

Make sure you have a running image of python 3
Enter the appropriate virtual environment

```bash

. myenv/bin/activate (myenv) > 3a. Installdbgap``` from github

bash (myenv) > git clone https://github.com/crDDI/dbgap (myenv) > cd dbgap (myenv) > python setup.py install

3b. Install dbgap from PyPi

bash (myenv) > pip install dbgap

4 . Run download_study

bash (myenv) > download_study usage: download_study [-h] [-i [INFILE [INFILE ...]]] [-id INDIR] [-o [OUTFILE [OUTFILE ...]]] [-od OUTDIR] [-f] [-s] [-v VERSION] [-p PVALUE] [--ftproot FTPROOT] [-r RDFDIR] [--logfile LOGFILE] [--loglevel {DEBUG,INFO,WARNING,ERROR}] [--port PORT] [-c CONTEXT] studyid [{d,j,r,a} [{d,j,r,a} ...]] download_study: error: the following arguments are required: studyid

Use

Transformation description

Downloading XML files

The utility allows any version of any study to be downloaded in XML from the dbGaP XML server

The default download directory is data/<studyid>/xml.

As an example,

bash (myenv) > download_study 979 d Creates a data/phs000979/xml directory with the following files:

(myenv) > ls xml StudyDescription.xml phs000979.v1.pht005193.v1.Mental_Disorders_Postmortem_Subject.data_dict.xml phs000979.v1.pht005194.v1.Mental_Disorders_Postmortem_Sample.data_dict.xml phs000979.v1.pht005195.v1.Mental_Disorders_Postmortem_Subject_Phenotypes.data_dict.xml phs000979.v1.pht005196.v1.Mental_Disorders_Postmortem_Sample_Attributes.data_dict.xml Where StudyDescription.xml was downloaded from ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000979/phs000979.v1.p1/GapExchange_phs000979.v1.p1.xml

and the four datadict files from ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000979/phs000979.v1.p1/phenovariable_summaries/

Converting XML to JSON

This utility uses the Object Management Group (OMG) XML to JSON conversion specification, as implemented in the pyjxslt utility, and loads the result as a first-class python object using the jsonasobj utility. The following transformations are performed on the input data:

Study transformations

The transformations in the table below are implemented by the biocaddie_json method in https://github.com/crDDI/dbgap/blob/master/dbgap/dbgapstudyinformation.py, and generate bioCADDIE compatible output from a dbGaP Study record:

| key | value | Notes |
|---|---|---| | @type | "biocaddie:Study" | This is necessary to establish the type of the entire document | | @id | "dbgap:"<study>".v"<version> | This is necessary establish the subject of the entire document | | identifierInfo | identifier= "dbgap:"<study>".v"<version> | The bioCaddie schema calls for an identifier/scheme pair -- although, curiously, the identifier is specified to be a URI | | | identifierScheme = "dbGaP" | | title |GapExchange.Studies.Study[0].Configuration.StudyNameEntrez | ISSUE: We need to determine what an entry with more than one study looks like | | description | GapExchange.Studies.Study[0].Configuration.StudyNameReportPage | | | studyType | GapExchange.Studies.Study[0].Configuration.StudyTypes.StudyType[0] | ISSUE: The alignment between dbGaP study type(s) and bioCaddie StudyType is not obvious. Mapping may be required or this may not be a valid field. | | keywords | GapExchange.Studies.Study[0].Configuration.Diseases.Disease (prefixed with "MESH - ") ISSUE: There are no keywords in the latest bioCaddie schema. | Is there somewhere else this would work better? -- perhaps isAboutBiologicalProcess | | resultsIn | (this is a list of the identifiers of all of the datasets) |

The transformations in the table below are implemented by the xform_dbgap_dataset method in https://github.com/crDDI/dbgap/blob/master/dbgap/xform_dbgap.py, and generate bioCADDIE compatible output from a dbGaP DataSet record:

| key | value | Notes |
|---|---|---| | @type | ''biocaddie:Dataset" | | | @id | "biocaddie:"datatable.studyid | | identifierInfo | identifier="dbgap:"datatable.studyid | | | | identifierScheme=dbgap | | | dateinfo | date=datatable.datecreated | | | | dateType="dct:created" | Dublin core seemed to be a reasonable source for dateinfo | | context | "fhir:Observation" | if dataset is "Subject Phenotypes" | | | "fhir:Specimen" | if dataset is "Sample Attributes" | | hasPartDimension | "dbgap:"v.id | for each data_table.variable | | * | * | All other dbgap elements are copied as is.

The transformations in the table below are implemented by the xform_dbgap_dimension method in https://github.com/crDDI/dbgap/blob/master/dbgap/xform_dbgap.py, and generate bioCADDIE compatible output from a dbGaP dataset variable:

| key | value | Notes |
|---|---|---| | @type | ''biocaddie:Dimension" | | | @id | "biocaddie:"variable.id | | | identifierInfo | identifier="dbgap:"variable.id | | | | identifierScheme="dbgap" | | | dimensionType | "xsd:string" | if variable.type == "string'. Note: We need to decide whether this is the correct use of type and whether datatypes even belong in bioCaddie | * | * | All other dbgap elements are copied as is.

The JSON images of the XML are stored in the data/<studyid>/json directory.

Study Transformation Example

bash (myenv) > download_study 979 j (myenv) > ls data/phs000979/json StudyDescription.biocaddie.json StudyDescription.json phs000979.v1.pht005193.v1.Mental_Disorders_Postmortem_Subject.data_dict.json phs000979.v1.pht005194.v1.Mental_Disorders_Postmortem_Sample.data_dict.json phs000979.v1.pht005195.v1.Mental_Disorders_Postmortem_Subject_Phenotypes.data_dict.json phs000979.v1.pht005196.v1.Mental_Disorders_Postmortem_Sample_Attributes.data_dict.json Where StudyDescription.json is the direct JSON image of ../xml/StudyDescription.xml and StudyDescription.biocaddie.json has been mapped according to the rules above.

StudyDescription in XML

Study in XML

Mapped StudyDescription in JSON

json { "resultsIn": [ "dbgap:pht005193.v1", "dbgap:pht005194.v1", "dbgap:pht005195.v1", "dbgap:pht005196.v1" ], "description": "Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders", "studyType": "Case-Control", "identifierInfo": [ { "identifierScheme": "dbGaP", "identifier": "dbgap:phs000979.v1" } ], "@type": "biocaddie:Study", "title": "Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders", "@id": "dbgap:phs000979.v1", "keywords": "MESH - Schizophrenia, Schizophrenia,Bipolar Disorder,Major Depressive Disorder" }

data_dict in XML

xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="./datadict_v2.xsl"?> <data_table id="pht005196.v1" study_id="phs000979.v1" participant_set="1" date_created="Wed Dec 9 12:55:00 2015"> <description/> <variable id="phv00258279.v1"> <name>SAMPLE_ID</name> <description>De-identified Sample ID</description> <type>string</type> </variable> <variable id="phv00258280.v1"> <name>BODY_SITE</name> <description>Body site where sample was collected</description> <type>string</type> </variable> <variable id="phv00258281.v1"> <name>ANALYTE_TYPE</name> <description>Analyte Type</description> <type>string</type> </variable> <variable id="phv00258282.v1"> <name>IS_TUMOR</name> <description>Tumor status</description> <type>encoded values</type> <value code="N">Is not a tumor</value> <value code="Y">Is Tumor</value> </variable> <variable id="phv00258283.v1"> <name>HISTOLOGICAL_TYPE</name> <description>Cell or tissue type or subtype of sample</description> <type>string</type> </variable> <variable id="phv00258284.v1"> <name>RIN</name> <description>RNA integrity number</description> <type/> </variable> <variable id="phv00258285.v1"> <name>BATCH</name> <description>Sample batch number</description> <type/> </variable> </data_table>

Mapped data_dict in JSON

json { "data_table": { "study_id": "phs000979.v1", "participant_set": "1", "description": "", "date_created": "Wed Dec 9 12:55:02 2015", "id": "pht005193.v1", "identifierInfo": [ { "identifierScheme": "dbgap", "identifier": "dbgap:phs000979.v1" } ], "variable": [ { "name": "SUBJECT_ID", "identifierInfo": [ { "identifierScheme": "dbgap", "identifier": "dbgap:phv00258253.v1" } ], "description": "Subject ID", "@type": "biocaddie:Dimension", "dimensionType": "xsd:string", "id": "phv00258253.v1", "@id": "dbgap:phv00258253.v1" }, { "name": "CONSENT", "value": { "code": "1", "_content": "General Research Use (GRU)" }, "identifierInfo": [ { "identifierScheme": "dbgap", "identifier": "dbgap:phv00258254.v1" } ], "description": "Consent group as determined by DAC", "@type": "biocaddie:Dimension", "type": "encoded value", "id": "phv00258254.v1", "@id": "dbgap:phv00258254.v1" } ], "date_info": [ { "dateType": "dct:created", "date": "Wed Dec 9 12:55:02 2015" } ], "hasPartDimension": [ "dbgap:phv00258253.v1", "dbgap:phv00258254.v1" ], "@type": "biocaddie:Dataset", "@id": "dbgap:phs000979.v1" } }

Converting JSON to RDF

The JSON to RDF conversion uses the PyLD JSON-LD library to convert the JSON generated in the previous step into RDF. It uses the output schematocontext converter, which has been applied to the JSON Schema's in the bioCaddie Working Group 3 Repository. It adds one additional context:

json { "@context": { "dbgap": "http://www.ncbi.nlm.nih.gov/gap/mms#", "@vocab": "http://www.ncbi.nlm.nih.gov/gap/mms#" } } which assigns a prefix and URI for tags that are specifically identified as being part of dbGaP as well as assigning the default tag.

Sample conversion

bash (myenv) > download_study 979 r -c http://localhost:8080/json-ld

Resulting Study in RDF Turtle

```turtle @prefix biocaddie: http://biocaddie.org/mms# . @prefix dbgap: http://www.ncbi.nlm.nih.gov/gap/mms# . @prefix dct: http://purl.org/dc/terms/ . @prefix fhir: http://hl7.org/fhir/mms# . @prefix mms: http://rdf.cdisc.org/mms# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix skos: http://www.w3.org/2004/02/skos/core# . @prefix xml: http://www.w3.org/XML/1998/namespace . @prefix xsd: http://www.w3.org/2001/XMLSchema# .

dbgap:phs000979.v1 a biocaddie:Study ; biocaddie:description "Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders" ; biocaddie:identifierInfo ( [ biocaddie:identifier dbgap:phs000979.v1 ; biocaddie:identifierScheme "dbGaP" ] ) ;
biocaddie:resultsIn ( "dbgap:pht005193.v1" "dbgap:pht005194.v1" "dbgap:pht005195.v1" "dbgap:pht005196.v1" ) ; biocaddie:title "Gene Expression in Postmortem DLPFC and Hippocampus from Schizophrenia and Mood Disorders" ; dbgap:keywords "MESH - Schizophrenia, Schizophrenia,Bipolar Disorder,Major Depressive Disorder" . ```

Resulting Dataset in RDF Turtle

```turtle @prefix biocaddie: http://biocaddie.org/mms# . @prefix dbgap: http://www.ncbi.nlm.nih.gov/gap/mms# . @prefix dct: http://purl.org/dc/terms/ . @prefix fhir: http://hl7.org/fhir/mms# . @prefix mms: http://rdf.cdisc.org/mms# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix skos: http://www.w3.org/2004/02/skos/core# . @prefix xml: http://www.w3.org/XML/1998/namespace . @prefix xsd: http://www.w3.org/2001/XMLSchema# .

dbgap:phs000979.v1 a biocaddie:Dataset ; biocaddie:dateinfo ( [ biocaddie:date "Wed Dec 9 12:55:00 2015"^^xsd:dateTime ; biocaddie:dateType dct:created ] ) ; biocaddie:description "" ; biocaddie:hasPartDimension ( "dbgap:phv00258279.v1" "dbgap:phv00258280.v1" "dbgap:phv00258281.v1" "dbgap:phv00258282.v1" "dbgap:phv00258283.v1" "dbgap:phv00258284.v1" "dbgap:phv00258285.v1" ) ; biocaddie:identifierInfo ( [ biocaddie:identifier dbgap:phs000979.v1 ; biocaddie:identifierScheme "dbgap" ] ) ; dbgap:context "fhir:Specimen" ; dbgap:datecreated "Wed Dec 9 12:55:00 2015" ; dbgap:id "pht005196.v1" ; dbgap:participantset "1" ; dbgap:studyid "phs000979.v1" ; dbgap:variable dbgap:phv00258279.v1, dbgap:phv00258280.v1, dbgap:phv00258281.v1, dbgap:phv00258282.v1, dbgap:phv00258283.v1, dbgap:phv00258284.v1, dbgap:phv00258285.v1 . ```

Sample Dimension Entry in RDF Turtle

```turtle @prefix biocaddie: http://biocaddie.org/mms# . @prefix dbgap: http://www.ncbi.nlm.nih.gov/gap/mms# . @prefix dct: http://purl.org/dc/terms/ . @prefix fhir: http://hl7.org/fhir/mms# . @prefix mms: http://rdf.cdisc.org/mms# . @prefix rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# . @prefix rdfs: http://www.w3.org/2000/01/rdf-schema# . @prefix skos: http://www.w3.org/2004/02/skos/core# . @prefix xml: http://www.w3.org/XML/1998/namespace . @prefix xsd: http://www.w3.org/2001/XMLSchema# .

dbgap:phv00258282.v1 a biocaddie:Dimension ; biocaddie:description "Tumor status" ; biocaddie:identifierInfo ( [ biocaddie:identifier dbgap:phv00258282.v1 ; biocaddie:identifierScheme "dbgap" ] ) ; biocaddie:name "ISTUMOR" ; dbgap:id "phv00258282.v1" ; dbgap:type "encoded values" ; dbgap:value [ dbgap:content "Is Tumor" ; dbgap:code "Y" ], [ dbgap:_content "Is not a tumor" ; dbgap:code "N" ] . ```

GitHub Events

Total

Last Year

Committers

Last synced: over 2 years ago

All Time

Total Commits: 13
Total Committers: 2
Avg Commits per committer: 6.5
Development Distribution Score (DDS): 0.077

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
hsolbrig	s**d@m**u	12
Harold Solbrig	s**g@e**t	1

Committer Domains (Top 20 + Academic)

earthlink.net: 1 mayo.edu: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 10 last-month

Total dependent packages: 0
Total dependent repositories: 2
Total versions: 2
Total maintainers: 1

pypi.org: dbgap

dbGaP to bioCaddie conversion utility

Homepage: http://github.com/crDDI/dbgap
Documentation: https://dbgap.readthedocs.io/
License: BSD 3-Clause license
Latest release: 0.2.1
published about 10 years ago

Versions: 2
Dependent Packages: 0
Dependent Repositories: 2
Downloads: 10 Last month

Rankings

Dependent packages count: 10.0%

Dependent repos count: 11.6%

Forks count: 22.6%

Average: 28.4%

Stargazers count: 38.8%

Downloads: 58.9%

Maintainers (1)

hsolbrig

Last synced: 7 months ago

dbgap

Science Score: 20.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

dbgap

Introduction

Installation

Use

Transformation description

Downloading XML files

Converting XML to JSON

Study transformations

Study Transformation Example

StudyDescription in XML

Mapped StudyDescription in JSON

data_dict in XML

Mapped data_dict in JSON

Converting JSON to RDF

Sample conversion

Resulting Study in RDF Turtle

Resulting Dataset in RDF Turtle

Sample Dimension Entry in RDF Turtle

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: dbgap

Rankings

Maintainers (1)