Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.1%) to scientific vocabulary
Keywords
Repository
Managing sequencing data with RDF
Basic Info
Statistics
- Stars: 11
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
hts-rdf
Author: Pierre Lindenbaum PhD.
Here are a few notes about Managing sequencing data with RDF. I want to keep track of the samples, BAMs, references, diseases etc.. used in my lab.
- This document is auto-generated using a Makefile. Do not edit it.
- I don't want to use a
SQLdatabase. - I don't want to join too many tab delimited files.
- I want to use a controlled vocabulary to define things like diseases, organims, etc...
- This document is NOT a tutorial for
RDForSPARQL. - I use the
RDF+XMLnotation because I 'm used to work withXML. - I created a namespace for my lab:
https://umr1087.univ-nantes.fr/rdf/and aXMLentity for this namespace:&u1087;. - I tried to reuse existing ontologies (e.g.
foaf:Personfor samples) as much as I can, but sometimes I created my own classes and properties. - I'm not an expert of
SPARQLorRDF - Required tools are (jena)[https://jena.apache.org/download/],
bcftools(forVCFs),samtools(forBAMs),awk.
Building the RDF GRAPH
Species
I manually wrote data/species.rdf defining the species used in my lab.
We will use rdf:subClassOf to find organisms that are a sub-species of a taxon in the NCBI taxonomy tree.
```rdf (...)
/rdf:RDF ```
Diseases / Phenotypes
I manually wrote data/diseases.rdf defining the diseases used in my lab.
We will use rdf:subClassOf to find diseases that are a sub-disease in a disease ontology tree.
```rdf
(...)
<owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_0080600">
<rdfs:label>COVID-19</rdfs:label>
<rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/DOID_0080599"/>
</owl:Class>
<owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_0081013">
<rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/DOID_0080600"/>
<rdfs:label>Severe COVID-19</rdfs:label>
</owl:Class>
<owl:Class rdf:about="http://purl.obolibrary.org/obo/DOID_3491">
<rdfs:label>Turner Syndrome</rdfs:label>
</owl:Class>
/rdf:RDF ```
References / FASTA / Genomes
I manually wrote data/references.tsv a tab delimited text file defining each FASTA reference genome available on my cluster.
The taxon id will be used to retrive the species associated to a FASTA file.
```
path genomeId ucsc taxid
data/hg19.fasta grch37 hg19 9606 data/hg38.fasta grch38 hg38 9606 data/rotavirus_rf.fa rotavirus 10912 ```
The table is transformed into RDF using awk:
bash
tail -n+2 data/references.tsv |\
awk -F '\t' '{printf("<u:Reference rdf:about=\"&u1087;references/%s\">\n\t<u:genomeId>%s</u:genomeId>\n\t<u:filename>%s</u:filename>\n",$2,$2,$1);if($4!="") printf("\t<u:taxon rdf:resource=\"http://purl.uniprot.org/taxonomy/%s\"/>\n",$4); printf("</u:Reference>\n");}'
output:
rdf
(...)
<u:Reference rdf:about="&u1087;references/grch37">
<u:genomeId>grch37</u:genomeId>
<u:filename>data/hg19.fasta</u:filename>
<u:taxon rdf:resource="http://purl.uniprot.org/taxonomy/9606"/>
</u:Reference>
<u:Reference rdf:about="&u1087;references/grch38">
<u:genomeId>grch38</u:genomeId>
<u:filename>data/hg38.fasta</u:filename>
<u:taxon rdf:resource="http://purl.uniprot.org/taxonomy/9606"/>
</u:Reference>
<u:Reference rdf:about="&u1087;references/rotavirus">
<u:genomeId>rotavirus</u:genomeId>
<u:filename>data/rotavirus_rf.fa</u:filename>
<u:taxon rdf:resource="http://purl.uniprot.org/taxonomy/10912"/>
</u:Reference>
</rdf:RDF>
Samples
I manually wrote data/samples.rdf defining the samples sequenced in my lab.
This is where we can define the gender, associate a sample to a diseases and where we can define the familial relations.
The Class foaf:Group is used to create a group of samples.
```rdf
(...)
/rdf:RDF ```
VCF BCF
VCF and genomes
for each VCF files, we need to associate a VCF and the reference genome:
Chromosome and length are extracted from the references, we calculate the md5 checksum and we sort on md5.
bash
tail -n+2 data/references.tsv | sort -T TMP -t $'\t' -k1,1 > TMP/sorted.refs.txt
cut -f 1 TMP/sorted.refs.txt | while read FA; do echo -ne "${FA}\t" && cut -f 1,2 "${FA}.fai" | md5sum | cut -d ' ' -f 1; done > TMP/references.md5.tmp.a
join -t $'\t' -1 1 -2 1 TMP/sorted.refs.txt TMP/references.md5.tmp.a | sort -t $'\t' -k5,5 > TMP/references.md5
rm -f TMP/references.md5.tmp.a
For each VCF, the header is extracted, we extract the chromosome and length of the contig lines, we calculate the md5 checksum and we sort on md5.
bash
find data -type f \( -name "*.vcf.gz" -o -name "*.bcf" -o -name "*.vcf" \) | sort > TMP/vcfs.txt
(cat TMP/vcfs.txt| while read V ; \
do echo -en "${V}\t" && \
bcftools view --header-only "${V}" | awk -F '[=,<>]' '/^##contig/ {printf("%s\t%s\n",$4,$6);}' | md5sum | cut -d ' ' -f1 ; done) | sort -t $'\t' -k2,2 > TMP/vcfs.md5.txt
we join both files on md5 and we convert to RDF using awk:
bash
cat data/header.rdf.part > TMP/vcf2ref.rdf
join -t $'\t' -1 2 -2 5 TMP/vcfs.md5.txt TMP/references.md5 |\
awk -F '\t' '{printf("<u:Vcf rdf:about=\"file://%s\"><u:filename>%s</u:filename><u:reference rdf:resource=\"&u1087;references/%s\"/></u:Vcf>",$2,$2,$4); }' >> TMP/vcf2ref.rdf
cat data/footer.rdf.part >> TMP/vcf2ref.rdf
VCF and samples
to link the VCF files and the sample, we use bcftools query -l to extract the samples and we convert to RDF using awk:
bash
find data -type f \( -name "*.vcf.gz" -o -name "*.bcf" -o -name "*.vcf" \) | sort > TMP/vcfs.txt
cat data/header.rdf.part > TMP/vcf2samples.rdf
cat TMP/vcfs.txt | while read F; do bcftools query -l "${F}" | awk -vVCF="$F" 'BEGIN {printf("<u:Vcf rdf:about=\"file://%s\"><u:filename>%s</u:filename>",VCF,VCF); } {printf("<u:sample rdf:resource=\"&u1087;samples/%s\"/>",$1);} END {printf("</u:Vcf>");}' >> TMP/vcf2samples.rdf ; done
cat data/footer.rdf.part >> TMP/vcf2samples.rdf
BAM files
BAM file contains the sample names in their read-groups; We use samtools samples to extract the samples, the reference and the path of each BAM file.
data/samtools.samples.to.rdf.awk is used to convert the output of samtools samples to RDF.
```bash find ${PWD}/data -type f -name "*.bam" |\ samtools samples -F TMP/references.txt |\ sort -T TMP -t $'\t' -k3,3 |\ join -t $'\t' -1 3 -2 1 - TMP/sorted.refs.txt > TMP/bams.txt
cat data/header.rdf.part > TMP/bams.rdf
awk -F '\t' -f data/samtools.samples.to.rdf.awk TMP/bams.txt >> TMP/bams.rdf
cat data/footer.rdf.part >> TMP/bams.rdf ```
the output:
rdf
(...)
<foaf:Person rdf:about="&u1087;samples/S5">
<foaf:name>S5</foaf:name>
</foaf:Person>
<u:Bam rdf:about="file:///home/lindenb/src/hts-rdf/data/S5.grch38.bam">
<u:filename>/home/lindenb/src/hts-rdf/data/S5.grch38.bam</u:filename>
<u:sample rdf:resource="&u1087;samples/S5"/>
<u:reference rdf:resource="&u1087;references/grch38"/>
</u:Bam>
(...)
Combining all the RDF chunks
jena/rio is used to merge RDF files into knowledge.rdf
bash
riot --formatted=RDFXML TMP/references.rdf data/species.rdf TMP/bams.rdf data/diseases.rdf data/samples.rdf TMP/vcf2ref.rdf TMP/vcf2samples.rdf > knowledge.rdf
Querying the GRAPH
jena/arq is used to run the SPARQL queries.
bash
arq --data=knowledge.rdf --query=querysparql
Example
show me the species that are a sub-taxon of "Homo"
query data/query.species.01.sparql :
```sparql (...)
SELECT DISTINCT ?taxonName WHERE { ?taxon dc:title ?taxonName . ?taxon a u:Taxon . ?taxon rdfs:subClassOf* ?root . ?root a u:Taxon . ?root dc:title "Homo" . } ```
execute:
bash
arq --data=knowledge.rdf --query=data/query.species.01.sparql > TMP/species.01.out
output TMP/species.01.out:
| taxonName | |-----| | "Homo sapiens neanderthalensis" | | "Homo Sapiens" | | "Homo" |
Example
show the diseases that are a sub disease of COVID-19.
query data/query.diseases.01.sparql :
```sparql (...)
SELECT DISTINCT ?diseaseName WHERE { ?disease rdfs:label ?diseaseName . ?disease a owl:Class . ?disease rdfs:subClassOf* ?root . ?root a owl:Class . ?root rdfs:label "COVID-19" . } ```
execute:
bash
arq --data=knowledge.rdf --query=data/query.diseases.01.sparql > TMP/diseases.01.out
output TMP/diseases.01.out:
| diseaseName | |-----| | "COVID-19" | | "Severe COVID-19" |
Example
find the samples , their children, parents , diseases
query data/query.samples.01.sparql :
```sparql (...)
SELECT DISTINCT (SAMPLE(?sampleName) as ?colName) (SAMPLE(?gender) as ?colGender ) (SAMPLE(?fatherName) as ?colFather ) (SAMPLE(?motherName) as ?colMother) (GROUPCONCAT(DISTINCT ?childName; SEPARATOR=";") as ?colChildren) (GROUPCONCAT(DISTINCT ?diseaseName; SEPARATOR=";") as ?colDiseases)
WHERE { ?sample a foaf:Person . ?sample foaf:name ?sampleName . OPTIONAL {?sample foaf:gender ?gender .} OPTIONAL { ?sample u:has-disease ?disease . ?disease a owl:Class . ?disease rdfs:label ?diseaseName . } . OPTIONAL { ?father a foaf:Person . ?sample rel:childOf ?father . ?father foaf:gender "male" . ?father foaf:name ?fatherName . } . OPTIONAL { ?mother a foaf:Person . ?sample rel:childOf ?mother . ?mother foaf:gender "female" . ?mother foaf:name ?motherName . } . OPTIONAL { ?child a foaf:Person . ?child rel:childOf ?sample . ?child foaf:name ?childName . } . } GROUP BY ?sample ```
execute:
bash
arq --data=knowledge.rdf --query=data/query.samples.01.sparql > TMP/samples.01.out
output TMP/samples.01.out:
| colName | colGender | colFather | colMother | colChildren | colDiseases | |-----|-----|-----|-----|-----|-----| | "S1" | "female" | "S3" | "S2" | | "Turner Syndrome;COVID-19" | | "S2" | "female" | | | "S1" | | | "S3" | "male" | | | "S1" | "Severe COVID-19" | | "S4" | | | | | | | "S5" | | | | | |
Example
List all the
VCFfiles and their samples, at least containing the sample "S1"
query data/query.vcfs.01.sparql :
```sparql (...)
SELECT DISTINCT ?vcfPath ?fasta ?taxonName ?sampleName WHERE { ?vcf a u:Vcf . ?vcf u:filename ?vcfPath .
?vcf u:sample ?sample1 . ?sample1 a foaf:Person . ?sample1 foaf:name "S1" .
?vcf u:sample ?sample2 . ?sample2 a foaf:Person . ?sample2 foaf:name ?sampleName .
OPTIONAL { ?vcf u:reference ?ref . ?ref a u:Reference . ?ref u:filename ?fasta
OPTIONAL {
?ref u:taxon ?taxon .
?taxon a u:Taxon .
?taxon dc:title ?taxonName .
}
}
} ```
execute:
bash
arq --data=knowledge.rdf --query=data/query.vcfs.01.sparql > TMP/vcfs.01.out
output TMP/vcfs.01.out:
| vcfPath | fasta | taxonName | sampleName | |-----|-----|-----|-----| | "data/variants2.vcf" | "data/hg19.fasta" | "Homo Sapiens" | "S1" | | "data/variants2.vcf" | "data/hg19.fasta" | "Homo Sapiens" | "S2" | | "data/variants2.vcf" | "data/hg19.fasta" | "Homo Sapiens" | "S3" | | "data/variants1.vcf" | "data/hg38.fasta" | "Homo Sapiens" | "S1" | | "data/variants1.vcf" | "data/hg38.fasta" | "Homo Sapiens" | "S5" | | "data/variants1.vcf" | "data/hg38.fasta" | "Homo Sapiens" | "S2" | | "data/variants1.vcf" | "data/hg38.fasta" | "Homo Sapiens" | "S3" |
Example
find the bam , their reference, samples , etc..
query data/query.bams.01.sparql :
```sparql (...)
SELECT DISTINCT ?bamPath (SAMPLE(?fasta) as ?colFasta) (SAMPLE(?taxonName) as ?colTaxon) (SAMPLE(?sampleName) as ?colSampleName ) (GROUPCONCAT(DISTINCT ?groupName; SEPARATOR=";") as ?colGroups ) (GROUPCONCAT(DISTINCT ?gender; SEPARATOR=";") as ?colGender ) (GROUPCONCAT(DISTINCT ?diseaseName; SEPARATOR=";") as ?colDiseases) (SAMPLE(?fatherName) as ?colFather ) (SAMPLE(?motherName) as ?colMother) (GROUPCONCAT(DISTINCT ?childName; SEPARATOR="; ") as ?colChildren) WHERE { ?bam a u:Bam . ?bam u:filename ?bamPath .
OPTIONAL { ?bam u:reference ?ref . ?ref a u:Reference . ?ref u:filename ?fasta
OPTIONAL {
?ref u:taxon ?taxon .
?taxon a u:Taxon .
?taxon dc:title ?taxonName .
}
}
OPTIONAL { ?bam u:sample ?sample . ?sample a foaf:Person . OPTIONAL {?sample foaf:name ?sampleName .} OPTIONAL {?sample foaf:gender ?gender .} OPTIONAL { ?group foaf:member ?sample . ?group a foaf:Group . ?group foaf:name ?groupName . } . OPTIONAL { ?sample u:has-disease ?disease . ?disease a owl:Class . ?disease rdfs:label ?diseaseName . } . OPTIONAL { ?father a foaf:Person . ?sample rel:childOf ?father . ?father foaf:gender "male" . ?father foaf:name ?fatherName . } . OPTIONAL { ?mother a foaf:Person . ?sample rel:childOf ?mother . ?mother foaf:gender "female" . ?mother foaf:name ?motherName . } . OPTIONAL { ?child a foaf:Person . ?child rel:childOf ?sample . ?child foaf:name ?childName . } . }. } GROUP BY ?bamPath ```
execute:
bash
arq --data=knowledge.rdf --query=data/query.bams.01.sparql > TMP/bams.01.out
output TMP/bams.01.out:
| bamPath | colFasta | colTaxon | colSampleName | colGroups | colGender | colDiseases | colFather | colMother | colChildren | |-----|-----|-----|-----|-----|-----|-----|-----|-----|-----| | "/home/lindenb/src/hts-rdf/data/S1.grch38.bam" | "data/hg38.fasta" | "Homo Sapiens" | "S1" | "Fam01" | "female" | "Turner Syndrome;COVID-19" | "S3" | "S2" | | | "/home/lindenb/src/hts-rdf/data/S2.grch37.bam" | "data/hg19.fasta" | "Homo Sapiens" | "S2" | "Fam01" | "female" | | | | "S1" | | "/home/lindenb/src/hts-rdf/data/S4.RF.bam" | "data/rotavirus_rf.fa" | "Rotavirus" | "S4" | | | | | | | | "/home/lindenb/src/hts-rdf/data/S5.grch38.bam" | "data/hg38.fasta" | "Homo Sapiens" | "S5" | "Fam01" | | | | | | | "/home/lindenb/src/hts-rdf/data/S3.grch38.bam" | "data/hg38.fasta" | "Homo Sapiens" | "S3" | "Fam01" | "male" | "Severe COVID-19" | | | "S1" | | "/home/lindenb/src/hts-rdf/data/S1.grch37.bam" | "data/hg19.fasta" | "Homo Sapiens" | "S1" | "Fam01" | "female" | "Turner Syndrome;COVID-19" | "S3" | "S2" | |
The Graph
and here is the RDF graph as a SVG document:
Owner
- Name: Pierre Lindenbaum
- Login: lindenb
- Kind: user
- Location: Nantes, France
- Company: INSERM
- Website: https://genomic.social/web/@yokofakun
- Twitter: yokofakun
- Repositories: 86
- Profile: https://github.com/lindenb
Institut-du-Thorax, Nantes, France.
Citation (CITATION.cff)
cff-version: "1.1.0"
abstract: "Managing sequencing data with RDF"
authors:
- family-names: "Lindenbaum"
given-names: "Pierre"
orcid: "https://orcid.org/0000-0003-0148-9787"
keywords:
- bioinformatics
- rdf
- hts
- sparql
license: MIT
message: "If you use this code, please cite it using these metadata."
repository-code: "http://github.com/lindenb/hts-rdf"
title: hts-rdf
version: " 2023.09.13"
GitHub Events
Total
Last Year
Committers
Last synced: almost 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Pierre Lindenbaum | p****m@y****r | 14 |
| Pierre Lindenbaum | 3****b | 1 |
Issues and Pull Requests
Last synced: almost 2 years ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0