https://github.com/ammar257ammar/swat4hcls2021-wikidata-subset-hdt

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.8%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: ammar257ammar
License: mit
Default Branch: main
Size: 211 KB

Statistics

Stars: 0
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 0

Created over 5 years ago · Last pushed about 5 years ago

Metadata Files

Readme License

README.md

Note

This work has been accomplished during the SWAT4HCLS Hackathon 2021 (Semantic Web Applications and Tools for Healthcare and Life Sciences)

Workflow to subset Wikidata using Wdumper and convert the output to HDT for sharing/querying

Shared Link

The same diagram is below:

Steps:

I downloaded a Wikidata dump of 2014 (~4 GB compressed) from here
I obtained the JSON specs file for Wdumper, generated by Guillermo here which was generated to subset Wikidata according to the model published in https://doi.org/10.7554/eLife.52614
I created a dokcer image for both "Wdumper" and "hdt-java" and provided them on DockerHub with example usage (Wdumper & hdt-java). So, they can be used directly without clone the GitHub repos and building the docker images.
I ran the wdumper tool using the WD dump and the JSON specs as input. The output was a (.nt.gz) file.

File Size compressed (.gz): ~500MB
File Size Uncompressed (.nt): 6.64GB

Command used:

shell=bash docker run -it \ --rm --name wdumper \ -v YOUR_DATA_PATH_HERE:/data \ -e DUMPS_PATH=/data \ aammar/wdumper \ /data/wikidata_2014_dump.json.gz \ /data/life_sciences.json

Next, I tried to run the hdt-java docker on the nt.gz file but apparently, som IRIs contains illegal characters and the file needs cleaning.
I unzipped the nt file, cleaned it (regex), and zipped it again, like this:

```shell=bash gunzip wdump-1.nt.gz

sed -i -E 's/(<.)}(.>)/\1\2/' wdump-1.nt sed -i -E 's/(<.)\n(.>)/\1\2/' wdump-1.nt sed -i -E 's/(<.)|(.>)/\1\2/' wdump-1.nt

gzip wdump-1.nt ```

I ran the hdt-java (rdf2hdt.sh command) on the .nt.gz file to get the HDT compressed file (still running now). Command used:

```shell=bash docker run -it \ --rm --name hdt \ -v YOURDATAPATHHERE:/data \ aammar/hdt-java \ rdf2hdt.sh \ /data/wdump-1.nt.gz \ /data/lifesciences_subset.hdt

```

The results of the compression are really good:

File converted in: 16 min 37 sec 9 ms 389 us
Total Triples: 60161218
Different subjects: 5715877
Different predicates: 802
Different objects: 38767299
Common Subject/Object:216743
HDT saved to file in: 14 sec 493 ms 307 us

HDT file size: 589MB
Input file size (.nt): 6.64GB = 6800MB
(Compression ratio: 589 / 6800 = 8.66%)
The HDT file is provided in the Releases of this GitHub repo.

Next step, is to try to run an instance of Fuiski server (Docker) provided by Jose on top of this HDT file and try some queries.

Owner

Name: Ammar Ammar
Login: ammar257ammar
Kind: user
Location: The Netherlands
Company: Maastricht University

Repositories: 14
Profile: https://github.com/ammar257ammar

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science