https://github.com/ammar257ammar/swat4hcls2021-wikidata-subset-hdt
https://github.com/ammar257ammar/swat4hcls2021-wikidata-subset-hdt
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: ammar257ammar
- License: mit
- Default Branch: main
- Size: 211 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Note
This work has been accomplished during the SWAT4HCLS Hackathon 2021 (Semantic Web Applications and Tools for Healthcare and Life Sciences)
Workflow to subset Wikidata using Wdumper and convert the output to HDT for sharing/querying
The same diagram is below:

Steps:
I downloaded a Wikidata dump of 2014 (~4 GB compressed) from here
I obtained the JSON specs file for Wdumper, generated by Guillermo here which was generated to subset Wikidata according to the model published in https://doi.org/10.7554/eLife.52614
I created a dokcer image for both "Wdumper" and "hdt-java" and provided them on DockerHub with example usage (Wdumper & hdt-java). So, they can be used directly without clone the GitHub repos and building the docker images.
I ran the wdumper tool using the WD dump and the JSON specs as input. The output was a (.nt.gz) file.
- File Size compressed (.gz): ~500MB
- File Size Uncompressed (.nt): 6.64GB
Command used:
shell=bash
docker run -it \
--rm --name wdumper \
-v YOUR_DATA_PATH_HERE:/data \
-e DUMPS_PATH=/data \
aammar/wdumper \
/data/wikidata_2014_dump.json.gz \
/data/life_sciences.json
Next, I tried to run the hdt-java docker on the nt.gz file but apparently, som IRIs contains illegal characters and the file needs cleaning.
I unzipped the nt file, cleaned it (regex), and zipped it again, like this:
```shell=bash gunzip wdump-1.nt.gz
sed -i -E 's/(<.)}(.>)/\1\2/' wdump-1.nt sed -i -E 's/(<.)\n(.>)/\1\2/' wdump-1.nt sed -i -E 's/(<.)|(.>)/\1\2/' wdump-1.nt
gzip wdump-1.nt ```
- I ran the hdt-java (rdf2hdt.sh command) on the .nt.gz file to get the HDT compressed file (still running now). Command used:
```shell=bash docker run -it \ --rm --name hdt \ -v YOURDATAPATHHERE:/data \ aammar/hdt-java \ rdf2hdt.sh \ /data/wdump-1.nt.gz \ /data/lifesciences_subset.hdt
```
- The results of the compression are really good:
File converted in: 16 min 37 sec 9 ms 389 us Total Triples: 60161218 Different subjects: 5715877 Different predicates: 802 Different objects: 38767299 Common Subject/Object:216743 HDT saved to file in: 14 sec 493 ms 307 us
- HDT file size: 589MB
- Input file size (.nt): 6.64GB = 6800MB
- (Compression ratio: 589 / 6800 = 8.66%)
- The HDT file is provided in the Releases of this GitHub repo.
Next step, is to try to run an instance of Fuiski server (Docker) provided by Jose on top of this HDT file and try some queries.
Owner
- Name: Ammar Ammar
- Login: ammar257ammar
- Kind: user
- Location: The Netherlands
- Company: Maastricht University
- Repositories: 14
- Profile: https://github.com/ammar257ammar