https://github.com/ammar257ammar/swat4hcls2021-wikidata-subset-hdt

https://github.com/ammar257ammar/swat4hcls2021-wikidata-subset-hdt

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: ammar257ammar
  • License: mit
  • Default Branch: main
  • Size: 211 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created over 5 years ago · Last pushed about 5 years ago
Metadata Files
Readme License

README.md

Note

This work has been accomplished during the SWAT4HCLS Hackathon 2021 (Semantic Web Applications and Tools for Healthcare and Life Sciences)

Workflow to subset Wikidata using Wdumper and convert the output to HDT for sharing/querying

Shared Link

The same diagram is below:

Steps:

  1. I downloaded a Wikidata dump of 2014 (~4 GB compressed) from here

  2. I obtained the JSON specs file for Wdumper, generated by Guillermo here which was generated to subset Wikidata according to the model published in https://doi.org/10.7554/eLife.52614

  3. I created a dokcer image for both "Wdumper" and "hdt-java" and provided them on DockerHub with example usage (Wdumper & hdt-java). So, they can be used directly without clone the GitHub repos and building the docker images.

  4. I ran the wdumper tool using the WD dump and the JSON specs as input. The output was a (.nt.gz) file.

  • File Size compressed (.gz): ~500MB
  • File Size Uncompressed (.nt): 6.64GB

Command used:

shell=bash docker run -it \ --rm --name wdumper \ -v YOUR_DATA_PATH_HERE:/data \ -e DUMPS_PATH=/data \ aammar/wdumper \ /data/wikidata_2014_dump.json.gz \ /data/life_sciences.json

  1. Next, I tried to run the hdt-java docker on the nt.gz file but apparently, som IRIs contains illegal characters and the file needs cleaning.

  2. I unzipped the nt file, cleaned it (regex), and zipped it again, like this:

```shell=bash gunzip wdump-1.nt.gz

sed -i -E 's/(<.)}(.>)/\1\2/' wdump-1.nt sed -i -E 's/(<.)\n(.>)/\1\2/' wdump-1.nt sed -i -E 's/(<.)|(.>)/\1\2/' wdump-1.nt

gzip wdump-1.nt ```

  1. I ran the hdt-java (rdf2hdt.sh command) on the .nt.gz file to get the HDT compressed file (still running now). Command used:

```shell=bash docker run -it \ --rm --name hdt \ -v YOURDATAPATHHERE:/data \ aammar/hdt-java \ rdf2hdt.sh \ /data/wdump-1.nt.gz \ /data/lifesciences_subset.hdt

```

  1. The results of the compression are really good:

File converted in: 16 min 37 sec 9 ms 389 us
Total Triples: 60161218
Different subjects: 5715877
Different predicates: 802
Different objects: 38767299
Common Subject/Object:216743
HDT saved to file in: 14 sec 493 ms 307 us

  • HDT file size: 589MB
  • Input file size (.nt): 6.64GB = 6800MB
  • (Compression ratio: 589 / 6800 = 8.66%)
  • The HDT file is provided in the Releases of this GitHub repo.

Next step, is to try to run an instance of Fuiski server (Docker) provided by Jose on top of this HDT file and try some queries.

Owner

  • Name: Ammar Ammar
  • Login: ammar257ammar
  • Kind: user
  • Location: The Netherlands
  • Company: Maastricht University

GitHub Events

Total
Last Year