https://github.com/dbpedia/neural-rdf-verbalizer

🗣 Multilingual RDF Verbalizer – Google Summer of Code 2019

https://github.com/dbpedia/neural-rdf-verbalizer

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • â—‹
    CITATION.cff file
  • â—‹
    codemeta.json file
  • â—‹
    .zenodo.json file
  • â—‹
    DOI references
  • ✓
    Academic publication links
    Links to: arxiv.org
  • â—‹
    Academic email domains
  • â—‹
    Institutional organization owner
  • â—‹
    JOSS paper metadata
  • â—‹
    Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary

Keywords

deep-learning knowledge-graph natural-language-generation neural-machine-translation
Last synced: 5 months ago · JSON representation

Repository

🗣 Multilingual RDF Verbalizer – Google Summer of Code 2019

Basic Info
  • Host: GitHub
  • Owner: dbpedia
  • License: mit
  • Language: Python
  • Default Branch: final
  • Homepage:
  • Size: 101 MB
Statistics
  • Stars: 21
  • Watchers: 13
  • Forks: 7
  • Open Issues: 2
  • Releases: 0
Topics
deep-learning knowledge-graph natural-language-generation neural-machine-translation
Created over 6 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License

README.md

Multilingual RDF verbalizer - GSoC/2019

Author - Dwaraknath Gnaneshwar

Abstract :

This project aims to create a deep-learning-based natural language generation framework that verbalizes RDF triples.

An RDF triple set contains a triple set, each of the form < subject | predicate | object>, the model aims to take in a set of such triples and output the information in human-readable form.

A high-level overview of the dataflow would be like this :

Image Picture courtesy

For ex : < Dwarak | birthplace | Chennai > < Dwarak | lives in | India > output: Dwarak was born in Chennai, and lives in India The model must be capable of doing the same in multiple languages, hence the name multilingual RDF verbalizer.

Model Architecture :

We use attention based encoder-decoder architecture with Graph Attention Networks encoder and Transformer decoder along with Pure-RNN model and Pure-Transformer model.

The architecture of our model takes the following form. Architecture Picture courtesy

The dataset in use is WebNLG challenge's dataset.

Intuition :

We justify the use of Graph Attention Networks by pointing out the fact that in a graph, each node is related to its first order neighbours. While generating the encoded representation, which is passed to the decoder to generate the probability distribution over target vocabulary, we consider each node's features and it's neighbour's features and apply mutual and self attention mechanism over them. The model must be able to culminate these features together and maintain the semantics of triple. By using Graph Networks we inject the sense of structure into the encoders, which is useful when we consider that RDF triples can be maintained and viewed as concepts of Knowledge Graphs.

Usage :

  • To preprocess the dataset and save graph nodes, edges.

python preprocess.py \ --train_src 'data/processed_data/eng/train_src' \ --train_tgt 'data/processed_data/eng/train_tgt' \ --eval_src 'data/processed_data/eng/eval_src' \ --eval_tgt 'data/processed_data/eng/eval_tgt' \ --test_src 'data/processed_data/eng/test_src' \ --spl_sym 'data/processed_data/special_symbols' \ --model gat --lang eng --sentencepiece True \ --vocab_size 16000 --sentencepiece_model 'bpe' - To start training with Graph Attention Network encoder and decoder. The preprocessed files are stored in the data folder, use the path in the below code snippet. Please use the hyper-parameters as you see fit, and provide the necessary arguments. - NOTE: If you use sentencepiece for preprocessing and not specify the flag for training script you may get shape errors. Also, for Transformer, RNN models source and target vocabularies are same. ``` python trainsingle.py \ --trainpath 'data/processedgraphs/eng/gat/train' \ --evalpath 'data/processedgraphs/eng/gat/eval' \ --testpath 'data/processedgraphs/eng/gat/test' \ --srcvocab 'vocabs/gat/eng/srcvocab' \ --tgtvocab 'vocabs/gat/eng/trainvocab.model' \ --batchsize 1 --enctype gat --dectype transformer --model gat --vocabsize 16000 \ --embdim 16 --hiddensize 16 --filtersize 16 --beamsize 5 \ --beamalpha 0.1 --enclayers 1 --declayers 1 --numheads 1 --sentencepiece True \ --steps 10000 --evalsteps 1000 --checkpoint 1000 --alpha 0.2 --dropout 0.2 \ --regscale 0.0 --decay True --decaysteps 5000 --lang eng --debugmode False \ --eval 'data/processeddata/eng/evalsrc' --evalref 'data/processeddata/eng/evaltgt'

- To train the multilingual model, which concatenates the datasets of individual languages and appends a token for each language's input sentences. python trainmultiple.py \ --trainpath 'data/processedgraphs/eng/gat/train' \ --evalpath 'data/processedgraphs/eng/gat/eval' \ --testpath 'data/processedgraphs/eng/gat/test' \ --srcvocab 'vocabs/gat/eng/srcvocab' \ --tgtvocab 'vocabs/gat/eng/trainvocab.model' \ --batchsize 1 --enctype gat --dectype transformer \ --model multi --vocabsize 16000 --embdim 16 --hiddensize 16 \ --filtersize 16 --beamsize 5 --sentencepiecemodel 'bpe' --beamalpha 0.1 \ --enclayers 1 --declayers 1 --numheads 1 --sentencepiece True --steps 10000 \ --evalsteps 1000 --checkpoint 1000 --alpha 0.2 --dropout 0.2 --distillation False \ --regscale 0.0 --decay True --decaysteps 5000 --lang multi --debugmode False \ --eval 'data/processeddata/eng/evalsrc' --evalref 'data/processeddata/eng/eval_tgt'

```

  • If you want to train an RNN or Transformer model, Input of the model is .triple and Target is .lex file.

Use Colab

  • To use Google-Colab, set the argument 'usecolab' to True run the following command first then above commands with '!' in front. ``` !git clone https://<githubaccess_token>@github.com/DwaraknathT/GSoC-19.git ```

  • You can get your Github access token from GitHub developer's settings.

  • To preprocess the files ``` !python 'GSoC-19/preprocess.py' \ --trainsrc 'GSoC-19/data/processeddata/eng/trainsrc' \ --traintgt 'GSoC-19/data/processeddata/eng/traintgt' \ --evalsrc 'GSoC-19/data/processeddata/eng/evalsrc' \ --evaltgt 'GSoC-19/data/processeddata/eng/evaltgt' \ --testsrc 'GSoC-19/data/processeddata/eng/testsrc' \ --splsym 'GSoC-19/data/processeddata/specialsymbols' \ --model gat --lang eng --usecolab True \ --vocabsize 16000 --sentencepiece_model 'bpe' --sentencepiece True

``` - Replace the 'eng' in each parameter with 'ger', 'rus' to process the German and Russian corpus. You can also set sentencepiece to True, and change sentenpiece to 'unigram', 'word'. The vocab size is usually set to 32000, but can be set to anything.

  • To start training !python 'GSoC-19/train_single.py' \ --train_path '/content/gdrive/My Drive/data/processed_graphs/eng/gat/train' \ --eval_path '/content/gdrive/My Drive/data/processed_graphs/eng/gat/eval' \ --test_path '/content/gdrive/My Drive/data/processed_graphs/eng/gat/test' \ --src_vocab 'vocabs/gat/eng/src_vocab' \ --tgt_vocab 'vocabs/gat/eng/train_vocab.model' \ --batch_size 64 --enc_type gat --dec_type transformer \ --model gat --vocab_size 16000 \ --emb_dim 256 --hidden_size 256 \ --filter_size 512 --use_bias True --beam_size 5 \ --beam_alpha 0.1 --enc_layers 6 --dec_layers 6 \ --num_heads 8 --sentencepiece True \ --steps 150 --eval_steps 500 --checkpoint 1000 \ --alpha 0.2 --dropout 0.2 --debug_mode False \ --reg_scale 0.0 --learning_rate 0.0001 \ --lang eng --use_colab True \ --eval 'GSoC-19/data/processed_data/eng/eval_src' \ --eval_ref 'GSoC-19/data/processed_data/eng/eval_tgt'
  • If you use sentencepiece the vocabsize argument must match the vocabsize used in the preprocess script. The preprocess file automatically saves the prepcessed datasets as pickle dumps in your drive.

  • To run the multilingual model replace trainsingle.py with trainmultiple.py. All languages must be preprocessed to train the multilingual model. The multilingual model preprocesses the data for all languages automatically, no need to change the trainpath, evalpath, testpath. the lang, eval and evalref parameters must be changed to 'mutli' to save it's checkpoints in a folder of the same name.

Credits :

Owner

  • Name: DBpedia
  • Login: dbpedia
  • Kind: organization
  • Email: dbpedia-discussion@lists.sourceforge.net

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Dependencies

requirements.txt pypi
  • networkx ==2.4
  • pickle *
  • sentencepiece ==0.1.85
  • tensorflow-gpu ==1.15.0
  • tqdm ==4.41.0