retomaton

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)

Keywords

icml-2022 knn knn-lm language modeling nearest nearest-neighbor neuro-symbolic neurosymbolic pytorch retomaton retrieval

Keywords from Contributors

count-model econometrics generalized-linear-models hypothesis-testing prediction regression-models robust-estimation timeseries-analysis

Last synced: 6 months ago · JSON representation ·

Repository

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)

Basic Info

Host: GitHub
Owner: neulab
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 6.62 MB

Statistics

Stars: 73
Watchers: 6
Forks: 4
Open Issues: 1
Releases: 0

Topics

icml-2022 knn knn-lm language modeling nearest nearest-neighbor neuro-symbolic neurosymbolic pytorch retomaton retrieval

Created almost 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Contributing License Code of conduct Citation

RetoMaton: Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

A neuro-symbolic language model, based on a base neural-LM and an automaton that retrieves examples from the training data. This is an official implementation of the model described in:

Uri Alon, Frank F. Xu, Junxian He, Sudipta Sengupta, Dan Roth, and Graham Neubig, \ "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval"

July 2022 - A Hugging Face transformers implementation of RetoMaton and kNN-LM is available at https://github.com/neulab/knn-transformers

June 2022 - Overview tweet!

May 2022 - The paper was accepted to ICML'2022! See you in Baltimore in July 2022 [Poster here]

April 2022 - a talk video is available here: [1-hour video] [5-min video]

Please let us know if anything here is not working as expected, and feel free to create new issues with any questions.

Overview

Results

WikiText-103:

Law-MT, with a based LM that was trained on WMT News Crawl:

Law-MT, with a base LM that was fine-tuned on Law-MT:

| Method | ppl | ppl, saving 50% of the searches | | :--- | ----: | ---: | | Fine-tuned LM | 8.61 | 8.61 | | kNN-LM | 7.93 | 8.25 | | AdaptRet baseline (He et al., 2021) | 7.81 | 7.91 | | RetoMaton (this work) | 7.10 | 7.15|

This project is based on python3 and PyTorch 1.9.0. To check PyTorch version: python python3 -c 'import torch; print(torch.__version__)
The project also depends on the faiss library. We recommend using the GPU version of faiss: pip install faiss-gpu The CPU version can be installed using pip install faiss. On a Macbook, use the Anaconda installation instead: conda install -c conda-forge pytorch faiss-cpu
Finally, from this project's directory, run: bash pip install --editable .

Hardware

Experiments for this paper were conducted on a machine that contains 16GB of RAM, and a single NVIDIA RTX 3090 GPU.

Saving the Wikitext-103 datastore requires 200GB of disk space (in fp16, which does not degrade the performance compared to fp32).

Quickstart

Step 0: Clone this repository:

bash git clone https://github.com/neulab/retomaton cd retomaton mkdir -p checkpoints/wt103 mkdir -p checkpoints/law

Step 1: Preparing the data

You can either download our preprocessed Wikitext-103 and Law-MT datasets, or preprocess them yourself.

Download the preprocessed Wikitext-103 dataset:

bash wget https://retomaton.s3.us-east-2.amazonaws.com/wt103/wiki103_preprocessed.tar.gz tar -xzvf wiki103_preprocessed.tar.gz

Download the preprocessed Law-MT dataset:

bash wget https://retomaton.s3.us-east-2.amazonaws.com/law/law_preprocessed.tar.gz tar -xzvf law_preprocessed.tar.gz

Preprocessing the dataset (not needed if you already downloaded our preprocessed dataset):

We include Fairseq's instructions on how to prepare the data here.

Preprocessing Wikitext-103:

```bash cd examples/language_model/ bash prepare-wikitext-103.sh cd ../..

TEXT=examples/language_model/wikitext-103 python preprocess.py \ --only-source \ --trainpref $TEXT/wiki.train.tokens \ --validpref $TEXT/wiki.valid.tokens \ --testpref $TEXT/wiki.test.tokens \ --destdir data-bin/wikitext-103 \ --workers 20 ```

Preprocessing Law-MT:

The data is originally from: https://github.com/roeeaharoni/unsupervised-domain-clusters. We used the law/ subdirectory, and only the English "source" files. Then, we re-tokenized the dataset using the model's BPE tokenizer.

The tokenized dataset can be downloaded from: mkdir -p datasets/law wget -P datasets/law/ https://retomaton.s3.us-east-2.amazonaws.com/law/law_tokenized.tar.gz cd datasets/law tar -xzvf law_tokenized.tar.gz cd ../../

and then preprocessing it can be performed using:

TEXT=datasets/law python preprocess.py \ --only-source \ --trainpref $TEXT/train.tokenized \ --validpref $TEXT/dev.en.tokenized \ --testpref $TEXT/test.en.tokenized \ --destdir data-bin/law \ --workers 20

Step 2: Downloading the Base Language Model

The models that we used can be downloaded from the following sources: For Wikitext-103: wget -P checkpoints/wt103/ https://nlp.stanford.edu/projects/knnlm/wt103_checkpoint_best.pt

For Law-MT: wget -P checkpoints/law/ https://dl.fbaipublicfiles.com/fairseq/models/lm/wmt19.en.tar.gz cd checkpoints/law tar -xzvf wmt19.en.tar.gz cd ..

We also include Fairseq's instructions on how to train the language model here:

bash python train.py --task language_modeling \ data-bin/wikitext-103 \ --save-dir checkpoints/ \ --arch transformer_lm_wiki103 \ --max-update 286000 --max-lr 1.0 --t-mult 2 --lr-period-updates 270000 --lr-scheduler cosine --lr-shrink 0.75 \ --warmup-updates 16000 --warmup-init-lr 1e-07 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \ --criterion adaptive_loss --max-tokens 3072 --update-freq 3 --tokens-per-sample 3072 --seed 1 \ --sample-break-mode none --skip-invalid-size-inputs-valid-test --ddp-backend=no_c10d

This model was trained on 8 gpus.

Step 3: Evaluating the base Language Model

To evaluate the base model on the validation set (without any retrieval):

For Wikitext-103: bash python eval_lm.py data-bin/wikitext-103 \ --path checkpoints/wt103/wt103_checkpoint_best.pt \ --sample-break-mode complete --max-tokens 3072 \ --context-window 2560 --softmax-batch 1024 --batch-size 2 \ --gen-subset valid

For Law-MT: bash python eval_lm.py data-bin/law \ --sample-break-mode eos \ --path checkpoints/law/wmt19.en/model.pt \ --max-tokens 2048 --context-window 0 --batch-size 2 \ --gen-subset valid --remove-bpe

Notice that the main difference between the datasets is that in Law-MT we use the flags --remove-bpe and --sample-break-mode eos, and also the --max-tokens and --context-window values are different.

Step 4: Saving the keys and values for the datastore

The next step is to run model evaluation over the entire training set, and save keys and values.

To download the keys and values that we already saved for Wikitext-103:

bash wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/dstore16_vals.npy wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/dstore16_keys.npy

Note: The keys of Wikitext-103 take 200GB of disk space

To download the keys and values that we already saved for Law-MT:

bash wget -P checkpoints/law/ https://retomaton.s3.us-east-2.amazonaws.com/law/dstore16_vals.npy wget -P checkpoints/law/ https://retomaton.s3.us-east-2.amazonaws.com/law/dstore16_keys.npy

To save keys and values (not needed if you already downloaded our keys and values):

for Wikitext-103:

bash python eval_lm.py data-bin/wikitext-103 \ --path checkpoints/wt103/wt103_checkpoint_best.pt \ --sample-break-mode none --max-tokens 3072 \ --softmax-batch 1024 --batch-size 2 --gen-subset train \ --context-window 1536 --tokens-per-sample 1536 \ --dstore-mmap checkpoints/wt103/dstore16 --knn-keytype 'last_ffn_input' \ --dstore-size 103225485 --model-overrides "{'knn_keytype': 'last_ffn_input'}" \ --save-knnlm-dstore --dstore-fp16

The total number of tokens in the Wikitext-103 training set is 103227021. The dstore size 103225485 is 1536 tokens less than the total due to the context-window. We want each key to be constructed using a minimum amount of prior context.

For Law-MT:

Following the instructions and using the code of https://github.com/jxhe/efficient-knnlm, we created the datastore using their code: cd ../efficient-knnlm python eval_lm.py ../retomaton/data-bin/law \ --path ../retomaton/checkpoints/law/wmt19.en/model.pt \ --sample-break-mode eos --max-tokens 2048 \ --softmax-batch 1024 --batch-size 2 --gen-subset train \ --context-window 0 --tokens-per-sample 512 \ --dstore-mmap ../retomaton/checkpoints/law/dstore16 --knn-keytype 'last_ffn_input' \ --dstore-size 19068709 \ --log-interval 100 \ --model-overrides "{'knn_keytype': 'last_ffn_input'}" \ --dstore-fp16 \ --save-knnlm-dstore

Step 5: Building the FAISS index

The FAISS index requires a training stage where it learns an index for accessing the keys quickly. Once this is completed, the keys must all be added to the index. The speed of adding keys to the index depends on the hardware, particularly the amount of RAM available.

To download our index:

Wikitext-103:

wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/knn16.index

For Law-MT:

wget -P checkpoints/law/ https://retomaton.s3.us-east-2.amazonaws.com/law/knn.19048862.index

To build the FAISS index yourself (not needed if you already downloaded our index):

For Wikitext-103: DSTORE=checkpoints/wt103/dstore16 DSTORE_SIZE=103225485 INDEX=checkpoints/wt103/knn16.index

For Law-MT: DSTORE=checkpoints/law/dstore16 DSTORE_SIZE=19068709 INDEX=checkpoints/law/knn16.index

and then for both datasets: bash python build_dstore.py \ --dstore_mmap ${DSTORE} \ --dstore_size ${DSTORE_SIZE} \ --faiss_index ${INDEX} \ --num_keys_to_add_at_a_time 500000 \ --starting_point 0

Step 6: Evaluating RetoMaton without clustering

To evaluate the model on the validation set:

Wikitext-103:

```bash DSTORE=checkpoints/wt103/dstore16 DSTORESIZE=103225485 INDEX=checkpoints/wt103/knn16.index MODEL=checkpoints/wt103/wt103checkpoint_best.pt

python evallm.py data-bin/wikitext-103 \ --path ${MODEL} \ --sample-break-mode complete --max-tokens 3072 \ --context-window 2560 --softmax-batch 1024000 --batch-size 2 \ --gen-subset valid --dstore-filename ${DSTORE} \ --indexfile ${INDEX} \ --model-overrides "{'knnkeytype': 'lastffninput'}" \ --k 1024 --lmbda 0.25 --dstore-size ${DSTORESIZE} --knn-keytype lastffninput \ --probe 32 --knnlm --dstore-fp16 \ --knn-sim-func donotrecompl2 --no-load-keys --move-dstore-to-mem \ --knnlm-gpu --min-knns 1 --max-knns 1024 ```

To encourage the model to perform a full kNN search more frequently and thus increase accuracy and reduce perplexity, use a larger value of --min-knns such as 100. Using --min-knns 9999999 makes the model perform kNN search at every step (FoSS = 0 in Figure 3 of the paper), and achieves the best results at the cost of slower speed.

To run the baseline kNN-LM, add the flag --no-pointer.

Law-MT:

```bash DSTORE=checkpoints/law/dstore16 DSTORE_SIZE=19068709 INDEX=checkpoints/law/knn16.index MODEL=checkpoints/law/wmt19.en/model.pt

python evallm.py data-bin/law \ --path ${MODEL} \ --sample-break-mode eos --max-tokens 2048 \ --context-window 0 --softmax-batch 1024000 --batch-size 2 \ --gen-subset valid --dstore-filename ${DSTORE} \ --indexfile ${INDEX} \ --model-overrides "{'knnkeytype': 'lastffninput'}" \ --k 1024 --lmbda 0.9 --dstore-size ${DSTORESIZE} --knn-keytype lastffninput \ --probe 32 --knnlm --dstore-fp16 \ --knn-sim-func donotrecompl2 --no-load-keys --move-dstore-to-mem \ --remove-bpe \ --knnlm-gpu --min-knns 1 --max-knns 1024 ```

Notice that the difference betweens between the datasets is that in Law-MT we use the flags --remove-bpe and --sample-break-mode eos, and also the --max-tokens and --context-window values are different. Further, as found by He et al., 2021, the interpolation coefficient should be set to --lmbda 0.9, to give more weight to the datastore than the base LM.

Step 7: Adding clustering

For the Greedy Merge clustering algorithm. See the code of He et al. (2021). Greedy Merge is much faster and requires much fewer memory than k-means, but results in slightly higher perplexity:

See also Figures 8 and 9 in Appendix D in the paper.

To download our clusters for Wikitext-103:

Note that only one of the following files is needed. For the main experiments in the paper, we used: bash wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/clusters_s40000000_k1000000_members.pkl

but additional clusterings are available as well: bash wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/clusters_s20000000_k500000_members.pkl wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/dstore_merge15_members_sp.pkl wget -P checkpoints/wt103/ https://retomaton.s3.us-east-2.amazonaws.com/wt103/dstore_merge29_members.pkl

To download our clusters for Law-MT:

Note that only one of the following files is needed. For the main experiments in the paper, we used: bash wget -P checkpoints/law/ https://retomaton.s3.us-east-2.amazonaws.com/law/law_clusters_s40000000_k200000_members.pkl

but additional clustering is available as well: bash wget -P checkpoints/law/ https://retomaton.s3.us-east-2.amazonaws.com/law/law_clusters_s40000000_k400000_members.pkl

Evaluating RetoMaton with clustering:

Basically identical to Step 6: Evaluating RetoMaton without clustering, except that we add the flag --members <filename>_members.pkl,

Wikitext-103:

```bash DSTORE=checkpoints/wt103/dstore16 DSTORESIZE=103225485 INDEX=checkpoints/wt103/knn16.index MODEL=checkpoints/wt103/wt103checkpointbest.pt MEMBERS=checkpoints/wt103/clusterss40000000k1000000members.pkl

python evallm.py data-bin/wikitext-103 \ --path ${MODEL} \ --sample-break-mode complete --max-tokens 3072 \ --context-window 2560 --softmax-batch 1024000 --batch-size 2 \ --gen-subset valid --dstore-filename ${DSTORE} \ --indexfile ${INDEX} \ --model-overrides "{'knnkeytype': 'lastffninput'}" \ --k 1024 --lmbda 0.25 --dstore-size ${DSTORESIZE} --knn-keytype lastffninput \ --probe 32 --knnlm --dstore-fp16 \ --knn-sim-func donotrecompl2 --no-load-keys --move-dstore-to-mem \ --knnlm-gpu --min-knns 1 --max-knns 1024 \ --members ${MEMBERS} ```

Law-MT:

```bash DSTORE=checkpoints/law/dstore16 DSTORESIZE=19068709 INDEX=checkpoints/law/knn16.index MODEL=checkpoints/law/wmt19.en/model.pt MEMBERS=checkpoints/law/lawclusterss40000000k200000_members.pkl

python evallm.py data-bin/law \ --path ${MODEL} \ --sample-break-mode eos --max-tokens 2048 \ --context-window 0 --softmax-batch 1024000 --batch-size 2 \ --gen-subset valid --dstore-filename ${DSTORE} \ --indexfile ${INDEX} \ --model-overrides "{'knnkeytype': 'lastffninput'}" \ --k 1024 --lmbda 0.9 --dstore-size ${DSTORESIZE} --knn-keytype lastffninput \ --probe 32 --knnlm --dstore-fp16 \ --knn-sim-func donotrecompl2 --no-load-keys --move-dstore-to-mem \ --remove-bpe \ --knnlm-gpu --min-knns 1 --max-knns 1024\ --members ${MEMBERS} ```

Cluster the keys yourself (not needed if you downloaded our clusters):

for Wikitext-103: DSTORE=checkpoints/wt103/dstore16 DSTORE_SIZE=103225485 NUM_CLUSTERS=1000000 SAMPLE=40000000 DIM=1024 SAVE=kmeans_wt103

For Law-MT: bash DSTORE=checkpoints/law/dstore16 DSTORE_SIZE=19068709 NUM_CLUSTERS=200000 SAMPLE=40000000 DIM=1536 SAVE=kmeans_law

And then for both datasets: bash python kmeans.py --dstore ${DSTORE} --dstore-size ${DSTORE_SIZE} --num-clusters ${NUM_CLUSTERS} --sample ${SAMPLE} --dim ${DIM} --save ${}

Step 8: Evaluating the Fine-tuned Model

The model that was fine-tuned on Law-MT, along with its corresponding datastore, FAISS index and clustering can be downloaded from:

bash mkdir checkpoints/law-finetuned/ wget -P checkpoints/law-finetuned/ https://retomaton.s3.us-east-2.amazonaws.com/law/finetuned.pt wget -P checkpoints/law-finetuned/ https://retomaton.s3.us-east-2.amazonaws.com/law/dstore16_finetuned_size19068709_embed1536_fp16_vals.npy wget -P checkpoints/law-finetuned/ https://retomaton.s3.us-east-2.amazonaws.com/law/dstore16_finetuned_size19068709_embed1536_fp16_keys.npy wget -P checkpoints/law-finetuned/ https://retomaton.s3.us-east-2.amazonaws.com/law/knn_finetuned.index wget -P checkpoints/law-finetuned/ https://retomaton.s3.us-east-2.amazonaws.com/law/law_finetuned_clusters_s20000000_k200000_members.pkl

Finally, evaluate using the fine-tuned checkpoint, datastore, and index.

It is important to also set --lmbda 0.25 when using the fine-tuned model: since the model is fine-tuned, we can rely on it more than before. See a clarification at #lambda-values

Best results with the fine-tuned model are achieved without clustering (that is, every datastore entry is a singleton cluster).

Then, the same steps as before should be run on the Law-MT datasets, except that: * finetuned.pt should be used as the ${MODEL} * dstore16_finetuned_size19068709_embed1536_fp16 should be used as the ${DSTORE} * knn_finetuned.index should be used as the ${INDEX} * law_finetuned_clusters_s20000000_k200000_members.pkl shoould be used as${MEMBERS}

That is:

```bash DSTORE=checkpoints/law-finetuned/dstore16finetunedsize19068709embed1536fp16 DSTORESIZE=19068709 INDEX=checkpoints/law-finetuned/knnfinetuned.index MODEL=checkpoints/law-finetuned/finetuned.pt MEMBERS=checkpoints/law-finetuned/lawfinetunedclusterss20000000k200000_members.pkl

python evallm.py data-bin/law \ --path ${MODEL} \ --sample-break-mode eos --max-tokens 2048 \ --context-window 0 --softmax-batch 1024000 --batch-size 2 \ --gen-subset valid --dstore-filename ${DSTORE} \ --indexfile ${INDEX} \ --model-overrides "{'knnkeytype': 'lastffninput'}" \ --k 1024 --lmbda 0.25 --dstore-size ${DSTORESIZE} --knn-keytype lastffninput \ --probe 32 --knnlm --dstore-fp16 \ --knn-sim-func donotrecompl2 --no-load-keys --move-dstore-to-mem \ --remove-bpe \ --knnlm-gpu --min-knns 1 --max-knns 1024 ```

Lambda values

In all configurations, the interpolation factor lmbda is set to 0.25, except when the base LM is checkpoints/law/wmt19.en/model.pt and the model is evaluated on Law-MT, since this scenario tests domain adaptation, and thus lmbda should be set to 0.9:

| | wt103_checkpoint_best.pt | wmt19.en/model.pt | finetuned.pt | | :--- | ----: | ---: | ---: | | Wikitext-103| 0.25 | - | - | | Law-MT | - | 0.9 | 0.25 |

All files:

Checkpoints and datasets can be downloaded from here: https://zenodo.org/record/6525426

And also from the AWS S3 bucket

Differences from the kNN-LM implementation

Implementation Pointers

Here we point to the code that differs our work from kNN-LM. * The main changes are in this commit. The pointers for the next timestep are initially the current k-nearest neighbors + 1. Then we extend each pointer to consider all entries in its cluster. This is the function that maps each pointer to its cluster, removes duplicate clusters, and then finds the members of each cluster. We find the log probabilities as suggested by the new pointers, and finally take to the next timestep - only the pointers that are consistent with the token that the model eventually predicted. * In this commit we utilize the given pointers, or perform kNN search and combine the results with the existing pointers. * When using the --knnlm-gpu flag, we use a GPU index to search for nearest neighbors, and its copy CPU index to reconstruct vectors given their ID. Unfortunately, currently reconstructing vectors in faiss is not implemented for GPU indexes (see also this issue). * Reconstructing a batch of vectors from the index is unfortunately not implemented in faiss (see this issue), and thus the fastest way that we found to do that is using np.vectorize, and reconstructing many single vectors in parallel: fairseq/knnlm.py#L92-L94. * Performing k-means clustering on millions of vectors can be performed in many ways, but specifically we utilize the faiss library to do it using the script kmeans.py.

Additional minor differences:

The original kNN-LM repository uses faiss CPU to perform retrieval. However, we added the flag --knnlm-gpu that allows performing retrieval much faster on the GPU.
After each retrieval, the original kNN-LM repository loads the found keys and re-computes the distance from the query to each nearest neighbor. This is much more time consuming, unless loading all the keys (200GB) into memory. We thus use the flags --knn-sim-func do_not_recomp_l2 --no-load-keys --move-dstore-to-mem.
When using faiss-gpu, it is useful to import faiss.contrib.torch_utils. This allows performing the kNN search using torch tensors (rather than only numpy arrays). Additionally, sometimes this import statement prevents searching bugs in faiss (see this issue).

Citation

Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

@inproceedings{alon2022neuro, title={Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval}, author={Alon, Uri and Xu, Frank and He, Junxian and Sengupta, Sudipta and Roth, Dan and Neubig, Graham}, booktitle={International Conference on Machine Learning}, pages={468--485}, year={2022}, organization={PMLR} }

Owner

Name: NeuLab
Login: neulab
Kind: organization
Location: Pittsburgh, PA

Website: http://www.cs.cmu.edu/~neulab/
Repositories: 58
Profile: https://github.com/neulab

Graham Neubig's Lab at LTI/CMU

Citation (CITATION.cff)

@inproceedings{alon2022neuro,
  title={Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval},
  author={Alon, Uri and Xu, Frank and He, Junxian and Sengupta, Sudipta and Roth, Dan and Neubig, Graham},
  booktitle={International Conference on Machine Learning},
  pages={468--485},
  year={2022},
  organization={PMLR}
}

GitHub Events

Total

Last Year

Committers

Last synced: 8 months ago

All Time

Total Commits: 1,174
Total Committers: 193
Avg Commits per committer: 6.083
Development Distribution Score (DDS): 0.549

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Myle Ott	m**t@f**m	530
alexeib	a**b@g**m	77
urialon	u**1@g**m	43
Sergey Edunov	e**v@f**m	26
Naman Goyal	n**l@d**r	22
Liezl Puzon	l**e@f**m	21
Louis MARTIN	l**m@g**m	19
Ning Dong	d**n@f**m	18
Angela Fan	a**n@g**m	16
Sergey Edunov	e**v@a**g	11
Kartikay Khandelwal	k**k@f**m	10
freewym	f**m@g**m	10
Jiatao Gu	j**u@f**m	10
Peng-Jen Chen	p**c@f**m	9
Wei Ho	w**o@f**m	9
Nayan Singhal	n**g@f**m	9
Haoran Li	a**i@f**m	9
Matt Le	m**e@f**m	8
Jerry Ma	n**e@g**m	8
Changhan Wang	c**n@f**m	8
urvashik	u****k	7
Stephen Roller	r**r@f**m	7
Nathan Ng	n**5@g**m	7
James Cross	j**s@f**m	7
Dmytro Okhonko	o**o@f**m	7
Jingfei Du	j**u@f**m	6
Halil Akin	h**n@f**m	6
Spencer Poff	s**f@f**m	6
ngimel	n**n@n**m	6
Xian Li	x**l@f**m	6
and 163 more...

Committer Domains (Top 20 + Academic)

fb.com: 66 qq.com: 4 apple.com: 2 learnfair1944.h2.fair: 1 devfair0226.h2.fair: 1 cs.cmu.edu: 1 jgu-fedora-mj0a98z5.dhcp.thefacebook.com: 1 ntu.edu.tw: 1 devfair0160.h2.fair: 1 fenbi.com: 1 devfair0162.h2.fair: 1 devfair0435.h2.fair: 1 learnfair0685.h2.fair: 1 learnfair0356.h2.fair: 1 learnfair0339.h2.fair: 1 learnfair0329.h2.fair: 1 hmc.edu: 1 devfair0110.h2.fair: 1 devfair0209.h2.fair: 1 learnfair0757.h2.fair: 1 smail.nju.edu.cn: 1 eng.ucsd.edu: 1 alumni.stanford.edu: 1 u.northwestern.edu: 1 vt.edu: 1 student.uva.nl: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 13.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

retomaton

Science Score: 72.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

RetoMaton: Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval

Overview

Results

Table of Contents

Requirements

Dependencies

Hardware

Quickstart

Step 0: Clone this repository:

Step 1: Preparing the data

Download the preprocessed Wikitext-103 dataset:

Download the preprocessed Law-MT dataset:

Preprocessing the dataset (not needed if you already downloaded our preprocessed dataset):

Preprocessing Wikitext-103:

Preprocessing Law-MT:

Step 2: Downloading the Base Language Model

Step 3: Evaluating the base Language Model

Step 4: Saving the keys and values for the datastore

To download the keys and values that we already saved for Wikitext-103:

To download the keys and values that we already saved for Law-MT:

To save keys and values (not needed if you already downloaded our keys and values):

for Wikitext-103:

For Law-MT:

Step 5: Building the FAISS index

Wikitext-103:

For Law-MT:

To build the FAISS index yourself (not needed if you already downloaded our index):

Step 6: Evaluating RetoMaton without clustering

Wikitext-103:

Law-MT:

Step 7: Adding clustering

To download our clusters for Wikitext-103:

To download our clusters for Law-MT:

Evaluating RetoMaton with clustering:

Wikitext-103:

Law-MT:

Cluster the keys yourself (not needed if you downloaded our clusters):

Step 8: Evaluating the Fine-tuned Model

Lambda values

All files:

Differences from the kNN-LM implementation

Implementation Pointers

Additional minor differences:

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies