retrieval-table-augmentation
This is the code for reproducing the TABBIE baseline in our paper: "Retrieval-Based Transformer for Table Augmentation"
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.7%) to scientific vocabulary
Repository
This is the code for reproducing the TABBIE baseline in our paper: "Retrieval-Based Transformer for Table Augmentation"
Basic Info
- Host: GitHub
- Owner: IBM
- License: mit
- Language: Python
- Default Branch: main
- Size: 7.78 MB
Statistics
- Stars: 12
- Watchers: 2
- Forks: 1
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
This is the code for reproducing our paper: "Retrieval-Based Transformer for Table Augmentation"
Retrieval
1) first create passages from your table corpus, this may involve splitting tables into multiple passages
For example:
bash
python ${PYTHONPATH}/table_augmentation/entitables/convert_entitables.py \
--table_dir ${TBL_DIR}/tables_redi2_1 \
--split_definitions ${TBL_DIR}/sigir2017-table/Data \
--passage_dir ${TBL_DIR}/passages \
--query_dir ${TBL_DIR}/queries
These passages should be in jsonl files with 'pid', 'title' and 'text' fields. The table_id that the passage comes from should be a prefix of the pid. This will allow excluding by pid prefix during training.
2) index your table passages with Anserini
Download and build Anserini. You will need to have Maven and a Java JDK. ```bash git clone https://github.com/castorini/anserini.git cd anserini
to use the 0.4.1 version dprBM25.jar is built for
git checkout 3a60106fdc83473d147218d78ae7dca7c3b6d47c export JAVA_HOME=your JDK directory mvn clean package appassembler:assemble ```
Run formating and indexing. For example: ```bash python ${PYTHONPATH}/dpr/anseriniindex.py \ --jar ${YOURANSERINIDIR}/Anserini/target/anserini-0.4.1-SNAPSHOT-fatjar.jar \ --input ${TBLDIR}/passages/row/a.jsonl.gz \ --outputdir ${TBLDIR}/passages/row
```
3) Create DPR training data
For example:
bash
python ${PYTHONPATH}/table_augmentation/table_dpr_bm25_answer_bearing.py \
--train_file ${TBL_DIR}/queries/row_train.jsonl.gz --task.task row --task.answer_normalization identity \
--jar ${YOUR_ANSERINI_DIR}/Anserini/target/anserini-0.4.1-SNAPSHOT-fatjar.jar \
--anserini_index ${TBL_DIR}/passages/row/index \
--output_dir ${TBL_DIR}/dpr_train/row
4) Train DPR
For example:
bash
python ${PYTHONPATH}/midpr/biencoder_trainer.py \
--train_dir ${TBL_DIR}/dpr_train/row \
--output_dir ${TBL_DIR}/models/dpr_e3_row \
--seq_len_q 64 --seq_len_c 128 \
--num_train_epochs 3 \
--encoder_gpu_train_limit 16 \
--max_grad_norm 1.0 --learning_rate 5e-5 \
--full_train_batch_size 128
5) Build DPR index
This example shows building the index in two parts (if you want to use 2 GPUs in parallel) ```bash python ${PYTHONPATH}/dpr/indexsimplecorpus.py \ --embed 1of2 --shardedindex \ --dprctxencoderpath ${TBLDIR}/models/dpre3row/ctxencoder \ --corpus ${TBLDIR}/passages/row/a.jsonl.gz \ --outputdir ${TBLDIR}/passages/row/dprindex
python ${PYTHONPATH}/dpr/indexsimplecorpus.py \ --embed 2of2 --shardedindex \ --dprctxencoderpath ${TBLDIR}/models/dpre3row/ctxencoder \ --corpus ${TBLDIR}/passages/row/a.jsonl.gz \ --outputdir ${TBLDIR}/passages/row/dprindex ```
6) Apply DPR
For example:
bash
python ${PYTHONPATH}/dpr/table_aug_dpr_apply.py \
--tables ${TBL_DIR}/queries/row_id_validation.jsonl --task.task row --task.answer_normalization identity \
--qry_encoder_path ${TBL_DIR}/models/dpr_e3_row/qry_encoder \
--corpus_endpoint ${TBL_DIR}/passages/row/dpr_index \
--output ${TBL_DIR}/apply/dpr_row.jsonl
Note that when a directory (rather than a http://IP:port) is provided as the corpus_endpoint, a DPR index service will be started.
An abnormal exit from either training or apply that uses such a service can leave the service still running.
Check to ensure that there are no stray DPR services left running with:
bash
ps -ef | grep python
And check for any left over processes
Reader
1) Train the row or column population model
bash
export CORPUS=${TBL_DIR}/passages/row/dpr_index
Optionally start the corpus service
```bash python ${PYTHONPATH}/corpus/corpusserverdirect.py \ --port 5001 --corpus_dir ${CORPUS}
export CORPUS_ENDPOINT=http://127.0.0.1:5001 ```
OR let the train / apply scripts start it for you
bash
export CORPUS_ENDPOINT=${CORPUS}
bash
python ${PYTHONPATH}/extractive/raex_train.py \
--task.task row --task.answer_normalization identity \
--train_data ${TBL_DIR}/queries/row_train.jsonl.gz \
--model_name_or_path bert-large-cased \
--dpr.qry_encoder_path ${TBL_DIR}/models/dpr_e3_row/qry_encoder \
--dpr.corpus_endpoint ${CORPUS_ENDPOINT} --dpr.n_docs 5 \
--num_train_epochs 1 --warmup_fraction 0.1 --full_train_batch_size 32 \
--output_dir ${TBL_DIR}/models/raex_row
2) Apply the model
bash
python ${PYTHONPATH}/extractive/raex_apply.py \
--tables ${TBL_DIR}/queries/row_id_validation.jsonl \
--task.task row --task.answer_normalization identity \
--model_name_or_path bert-large-cased --resume_from ${TBL_DIR}/models/raex_row \
--dpr.corpus_endpoint ${CORPUS_ENDPOINT} --dpr.n_docs 5 \
--output_dir ${TBL_DIR}/apply/raex_row
NOTE: cell population
Cell filling is documented under tableaugmentation/cellfilling
Citation
@inproceedings{glass2023retrieval,
title={Retrieval-Based Transformer for Table Augmentation},
author={Glass, Michael and Wu, Xuecheng and Naik, Ankita and Rossiello, Gaetano and Gliozzo, Alfio},
booktitle={Annual Meeting of the Association for Computational Linguistics},
year={2023}
}
Owner
- Name: International Business Machines
- Login: IBM
- Kind: organization
- Email: awesome@ibm.com
- Location: United States of America
- Website: https://www.ibm.com/opensource/
- Twitter: ibmdeveloper
- Repositories: 3,152
- Profile: https://github.com/IBM
Citation (CITATION.cff)
cff-version: 1.2.0
title: Retrieval-Based Transformer for Table Augmentation
message: 'If you use this software, please cite it as below.'
type: software
authors:
- family-names: Glass
given-names: Michael
orcid: 'https://orcid.org/0009-0000-1505-4667'
affiliation: IBM Research AI
- family-names: Wu
given-names: Xuecheng
affiliation: IBM Research AI
- family-names: Naik
given-names: Ankita
affiliation: IBM Research AI
- family-names: Rossiello
given-names: Gaetano
orcid: 'https://orcid.org/0000-0003-1042-4782'
affiliation: IBM Research AI
- family-names: Gliozzo
given-names: Alfio
orcid: 'https://orcid.org/0000-0002-8044-2911'
affiliation: IBM Research AI
url: 'https://github.com/IBM/retrieval-table-augmentation'
abstract: >-
Data preparation, also called data wrangling, is
considered one of the most expensive and time-consuming
steps when performing analytics or building machine
learning models. Preparing data typically involves
collecting and merging data from complex heterogeneous,
and often large-scale data sources, such as data lakes.
In this paper, we introduce a novel approach toward
automatic data wrangling in an attempt to alleviate the
effort of end-users, e.g. data analysts, in structuring
dynamic views from data lakes in the form of tabular
data. Given a corpus of tables, we propose a retrieval
augmented transformer model that is self-trained for the
table augmentation tasks of row/column population and data
imputation. Our self-learning strategy consists in
randomly ablating tables from the corpus and training the
retrieval-based model with the objective of reconstructing
the partial tables given as input with the original values
or headers. We adopt this strategy to first train the
dense neural retrieval model encoding portions of tables
to vectors, and then the end-to-end model trained to
perform table augmentation tasks. We test on EntiTables,
the standard benchmark for table augmentation, as well as
introduce a new benchmark to advance further research:
WebTables. Our model consistently and substantially
outperforms both supervised statistical methods and the
current state-of-the-art transformer-based models.
license: MIT
version: 1
date-released: '2023-06-16'
preferred-citation:
type: article
authors:
- family-names: Glass
given-names: Michael
orcid: 'https://orcid.org/0009-0000-1505-4667'
affiliation: IBM Research AI
- family-names: Wu
given-names: Xuecheng
affiliation: IBM Research AI
- family-names: Naik
given-names: Ankita
affiliation: IBM Research AI
- family-names: Rossiello
given-names: Gaetano
orcid: 'https://orcid.org/0000-0003-1042-4782'
affiliation: IBM Research AI
- family-names: Gliozzo
given-names: Alfio
orcid: 'https://orcid.org/0000-0002-8044-2911'
affiliation: IBM Research AI
journal: Annual Meeting of the Association for Computational Linguistics
title: Retrieval-Based Transformer for Table Augmentation
year: 2023
GitHub Events
Total
- Watch event: 2
- Pull request event: 1
- Create event: 1
Last Year
- Watch event: 2
- Pull request event: 1
- Create event: 1
Dependencies
- faiss-cpu ==1.7.2
- flask *
- pyjnius *
- pyserini ==0.18.0
- requests *
- torch >=1.13.1
- transformers ==4.21.2
- ujson *