retrieval-table-augmentation

This is the code for reproducing the TABBIE baseline in our paper: "Retrieval-Based Transformer for Table Augmentation"

https://github.com/ibm/retrieval-table-augmentation

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.7%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

This is the code for reproducing the TABBIE baseline in our paper: "Retrieval-Based Transformer for Table Augmentation"

Basic Info
  • Host: GitHub
  • Owner: IBM
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 7.78 MB
Statistics
  • Stars: 12
  • Watchers: 2
  • Forks: 1
  • Open Issues: 2
  • Releases: 0
Created over 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

This is the code for reproducing our paper: "Retrieval-Based Transformer for Table Augmentation"

Retrieval

1) first create passages from your table corpus, this may involve splitting tables into multiple passages

For example: bash python ${PYTHONPATH}/table_augmentation/entitables/convert_entitables.py \ --table_dir ${TBL_DIR}/tables_redi2_1 \ --split_definitions ${TBL_DIR}/sigir2017-table/Data \ --passage_dir ${TBL_DIR}/passages \ --query_dir ${TBL_DIR}/queries

These passages should be in jsonl files with 'pid', 'title' and 'text' fields. The table_id that the passage comes from should be a prefix of the pid. This will allow excluding by pid prefix during training.

2) index your table passages with Anserini

Download and build Anserini. You will need to have Maven and a Java JDK. ```bash git clone https://github.com/castorini/anserini.git cd anserini

to use the 0.4.1 version dprBM25.jar is built for

git checkout 3a60106fdc83473d147218d78ae7dca7c3b6d47c export JAVA_HOME=your JDK directory mvn clean package appassembler:assemble ```

Run formating and indexing. For example: ```bash python ${PYTHONPATH}/dpr/anseriniindex.py \ --jar ${YOURANSERINIDIR}/Anserini/target/anserini-0.4.1-SNAPSHOT-fatjar.jar \ --input ${TBLDIR}/passages/row/a.jsonl.gz \ --outputdir ${TBLDIR}/passages/row

```

3) Create DPR training data

For example: bash python ${PYTHONPATH}/table_augmentation/table_dpr_bm25_answer_bearing.py \ --train_file ${TBL_DIR}/queries/row_train.jsonl.gz --task.task row --task.answer_normalization identity \ --jar ${YOUR_ANSERINI_DIR}/Anserini/target/anserini-0.4.1-SNAPSHOT-fatjar.jar \ --anserini_index ${TBL_DIR}/passages/row/index \ --output_dir ${TBL_DIR}/dpr_train/row

4) Train DPR

For example: bash python ${PYTHONPATH}/midpr/biencoder_trainer.py \ --train_dir ${TBL_DIR}/dpr_train/row \ --output_dir ${TBL_DIR}/models/dpr_e3_row \ --seq_len_q 64 --seq_len_c 128 \ --num_train_epochs 3 \ --encoder_gpu_train_limit 16 \ --max_grad_norm 1.0 --learning_rate 5e-5 \ --full_train_batch_size 128

5) Build DPR index

This example shows building the index in two parts (if you want to use 2 GPUs in parallel) ```bash python ${PYTHONPATH}/dpr/indexsimplecorpus.py \ --embed 1of2 --shardedindex \ --dprctxencoderpath ${TBLDIR}/models/dpre3row/ctxencoder \ --corpus ${TBLDIR}/passages/row/a.jsonl.gz \ --outputdir ${TBLDIR}/passages/row/dprindex

python ${PYTHONPATH}/dpr/indexsimplecorpus.py \ --embed 2of2 --shardedindex \ --dprctxencoderpath ${TBLDIR}/models/dpre3row/ctxencoder \ --corpus ${TBLDIR}/passages/row/a.jsonl.gz \ --outputdir ${TBLDIR}/passages/row/dprindex ```

6) Apply DPR

For example: bash python ${PYTHONPATH}/dpr/table_aug_dpr_apply.py \ --tables ${TBL_DIR}/queries/row_id_validation.jsonl --task.task row --task.answer_normalization identity \ --qry_encoder_path ${TBL_DIR}/models/dpr_e3_row/qry_encoder \ --corpus_endpoint ${TBL_DIR}/passages/row/dpr_index \ --output ${TBL_DIR}/apply/dpr_row.jsonl

Note that when a directory (rather than a http://IP:port) is provided as the corpus_endpoint, a DPR index service will be started. An abnormal exit from either training or apply that uses such a service can leave the service still running. Check to ensure that there are no stray DPR services left running with: bash ps -ef | grep python And check for any left over processes

Reader

1) Train the row or column population model

bash export CORPUS=${TBL_DIR}/passages/row/dpr_index

Optionally start the corpus service

```bash python ${PYTHONPATH}/corpus/corpusserverdirect.py \ --port 5001 --corpus_dir ${CORPUS}

export CORPUS_ENDPOINT=http://127.0.0.1:5001 ```

OR let the train / apply scripts start it for you

bash export CORPUS_ENDPOINT=${CORPUS}

bash python ${PYTHONPATH}/extractive/raex_train.py \ --task.task row --task.answer_normalization identity \ --train_data ${TBL_DIR}/queries/row_train.jsonl.gz \ --model_name_or_path bert-large-cased \ --dpr.qry_encoder_path ${TBL_DIR}/models/dpr_e3_row/qry_encoder \ --dpr.corpus_endpoint ${CORPUS_ENDPOINT} --dpr.n_docs 5 \ --num_train_epochs 1 --warmup_fraction 0.1 --full_train_batch_size 32 \ --output_dir ${TBL_DIR}/models/raex_row

2) Apply the model

bash python ${PYTHONPATH}/extractive/raex_apply.py \ --tables ${TBL_DIR}/queries/row_id_validation.jsonl \ --task.task row --task.answer_normalization identity \ --model_name_or_path bert-large-cased --resume_from ${TBL_DIR}/models/raex_row \ --dpr.corpus_endpoint ${CORPUS_ENDPOINT} --dpr.n_docs 5 \ --output_dir ${TBL_DIR}/apply/raex_row

NOTE: cell population

Cell filling is documented under tableaugmentation/cellfilling

Citation

@inproceedings{glass2023retrieval, title={Retrieval-Based Transformer for Table Augmentation}, author={Glass, Michael and Wu, Xuecheng and Naik, Ankita and Rossiello, Gaetano and Gliozzo, Alfio}, booktitle={Annual Meeting of the Association for Computational Linguistics}, year={2023} }

Owner

  • Name: International Business Machines
  • Login: IBM
  • Kind: organization
  • Email: awesome@ibm.com
  • Location: United States of America

Citation (CITATION.cff)

cff-version: 1.2.0
title: Retrieval-Based Transformer for Table Augmentation
message: 'If you use this software, please cite it as below.'
type: software
authors:
  - family-names: Glass
    given-names: Michael
    orcid: 'https://orcid.org/0009-0000-1505-4667'
    affiliation: IBM Research AI
  - family-names: Wu
    given-names: Xuecheng
    affiliation: IBM Research AI
  - family-names: Naik
    given-names: Ankita
    affiliation: IBM Research AI
  - family-names: Rossiello
    given-names: Gaetano
    orcid: 'https://orcid.org/0000-0003-1042-4782'
    affiliation: IBM Research AI
  - family-names: Gliozzo
    given-names: Alfio
    orcid: 'https://orcid.org/0000-0002-8044-2911'
    affiliation: IBM Research AI
url: 'https://github.com/IBM/retrieval-table-augmentation'
abstract: >-
  Data preparation, also called data wrangling, is
  considered one of the most expensive and time-consuming
  steps when performing analytics or building machine
  learning models.  Preparing data typically involves
  collecting and merging data from complex heterogeneous,
  and often large-scale data sources, such as data lakes. 
  In this paper, we introduce a novel approach toward
  automatic data wrangling in an attempt to alleviate the
  effort of end-users, e.g. data analysts, in structuring
  dynamic views from data lakes in the form of tabular
  data.  Given a corpus of tables, we propose a retrieval
  augmented transformer model that is self-trained for the
  table augmentation tasks of row/column population and data
  imputation.  Our self-learning strategy consists in
  randomly ablating tables from the corpus and training the
  retrieval-based model with the objective of reconstructing
  the partial tables given as input with the original values
  or headers.  We adopt this strategy to first train the
  dense neural retrieval model encoding portions of tables
  to vectors, and then the end-to-end model trained to
  perform table augmentation tasks.  We test on EntiTables,
  the standard benchmark for table augmentation, as well as
  introduce a new benchmark to advance further research:
  WebTables.  Our model consistently and substantially
  outperforms both supervised statistical methods and the
  current state-of-the-art transformer-based models.
license: MIT
version: 1
date-released: '2023-06-16'
preferred-citation:
  type: article
  authors:
  - family-names: Glass
    given-names: Michael
    orcid: 'https://orcid.org/0009-0000-1505-4667'
    affiliation: IBM Research AI
  - family-names: Wu
    given-names: Xuecheng
    affiliation: IBM Research AI
  - family-names: Naik
    given-names: Ankita
    affiliation: IBM Research AI
  - family-names: Rossiello
    given-names: Gaetano
    orcid: 'https://orcid.org/0000-0003-1042-4782'
    affiliation: IBM Research AI
  - family-names: Gliozzo
    given-names: Alfio
    orcid: 'https://orcid.org/0000-0002-8044-2911'
    affiliation: IBM Research AI
  journal: Annual Meeting of the Association for Computational Linguistics
  title: Retrieval-Based Transformer for Table Augmentation
  year: 2023

GitHub Events

Total
  • Watch event: 2
  • Pull request event: 1
  • Create event: 1
Last Year
  • Watch event: 2
  • Pull request event: 1
  • Create event: 1

Dependencies

requirements.txt pypi
  • faiss-cpu ==1.7.2
  • flask *
  • pyjnius *
  • pyserini ==0.18.0
  • requests *
  • torch >=1.13.1
  • transformers ==4.21.2
  • ujson *