https://github.com/bayer-group/xtars-naacl2022

Zero/few-shot learning for classification with very large label sets and long-tailed distribution of labels in data points

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary

Keywords

bayer-not-classified bayer-reg-none beat-not-applicable few-shot-learning large-scale-classification natural-language-processing text-classification zero-shot-learning

Last synced: 5 months ago · JSON representation

Repository

Zero/few-shot learning for classification with very large label sets and long-tailed distribution of labels in data points

Basic Info

Host: GitHub
Owner: Bayer-Group
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 18.6 KB

Statistics

Stars: 6
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

bayer-not-classified bayer-reg-none beat-not-applicable few-shot-learning large-scale-classification natural-language-processing text-classification zero-shot-learning

Created almost 4 years ago · Last pushed about 3 years ago

Metadata Files

Readme License Codeowners

XTARS: zero/few-shot learning for large-scale text classification

This repository contains the code of the following paper:

@inproceedings{ziletti-etal-2022-medical,
title = "{M}edical Coding with Biomedical Transformer Ensembles and Zero/Few-shot Learning",
author = "Ziletti, Angelo  and
Akbik, Alan  and
Berns, Christoph  and
Herold, Thomas  and
Legler, Marion  and
Viell, Martina",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track",
month = jul,
year = "2022",
address = "Hybrid: Seattle, Washington + Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-industry.21",
doi = "10.18653/v1/2022.naacl-industry.21",
pages = "176--187",
abstract = "Medical coding (MC) is an essential pre-requisite for reliable data retrieval and reporting. Given a free-text \textit{reported term} (RT) such as {``}pain of right thigh to the knee{''}, the task is to identify the matching \textit{lowest-level term} (LLT) {--}in this case {``}unilateral leg pain{''}{--} from a very large and continuously growing repository of standardized medical terms. However, automating this task is challenging due to a large number of LLT codes (as of writing over $80\,000$), limited availability of training data for long tail/emerging classes, and the general high accuracy demands of the medical domain.With this paper, we introduce the MC task, discuss its challenges, and present a novel approach called xTARS that combines traditional BERT-based classification with a recent zero/few-shot learning approach (TARS). We present extensive experiments that show that our combined approach outperforms strong baselines, especially in the few-shot regime. The approach is developed and deployed at Bayer, live since November 2021. As we believe our approach potentially promising beyond MC, and to ensure reproducibility, we release the code to the research community.",
}

Within this paper, we present a novel approach called XTARS that combines traditional BERTbased classification with a recent zero/few-shot learning approach (TARS, by Halder et al. (2020)).
XTARS is suitable for classification tasks with very large label sets and long-tailed distribution of labels in data points.

Installation

We recommend to create a virtual python 3.8 environment (for instance, with conda: https://docs.anaconda.com/anaconda/install/linux/), and then execute

Install latest version from the master branch on Github by: git clone <GITHUB-URL> cd xtars python setup.py install

Quick start

The XTARSClassifier in this repository can be used in the same way as the TARSClassifier in Flair.

Documentation on the usage of the TARSClassifier in Flair can be found here.

To use XTARS instead of TARS, simply substitute TARSClassifier with XTARSClassifier at training time.
An example of training is presented in main_train.py.

During prediction, a saved XTARSClassifier can be loaded in exactly the same way as the TARSClassifier. We refer you to the Flair documentation for more details.

Example code

A script for fine tuning (main_train.py) and making predictions (main_predict.py) are provided for your convenience.

Data for Fine Tuning

Sample data are provided in the /sample_data/ folder. If you are using your own data, it must be formatted as the sample data provided.
As prescribed by Flair, to create the corpus three files are needed (see here): train.csv dev.csv test.csv

We prepared a sample dataset in /sample_data/ for your convenience.

Fine Tuning

Use the main_train.py script to fine tune a model on the sample data provided.

python main_train.py

Predictions

After you trained a model, you can use main_predict.py script to obtain prediction for the test set.

python main_predict.py

Owner

Name: Bayer Open Source
Login: Bayer-Group
Kind: organization

Website: https://bayer.com/
Repositories: 98
Profile: https://github.com/Bayer-Group

Science for a better life

GitHub Events

Total

Pull request event: 1
Create event: 1

Last Year

Pull request event: 1
Create event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 2
Total Committers: 1
Avg Commits per committer: 2.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Angelo Ziletti	a**i@b**m	2

Committer Domains (Top 20 + Academic)

bayer.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 0
Average time to close issues: 8 months
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 0
Average comments per issue: 2.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

abhinav-kashyap-asus (1)

Pull Request Authors

dependabot[bot] (1)

Top Labels

Issue Labels

question (1)

Pull Request Labels

dependencies (1) python (1)

Dependencies

requirements.txt pypi

black *
datasets *
flair ==0.10
isort *
numpy *
pandas *
s3fs *
sklearn *
torch *
transformers *

https://github.com/bayer-group/xtars-naacl2022

Science Score: 26.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

XTARS: zero/few-shot learning for large-scale text classification

Installation

Quick start

Example code

Data for Fine Tuning

Fine Tuning

Predictions

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies