catbird

Open-source toolkit for paraphrase generation.

https://github.com/afonso-sousa/catbird

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.2%) to scientific vocabulary

Keywords

paraphrase-generation pytorch
Last synced: 6 months ago · JSON representation

Repository

Open-source toolkit for paraphrase generation.

Basic Info
  • Host: GitHub
  • Owner: afonso-sousa
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 26.6 MB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Topics
paraphrase-generation pytorch
Created about 4 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

[![No Maintenance Intended](http://unmaintained.tech/badge.svg)](http://unmaintained.tech/) [![License: MIT](https://img.shields.io/badge/License-MIT-brightgreen.svg)](https://opensource.org/licenses/MIT) [![codecov](https://img.shields.io/codecov/c/gh/AfonsoSalgadoSousa/catbird)](https://codecov.io/gh/AfonsoSalgadoSousa/catbird)

⛔️ DEPRECATED

This was a personal project I built to help in my research. As it got bigger, I thought others could find some of its features useful. However, with the fast development of NLP, others are doing it faster and better (e.g. Huggingface or fairseq). As such, I am no longer maintaining this repository.

Catbird is an open source paraphrase generation toolkit based on PyTorch.

Main Features

This is an ongoing, one-person project. Hopefully you find it useful. If you do so, do not forget to leave a star 🌟.

Datasets

  • Quora Question Pairs
  • MSCOCO

Tokenizers

We use the HuggingFace's Tokenizers package. As such, you can easily use any pretrained tokenizer. Additionally, you can train your own tokenizers, either using BPE, Unigram, WordPiece or word-level algorithms. To do so, you might find the wikitext-103 useful.

Metrics

We support the following metrics. We currently use the HuggingFace implementations and wrap them to use with Pytorch Ignite.

  • BLEU
  • METEOR
  • TER

Seq2Seq Techniques

We support Teacher Forcing and for decoding both greedy and beam search.

Quick Start

Requirements and Installation

The project is based on PyTorch 1.11+ and Python 3.8+.

Install Catbird

The package can be installed using pip:

shell pip install catbird

This does not include configuration files or tools and is not yet actively updated. Alternatively, you can run from the source code:

a. Clone the repository.

shell git clone https://github.com/AfonsoSalgadoSousa/catbird.git

b. Install dependencies.

This project uses Poetry as its package manager. Make sure you have it installed. For more info check Poetry's official documentation. To install dependencies, simply run:

shell poetry install

We have also compiled an enviroment.yml file with all the required dependencies to create an Anaconda environment. To do so, simply run:

shell conda env create -f environment.yml

Dataset Preparation

For now, we support Quora Question Pairs dataset, and MSCOCO. It is recommended to download and extract the datasets somewhere outside the project directory and symlink the dataset root to $CATBIRD/data as below. If your folder structure is different, you may need to change the corresponding paths in config files.

text catbird ├── catbird ├── tools ├── configs ├── data │ ├── quora │ │ ├── quora_duplicate_questions.tsv │ ├── mscoco │ │ ├── captions_train2014.json │ │ ├── captions_val2014.json

Donwload Quora data HERE. Prepare Quora data by running:

shell poetry run python tools/preprocessing/create_data.py quora --root-path ./data/quora --out-dir ./data/quora

Download MSCOCO HERE, under the link '2014 Train/Val annotations'. Prepare MSCOCO data by running:

shell poetry run python tools/preprocessing/create_data.py mscoco --root-path ./data/mscoco --out-dir ./data/mscoco --split train poetry run python tools/preprocessing/create_data.py mscoco --root-path ./data/mscoco --out-dir ./data/mscoco --split val

Train

shell poetry run python tools/train.py ${CONFIG_FILE} [optional arguments]

Example:

  1. Train T5 on QQP.

bash $ poetry run python tools/train.py configs/t5_quora.yaml

Contributors

Acknowledgement

This project borrowed ideas from the following open-source repositories:

Owner

  • Name: Afonso Sousa
  • Login: afonso-sousa
  • Kind: user
  • Location: Porto
  • Company: @FEUP

Aspiring Data Scientist

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • vfdev-5 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels