https://github.com/amazon-science/contraclm

[ACL 2023] Code for ContraCLM: Contrastive Learning For Causal Language Model

https://github.com/amazon-science/contraclm

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.1%) to scientific vocabulary

Keywords

contrastive-learning generative-ai gpt-2 llm nlp
Last synced: 9 months ago · JSON representation

Repository

[ACL 2023] Code for ContraCLM: Contrastive Learning For Causal Language Model

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 128 KB
Statistics
  • Stars: 34
  • Watchers: 1
  • Forks: 2
  • Open Issues: 1
  • Releases: 0
Topics
contrastive-learning generative-ai gpt-2 llm nlp
Created almost 3 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

ContraCLM: Contrastive Learning for Causal Language Model

This repository contains code for the ACL 2023 paper, ContraCLM: Contrastive Learning for Causal Language Model.

Work done by: Nihal Jain, Dejiao Zhang, Wasi Uddin Ahmad, Zijian Wang, Feng Nan, Xiaopeng Li, Ming Tan, Ramesh Nallapati, Baishakhi Ray, Parminder Bhatia, Xiaofei Ma, Bing Xiang. ( indicates equal contribution).

Updates

  • [07-08-2023] Initial release of the code.

Quick Links

Overview

We present ContraCLM, a novel contrastive learning framework which operates at both the token-level and sequence-level. ContraCLM enhances the discrimination of representations from a decoder-only language model and bridges the gap with encoder-only models, making causal language models better suited for tasks beyond language generation. We encourage you to check out our paper for more details.

Setup

The setup involves installing the necessary dependencies in an environment and placing the datasets in the requisite directory.

Environment

Run these commands to create a new conda environment and install the required packages for this repository.

```bash

create a new conda environment with python >= 3.8

conda create -n contraclm python=3.8.12

install dependencies within the environment

conda activate contraclm pip install -r requirements.txt ```

Datasets & Preprocessing

See here.

Pretraining

In this section, we show how to use this repository to pretrain (i) GPT2 on Natural Language (NL) data, and (ii) CodeGen-350M-Mono on Programming Language (PL) data.

Common Instructions

  1. This section assumes that you have the train and validation data stored at TRAIN_DIR and VALID_DIR respectively, and are within an environment with all the above dependencies installed (see Setup).

  2. You can get an overview of all the flags associated with pretraining by running: bash python pl_trainer.py --help

Pretain GPT2 on NL Data

Usage

bash runscripts/run_wikitext.sh 1. For quickly testing the code and debug, suggesting run the code with MLE loss only by setting CL_Config=$(eval echo ${options[1]}) within the script. 2. All other opotions involves CL loss at either token-level or sequence-level.

Pretrain CodeGen-350M-Mono on PL Data

Usage

  1. Configure the variables at the top of runscripts/run_code.sh. There are lots of options but only the dropout options are explained here (others are self-explanatory):
  • dropout_p: The dropout probability value used in torch.nn.Dropout

  • dropout_layers: If > 0, this will activate the last dropout_layers with probability dropout_p

  • functional_dropout: If specified, will use a functional dropout layer on top of the token representations output from the CodeGen model

  1. Set the variable CL according to desired model configuration. Make sure the paths to TRAIN_DIR, VALID_DIR are set as desired.

  2. Run the command: bash runscripts/run_code.sh

Evaluation

See the relevant task-specific directories here.

Citation

If you use our code in your research, please cite our work as:

@inproceedings{jain-etal-2023-contraclm, title = "{C}ontra{CLM}: Contrastive Learning For Causal Language Model", author = "Jain, Nihal and Zhang, Dejiao and Ahmad, Wasi Uddin and Wang, Zijian and Nan, Feng and Li, Xiaopeng and Tan, Ming and Nallapati, Ramesh and Ray, Baishakhi and Bhatia, Parminder and Ma, Xiaofei and Xiang, Bing", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.acl-long.355", pages = "6436--6459" }

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 4
Last Year
  • Watch event: 4

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 4
  • Total Committers: 3
  • Avg Commits per committer: 1.333
  • Development Distribution Score (DDS): 0.5
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Nihal Jain n****n@g****m 2
Amazon GitHub Automation 5****o 1
Nihal Jain n****n@a****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 15 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • dependabot[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2)