https://github.com/amazon-science/text_generation_diffusion_llm_topic

Topic Embedding, Text Generation and Modeling using diffusion

https://github.com/amazon-science/text_generation_diffusion_llm_topic

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

diffusion-models lda machine-learning natural-language-processing nlp sentence-embeddings t5 text-embedding text-embeddings text-generation topic topic-modeling topic-models transformers
Last synced: 5 months ago · JSON representation

Repository

Topic Embedding, Text Generation and Modeling using diffusion

Basic Info
  • Host: GitHub
  • Owner: amazon-science
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 154 KB
Statistics
  • Stars: 15
  • Watchers: 3
  • Forks: 4
  • Open Issues: 2
  • Releases: 0
Topics
diffusion-models lda machine-learning natural-language-processing nlp sentence-embeddings t5 text-embedding text-embeddings text-generation topic topic-modeling topic-models transformers
Created about 2 years ago · Last pushed 9 months ago
Metadata Files
Readme Contributing License Code of conduct

README.md

DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM (Accepted by EMNLP 2023 as Findings)

This repository is the official implementation of DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM.

DeTiME can generate embeddings, do diffusion and

Installation

To install requirements:

setup pip install -r requirements.txt

Training and Evaluation

To train and evaluate the model, run this command:

Step 1: If the data is in the huggingface. specify --datasource as the repository of hugging face If the data is a csv file specify where the data is and specify --datasource csv Step 2: Define number of topics. if the number is 10 use --numbembeddings 10 Step 3: Define the metric you want to evaluate, currently it supports diversity, cv, c_uci, etc

Then you just have to run train python3 main.py --data_source xwjzds/ag_news --metric diversity --topk 20 It will output the diversity metric using data in xwjzds/ag_news

Embedding Explain

Diffusion Explain

After getting the embedding using the encoders of DeTiME, the diffusion can be leveraged to denoise the embeddings. The denoised embeddings can be passed to the decoders of the DeTiME to generate text.

The training of diffusor involved two steps.

Step 1: generate embedding of datasets using the encoders of the DeTiME. The code below shows how to generate embeddings

```python outputs = []

text_ls = dataset['summary']

batch_size = 2

batchls = [textls[ind: ind + batchsize]for ind in range(0, len(textls), batch_size)]

print(dataset)

for text in tqdm(batch_ls):

# inputs = tokenizer(text, return_tensors="pt").input_ids
# attention = tokenizer(text, return_tensors="pt").attention_mask

# add instruction
# text = ['repeat: ' + t for t in text]

inputs = tokenizer(text, return_tensors="pt", padding='max_length', truncation=True, max_length = args.max_length)

# get the inputs and attention
inputs_id = inputs.input_ids.to(models.device)
attention = inputs.attention_mask.to(models.device)

output = models.model.encoder(inputs_id, attention).last_hidden_state   #batch size * seq length * embedding size, 
output = models.encoder(output)
outputs.append(output.detach().cpu())

gc.collect()

```

Step 2: train a diffusor using the embeddings. To train a diffusor, the users can leverage python diffusertraining.py --embeddinginput './example/embedvectorsbase71000prefix.pt' --modelname 'UNetConv' --outputdir './example'. Here. embeddinginput is the embedding file location, modelname is the diffusor model name to train, output_dir is the location where the trained diffusor saved.

To generate the text using the deniosed embedding, three steps are involved.

Step 1: generate embedding of datasets using the encoders of the DeTiME.

Step 2: denoise the embeddings using the generated embeddings. ```python from diffusion.diffusiongenerate import generatediffusedembed, generatetext

generate from the noise vector

sampling_turn = 2 timesteps = 1000

xnoise = torch.randn((numimages, 4, latentdim // 4), device=device) xtracklslsnoise, x0tracklslsnoise = generatediffusedembed(xnoise, model, timesteps, device, batchsize=2, numgeneratedsample=2, returnalltime_embed=True) ```

Step 3: generate text from the denoised embeddings.

Interactive Code

Example of using dataset from OCTIS

```python from octis.dataset.dataset import Dataset import sys sys.path.insert(0, '../src/topicmodeling') from model import TopicModel from datasets import loaddataset from octis.evaluationmetrics.diversitymetrics import TopicDiversity from octis.evaluationmetrics.coherence_metrics import Coherence

dataset = Dataset() dataset.fetchdataset("20NewsGroup") #It can support 20NewsGroup, BBCNews, DBLP, DBPediaIT tm = TopicModel(numbembeddings = 10) texts = [' '.join(i) for i in dataset.getcorpus()] modeloutput = tm.trainmodel(texts) metric = TopicDiversity(topk=10) topicdiversityscore = metric.score(modeloutput) # Compute score of diversity cmetric = Coherence(texts = tm.tp.lemmas, measure='cnpmi') coherence = cmetric.score(modeloutput) # Compute score of coherence ```

Example of using datasets from huggingface ```python import sys sys.path.insert(0, '../src/topicmodeling') from model import TopicModel from datasets import loaddataset from octis.evaluationmetrics.diversitymetrics import TopicDiversity from octis.evaluationmetrics.coherence_metrics import Coherence

df = loaddataset('xwjzds/agnews') tm = TopicModel(numb_embeddings = 10)

modeloutput = tm.trainmodel(df['train']['text']) metric = TopicDiversity(topk=10) topicdiversityscore = metric.score(modeloutput) # Compute score of diversity cmetric = Coherence(texts = tm.tp.lemmas, measure='cnpmi') coherence = cmetric.score(model_output) # Compute score of coherence

```

Arugument Explain

Arguments Explained:

--numb_embeddings: Number of embeddings (default is 10).

--epochs: Number of epochs for training (default is 20).

--batch_size: Batch size for training (default is 256).

--gpu_num: GPU number to use (default is 1).

--learning_rate: Learning rate (default is 0.002).

--weight_decay: Weight decay (default is 1.2e-6).

--penalty: Penalty term (default is 1).

--beta: Beta value (default is 1).

--temp: Temperature (default is 10).

--data_source: Data source type (default is 'huggingface'). Can be 'huggingface', 'csv', or 'txt'.

--data_path: Path to the data file for 'csv' or 'txt' (default is '').

--metrics: List of metrics to report (default is ['diversity', 'cv', 'cnpmi', 'cuci', 'umass']).

--topk: Top k words to report for diversity (default is 10).

Results

Our model achieves the following performance on Ag News:

| Model name | Diversity | Cv | Cnpmi | | ------------------ |---------------- | -------------- | -------------- | | vONT | 0.865 | 0.618 | 0.115 | | DeTiME | 0.93 | 0.645 | 0.113 |

we use existed embeddings in this code relase instead of using spherical embeddings. Training a spherical embeddings takes time. We noticed that this reported performance is better than the performance on our paper.

Citation

@inproceedings{xu-etal-2023-vontss, title = "v{ONTSS}: v{MF} based semi-supervised neural topic modeling with optimal transport", author = "Xu, Weijie and Jiang, Xiaoyu and Sengamedu Hanumantha Rao, Srinivasan and Iannacci, Francis and Zhao, Jinjin", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.271", doi = "10.18653/v1/2023.findings-acl.271", pages = "4433--4457", abstract = "Recently, Neural Topic Models (NTM), inspired by variational autoencoders, have attracted a lot of research interest; however, these methods have limited applications in the real world due to the challenge of incorporating human knowledge. This work presents a semi-supervised neural topic modeling method, vONTSS, which uses von Mises-Fisher (vMF) based variational autoencoders and optimal transport. When a few keywords per topic are provided, vONTSS in the semi-supervised setting generates potential topics and optimizes topic-keyword quality and topic classification. Experiments show that vONTSS outperforms existing semi-supervised topic modeling methods in classification accuracy and diversity. vONTSS also supports unsupervised topic modeling. Quantitative and qualitative experiments show that vONTSS in the unsupervised setting outperforms recent NTMs on multiple aspects: vONTSS discovers highly clustered and coherent topics on benchmark datasets. It is also much faster than the state-of-the-art weakly supervised text classification method while achieving similar classification performance. We further prove the equivalence of optimal transport loss and cross-entropy loss at the global minimum.", } @inproceedings{xu-etal-2023-detime, title = "{D}e{T}i{ME}: Diffusion-Enhanced Topic Modeling using Encoder-decoder based {LLM}", author = "Xu, Weijie and Hu, Wenxiang and Wu, Fanyou and Sengamedu, Srinivasan", editor = "Bouamor, Houda and Pino, Juan and Bali, Kalika", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023", month = dec, year = "2023", address = "Singapore", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-emnlp.606", doi = "10.18653/v1/2023.findings-emnlp.606", pages = "9040--9057", abstract = "In the burgeoning field of natural language processing, Neural Topic Models (NTMs) and Large Language Models (LLMs) have emerged as areas of significant research interest. Despite this, NTMs primarily utilize contextual embeddings from LLMs, which are not optimal for clustering or capable for topic generation. Our study addresses this gap by introducing a novel framework named Diffusion-Enhanced Topic Modeling using Encoder-Decoder-based LLMs (DeTiME). DeTiME leverages Encoder-Decoder-based LLMs to produce highly clusterable embeddings that could generate topics that exhibit both superior clusterability and enhanced semantic coherence compared to existing methods. Additionally, by exploiting the power of diffusion, our framework also provides the capability to generate content relevant to the identified topics. This dual functionality allows users to efficiently produce highly clustered topics and related content simultaneously. DeTiME{'}s potential extends to generating clustered embeddings as well. Notably, our proposed framework proves to be efficient to train and exhibits high adaptability, demonstrating its potential for a wide array of applications.", }

Owner

  • Name: Amazon Science
  • Login: amazon-science
  • Kind: organization

GitHub Events

Total
  • Watch event: 7
  • Fork event: 1
  • Create event: 2
Last Year
  • Watch event: 7
  • Fork event: 1
  • Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 8 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.2
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 1
  • Pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 8 days
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.2
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 4
Top Authors
Issue Authors
  • LLLD-0901 (1)
Pull Request Authors
  • dependabot[bot] (8)
  • wenxh0718 (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (8)

Dependencies

requirements.txt pypi
  • cudatoolkit ==10.1
  • datasets ==2.11.0
  • gensim ==4.3.1
  • gensim ==4.2.0
  • huggingface-hub ==0.16.4
  • nltk ==3.6.6
  • octis ==1.13.1
  • scikit-learn ==1.3.0
  • torch ==2.0.1
  • torchvision ==0.2.1