BTM

Biterm Topic Modelling for Short Text with R

https://github.com/bnosac/btm

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary

Keywords

biterm-topic-modelling natural-language-processing r topic-modeling

Last synced: 6 months ago · JSON representation

Repository

Biterm Topic Modelling for Short Text with R

Basic Info

Host: GitHub
Owner: bnosac
License: apache-2.0
Language: C++
Default Branch: master
Homepage:
Size: 178 KB

Statistics

Stars: 96
Watchers: 7
Forks: 15
Open Issues: 4
Releases: 8

Topics

biterm-topic-modelling natural-language-processing r topic-modeling

Created about 7 years ago · Last pushed about 3 years ago

Metadata Files

Readme Changelog License

BTM - Biterm Topic Modelling for Short Text with R

This is an R package wrapping the C++ code available at https://github.com/xiaohuiyan/BTM for constructing a Biterm Topic Model (BTM). This model models word-word co-occurrences patterns (e.g., biterms).

Topic modelling using biterms is particularly good for finding topics in short texts (as occurs in short survey answers or twitter data).

Installation

This R package is on CRAN, just install it with install.packages('BTM')

What

The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm b=(wi,wj) is defined as: P(b) = sum_k{P(wi|z)*P(wj|z)*P(z)} where k is the number of topics you want to extract.
Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for P(w|k)=phi and P(z)=theta.

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013. https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Example

``` library(udpipe) library(BTM) data("brusselsreviewsanno", package = "udpipe")

Taking only nouns of Dutch data

x <- subset(brusselsreviewsanno, language == "nl") x <- subset(x, xpos %in% c("NN", "NNP", "NNS")) x <- x[, c("doc_id", "lemma")]

Building the model

set.seed(321) model <- BTM(x, k = 3, beta = 0.01, iter = 1000, trace = 100)

Inspect the model - topic frequency + conditional term probabilities

model$theta [1] 0.3406998 0.2413721 0.4179281

topicterms <- terms(model, top_n = 10) topicterms [[1]] token probability 1 appartement 0.06168297 2 brussel 0.04057012 3 kamer 0.02372442 4 centrum 0.01550855 5 locatie 0.01547671 6 stad 0.01229227 7 buurt 0.01181460 8 verblijf 0.01155985 9 huis 0.01111402 10 dag 0.01041345

[[2]] token probability 1 appartement 0.05687312 2 brussel 0.01888307 3 buurt 0.01883812 4 kamer 0.01465696 5 verblijf 0.01339812 6 badkamer 0.01285862 7 slaapkamer 0.01276870 8 dag 0.01213928 9 bed 0.01195945 10 raam 0.01164474

[[3]] token probability 1 appartement 0.061804812 2 brussel 0.035873377 3 centrum 0.022193831 4 huis 0.020091282 5 buurt 0.019935537 6 verblijf 0.018611710 7 aanrader 0.014614272 8 kamer 0.011447470 9 locatie 0.010902365 10 keuken 0.009448751 scores <- predict(model, newdata = x) ```

Make a specific topic called the background

```

If you set background to TRUE

The first topic is set to a background topic that equals to the empirical word distribution.

This can be used to filter out common words.

set.seed(321) model <- BTM(x, k = 5, beta = 0.01, background = TRUE, iter = 1000, trace = 100) topicterms <- terms(model, top_n = 5) topicterms ```

Visualisation of your model

Can be done using the textplot package (https://github.com/bnosac/textplot), which can be found at CRAN as well (https://cran.r-project.org/package=textplot)
An example visualisation built on a model of all R packages from the Natural Language Processing and Machine Learning task views is shown above (see also https://www.bnosac.be/index.php/blog/98-biterm-topic-modelling-for-short-texts)

library(textplot) library(ggraph) library(concaveman) plot(model)

Provide your own set of biterms

An interesting use case of this package is to

cluster based on parts of speech tags like nouns and adjectives which can be found in the text in the neighbourhood of one another
cluster dependency relationships provided by NLP tools like udpipe (https://CRAN.R-project.org/package=udpipe)

This can be done by providing your own set of biterms to cluster upon.

Example clustering cooccurrences of nouns/adjectives

``` library(data.table) library(udpipe)

Annotate text with parts of speech tags

data("brusselsreviews", package = "udpipe") anno <- subset(brusselsreviews, language %in% "nl") anno <- data.frame(doc_id = anno$id, text = anno$feedback, stringsAsFactors = FALSE) anno <- udpipe(anno, "dutch", trace = 10)

Get cooccurrences of nouns / adjectives and proper nouns

biterms <- as.data.table(anno) biterms <- biterms[, cooccurrence(x = lemma, relevant = upos %in% c("NOUN", "PROPN", "ADJ"), skipgram = 2), by = list(doc_id)]

Build the model

set.seed(123456) x <- subset(anno, upos %in% c("NOUN", "PROPN", "ADJ")) x <- x[, c("docid", "lemma")] model <- BTM(x, k = 5, beta = 0.01, iter = 2000, background = TRUE, biterms = biterms, trace = 100) topicterms <- terms(model, topn = 5) topicterms ```

Example clustering dependency relationships

``` library(udpipe) library(tm) library(data.table) data("brussels_reviews", package = "udpipe") exclude <- stopwords("nl")

Do annotation on Dutch text

anno <- subset(brusselsreviews, language %in% "nl") anno <- data.frame(docid = anno$id, text = anno$feedback, stringsAsFactors = FALSE) anno <- udpipe(anno, "dutch", trace = 10) anno <- setDT(anno) anno <- merge(anno, anno, by.x = c("docid", "paragraphid", "sentenceid", "headtokenid"), by.y = c("docid", "paragraphid", "sentenceid", "tokenid"), all.x = TRUE, all.y = FALSE, suffixes = c("", "parent"), sort = FALSE)

Specify a set of relationships you are interested in (e.g. objects of a verb)

anno$relevant <- anno$deprel %in% c("obj") & !is.na(anno$lemmaparent) biterms <- subset(anno, relevant == TRUE) biterms <- data.frame(docid = biterms$docid, term1 = biterms$lemma, term2 = biterms$lemma_parent, cooc = 1, stringsAsFactors = FALSE) biterms <- subset(biterms, !term1 %in% exclude & !term2 %in% exclude)

Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTM

anno <- anno[, keep := relevant | (tokenid %in% headtokenid[relevant == TRUE]), by = list(docid, paragraphid, sentenceid)] x <- subset(anno, keep == TRUE, select = c("doc_id", "lemma")) x <- subset(x, !lemma %in% exclude)

Build the topic model

model <- BTM(data = x, biterms = biterms, k = 6, iter = 2000, background = FALSE, trace = 100) topicterms <- terms(model, top_n = 5) topicterms ```

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Owner

Name: bnosac
Login: bnosac
Kind: organization

Website: www.bnosac.be
Repositories: 28
Profile: https://github.com/bnosac

open sourced projects

GitHub Events

Total

Watch event: 1
Fork event: 1

Last Year

Watch event: 1
Fork event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 54
Total Committers: 2
Avg Commits per committer: 27.0
Development Distribution Score (DDS): 0.056

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Jan Wijffels	j**s@b**e	51
Michael Chirico	c**m@g**m	3

Committer Domains (Top 20 + Academic)

google.com: 1 bnosac.be: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 16
Total pull requests: 2
Average time to close issues: 4 months
Average time to close pull requests: 1 day
Total issue authors: 11
Total pull request authors: 1
Average comments per issue: 4.94
Average comments per pull request: 1.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jwijffels (5)
rdatasculptor (2)
isuytto (1)
hans-ekbrand (1)
adjoshi81 (1)
wanthanaj (1)
wisamb (1)
lhmcgrath (1)
Evelynhuang (1)
omstuhler (1)
YixiC94 (1)

Pull Request Authors

MichaelChirico (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 465 last-month
Total docker downloads: 20,476

Total dependent packages: 2
Total dependent repositories: 6
Total versions: 9
Total maintainers: 1

cran.r-project.org: BTM

Biterm Topic Models for Short Text

Homepage: https://github.com/bnosac/BTM
Documentation: http://cran.r-project.org/web/packages/BTM/BTM.pdf
License: Apache License 2.0
Latest release: 0.3.7
published about 3 years ago

Versions: 9
Dependent Packages: 2
Dependent Repositories: 6
Downloads: 465 Last month
Docker Downloads: 20,476

Rankings

Stargazers count: 4.3%

Forks count: 4.8%

Average: 10.8%

Dependent repos count: 12.0%

Docker downloads count: 12.6%

Dependent packages count: 13.7%

Downloads: 17.7%

Maintainers (1)

jwijffels@bnosac.be

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

Rcpp * imports
utils * imports
data.table * suggests
udpipe * suggests

BTM

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

BTM - Biterm Topic Modelling for Short Text with R

Installation

What

Example

Taking only nouns of Dutch data

Building the model

Inspect the model - topic frequency + conditional term probabilities

If you set background to TRUE

The first topic is set to a background topic that equals to the empirical word distribution.

This can be used to filter out common words.

Visualisation of your model

Provide your own set of biterms

Annotate text with parts of speech tags

Get cooccurrences of nouns / adjectives and proper nouns

Build the model

Do annotation on Dutch text

Specify a set of relationships you are interested in (e.g. objects of a verb)

Put in x only terms whch were used in the biterms object such that frequency stats of terms can be computed in BTM

Build the topic model

Support in text mining

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: BTM

Rankings

Maintainers (1)

Dependencies