doc2vec

Distributed Representations of Sentences and Documents

https://github.com/bnosac/doc2vec

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Keywords

doc2vec embeddings natural-language-processing paragraph2vec r-package word2vec

Last synced: 6 months ago · JSON representation

Repository

Distributed Representations of Sentences and Documents

Basic Info

Host: GitHub
Owner: bnosac
License: other
Language: C++
Default Branch: master
Homepage:
Size: 3.2 MB

Statistics

Stars: 48
Watchers: 3
Forks: 7
Open Issues: 9
Releases: 0

Topics

doc2vec embeddings natural-language-processing paragraph2vec r-package word2vec

Created over 5 years ago · Last pushed over 4 years ago

Metadata Files

Readme License

doc2vec

This repository contains an R package allowing to build Paragraph Vector models also known as doc2vec models. You can train the distributed memory ('PV-DM') and the distributed bag of words ('PV-DBOW') models. Next to that, it also allows to build a top2vec model allowing to cluster documents based on these embeddings.

doc2vec is based on the paper Distributed Representations of Sentences and Documents Mikolov et al. while top2vec is based on the paper Distributed Representations of Topics Angelov
The doc2vec part is an Rcpp wrapper around https://github.com/hiyijian/doc2vec
The package allows one
- to train paragraph embeddings (also known as document embeddings) on character data or data in a text file
- use the embeddings to find similar documents, paragraphs, sentences or words
- cluster document embeddings using top2vec
Note. For getting word vectors in R: look at package https://github.com/bnosac/word2vec, details here, for Starspace embeddings: look at package https://github.com/bnosac/ruimtehol, details here

Installation

For regular users, install the package from your local CRAN mirror install.packages("doc2vec")
For installing the development version of this package: remotes::install_github("bnosac/doc2vec")

Look to the documentation of the functions

r help(package = "doc2vec")

Example on doc2vec

Take some data and standardise it a bit.
- Make sure it has columns doc_id and text
- Make sure that each text has less than 1000 words (a word is considered separated by a single space)
- Make sure that each text does not contain newline symbols

r library(doc2vec) library(tokenizers.bpe) library(udpipe) data(belgium_parliament, package = "tokenizers.bpe") x <- subset(belgium_parliament, language %in% "dutch") x <- data.frame(doc_id = sprintf("doc_%s", 1:nrow(x)), text = x$text, stringsAsFactors = FALSE) x$text <- tolower(x$text) x$text <- gsub("[^[:alpha:]]", " ", x$text) x$text <- gsub("[[:space:]]+", " ", x$text) x$text <- trimws(x$text) x$nwords <- txt_count(x$text, pattern = " ") x <- subset(x, nwords < 1000 & nchar(text) > 0)

Build the model

```r

Low-dimensional model using DM, low number of iterations, for speed and display purposes

model <- paragraph2vec(x = x, type = "PV-DM", dim = 5, iter = 3,
min_count = 5, lr = 0.05, threads = 1) str(model) ```

```

List of 3

$ model :

$ data :List of 4

..$ file : chr "C:\Users\Jan\AppData\Local\Temp\Rtmpk9Npjg\textspace_1c446bffa0e.txt"

..$ n : num 170469

..$ n_vocabulary: num 3867

..$ n_docs : num 1000

$ control:List of 9

..$ min_count: int 5

..$ dim : int 5

..$ window : int 5

..$ iter : int 3

..$ lr : num 0.05

..$ skipgram : logi FALSE

..$ hs : int 0

..$ negative : int 5

..$ sample : num 0.001

- attr(*, "class")= chr "paragraph2vec_trained"

```

```r

More realistic model

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20, min_count = 5, lr = 0.05, threads = 4) ```

Get the embedding of the documents or words and get the vocabulary

r embedding <- as.matrix(model, which = "words") embedding <- as.matrix(model, which = "docs") vocab <- summary(model, which = "docs") vocab <- summary(model, which = "words")

Get the embedding of specific documents / words or sentences.

r sentences <- list( sent1 = c("geld", "diabetes"), sent2 = c("frankrijk", "koning", "proximus")) embedding <- predict(model, newdata = sentences, type = "embedding") embedding <- predict(model, newdata = c("geld", "koning"), type = "embedding", which = "words") embedding <- predict(model, newdata = c("doc_1", "doc_10", "doc_3"), type = "embedding", which = "docs") ncol(embedding)

```

[1] 100

```

r embedding[, 1:4]

```

[,1] [,2] [,3] [,4]

doc_1 0.05721277 -0.10298843 0.1089350 -0.03075439

doc_10 0.09553983 0.05211980 -0.0513489 -0.11847925

doc_3 0.08008177 -0.03324692 0.1563442 0.06585038

```

Get similar documents or words when providing sentences, documents or words

r nn <- predict(model, newdata = c("proximus", "koning"), type = "nearest", which = "word2word", top_n = 5) nn

```

[[1]]

term1 term2 similarity rank

1 proximus telefoontoestellen 0.5357178 1

2 proximus belfius 0.5169221 2

3 proximus ceo 0.4839031 3

4 proximus klanten 0.4819543 4

5 proximus taal 0.4590944 5

[[2]]

term1 term2 similarity rank

1 koning ministerie 0.5615162 1

2 koning verplaatsingen 0.5484987 2

3 koning familie 0.4911003 3

4 koning grondwet 0.4871097 4

5 koning gedragen 0.4694150 5

```

r nn <- predict(model, newdata = c("proximus", "koning"), type = "nearest", which = "word2doc", top_n = 5) nn

```

[[1]]

term1 term2 similarity rank

1 proximus doc_105 0.6684639 1

2 proximus doc_863 0.5917463 2

3 proximus doc_186 0.5233522 3

4 proximus doc_620 0.4919243 4

5 proximus doc_862 0.4619178 5

[[2]]

term1 term2 similarity rank

1 koning doc_44 0.6686417 1

2 koning doc_45 0.5616031 2

3 koning doc_583 0.5379452 3

4 koning doc_943 0.4855201 4

5 koning doc_797 0.4573555 5

```

r nn <- predict(model, newdata = c("doc_198", "doc_285"), type = "nearest", which = "doc2doc", top_n = 5) nn

```

[[1]]

term1 term2 similarity rank

1 doc198 doc343 0.5522854 1

2 doc198 doc899 0.4902798 2

3 doc198 doc983 0.4847047 3

4 doc198 doc642 0.4829021 4

5 doc198 doc336 0.4674844 5

[[2]]

term1 term2 similarity rank

1 doc285 doc319 0.5318567 1

2 doc285 doc286 0.5100293 2

3 doc285 doc113 0.5056069 3

4 doc285 doc526 0.4840761 4

5 doc285 doc488 0.4805686 5

```

r sentences <- list( sent1 = c("geld", "frankrijk"), sent2 = c("proximus", "onderhandelen")) nn <- predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 5) nn

```

$sent1

term1 term2 similarity rank

1 sent1 doc_742 0.4830917 1

2 sent1 doc_151 0.4340138 2

3 sent1 doc_825 0.4263285 3

4 sent1 doc_740 0.4059283 4

5 sent1 doc_776 0.4024554 5

$sent2

term1 term2 similarity rank

1 sent2 doc_105 0.5497447 1

2 sent2 doc_863 0.5061581 2

3 sent2 doc_862 0.4973840 3

4 sent2 doc_620 0.4793786 4

5 sent2 doc_186 0.4755909 5

```

r sentences <- strsplit(setNames(x$text, x$doc_id), split = " ") nn <- predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 5)

Example on top2vec

Top2vec clusters document semantically and finds most semantically relevant terms for each topic

```r library(doc2vec) library(word2vec) library(uwot) library(dbscan) data(beparliament2020, package = "doc2vec") x <- data.frame(docid = beparliament2020$docid, text = beparliament2020$textnl, stringsAsFactors = FALSE) x$text <- txtcleanword2vec(x$text) x <- subset(x, txtcount_words(text) < 1000)

d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 50, lr = 0.05, iter = 10, window = 15, hs = TRUE, negative = 0, sample = 0.00001, mincount = 5, threads = 1) model <- top2vec(d2v, control.dbscan = list(minPts = 50), control.umap = list(nneighbors = 15L, ncomponents = 3), umap = tumap, trace = TRUE) info <- summary(model, topn = 7) info$topwords ```

Note

The package has some hard limits namely

Each document should contain less than 1000 words
Each word has a maximum length of 100 letters

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Owner

Name: bnosac
Login: bnosac
Kind: organization

Website: www.bnosac.be
Repositories: 28
Profile: https://github.com/bnosac

open sourced projects

GitHub Events

Total

Watch event: 2
Issue comment event: 2
Fork event: 1

Last Year

Watch event: 2
Issue comment event: 2
Fork event: 1

Committers

Last synced: over 2 years ago

All Time

Total Commits: 83
Total Committers: 1
Avg Commits per committer: 83.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Jan Wijffels	j**s@b**e	83

Committer Domains (Top 20 + Academic)

bnosac.be: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 24
Total pull requests: 1
Average time to close issues: about 1 month
Average time to close pull requests: 14 days
Total issue authors: 8
Total pull request authors: 1
Average comments per issue: 2.92
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jwijffels (16)
michalovadek (2)
mlinegar (1)
Cdk29 (1)
Ingolifs (1)
dominiqueemmanuel (1)
jusme326 (1)
dmhenke (1)

Pull Request Authors

jwijffels (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- cran 867 last-month
Total docker downloads: 8

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 4
Total maintainers: 1

cran.r-project.org: doc2vec

Distributed Representations of Sentences, Documents and Topics

Homepage: https://github.com/bnosac/doc2vec
Documentation: http://cran.r-project.org/web/packages/doc2vec/doc2vec.pdf
License: MIT + file LICENSE
Latest release: 0.2.0
published almost 5 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 867 Last month
Docker Downloads: 8

Rankings

Stargazers count: 8.7%

Forks count: 11.3%

Average: 21.4%

Downloads: 21.8%

Dependent packages count: 29.8%

Dependent repos count: 35.5%

Maintainers (1)

jwijffels@bnosac.be

Last synced: 7 months ago

conda-forge.org: r-doc2vec

Homepage: https://github.com/bnosac/doc2vec
License: MIT
Latest release: 0.2.0
published over 4 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent repos count: 34.0%

Stargazers count: 40.7%

Average: 43.4%

Forks count: 47.7%

Dependent packages count: 51.2%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 2.10 depends
Rcpp >= 0.11.5 imports
stats * imports
utils * imports
dbscan * suggests
tokenizers.bpe * suggests
udpipe >= 0.8 suggests
uwot * suggests
word2vec >= 0.3.3 suggests

doc2vec

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

doc2vec

Installation

Example on doc2vec

Low-dimensional model using DM, low number of iterations, for speed and display purposes

List of 3

$ model :

$ data :List of 4

..$ file : chr "C:\Users\Jan\AppData\Local\Temp\Rtmpk9Npjg\textspace_1c446bffa0e.txt"

..$ n : num 170469

..$ n_vocabulary: num 3867

..$ n_docs : num 1000

$ control:List of 9

..$ min_count: int 5

..$ dim : int 5

..$ window : int 5

..$ iter : int 3

..$ lr : num 0.05

..$ skipgram : logi FALSE

..$ hs : int 0

..$ negative : int 5

..$ sample : num 0.001

- attr(*, "class")= chr "paragraph2vec_trained"

More realistic model

[1] 100

[,1] [,2] [,3] [,4]

doc_1 0.05721277 -0.10298843 0.1089350 -0.03075439

doc_10 0.09553983 0.05211980 -0.0513489 -0.11847925

doc_3 0.08008177 -0.03324692 0.1563442 0.06585038

[[1]]

term1 term2 similarity rank

1 proximus telefoontoestellen 0.5357178 1

2 proximus belfius 0.5169221 2

3 proximus ceo 0.4839031 3

4 proximus klanten 0.4819543 4

5 proximus taal 0.4590944 5

[[2]]

term1 term2 similarity rank

1 koning ministerie 0.5615162 1

2 koning verplaatsingen 0.5484987 2

3 koning familie 0.4911003 3

4 koning grondwet 0.4871097 4

5 koning gedragen 0.4694150 5

[[1]]

term1 term2 similarity rank

1 proximus doc_105 0.6684639 1

2 proximus doc_863 0.5917463 2

3 proximus doc_186 0.5233522 3

4 proximus doc_620 0.4919243 4

5 proximus doc_862 0.4619178 5

[[2]]

term1 term2 similarity rank

1 koning doc_44 0.6686417 1

2 koning doc_45 0.5616031 2

3 koning doc_583 0.5379452 3

4 koning doc_943 0.4855201 4

5 koning doc_797 0.4573555 5

[[1]]

term1 term2 similarity rank

1 doc198 doc343 0.5522854 1

2 doc198 doc899 0.4902798 2

3 doc198 doc983 0.4847047 3

4 doc198 doc642 0.4829021 4

5 doc198 doc336 0.4674844 5

[[2]]

term1 term2 similarity rank

1 doc285 doc319 0.5318567 1

2 doc285 doc286 0.5100293 2

3 doc285 doc113 0.5056069 3

4 doc285 doc526 0.4840761 4

5 doc285 doc488 0.4805686 5

$sent1