Science Score: 10.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Keywords
Repository
Distributed Representations of Sentences and Documents
Basic Info
Statistics
- Stars: 48
- Watchers: 3
- Forks: 7
- Open Issues: 9
- Releases: 0
Topics
Metadata Files
README.md
doc2vec
This repository contains an R package allowing to build Paragraph Vector models also known as doc2vec models. You can train the distributed memory ('PV-DM') and the distributed bag of words ('PV-DBOW') models.
Next to that, it also allows to build a top2vec model allowing to cluster documents based on these embeddings.
- doc2vec is based on the paper Distributed Representations of Sentences and Documents Mikolov et al. while top2vec is based on the paper Distributed Representations of Topics Angelov
- The doc2vec part is an Rcpp wrapper around https://github.com/hiyijian/doc2vec
- The package allows one
- to train paragraph embeddings (also known as document embeddings) on character data or data in a text file
- use the embeddings to find similar documents, paragraphs, sentences or words
- cluster document embeddings using top2vec
- Note. For getting word vectors in R: look at package https://github.com/bnosac/word2vec, details here, for Starspace embeddings: look at package https://github.com/bnosac/ruimtehol, details here
Installation
- For regular users, install the package from your local CRAN mirror
install.packages("doc2vec") - For installing the development version of this package:
remotes::install_github("bnosac/doc2vec")
Look to the documentation of the functions
r
help(package = "doc2vec")
Example on doc2vec
- Take some data and standardise it a bit.
- Make sure it has columns doc_id and text
- Make sure that each text has less than 1000 words (a word is considered separated by a single space)
- Make sure that each text does not contain newline symbols
r
library(doc2vec)
library(tokenizers.bpe)
library(udpipe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- subset(belgium_parliament, language %in% "dutch")
x <- data.frame(doc_id = sprintf("doc_%s", 1:nrow(x)),
text = x$text,
stringsAsFactors = FALSE)
x$text <- tolower(x$text)
x$text <- gsub("[^[:alpha:]]", " ", x$text)
x$text <- gsub("[[:space:]]+", " ", x$text)
x$text <- trimws(x$text)
x$nwords <- txt_count(x$text, pattern = " ")
x <- subset(x, nwords < 1000 & nchar(text) > 0)
- Build the model
```r
Low-dimensional model using DM, low number of iterations, for speed and display purposes
model <- paragraph2vec(x = x, type = "PV-DM", dim = 5, iter = 3,
min_count = 5, lr = 0.05, threads = 1)
str(model)
```
```
List of 3
$ model :
$ data :List of 4
..$ file : chr "C:\Users\Jan\AppData\Local\Temp\Rtmpk9Npjg\textspace_1c446bffa0e.txt"
..$ n : num 170469
..$ n_vocabulary: num 3867
..$ n_docs : num 1000
$ control:List of 9
..$ min_count: int 5
..$ dim : int 5
..$ window : int 5
..$ iter : int 3
..$ lr : num 0.05
..$ skipgram : logi FALSE
..$ hs : int 0
..$ negative : int 5
..$ sample : num 0.001
- attr(*, "class")= chr "paragraph2vec_trained"
```
```r
More realistic model
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20, min_count = 5, lr = 0.05, threads = 4) ```
- Get the embedding of the documents or words and get the vocabulary
r
embedding <- as.matrix(model, which = "words")
embedding <- as.matrix(model, which = "docs")
vocab <- summary(model, which = "docs")
vocab <- summary(model, which = "words")
- Get the embedding of specific documents / words or sentences.
r
sentences <- list(
sent1 = c("geld", "diabetes"),
sent2 = c("frankrijk", "koning", "proximus"))
embedding <- predict(model, newdata = sentences, type = "embedding")
embedding <- predict(model, newdata = c("geld", "koning"), type = "embedding", which = "words")
embedding <- predict(model, newdata = c("doc_1", "doc_10", "doc_3"), type = "embedding", which = "docs")
ncol(embedding)
```
[1] 100
```
r
embedding[, 1:4]
```
[,1] [,2] [,3] [,4]
doc_1 0.05721277 -0.10298843 0.1089350 -0.03075439
doc_10 0.09553983 0.05211980 -0.0513489 -0.11847925
doc_3 0.08008177 -0.03324692 0.1563442 0.06585038
```
- Get similar documents or words when providing sentences, documents or words
r
nn <- predict(model, newdata = c("proximus", "koning"), type = "nearest", which = "word2word", top_n = 5)
nn
```
[[1]]
term1 term2 similarity rank
1 proximus telefoontoestellen 0.5357178 1
2 proximus belfius 0.5169221 2
3 proximus ceo 0.4839031 3
4 proximus klanten 0.4819543 4
5 proximus taal 0.4590944 5
[[2]]
term1 term2 similarity rank
1 koning ministerie 0.5615162 1
2 koning verplaatsingen 0.5484987 2
3 koning familie 0.4911003 3
4 koning grondwet 0.4871097 4
5 koning gedragen 0.4694150 5
```
r
nn <- predict(model, newdata = c("proximus", "koning"), type = "nearest", which = "word2doc", top_n = 5)
nn
```
[[1]]
term1 term2 similarity rank
1 proximus doc_105 0.6684639 1
2 proximus doc_863 0.5917463 2
3 proximus doc_186 0.5233522 3
4 proximus doc_620 0.4919243 4
5 proximus doc_862 0.4619178 5
[[2]]
term1 term2 similarity rank
1 koning doc_44 0.6686417 1
2 koning doc_45 0.5616031 2
3 koning doc_583 0.5379452 3
4 koning doc_943 0.4855201 4
5 koning doc_797 0.4573555 5
```
r
nn <- predict(model, newdata = c("doc_198", "doc_285"), type = "nearest", which = "doc2doc", top_n = 5)
nn
```
[[1]]
term1 term2 similarity rank
1 doc198 doc343 0.5522854 1
2 doc198 doc899 0.4902798 2
3 doc198 doc983 0.4847047 3
4 doc198 doc642 0.4829021 4
5 doc198 doc336 0.4674844 5
[[2]]
term1 term2 similarity rank
1 doc285 doc319 0.5318567 1
2 doc285 doc286 0.5100293 2
3 doc285 doc113 0.5056069 3
4 doc285 doc526 0.4840761 4
5 doc285 doc488 0.4805686 5
```
r
sentences <- list(
sent1 = c("geld", "frankrijk"),
sent2 = c("proximus", "onderhandelen"))
nn <- predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 5)
nn
```
$sent1
term1 term2 similarity rank
1 sent1 doc_742 0.4830917 1
2 sent1 doc_151 0.4340138 2
3 sent1 doc_825 0.4263285 3
4 sent1 doc_740 0.4059283 4
5 sent1 doc_776 0.4024554 5
$sent2
term1 term2 similarity rank
1 sent2 doc_105 0.5497447 1
2 sent2 doc_863 0.5061581 2
3 sent2 doc_862 0.4973840 3
4 sent2 doc_620 0.4793786 4
5 sent2 doc_186 0.4755909 5
```
r
sentences <- strsplit(setNames(x$text, x$doc_id), split = " ")
nn <- predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 5)
Example on top2vec
Top2vec clusters document semantically and finds most semantically relevant terms for each topic

```r library(doc2vec) library(word2vec) library(uwot) library(dbscan) data(beparliament2020, package = "doc2vec") x <- data.frame(docid = beparliament2020$docid, text = beparliament2020$textnl, stringsAsFactors = FALSE) x$text <- txtcleanword2vec(x$text) x <- subset(x, txtcount_words(text) < 1000)
d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 50, lr = 0.05, iter = 10, window = 15, hs = TRUE, negative = 0, sample = 0.00001, mincount = 5, threads = 1) model <- top2vec(d2v, control.dbscan = list(minPts = 50), control.umap = list(nneighbors = 15L, ncomponents = 3), umap = tumap, trace = TRUE) info <- summary(model, topn = 7) info$topwords ```
Note
The package has some hard limits namely
- Each document should contain less than 1000 words
- Each word has a maximum length of 100 letters
Support in text mining
Need support in text mining? Contact BNOSAC: http://www.bnosac.be
Owner
- Name: bnosac
- Login: bnosac
- Kind: organization
- Website: www.bnosac.be
- Repositories: 28
- Profile: https://github.com/bnosac
open sourced projects
GitHub Events
Total
- Watch event: 2
- Issue comment event: 2
- Fork event: 1
Last Year
- Watch event: 2
- Issue comment event: 2
- Fork event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Jan Wijffels | j****s@b****e | 83 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 24
- Total pull requests: 1
- Average time to close issues: about 1 month
- Average time to close pull requests: 14 days
- Total issue authors: 8
- Total pull request authors: 1
- Average comments per issue: 2.92
- Average comments per pull request: 1.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 3.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jwijffels (16)
- michalovadek (2)
- mlinegar (1)
- Cdk29 (1)
- Ingolifs (1)
- dominiqueemmanuel (1)
- jusme326 (1)
- dmhenke (1)
Pull Request Authors
- jwijffels (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- cran 867 last-month
- Total docker downloads: 8
-
Total dependent packages: 0
(may contain duplicates) -
Total dependent repositories: 0
(may contain duplicates) - Total versions: 4
- Total maintainers: 1
cran.r-project.org: doc2vec
Distributed Representations of Sentences, Documents and Topics
- Homepage: https://github.com/bnosac/doc2vec
- Documentation: http://cran.r-project.org/web/packages/doc2vec/doc2vec.pdf
- License: MIT + file LICENSE
-
Latest release: 0.2.0
published almost 5 years ago
Rankings
Maintainers (1)
conda-forge.org: r-doc2vec
- Homepage: https://github.com/bnosac/doc2vec
- License: MIT
-
Latest release: 0.2.0
published over 4 years ago
Rankings
Dependencies
- R >= 2.10 depends
- Rcpp >= 0.11.5 imports
- stats * imports
- utils * imports
- dbscan * suggests
- tokenizers.bpe * suggests
- udpipe >= 0.8 suggests
- uwot * suggests
- word2vec >= 0.3.3 suggests