doc2vec

Distributed Representations of Sentences and Documents

https://github.com/bnosac/doc2vec

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

doc2vec embeddings natural-language-processing paragraph2vec r-package word2vec
Last synced: 6 months ago · JSON representation

Repository

Distributed Representations of Sentences and Documents

Basic Info
  • Host: GitHub
  • Owner: bnosac
  • License: other
  • Language: C++
  • Default Branch: master
  • Homepage:
  • Size: 3.2 MB
Statistics
  • Stars: 48
  • Watchers: 3
  • Forks: 7
  • Open Issues: 9
  • Releases: 0
Topics
doc2vec embeddings natural-language-processing paragraph2vec r-package word2vec
Created over 5 years ago · Last pushed over 4 years ago
Metadata Files
Readme License

README.md

doc2vec

This repository contains an R package allowing to build Paragraph Vector models also known as doc2vec models. You can train the distributed memory ('PV-DM') and the distributed bag of words ('PV-DBOW') models. Next to that, it also allows to build a top2vec model allowing to cluster documents based on these embeddings.

  • doc2vec is based on the paper Distributed Representations of Sentences and Documents Mikolov et al. while top2vec is based on the paper Distributed Representations of Topics Angelov
  • The doc2vec part is an Rcpp wrapper around https://github.com/hiyijian/doc2vec
  • The package allows one
    • to train paragraph embeddings (also known as document embeddings) on character data or data in a text file
    • use the embeddings to find similar documents, paragraphs, sentences or words
    • cluster document embeddings using top2vec
  • Note. For getting word vectors in R: look at package https://github.com/bnosac/word2vec, details here, for Starspace embeddings: look at package https://github.com/bnosac/ruimtehol, details here

Installation

  • For regular users, install the package from your local CRAN mirror install.packages("doc2vec")
  • For installing the development version of this package: remotes::install_github("bnosac/doc2vec")

Look to the documentation of the functions

r help(package = "doc2vec")

Example on doc2vec

  • Take some data and standardise it a bit.
    • Make sure it has columns doc_id and text
    • Make sure that each text has less than 1000 words (a word is considered separated by a single space)
    • Make sure that each text does not contain newline symbols

r library(doc2vec) library(tokenizers.bpe) library(udpipe) data(belgium_parliament, package = "tokenizers.bpe") x <- subset(belgium_parliament, language %in% "dutch") x <- data.frame(doc_id = sprintf("doc_%s", 1:nrow(x)), text = x$text, stringsAsFactors = FALSE) x$text <- tolower(x$text) x$text <- gsub("[^[:alpha:]]", " ", x$text) x$text <- gsub("[[:space:]]+", " ", x$text) x$text <- trimws(x$text) x$nwords <- txt_count(x$text, pattern = " ") x <- subset(x, nwords < 1000 & nchar(text) > 0)

  • Build the model

```r

Low-dimensional model using DM, low number of iterations, for speed and display purposes

model <- paragraph2vec(x = x, type = "PV-DM", dim = 5, iter = 3,
min_count = 5, lr = 0.05, threads = 1) str(model) ```

```

List of 3

$ model :

$ data :List of 4

..$ file : chr "C:\Users\Jan\AppData\Local\Temp\Rtmpk9Npjg\textspace_1c446bffa0e.txt"

..$ n : num 170469

..$ n_vocabulary: num 3867

..$ n_docs : num 1000

$ control:List of 9

..$ min_count: int 5

..$ dim : int 5

..$ window : int 5

..$ iter : int 3

..$ lr : num 0.05

..$ skipgram : logi FALSE

..$ hs : int 0

..$ negative : int 5

..$ sample : num 0.001

- attr(*, "class")= chr "paragraph2vec_trained"

```

```r

More realistic model

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20, min_count = 5, lr = 0.05, threads = 4) ```

  • Get the embedding of the documents or words and get the vocabulary

r embedding <- as.matrix(model, which = "words") embedding <- as.matrix(model, which = "docs") vocab <- summary(model, which = "docs") vocab <- summary(model, which = "words")

  • Get the embedding of specific documents / words or sentences.

r sentences <- list( sent1 = c("geld", "diabetes"), sent2 = c("frankrijk", "koning", "proximus")) embedding <- predict(model, newdata = sentences, type = "embedding") embedding <- predict(model, newdata = c("geld", "koning"), type = "embedding", which = "words") embedding <- predict(model, newdata = c("doc_1", "doc_10", "doc_3"), type = "embedding", which = "docs") ncol(embedding)

```

[1] 100

```

r embedding[, 1:4]

```

[,1] [,2] [,3] [,4]

doc_1 0.05721277 -0.10298843 0.1089350 -0.03075439

doc_10 0.09553983 0.05211980 -0.0513489 -0.11847925

doc_3 0.08008177 -0.03324692 0.1563442 0.06585038

```

  • Get similar documents or words when providing sentences, documents or words

r nn <- predict(model, newdata = c("proximus", "koning"), type = "nearest", which = "word2word", top_n = 5) nn

```

[[1]]

term1 term2 similarity rank

1 proximus telefoontoestellen 0.5357178 1

2 proximus belfius 0.5169221 2

3 proximus ceo 0.4839031 3

4 proximus klanten 0.4819543 4

5 proximus taal 0.4590944 5

[[2]]

term1 term2 similarity rank

1 koning ministerie 0.5615162 1

2 koning verplaatsingen 0.5484987 2

3 koning familie 0.4911003 3

4 koning grondwet 0.4871097 4

5 koning gedragen 0.4694150 5

```

r nn <- predict(model, newdata = c("proximus", "koning"), type = "nearest", which = "word2doc", top_n = 5) nn

```

[[1]]

term1 term2 similarity rank

1 proximus doc_105 0.6684639 1

2 proximus doc_863 0.5917463 2

3 proximus doc_186 0.5233522 3

4 proximus doc_620 0.4919243 4

5 proximus doc_862 0.4619178 5

[[2]]

term1 term2 similarity rank

1 koning doc_44 0.6686417 1

2 koning doc_45 0.5616031 2

3 koning doc_583 0.5379452 3

4 koning doc_943 0.4855201 4

5 koning doc_797 0.4573555 5

```

r nn <- predict(model, newdata = c("doc_198", "doc_285"), type = "nearest", which = "doc2doc", top_n = 5) nn

```

[[1]]

term1 term2 similarity rank

1 doc198 doc343 0.5522854 1

2 doc198 doc899 0.4902798 2

3 doc198 doc983 0.4847047 3

4 doc198 doc642 0.4829021 4

5 doc198 doc336 0.4674844 5

[[2]]

term1 term2 similarity rank

1 doc285 doc319 0.5318567 1

2 doc285 doc286 0.5100293 2

3 doc285 doc113 0.5056069 3

4 doc285 doc526 0.4840761 4

5 doc285 doc488 0.4805686 5

```

r sentences <- list( sent1 = c("geld", "frankrijk"), sent2 = c("proximus", "onderhandelen")) nn <- predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 5) nn

```

$sent1

term1 term2 similarity rank

1 sent1 doc_742 0.4830917 1

2 sent1 doc_151 0.4340138 2

3 sent1 doc_825 0.4263285 3

4 sent1 doc_740 0.4059283 4

5 sent1 doc_776 0.4024554 5

$sent2

term1 term2 similarity rank

1 sent2 doc_105 0.5497447 1

2 sent2 doc_863 0.5061581 2

3 sent2 doc_862 0.4973840 3

4 sent2 doc_620 0.4793786 4

5 sent2 doc_186 0.4755909 5

```

r sentences <- strsplit(setNames(x$text, x$doc_id), split = " ") nn <- predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 5)

Example on top2vec

Top2vec clusters document semantically and finds most semantically relevant terms for each topic

```r library(doc2vec) library(word2vec) library(uwot) library(dbscan) data(beparliament2020, package = "doc2vec") x <- data.frame(docid = beparliament2020$docid, text = beparliament2020$textnl, stringsAsFactors = FALSE) x$text <- txtcleanword2vec(x$text) x <- subset(x, txtcount_words(text) < 1000)

d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 50, lr = 0.05, iter = 10, window = 15, hs = TRUE, negative = 0, sample = 0.00001, mincount = 5, threads = 1) model <- top2vec(d2v, control.dbscan = list(minPts = 50), control.umap = list(nneighbors = 15L, ncomponents = 3), umap = tumap, trace = TRUE) info <- summary(model, topn = 7) info$topwords ```

Note

The package has some hard limits namely

  • Each document should contain less than 1000 words
  • Each word has a maximum length of 100 letters

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Owner

  • Name: bnosac
  • Login: bnosac
  • Kind: organization

open sourced projects

GitHub Events

Total
  • Watch event: 2
  • Issue comment event: 2
  • Fork event: 1
Last Year
  • Watch event: 2
  • Issue comment event: 2
  • Fork event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 83
  • Total Committers: 1
  • Avg Commits per committer: 83.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Jan Wijffels j****s@b****e 83
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 24
  • Total pull requests: 1
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 14 days
  • Total issue authors: 8
  • Total pull request authors: 1
  • Average comments per issue: 2.92
  • Average comments per pull request: 1.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jwijffels (16)
  • michalovadek (2)
  • mlinegar (1)
  • Cdk29 (1)
  • Ingolifs (1)
  • dominiqueemmanuel (1)
  • jusme326 (1)
  • dmhenke (1)
Pull Request Authors
  • jwijffels (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 867 last-month
  • Total docker downloads: 8
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: doc2vec

Distributed Representations of Sentences, Documents and Topics

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 867 Last month
  • Docker Downloads: 8
Rankings
Stargazers count: 8.7%
Forks count: 11.3%
Average: 21.4%
Downloads: 21.8%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Maintainers (1)
Last synced: 7 months ago
conda-forge.org: r-doc2vec
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Stargazers count: 40.7%
Average: 43.4%
Forks count: 47.7%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 2.10 depends
  • Rcpp >= 0.11.5 imports
  • stats * imports
  • utils * imports
  • dbscan * suggests
  • tokenizers.bpe * suggests
  • udpipe >= 0.8 suggests
  • uwot * suggests
  • word2vec >= 0.3.3 suggests