word2vec

Distributed Representations of Words using word2vec

https://github.com/bnosac/word2vec

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary

Keywords

embeddings natural-language-processing r-package word2vec
Last synced: 6 months ago · JSON representation

Repository

Distributed Representations of Words using word2vec

Basic Info
  • Host: GitHub
  • Owner: bnosac
  • License: apache-2.0
  • Language: C++
  • Default Branch: master
  • Size: 291 KB
Statistics
  • Stars: 71
  • Watchers: 9
  • Forks: 5
  • Open Issues: 10
  • Releases: 8
Topics
embeddings natural-language-processing r-package word2vec
Created over 5 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License

README.md

word2vec

This repository contains an R package allowing to build a word2vec model

  • It is based on the paper Distributed Representations of Words and Phrases and their Compositionality [Mikolov et al.]
  • This R package is an Rcpp wrapper around https://github.com/maxoodf/word2vec
  • The package allows one
    • to train word embeddings using multiple threads on character data or data in a text file
    • use the embeddings to find relations between words

Installation

  • For regular users, install the package from your local CRAN mirror install.packages("word2vec")
  • For installing the development version of this package: remotes::install_github("bnosac/word2vec")

Look to the documentation of the functions

{r} help(package = "word2vec")

Example

  • Take some data and standardise it a bit

{r} library(udpipe) data(brussels_reviews, package = "udpipe") x <- subset(brussels_reviews, language == "nl") x <- tolower(x$feedback)

  • Build a model

```{r} library(word2vec) set.seed(123456789) model <- word2vec(x = x, type = "cbow", dim = 15, iter = 20) embedding <- as.matrix(model) embedding <- predict(model, c("bus", "toilet"), type = "embedding") lookslike <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5) lookslike $bus term1 term2 similarity rank bus gratis 0.9959141 1 bus tram 0.9898559 2 bus voet 0.9882312 3 bus ben 0.9854795 4 bus auto 0.9839599 5

$toilet term1 term2 similarity rank toilet koelkast 0.9870380 1 toilet douche 0.9850463 2 toilet werkte 0.9843599 3 toilet slaapkamers 0.9802811 4 toilet eigen 0.9759347 5 ```

  • Save the model and read it back in and do something with it

{r} write.word2vec(model, "mymodel.bin") model <- read.word2vec("mymodel.bin") terms <- summary(model, "vocabulary") embedding <- as.matrix(model)

Visualise the embeddings

  • Using another example, we get the embeddings of words together with parts of speech tag (Look to the help of the udpipe R package to easily get parts of speech tags on text)

```{r} library(udpipe) data(brusselsreviewsanno, package = "udpipe") x <- subset(brusselsreviewsanno, language == "fr" & !is.na(lemma) & nchar(lemma) > 1) x <- subset(x, xpos %in% c("NN", "IN", "RB", "VB", "DT", "JJ", "PRP", "CC", "VBN", "NNP", "NNS", "PRP$", "CD", "WP", "VBG", "UH", "SYM")) x$text <- sprintf("%s//%s", x$lemma, x$xpos) x <- paste.data.frame(x, term = "text", group = "doc_id", collapse = " ")

model <- word2vec(x = x$text, dim = 15, iter = 20, split = c(" ", ".\n?!")) embedding <- as.matrix(model) ```

  • Perform dimension reduction using UMAP + make interactive plot of only the adjectives for example

```{r} library(uwot) viz <- umap(embedding, nneighbors = 15, nthreads = 2)

Static plot

library(ggplot2) library(ggrepel) df <- data.frame(word = gsub("//.+", "", rownames(embedding)), xpos = gsub(".+//", "", rownames(embedding)), x = viz[, 1], y = viz[, 2], stringsAsFactors = FALSE) df <- subset(df, xpos %in% c("JJ")) ggplot(df, aes(x = x, y = y, label = word)) + geomtextrepel() + theme_void() + labs(title = "word2vec - adjectives in 2D using UMAP")

Interactive plot

library(plotly) plot_ly(df, x = ~x, y = ~y, type = "scatter", mode = 'text', text = ~word) ```

Pretrained models

  • Note that the framework is compatible with theh original word2vec model implementation. In order to use external models which are not trained and saved with this R package, you need to set normalize=TRUE in read.word2vec. This holds for models e.g. trained with gensim or the models made available through R package sentencepiece
  • Example below using a pretrained model available for English at https://github.com/maxoodf/word2vec#basic-usage

{r} library(word2vec) model <- read.word2vec(file = "cb_ns_500_10.w2v", normalize = TRUE)

Examples on word similarities, classical analogies and embedding similarities

  • Which words are similar to fries or money

```{r} predict(model, newdata = c("fries", "money"), type = "nearest", top_n = 5) $fries term1 term2 similarity rank fries burgers 0.7641346 1 fries cheeseburgers 0.7636056 2 fries cheeseburger 0.7570285 3 fries hamburgers 0.7546136 4 fries coleslaw 0.7540344 5

$money term1 term2 similarity rank money funds 0.8281102 1 money cash 0.8158758 2 money monies 0.7874741 3 money sums 0.7648080 4 money taxpayers 0.7553093 5 ```

  • Classical example: king - man + woman = queen

{r} wv <- predict(model, newdata = c("king", "man", "woman"), type = "embedding") wv <- wv["king", ] - wv["man", ] + wv["woman", ] predict(model, newdata = wv, type = "nearest", top_n = 3) term similarity rank king 0.9479475 1 queen 0.7680065 2 princess 0.7155131 3

  • What could Belgium look like if we had a government or Belgium without a government. Intelligent :)

```{r} wv <- predict(model, newdata = c("belgium", "government"), type = "embedding")

predict(model, newdata = wv["belgium", ] + wv["government", ], type = "nearest", top_n = 2) term similarity rank netherlands 0.9337973 1 germany 0.9305047 2

predict(model, newdata = wv["belgium", ] - wv["government", ], type = "nearest", top_n = 1) term similarity rank belgium 0.9759384 1 ```

  • They are just numbers, you can prove anything with it

```{r} wv <- predict(model, newdata = c("black", "white", "racism", "person"), type = "embedding") wv <- wv["white", ] - wv["person", ] + wv["racism", ]

predict(model, newdata = wv, type = "nearest", top_n = 10) term similarity rank black 0.9480463 1 racial 0.8962515 2 racist 0.8518659 3 segregationists 0.8304701 4 bigotry 0.8055548 5 racialized 0.8053641 6 racists 0.8034531 7 racially 0.8023036 8 dixiecrats 0.8008670 9 homophobia 0.7886864 10

wv <- predict(model, newdata = c("black", "white"), type = "embedding") wv <- wv["black", ] + wv["white", ]

predict(model, newdata = wv, type = "nearest", top_n = 3) term similarity rank blue 0.9792663 1 purple 0.9520039 2 colored 0.9480994 3 ```

Integration with ...

quanteda

  • You can build a word2vec model by providing a tokenised list

```{r} library(quanteda) library(word2vec) data("datacorpusinaugural", package = "quanteda") toks <- datacorpusinaugural %>% corpusreshape(to = "sentences") %>% tokens(removepunct = TRUE, removesymbols = TRUE) %>% tokenstolower() %>% as.list()

set.seed(54321) model <- word2vec(toks, dim = 25, iter = 20, mincount = 3, type = "skip-gram", lr = 0.05) emb <- as.matrix(model) predict(model, c("freedom", "constitution", "president"), type = "nearest", topn = 5) $freedom term1 term2 similarity rank freedom human 0.9094619 1 freedom man 0.9001195 2 freedom life 0.8840834 3 freedom generations 0.8676646 4 freedom mankind 0.8632550 5

$constitution term1 term2 similarity rank constitution constitutional 0.8814662 1 constitution conformity 0.8810275 2 constitution authority 0.8786194 3 constitution prescribed 0.8768463 4 constitution states 0.8661923 5

$president term1 term2 similarity rank president clinton 0.9552274 1 president clergy 0.9426718 2 president carter 0.9386149 3 president chief 0.9377645 4 president reverend 0.9347451 5 ```

byte-pair encoding tokenizers (e.g. tokenizers.bpe/sentencepiece)

  • You can build a word2vec model by providing a tokenised list of token id's or subwords in order to feed the embeddings of these into deep learning models

{r} library(tokenizers.bpe) library(word2vec) data(belgium_parliament, package = "tokenizers.bpe") x <- subset(belgium_parliament, language == "french") x <- x$text tokeniser <- bpe(x, coverage = 0.999, vocab_size = 1000, threads = 1) toks <- bpe_encode(tokeniser, x = x, type = "subwords") toks <- bpe_encode(tokeniser, x = x, type = "ids") model <- word2vec(toks, dim = 25, iter = 20, min_count = 3, type = "skip-gram", lr = 0.05) emb <- as.matrix(model)

Support in text mining

Need support in text mining? Contact BNOSAC: http://www.bnosac.be

Owner

  • Name: bnosac
  • Login: bnosac
  • Kind: organization

open sourced projects

GitHub Events

Total
  • Issues event: 2
  • Watch event: 1
  • Issue comment event: 4
Last Year
  • Issues event: 2
  • Watch event: 1
  • Issue comment event: 4

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 87
  • Total Committers: 2
  • Avg Commits per committer: 43.5
  • Development Distribution Score (DDS): 0.218
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Jan Wijffels j****s@b****e 68
Kohei Watanabe w****i@g****m 19
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 18
  • Total pull requests: 8
  • Average time to close issues: 8 months
  • Average time to close pull requests: 3 days
  • Total issue authors: 9
  • Total pull request authors: 3
  • Average comments per issue: 5.5
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 2
  • Pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jwijffels (6)
  • koheiw (3)
  • niutyut (3)
  • steffen-stell (1)
  • michalovadek (1)
  • luciebaudoin (1)
  • ahmoreira (1)
  • dataspelunking (1)
  • dafnevk (1)
Pull Request Authors
  • jwijffels (7)
  • koheiw (2)
  • randef1ned (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 1,111 last-month
  • Total docker downloads: 20,366
  • Total dependent packages: 6
    (may contain duplicates)
  • Total dependent repositories: 6
    (may contain duplicates)
  • Total versions: 9
  • Total maintainers: 1
cran.r-project.org: word2vec

Distributed Representations of Words

  • Versions: 8
  • Dependent Packages: 6
  • Dependent Repositories: 6
  • Downloads: 1,111 Last month
  • Docker Downloads: 20,366
Rankings
Stargazers count: 5.9%
Dependent packages count: 7.3%
Forks count: 10.8%
Downloads: 11.1%
Dependent repos count: 12.0%
Average: 12.1%
Docker downloads count: 25.8%
Maintainers (1)
Last synced: 7 months ago
conda-forge.org: r-word2vec
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Stargazers count: 36.0%
Average: 43.9%
Dependent packages count: 51.2%
Forks count: 54.2%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 2.10 depends
  • Rcpp >= 0.11.5 imports
  • stats * imports
  • udpipe * suggests
.github/workflows/R-CMD-check.yml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite