wordvector

https://github.com/koheiw/wordvector

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: koheiw
License: apache-2.0
Language: R
Default Branch: master
Homepage: https://koheiw.github.io/wordvector/
Size: 6.26 MB

Statistics

Stars: 10
Watchers: 1
Forks: 1
Open Issues: 8
Releases: 0

Created almost 2 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)
```

# Wordvector: word and document vector models

The **wordvector** package is developed to create word and document vectors using **quanteda**. This package currently supports word2vec ([Mikolov et al., 2013](http://arxiv.org/abs/1310.4546)) and latent semantic analysis ([Deerwester et al., 1990](https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9)). 

## How to install

**wordvector** is available on CRAN.

```{r, eval=FALSE}
install.packages("wordvector")
```

The latest version is available on Github.

```{r, eval=FALSE}
remotes::install_github("koheiw/wordvector")
```


## Example

We train the word2vec model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS between 2012 and 2016. 

### Download data

```{r, eval=FALSE}
# download data
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', 
              '~/yahoo-news.RDS', mode = "wb")
```

### Train word2vec

```{r}
library(wordvector)
library(quanteda)

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
corp <- corpus(dat, text_field = 'text', docid_field = "tid")

# Pre-processing
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
    tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
    tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                  padding = TRUE)

# Train word2vec
wdv <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, verbose = TRUE)
```

### Similarity between word vectors

`similarity()` computes cosine similarity between word vectors.

```{r}
head(similarity(wdv, c("amazon", "forests", "obama", "america", "afghanistan"), mode = "word"))
```

### Arithmetic operations of word vectors

`analogy()` offers interface for arithmetic operations of word vectors. 

```{r}
# What is Amazon without forests?
head(similarity(wdv, analogy(~ amazon - forests))) 
```

```{r}
# What is for Afghanistan as Obama for America? 
head(similarity(wdv, analogy(~ obama - america + afghanistan))) 
```

These examples replicates analogical tasks in the original word2vec paper.

```{r}
# What is for France as Berlin for Germany?
head(similarity(wdv, analogy(~ berlin - germany + france))) 
```

```{r}
# What is for slowly as quick for quickly?
head(similarity(wdv, analogy(~ quick - quickly + slowly)))
```

Owner

Name: Kohei Watanabe
Login: koheiw
Kind: user
Location: Japan

Website: http://koheiw.net
Twitter: koheiw7
Repositories: 34
Profile: https://github.com/koheiw

Data analyst specializes in political and financial texts

GitHub Events

Total

Issues event: 13
Watch event: 7
Delete event: 10
Issue comment event: 6
Push event: 125
Pull request event: 39
Fork event: 1
Create event: 28

Last Year

Issues event: 13
Watch event: 7
Delete event: 10
Issue comment event: 6
Push event: 125
Pull request event: 39
Fork event: 1
Create event: 28

Committers

Last synced: 12 months ago

All Time

Total Commits: 438
Total Committers: 2
Avg Commits per committer: 219.0
Development Distribution Score (DDS): 0.155

Past Year

Commits: 297
Committers: 1
Avg Commits per committer: 297.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Kohei Watanabe	w**i@g**m	370
Jan Wijffels	j**s@b**e	68

Committer Domains (Top 20 + Academic)

bnosac.be: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 11
Total pull requests: 39
Average time to close issues: 25 days
Average time to close pull requests: about 18 hours
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.21
Merged pull requests: 33
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 11
Pull requests: 39
Average time to close issues: 25 days
Average time to close pull requests: about 18 hours
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.21
Merged pull requests: 33
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

koheiw (10)

Pull Request Authors

koheiw (42)
kbenoit (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 131 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 5
Total maintainers: 1

cran.r-project.org: wordvector

Word and Document Vector Models

Homepage: https://github.com/koheiw/wordvector
Documentation: http://cran.r-project.org/web/packages/wordvector/wordvector.pdf
License: Apache License (≥ 2.0)
Latest release: 0.5.1
published about 1 year ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 131 Last month

Rankings

Dependent packages count: 27.6%

Dependent repos count: 34.0%

Average: 49.5%

Downloads: 86.9%

Maintainers (1)

watanabe.kohei@gmail.com

Last synced: 10 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

wordvector

Science Score: 49.0%

Repository

Basic Info

Statistics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: wordvector

Rankings

Maintainers (1)