Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

Basic Info
Statistics
  • Stars: 10
  • Watchers: 1
  • Forks: 1
  • Open Issues: 8
  • Releases: 0
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme Changelog License

README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)
```

# Wordvector: word and document vector models

The **wordvector** package is developed to create word and document vectors using **quanteda**. This package currently supports word2vec ([Mikolov et al., 2013](http://arxiv.org/abs/1310.4546)) and latent semantic analysis ([Deerwester et al., 1990](https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9)). 

## How to install

**wordvector** is available on CRAN.

```{r, eval=FALSE}
install.packages("wordvector")
```

The latest version is available on Github.

```{r, eval=FALSE}
remotes::install_github("koheiw/wordvector")
```


## Example

We train the word2vec model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS between 2012 and 2016. 

### Download data

```{r, eval=FALSE}
# download data
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', 
              '~/yahoo-news.RDS', mode = "wb")
```

### Train word2vec

```{r}
library(wordvector)
library(quanteda)

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
corp <- corpus(dat, text_field = 'text', docid_field = "tid")

# Pre-processing
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
    tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
    tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                  padding = TRUE)

# Train word2vec
wdv <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, verbose = TRUE)
```

### Similarity between word vectors

`similarity()` computes cosine similarity between word vectors.

```{r}
head(similarity(wdv, c("amazon", "forests", "obama", "america", "afghanistan"), mode = "word"))
```

### Arithmetic operations of word vectors

`analogy()` offers interface for arithmetic operations of word vectors. 

```{r}
# What is Amazon without forests?
head(similarity(wdv, analogy(~ amazon - forests))) 
```

```{r}
# What is for Afghanistan as Obama for America? 
head(similarity(wdv, analogy(~ obama - america + afghanistan))) 
```

These examples replicates analogical tasks in the original word2vec paper.

```{r}
# What is for France as Berlin for Germany?
head(similarity(wdv, analogy(~ berlin - germany + france))) 
```

```{r}
# What is for slowly as quick for quickly?
head(similarity(wdv, analogy(~ quick - quickly + slowly)))
```


Owner

  • Name: Kohei Watanabe
  • Login: koheiw
  • Kind: user
  • Location: Japan

Data analyst specializes in political and financial texts

GitHub Events

Total
  • Issues event: 13
  • Watch event: 7
  • Delete event: 10
  • Issue comment event: 6
  • Push event: 125
  • Pull request event: 39
  • Fork event: 1
  • Create event: 28
Last Year
  • Issues event: 13
  • Watch event: 7
  • Delete event: 10
  • Issue comment event: 6
  • Push event: 125
  • Pull request event: 39
  • Fork event: 1
  • Create event: 28

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 438
  • Total Committers: 2
  • Avg Commits per committer: 219.0
  • Development Distribution Score (DDS): 0.155
Past Year
  • Commits: 297
  • Committers: 1
  • Avg Commits per committer: 297.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Kohei Watanabe w****i@g****m 370
Jan Wijffels j****s@b****e 68
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 11
  • Total pull requests: 39
  • Average time to close issues: 25 days
  • Average time to close pull requests: about 18 hours
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.21
  • Merged pull requests: 33
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 11
  • Pull requests: 39
  • Average time to close issues: 25 days
  • Average time to close pull requests: about 18 hours
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.21
  • Merged pull requests: 33
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • koheiw (10)
Pull Request Authors
  • koheiw (42)
  • kbenoit (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 131 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 5
  • Total maintainers: 1
cran.r-project.org: wordvector

Word and Document Vector Models

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 131 Last month
Rankings
Dependent packages count: 27.6%
Dependent repos count: 34.0%
Average: 49.5%
Downloads: 86.9%
Maintainers (1)
Last synced: 7 months ago