wordvector
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary
Last synced: 7 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: koheiw
- License: apache-2.0
- Language: R
- Default Branch: master
- Homepage: https://koheiw.github.io/wordvector/
- Size: 6.26 MB
Statistics
- Stars: 10
- Watchers: 1
- Forks: 1
- Open Issues: 8
- Releases: 0
Created over 1 year ago
· Last pushed 8 months ago
Metadata Files
Readme
Changelog
License
README.Rmd
---
output: github_document
editor_options:
chunk_output_type: console
---
```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/"
)
```
# Wordvector: word and document vector models
The **wordvector** package is developed to create word and document vectors using **quanteda**. This package currently supports word2vec ([Mikolov et al., 2013](http://arxiv.org/abs/1310.4546)) and latent semantic analysis ([Deerwester et al., 1990](https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9)).
## How to install
**wordvector** is available on CRAN.
```{r, eval=FALSE}
install.packages("wordvector")
```
The latest version is available on Github.
```{r, eval=FALSE}
remotes::install_github("koheiw/wordvector")
```
## Example
We train the word2vec model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS between 2012 and 2016.
### Download data
```{r, eval=FALSE}
# download data
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1',
'~/yahoo-news.RDS', mode = "wb")
```
### Train word2vec
```{r}
library(wordvector)
library(quanteda)
# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
corp <- corpus(dat, text_field = 'text', docid_field = "tid")
# Pre-processing
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE)
# Train word2vec
wdv <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, verbose = TRUE)
```
### Similarity between word vectors
`similarity()` computes cosine similarity between word vectors.
```{r}
head(similarity(wdv, c("amazon", "forests", "obama", "america", "afghanistan"), mode = "word"))
```
### Arithmetic operations of word vectors
`analogy()` offers interface for arithmetic operations of word vectors.
```{r}
# What is Amazon without forests?
head(similarity(wdv, analogy(~ amazon - forests)))
```
```{r}
# What is for Afghanistan as Obama for America?
head(similarity(wdv, analogy(~ obama - america + afghanistan)))
```
These examples replicates analogical tasks in the original word2vec paper.
```{r}
# What is for France as Berlin for Germany?
head(similarity(wdv, analogy(~ berlin - germany + france)))
```
```{r}
# What is for slowly as quick for quickly?
head(similarity(wdv, analogy(~ quick - quickly + slowly)))
```
Owner
- Name: Kohei Watanabe
- Login: koheiw
- Kind: user
- Location: Japan
- Website: http://koheiw.net
- Twitter: koheiw7
- Repositories: 34
- Profile: https://github.com/koheiw
Data analyst specializes in political and financial texts
GitHub Events
Total
- Issues event: 13
- Watch event: 7
- Delete event: 10
- Issue comment event: 6
- Push event: 125
- Pull request event: 39
- Fork event: 1
- Create event: 28
Last Year
- Issues event: 13
- Watch event: 7
- Delete event: 10
- Issue comment event: 6
- Push event: 125
- Pull request event: 39
- Fork event: 1
- Create event: 28
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Kohei Watanabe | w****i@g****m | 370 |
| Jan Wijffels | j****s@b****e | 68 |
Committer Domains (Top 20 + Academic)
bnosac.be: 1
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 11
- Total pull requests: 39
- Average time to close issues: 25 days
- Average time to close pull requests: about 18 hours
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.21
- Merged pull requests: 33
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 11
- Pull requests: 39
- Average time to close issues: 25 days
- Average time to close pull requests: about 18 hours
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.21
- Merged pull requests: 33
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- koheiw (10)
Pull Request Authors
- koheiw (42)
- kbenoit (2)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 131 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 5
- Total maintainers: 1
cran.r-project.org: wordvector
Word and Document Vector Models
- Homepage: https://github.com/koheiw/wordvector
- Documentation: http://cran.r-project.org/web/packages/wordvector/wordvector.pdf
- License: Apache License (≥ 2.0)
-
Latest release: 0.5.1
published 10 months ago
Rankings
Dependent packages count: 27.6%
Dependent repos count: 34.0%
Average: 49.5%
Downloads: 86.9%
Maintainers (1)
Last synced:
7 months ago