wordmap

https://github.com/koheiw/wordmap

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: koheiw
License: other
Language: R
Default Branch: master
Size: 3.27 MB

Statistics

Stars: 4
Watchers: 1
Forks: 1
Open Issues: 2
Releases: 0

Created about 2 years ago · Last pushed 12 months ago

Metadata Files

Readme Changelog License

README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

```{r, echo=FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)
```

# Wordmap: Semi-supervised Multinomial Document Classifier

**wordmap** is a semi-supervised algorithm for multinomial document classification originally created for [newsmap](https://github.com/koheiw/newsmap). **wordmap** is separated from **newsmap** to expand the score of its application beyond geographical classification of news. 

The algorithm is also useful in extracting features associated with document meta-data (industry group, patent class etc.) from vary larger corpora. The list of features could be used to create a lexicon to perform dictionary analysis.

## How to install

**wordmap** is available on CRAN since the v0.8.0 You can install the package using the R command.

```{r, eval=FALSE}
install.packages("wordmap")
```

If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.

```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/wordmap")
```

## Example

In this example, we identify topics of sentences from using a seed topic dictionary adopted from [Watanabe & Zhou (2020)](https://journals.sagepub.com/doi/full/10.1177/0894439320907027).
`data_corpus_ungd2017` contains transcripts of speeches delivered at the United Nations General Assembly in 2017.

```{r}
require(quanteda)
require(wordmap)

dict <- data_dictionary_topic
print(dict)

corp <- data_corpus_ungd2017 %>% 
    corpus_reshape()

toks <- tokens(corp, remove_url = TRUE, remove_numbers = TRUE) %>% 
    tokens_remove(stopwords("en"), min_nchar = 2, padding = TRUE) #%>% 
    #tokens_remove("^[A-Z]", valuetype = "regex", case_insensitive = FALSE, padding = TRUE)
    
dfmt_feat <- dfm(toks, remove_padding = TRUE) %>% 
    dfm_trim(min_termfreq = 5)
dfmt_label <- tokens_lookup(toks, dict) %>% 
    dfm()

map <- textmodel_wordmap(dfmt_feat, dfmt_label)
coef(map)
```

### Predict topics of sentences 

```{r}
dat <- data.frame(text = corp, topic = predict(map))
```

```{r echo=FALSE}
knitr::kable(head(dat, 10))
```

### Create a topic dictionary

Create a **quanteda** dictionary object from the extracted features. The dictionary could be use to perform analysis of other corpora.

```{r}
as.dictionary(map, n = 100)
```

Owner

Name: Kohei Watanabe
Login: koheiw
Kind: user
Location: Japan

Website: http://koheiw.net
Twitter: koheiw7
Repositories: 34
Profile: https://github.com/koheiw

Data analyst specializes in political and financial texts

GitHub Events

Total

Issues event: 1
Watch event: 2
Delete event: 3
Issue comment event: 1
Push event: 21
Pull request event: 11
Fork event: 1
Create event: 5

Last Year

Issues event: 1
Watch event: 2
Delete event: 3
Issue comment event: 1
Push event: 21
Pull request event: 11
Fork event: 1
Create event: 5

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 12
Average time to close issues: N/A
Average time to close pull requests: about 20 hours
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.17
Merged pull requests: 10
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 10
Average time to close issues: N/A
Average time to close pull requests: 1 day
Issue authors: 1
Pull request authors: 2
Average comments per issue: 0.0
Average comments per pull request: 0.1
Merged pull requests: 8
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

koheiw (1)

Pull Request Authors

koheiw (14)
kbenoit (2)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 297 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 6
Total maintainers: 1

cran.r-project.org: wordmap

Feature Extraction and Document Classification with Noisy Labels

Homepage: https://github.com/koheiw/wordmap
Documentation: http://cran.r-project.org/web/packages/wordmap/wordmap.pdf
License: MIT + file LICENSE
Latest release: 0.9.5
published 12 months ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 297 Last month

Rankings

Dependent packages count: 28.7%

Dependent repos count: 35.4%

Average: 50.1%

Downloads: 86.2%

Maintainers (1)

watanabe.kohei@gmail.com