newsmap
Semi-supervised algorithm for geographical document classification
Science Score: 33.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
✓Academic publication links
Links to: scholar.google -
✓Committers with academic emails
3 of 14 committers (21.4%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary
Keywords
machine-learning
news-stories
quanteda
text-analysis
Keywords from Contributors
corpus
text-analytics
encoding
Last synced: 6 months ago
·
JSON representation
Repository
Semi-supervised algorithm for geographical document classification
Basic Info
Statistics
- Stars: 64
- Watchers: 5
- Forks: 24
- Open Issues: 10
- Releases: 3
Topics
machine-learning
news-stories
quanteda
text-analysis
Created almost 10 years ago
· Last pushed 8 months ago
Metadata Files
Readme
License
README.Rmd
---
output: github_document
---
```{r, echo=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "##",
fig.path = "man/images/"
)
```
# Newsmap: geographical document classifier
[](https://CRAN.R-project.org/package=newsmap)
[](https://CRAN.R-project.org/package=newsmap)
[](https://CRAN.R-project.org/package=newsmap)
[](https://github.com/koheiw/newsmap/actions)
[](https://codecov.io/gh/koheiw/newsmap)
Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the **newsmap** package contains seed dictionaries in multiple languages that include *English*, *German*, *French*, *Spanish*, *Portuguese*, *Russian*, *Italian*, *Arabic*, *Turkish*, *Hebrew*, *Japanese*, *Chinese*.
The detail of the algorithm is explained in [Newsmap: semi-supervised approach to geographical news classification](https://www.tandfonline.com/eprint/dDeyUTBrhxBSSkHPn5uB/full). **newsmap** has also been used in scientific research in various fields ([Google Scholar](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=3438152153062747083)).
## How to install
**newsmap** is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.
```{r, eval=FALSE}
install.packages("newsmap")
```
If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.
```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/newsmap")
```
## Example
In this example, using a text analysis package [**quanteda**](https://quanteda.io) for preprocessing of textual data, we train a geographical classification model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS in 2014.
### Download example data
```{r, eval=FALSE}
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1',
'~/yahoo-news.RDS', mode = "wb")
```
### Train Newsmap classifier
```{r}
require(newsmap)
require(quanteda)
# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
dat$body <- NULL
corp <- corpus(dat, text_field = 'text')
# Custom stopwords
month <- c('January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December')
day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
agency <- c('AP', 'AFP', 'Reuters')
# Select training period
sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31')
# Tokenize
toks <- tokens(sub_corp)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)
# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup
toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en,
levels = 3, nested_scope = "dictionary")
dfmt_label <- dfm(toks_label)
dfmt_feat <- dfm(toks, tolower = FALSE)
dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+',
valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model
dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10)
model <- textmodel_newsmap(dfmt_feat, dfmt_label)
# Features with largest weights
coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]
```
### Predict geographical focus of texts
```{r}
pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))
```
```{r echo=FALSE}
knitr::kable(head(pred_data))
```
Owner
- Name: Kohei Watanabe
- Login: koheiw
- Kind: user
- Location: Japan
- Website: http://koheiw.net
- Twitter: koheiw7
- Repositories: 34
- Profile: https://github.com/koheiw
Data analyst specializes in political and financial texts
GitHub Events
Total
- Watch event: 5
- Issue comment event: 7
- Pull request event: 2
- Fork event: 2
Last Year
- Watch event: 5
- Issue comment event: 7
- Pull request event: 2
- Fork event: 2
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 349
- Total Committers: 14
- Avg Commits per committer: 24.929
- Development Distribution Score (DDS): 0.097
Top Committers
| Name | Commits | |
|---|---|---|
| Kohei Watanabe | w****i@g****m | 315 |
| Stefan Müller | m****s@t****e | 7 |
| Chung-hong Chan | c****y@g****m | 5 |
| bah-elly | 7****y@u****m | 5 |
| daiyamao | 5****o@u****m | 3 |
| eladseg | 5****g@u****m | 3 |
| Giuseppe Carteny | 4****y@u****m | 2 |
| Dani Madrid-Morales | d****d@m****k | 2 |
| kbenoit | k****t@l****k | 2 |
| Giuseppe Carteny | g****y@u****t | 1 |
| KT01 | 4****1@u****m | 1 |
| Dani Madrid-Morales | d****e@c****u | 1 |
| Lanabi | 3****i@u****m | 1 |
| Ke Cheng | k****c@g****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 20
- Total pull requests: 60
- Average time to close issues: 9 months
- Average time to close pull requests: 21 days
- Total issue authors: 6
- Total pull request authors: 11
- Average comments per issue: 3.45
- Average comments per pull request: 0.97
- Merged pull requests: 53
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- koheiw (14)
- chainsawriot (2)
- JBGruber (1)
- SCommain (1)
- giucarny (1)
- R01010010R (1)
Pull Request Authors
- koheiw (46)
- eladseg (5)
- LungtaSEKI (4)
- danimadrid (3)
- kbenoit (3)
- giucarny (2)
- stefan-mueller (2)
- daiyamao (2)
- kecheng-ac (1)
- bah-elly (1)
- chainsawriot (1)
- yuanzhouIR (1)
Top Labels
Issue Labels
dictionary (6)
meta (2)
help wanted (1)
question (1)
bug (1)
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- cran 606 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 16
- Total maintainers: 1
cran.r-project.org: newsmap
Semi-Supervised Model for Geographical Document Classification
- Homepage: https://github.com/koheiw/newsmap
- Documentation: http://cran.r-project.org/web/packages/newsmap/newsmap.pdf
- License: MIT + file LICENSE
-
Latest release: 0.9.2
published 8 months ago
Rankings
Forks count: 3.8%
Stargazers count: 6.1%
Average: 16.3%
Downloads: 19.1%
Dependent repos count: 23.8%
Dependent packages count: 28.6%
Maintainers (1)
Last synced:
6 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.5 depends
- methods * depends
- Matrix * imports
- quanteda >= 2.1 imports
- quanteda.textstats * imports
- stringi * imports
- utils * imports
- testthat * suggests
.github/workflows/check-standard.yaml
actions
- actions/checkout v2 composite
- actions/upload-artifact main composite
- r-lib/actions/check-r-package v1 composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite
- r-lib/actions/setup-r-dependencies v1 composite
.github/workflows/test-coverage.yaml
actions
- actions/cache v2 composite
- actions/checkout v2 composite
- r-lib/actions/setup-pandoc v1 composite
- r-lib/actions/setup-r v1 composite