newsmap

Semi-supervised algorithm for geographical document classification

https://github.com/koheiw/newsmap

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: scholar.google
✓
Committers with academic emails
3 of 14 committers (21.4%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.9%) to scientific vocabulary

Keywords

machine-learning news-stories quanteda text-analysis

Keywords from Contributors

corpus text-analytics encoding

Last synced: 9 months ago · JSON representation

Repository

Semi-supervised algorithm for geographical document classification

Basic Info

Host: GitHub
Owner: koheiw
License: other
Language: R
Default Branch: master
Homepage:
Size: 1.84 MB

Statistics

Stars: 64
Watchers: 5
Forks: 24
Open Issues: 10
Releases: 3

Topics

machine-learning news-stories quanteda text-analysis

Created about 10 years ago · Last pushed 11 months ago

Metadata Files

Readme License

README.Rmd

---
output: github_document
---

```{r, echo=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)
```

# Newsmap: geographical document classifier



[![CRAN
Version](https://www.r-pkg.org/badges/version/newsmap)](https://CRAN.R-project.org/package=newsmap)
[![Downloads](https://cranlogs.r-pkg.org/badges/newsmap)](https://CRAN.R-project.org/package=newsmap)
[![Total
Downloads](https://cranlogs.r-pkg.org/badges/grand-total/newsmap?color=orange)](https://CRAN.R-project.org/package=newsmap)
[![R build
status](https://github.com/koheiw/newsmap/workflows/R-CMD-check/badge.svg)](https://github.com/koheiw/newsmap/actions)
[![codecov](https://codecov.io/gh/koheiw/newsmap/branch/master/graph/badge.svg)](https://codecov.io/gh/koheiw/newsmap)


Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the **newsmap** package contains seed dictionaries in multiple languages that include *English*, *German*, *French*, *Spanish*, *Portuguese*, *Russian*, *Italian*, *Arabic*, *Turkish*, *Hebrew*, *Japanese*, *Chinese*.

The detail of the algorithm is explained in [Newsmap: semi-supervised approach to geographical news classification](https://www.tandfonline.com/eprint/dDeyUTBrhxBSSkHPn5uB/full). **newsmap** has also been used in scientific research in various fields ([Google Scholar](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=3438152153062747083)).

## How to install

**newsmap** is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.

```{r, eval=FALSE}
install.packages("newsmap")
```

If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.

```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/newsmap")
```

## Example

In this example, using a text analysis package [**quanteda**](https://quanteda.io) for preprocessing of textual data, we train a geographical classification model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS in 2014.

### Download example data

```{r, eval=FALSE}
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', 
              '~/yahoo-news.RDS', mode = "wb")
```

### Train Newsmap classifier

```{r}
require(newsmap)
require(quanteda)

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
dat$body <- NULL
corp <- corpus(dat, text_field = 'text')

# Custom stopwords
month <- c('January', 'February', 'March', 'April', 'May', 'June',
           'July', 'August', 'September', 'October', 'November', 'December')
day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
agency <- c('AP', 'AFP', 'Reuters')

# Select training period
sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31')

# Tokenize
toks <- tokens(sub_corp)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)

# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup
toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en, 
                            levels = 3, nested_scope = "dictionary")
dfmt_label <- dfm(toks_label)

dfmt_feat <- dfm(toks, tolower = FALSE)
dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+', 
                        valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model
dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10)

model <- textmodel_newsmap(dfmt_feat, dfmt_label)

# Features with largest weights
coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]
```

### Predict geographical focus of texts 

```{r}
pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))
```

```{r echo=FALSE}
knitr::kable(head(pred_data))
```

Owner

Name: Kohei Watanabe
Login: koheiw
Kind: user
Location: Japan

Website: http://koheiw.net
Twitter: koheiw7
Repositories: 34
Profile: https://github.com/koheiw

Data analyst specializes in political and financial texts

GitHub Events

Total

Watch event: 5
Issue comment event: 7
Pull request event: 2
Fork event: 2

Last Year

Watch event: 5
Issue comment event: 7
Pull request event: 2
Fork event: 2

Committers

Last synced: about 3 years ago

All Time

Total Commits: 349
Total Committers: 14
Avg Commits per committer: 24.929
Development Distribution Score (DDS): 0.097

Top Committers

Name	Email	Commits
Kohei Watanabe	w**i@g**m	315
Stefan Müller	m**s@t**e	7
Chung-hong Chan	c**y@g**m	5
bah-elly	7**y@u**m	5
daiyamao	5**o@u**m	3
eladseg	5**g@u**m	3
Giuseppe Carteny	4**y@u**m	2
Dani Madrid-Morales	d**d@m**k	2
kbenoit	k**t@l**k	2
Giuseppe Carteny	g**y@u**t	1
KT01	4**1@u**m	1
Dani Madrid-Morales	d**e@c**u	1
Lanabi	3**i@u**m	1
Ke Cheng	k**c@g**m	1

Committer Domains (Top 20 + Academic)

central.uh.edu: 1 unimi.it: 1 lse.ac.uk: 1 my.cityu.edu.hk: 1 tcd.ie: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 20
Total pull requests: 60
Average time to close issues: 9 months
Average time to close pull requests: 21 days
Total issue authors: 6
Total pull request authors: 11
Average comments per issue: 3.45
Average comments per pull request: 0.97
Merged pull requests: 53
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

koheiw (14)
chainsawriot (2)
JBGruber (1)
SCommain (1)
giucarny (1)
R01010010R (1)

Pull Request Authors

koheiw (46)
eladseg (5)
LungtaSEKI (4)
danimadrid (3)
kbenoit (3)
giucarny (2)
stefan-mueller (2)
daiyamao (2)
kecheng-ac (1)
bah-elly (1)
chainsawriot (1)
yuanzhouIR (1)

Top Labels

Issue Labels

dictionary (6) meta (2) help wanted (1) question (1) bug (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 606 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 16
Total maintainers: 1

cran.r-project.org: newsmap

Semi-Supervised Model for Geographical Document Classification

Homepage: https://github.com/koheiw/newsmap
Documentation: http://cran.r-project.org/web/packages/newsmap/newsmap.pdf
License: MIT + file LICENSE
Latest release: 0.9.2
published 11 months ago

Versions: 16
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 606 Last month

Rankings

Forks count: 3.8%

Stargazers count: 6.1%

Average: 16.3%

Downloads: 19.1%

Dependent repos count: 23.8%

Dependent packages count: 28.6%

Maintainers (1)

watanabe.kohei@gmail.com

Last synced: 10 months ago

Dependencies

DESCRIPTION cran

R >= 3.5 depends
methods * depends
Matrix * imports
quanteda >= 2.1 imports
quanteda.textstats * imports
stringi * imports
utils * imports
testthat * suggests

.github/workflows/check-standard.yaml actions

actions/checkout v2 composite
actions/upload-artifact main composite
r-lib/actions/check-r-package v1 composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite
r-lib/actions/setup-r-dependencies v1 composite

.github/workflows/test-coverage.yaml actions

actions/cache v2 composite
actions/checkout v2 composite
r-lib/actions/setup-pandoc v1 composite
r-lib/actions/setup-r v1 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

newsmap

Science Score: 33.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: newsmap

Rankings

Maintainers (1)

Dependencies