newsmap

Semi-supervised algorithm for geographical document classification

https://github.com/koheiw/newsmap

Science Score: 33.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: scholar.google
  • Committers with academic emails
    3 of 14 committers (21.4%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.9%) to scientific vocabulary

Keywords

machine-learning news-stories quanteda text-analysis

Keywords from Contributors

corpus text-analytics encoding
Last synced: 6 months ago · JSON representation

Repository

Semi-supervised algorithm for geographical document classification

Basic Info
  • Host: GitHub
  • Owner: koheiw
  • License: other
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 1.84 MB
Statistics
  • Stars: 64
  • Watchers: 5
  • Forks: 24
  • Open Issues: 10
  • Releases: 3
Topics
machine-learning news-stories quanteda text-analysis
Created almost 10 years ago · Last pushed 8 months ago
Metadata Files
Readme License

README.Rmd

---
output: github_document
---

```{r, echo=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "man/images/"
)
```

# Newsmap: geographical document classifier



[![CRAN
Version](https://www.r-pkg.org/badges/version/newsmap)](https://CRAN.R-project.org/package=newsmap)
[![Downloads](https://cranlogs.r-pkg.org/badges/newsmap)](https://CRAN.R-project.org/package=newsmap)
[![Total
Downloads](https://cranlogs.r-pkg.org/badges/grand-total/newsmap?color=orange)](https://CRAN.R-project.org/package=newsmap)
[![R build
status](https://github.com/koheiw/newsmap/workflows/R-CMD-check/badge.svg)](https://github.com/koheiw/newsmap/actions)
[![codecov](https://codecov.io/gh/koheiw/newsmap/branch/master/graph/badge.svg)](https://codecov.io/gh/koheiw/newsmap)


Semi-supervised Bayesian model for geographical document classification. Newsmap automatically constructs a large geographical dictionary from a corpus to accurate classify documents. Currently, the **newsmap** package contains seed dictionaries in multiple languages that include *English*, *German*, *French*, *Spanish*, *Portuguese*, *Russian*, *Italian*, *Arabic*, *Turkish*, *Hebrew*, *Japanese*, *Chinese*.

The detail of the algorithm is explained in [Newsmap: semi-supervised approach to geographical news classification](https://www.tandfonline.com/eprint/dDeyUTBrhxBSSkHPn5uB/full). **newsmap** has also been used in scientific research in various fields ([Google Scholar](https://scholar.google.com/scholar?oi=bibs&hl=en&cites=3438152153062747083)).

## How to install

**newsmap** is available on CRAN since the version 0.6. You can install the package using R Studio GUI or the command.

```{r, eval=FALSE}
install.packages("newsmap")
```

If you want to the latest version, please install by running this command in R. You need to have **devtools** installed beforehand.

```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("koheiw/newsmap")
```

## Example

In this example, using a text analysis package [**quanteda**](https://quanteda.io) for preprocessing of textual data, we train a geographical classification model on a [corpus of news summaries collected from Yahoo News](https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1) via RSS in 2014.

### Download example data

```{r, eval=FALSE}
download.file('https://www.dropbox.com/s/e19kslwhuu9yc2z/yahoo-news.RDS?dl=1', 
              '~/yahoo-news.RDS', mode = "wb")
```

### Train Newsmap classifier

```{r}
require(newsmap)
require(quanteda)

# Load data
dat <- readRDS('~/yahoo-news.RDS')
dat$text <- paste0(dat$head, ". ", dat$body)
dat$body <- NULL
corp <- corpus(dat, text_field = 'text')

# Custom stopwords
month <- c('January', 'February', 'March', 'April', 'May', 'June',
           'July', 'August', 'September', 'October', 'November', 'December')
day <- c('Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday')
agency <- c('AP', 'AFP', 'Reuters')

# Select training period
sub_corp <- corpus_subset(corp, '2014-01-01' <= date & date <= '2014-12-31')

# Tokenize
toks <- tokens(sub_corp)
toks <- tokens_remove(toks, stopwords('english'), valuetype = 'fixed', padding = TRUE)
toks <- tokens_remove(toks, c(month, day, agency), valuetype = 'fixed', padding = TRUE)

# quanteda v1.5 introduced 'nested_scope' to reduce ambiguity in dictionary lookup
toks_label <- tokens_lookup(toks, data_dictionary_newsmap_en, 
                            levels = 3, nested_scope = "dictionary")
dfmt_label <- dfm(toks_label)

dfmt_feat <- dfm(toks, tolower = FALSE)
dfmt_feat <- dfm_select(dfmt_feat, selection = "keep", '^[A-Z][A-Za-z1-2]+', 
                        valuetype = 'regex', case_insensitive = FALSE) # include only proper nouns to model
dfmt_feat <- dfm_trim(dfmt_feat, min_termfreq = 10)

model <- textmodel_newsmap(dfmt_feat, dfmt_label)

# Features with largest weights
coef(model, n = 7)[c("us", "gb", "fr", "br", "jp")]
```

### Predict geographical focus of texts 

```{r}
pred_data <- data.frame(text = as.character(sub_corp), country = predict(model))
```

```{r echo=FALSE}
knitr::kable(head(pred_data))
```

Owner

  • Name: Kohei Watanabe
  • Login: koheiw
  • Kind: user
  • Location: Japan

Data analyst specializes in political and financial texts

GitHub Events

Total
  • Watch event: 5
  • Issue comment event: 7
  • Pull request event: 2
  • Fork event: 2
Last Year
  • Watch event: 5
  • Issue comment event: 7
  • Pull request event: 2
  • Fork event: 2

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 349
  • Total Committers: 14
  • Avg Commits per committer: 24.929
  • Development Distribution Score (DDS): 0.097
Top Committers
Name Email Commits
Kohei Watanabe w****i@g****m 315
Stefan Müller m****s@t****e 7
Chung-hong Chan c****y@g****m 5
bah-elly 7****y@u****m 5
daiyamao 5****o@u****m 3
eladseg 5****g@u****m 3
Giuseppe Carteny 4****y@u****m 2
Dani Madrid-Morales d****d@m****k 2
kbenoit k****t@l****k 2
Giuseppe Carteny g****y@u****t 1
KT01 4****1@u****m 1
Dani Madrid-Morales d****e@c****u 1
Lanabi 3****i@u****m 1
Ke Cheng k****c@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 20
  • Total pull requests: 60
  • Average time to close issues: 9 months
  • Average time to close pull requests: 21 days
  • Total issue authors: 6
  • Total pull request authors: 11
  • Average comments per issue: 3.45
  • Average comments per pull request: 0.97
  • Merged pull requests: 53
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • koheiw (14)
  • chainsawriot (2)
  • JBGruber (1)
  • SCommain (1)
  • giucarny (1)
  • R01010010R (1)
Pull Request Authors
  • koheiw (46)
  • eladseg (5)
  • LungtaSEKI (4)
  • danimadrid (3)
  • kbenoit (3)
  • giucarny (2)
  • stefan-mueller (2)
  • daiyamao (2)
  • kecheng-ac (1)
  • bah-elly (1)
  • chainsawriot (1)
  • yuanzhouIR (1)
Top Labels
Issue Labels
dictionary (6) meta (2) help wanted (1) question (1) bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 606 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 16
  • Total maintainers: 1
cran.r-project.org: newsmap

Semi-Supervised Model for Geographical Document Classification

  • Versions: 16
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 606 Last month
Rankings
Forks count: 3.8%
Stargazers count: 6.1%
Average: 16.3%
Downloads: 19.1%
Dependent repos count: 23.8%
Dependent packages count: 28.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.5 depends
  • methods * depends
  • Matrix * imports
  • quanteda >= 2.1 imports
  • quanteda.textstats * imports
  • stringi * imports
  • utils * imports
  • testthat * suggests
.github/workflows/check-standard.yaml actions
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/check-r-package v1 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
  • r-lib/actions/setup-r-dependencies v1 composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite