rainette

R implementation of the Reinert text clustering method

https://github.com/juba/rainette

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.8%) to scientific vocabulary

Keywords

r text-analysis text-classification

Last synced: 6 months ago · JSON representation

Repository

R implementation of the Reinert text clustering method

Basic Info

Host: GitHub
Owner: juba
Language: R
Default Branch: main
Homepage: https://juba.github.io/rainette/
Size: 15.5 MB

Statistics

Stars: 57
Watchers: 4
Forks: 7
Open Issues: 5
Releases: 4

Topics

r text-analysis text-classification

Created over 7 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog

Rainette

CRAN Downloads

Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other softwares such as Iramuteq (free software) or Alceste (commercial, closed source).

Features

Simple and double clustering algorithms
Plot functions and shiny interfaces to visualise and explore clustering results
Utility functions to split a corpus into segments or import a corpus in Iramuteq format

Installation

The package is installable from CRAN.

r install_packages("rainette")

The development version is installable from R-universe.

r install.packages("rainette", repos = "https://juba.r-universe.dev")

Usage

Let's start with an example corpus provided by the excellent quanteda package.

r library(quanteda) data_corpus_inaugural

First, we'll use split_segments() to split each document into segments of about 40 words (punctuation is taken into account).

r corpus <- split_segments(data_corpus_inaugural, segment_size = 40)

Next, we'll apply some preprocessing and compute a document-term matrix with quanteda functions.

r tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 10)

We can then apply a simple clustering on this matrix with the rainette() function. We specify the number of clusters (k), and the minimum number of forms in each segment (min_segment_size). Segments which do not include enough forms will be merged with the following or previous one when possible.

r res <- rainette(dtm, k = 6, min_segment_size = 15)

We can use the rainette_explor() shiny interface to visualise and explore the different clusterings at each k.

r rainette_explor(res, dtm, corpus)

rainette_explor() interface

The Cluster documents tab allows to browse and filter the documents in each cluster.

rainette_explor() documents tab

We can also directly generate the clusters description plot for a given k with rainette_plot().

r rainette_plot(res, dtm, k = 5)

Or cut the tree at chosen k and add a group membership variable to our corpus metadata.

r docvars(corpus)$cluster <- cutree(res, k = 5)

In addition to this, we can also perform a double clustering, ie two simple clusterings produced with different min_segment_size which are then "crossed" to generate more robust clusters. To do this, we use rainette2() on two rainette() results :

r res1 <- rainette(dtm, k = 5, min_segment_size = 10) res2 <- rainette(dtm, k = 5, min_segment_size = 15) res <- rainette2(res1, res2, max_k = 5)

We can then use rainette2_explor() to explore and visualise the results.

r rainette2_explor(res, dtm, corpus)

rainette2_explor() interface

Tell me more

Two vignettes are available :

Introduction and usage vignette : english, french
Algorithms description vignette : english, french

Credits

This clustering method has been created by Max Reinert, and is described in several articles, notably :

Reinert M., "Une méthode de classification descendante hiérarchique : application à l'analyse lexicale par contexte", Cahiers de l'analyse des données, Volume 8, Numéro 2, 1983. http://www.numdam.org/item/?id=CAD_1983__8_2_187_0
Reinert M., "Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval", Bulletin de Méthodologie Sociologique, Volume 26, Numéro 1, 1990. https://doi.org/10.1177/075910639002600103
Reinert M., "Une méthode de classification des énoncés d’un corpus présentée à l’aide d’une application", Les cahiers de l’analyse des données, Tome 15, Numéro 1, 1990. http://www.numdam.org/item/?id=CAD_1990__15_1_21_0

Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.

Many thanks to Sébastien Rochette for the creation of the hex logo.

Many thanks to Florian Privé for his work on rewriting and optimizing the Rcpp code.

Owner

Name: Julien Barnier
Login: juba
Kind: user
Location: Villeurbanne, France

Repositories: 61
Profile: https://github.com/juba

GitHub Events

Total

Issues event: 1
Issue comment event: 2

Last Year

Issues event: 1
Issue comment event: 2

Committers

Last synced: 8 months ago

All Time

Total Commits: 517
Total Committers: 4
Avg Commits per committer: 129.25
Development Distribution Score (DDS): 0.01

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Julien	j**n@n**g	512
Florian Privé	f**1@g**m	3
kbenoit	k**t@l**k	1
lvaudor	l**r@e**r	1

Committer Domains (Top 20 + Academic)

ens-lyon.fr: 1 lse.ac.uk: 1 nozav.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 35
Total pull requests: 3
Average time to close issues: 16 days
Average time to close pull requests: about 23 hours
Total issue authors: 11
Total pull request authors: 3
Average comments per issue: 3.06
Average comments per pull request: 1.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

gabrielparriaux (19)
JacquesAntoine (4)
manubonnet (3)
wilcar (2)
romane-lry (1)
claireINRS (1)
CreaPolitics (1)
yoannfol (1)
ghost (1)
jackobenco016 (1)
juba (1)

Pull Request Authors

lvaudor (2)
privefl (1)
kbenoit (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- cran 289 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 9
Total maintainers: 1

cran.r-project.org: rainette

The Reinert Method for Textual Data Clustering

Homepage: https://juba.github.io/rainette/
Documentation: http://cran.r-project.org/web/packages/rainette/rainette.pdf
License: GPL (≥ 3)
Latest release: 0.3.1
published almost 3 years ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 289 Last month

Rankings

Stargazers count: 6.8%

Forks count: 10.8%

Average: 20.9%

Dependent repos count: 23.9%

Dependent packages count: 28.7%

Downloads: 34.1%

Maintainers (1)

julien.barnier@cnrs.fr

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.6.0 depends
RSpectra * imports
Rcpp >= 1.0.3 imports
dendextend * imports
dplyr >= 1.0.0 imports
ggplot2 * imports
ggwordcloud * imports
gridExtra * imports
highr * imports
miniUI * imports
progressr * imports
purrr * imports
quanteda >= 2.1 imports
quanteda.textstats * imports
rlang * imports
shiny * imports
stringr * imports
tidyr * imports
FNN * suggests
knitr * suggests
quanteda.textmodels * suggests
rmarkdown * suggests
testthat * suggests
tm * suggests
vdiffr * suggests

.github/workflows/R-CMD-check.yaml actions

actions/checkout v3 composite
actions/upload-artifact main composite
r-lib/actions/check-r-package v2 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

.github/workflows/pkgdown.yaml actions

JamesIves/github-pages-deploy-action v4.4.1 composite
actions/checkout v3 composite
r-lib/actions/setup-pandoc v2 composite
r-lib/actions/setup-r v2 composite
r-lib/actions/setup-r-dependencies v2 composite

rainette

Science Score: 59.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Rainette

Features

Installation

Usage

Tell me more

Credits

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: rainette

Rankings

Maintainers (1)

Dependencies