rainette

R implementation of the Reinert text clustering method

https://github.com/juba/rainette

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 4 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary

Keywords

r text-analysis text-classification
Last synced: 6 months ago · JSON representation

Repository

R implementation of the Reinert text clustering method

Basic Info
Statistics
  • Stars: 57
  • Watchers: 4
  • Forks: 7
  • Open Issues: 5
  • Releases: 4
Topics
r text-analysis text-classification
Created over 7 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Changelog

README.md

Rainette

CRAN status rainette status badge DOI CRAN Downloads R build status <!-- Coverage status -->

Rainette is an R package which implements a variant of the Reinert textual clustering method. This method is available in other softwares such as Iramuteq (free software) or Alceste (commercial, closed source).

Features

  • Simple and double clustering algorithms
  • Plot functions and shiny interfaces to visualise and explore clustering results
  • Utility functions to split a corpus into segments or import a corpus in Iramuteq format

Installation

The package is installable from CRAN.

r install_packages("rainette")

The development version is installable from R-universe.

r install.packages("rainette", repos = "https://juba.r-universe.dev")

Usage

Let's start with an example corpus provided by the excellent quanteda package.

r library(quanteda) data_corpus_inaugural

First, we'll use split_segments() to split each document into segments of about 40 words (punctuation is taken into account).

r corpus <- split_segments(data_corpus_inaugural, segment_size = 40)

Next, we'll apply some preprocessing and compute a document-term matrix with quanteda functions.

r tok <- tokens(corpus, remove_punct = TRUE) tok <- tokens_remove(tok, stopwords("en")) dtm <- dfm(tok, tolower = TRUE) dtm <- dfm_trim(dtm, min_docfreq = 10)

We can then apply a simple clustering on this matrix with the rainette() function. We specify the number of clusters (k), and the minimum number of forms in each segment (min_segment_size). Segments which do not include enough forms will be merged with the following or previous one when possible.

r res <- rainette(dtm, k = 6, min_segment_size = 15)

We can use the rainette_explor() shiny interface to visualise and explore the different clusterings at each k.

r rainette_explor(res, dtm, corpus)

rainette_explor() interface

The Cluster documents tab allows to browse and filter the documents in each cluster.

rainette_explor() documents tab

We can also directly generate the clusters description plot for a given k with rainette_plot().

r rainette_plot(res, dtm, k = 5)

Or cut the tree at chosen k and add a group membership variable to our corpus metadata.

r docvars(corpus)$cluster <- cutree(res, k = 5)

In addition to this, we can also perform a double clustering, ie two simple clusterings produced with different min_segment_size which are then "crossed" to generate more robust clusters. To do this, we use rainette2() on two rainette() results :

r res1 <- rainette(dtm, k = 5, min_segment_size = 10) res2 <- rainette(dtm, k = 5, min_segment_size = 15) res <- rainette2(res1, res2, max_k = 5)

We can then use rainette2_explor() to explore and visualise the results.

r rainette2_explor(res, dtm, corpus)

rainette2_explor() interface

Tell me more

Two vignettes are available :

Credits

This clustering method has been created by Max Reinert, and is described in several articles, notably :

Thanks to Pierre Ratineau, the author of Iramuteq, for providing it as free software and open source. Even if the R code has been almost entirely rewritten, it has been a precious resource to understand the algorithms.

Many thanks to Sébastien Rochette for the creation of the hex logo.

Many thanks to Florian Privé for his work on rewriting and optimizing the Rcpp code.

Owner

  • Name: Julien Barnier
  • Login: juba
  • Kind: user
  • Location: Villeurbanne, France

GitHub Events

Total
  • Issues event: 1
  • Issue comment event: 2
Last Year
  • Issues event: 1
  • Issue comment event: 2

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 517
  • Total Committers: 4
  • Avg Commits per committer: 129.25
  • Development Distribution Score (DDS): 0.01
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Julien j****n@n****g 512
Florian Privé f****1@g****m 3
kbenoit k****t@l****k 1
lvaudor l****r@e****r 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 35
  • Total pull requests: 3
  • Average time to close issues: 16 days
  • Average time to close pull requests: about 23 hours
  • Total issue authors: 11
  • Total pull request authors: 3
  • Average comments per issue: 3.06
  • Average comments per pull request: 1.67
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • gabrielparriaux (19)
  • JacquesAntoine (4)
  • manubonnet (3)
  • wilcar (2)
  • romane-lry (1)
  • claireINRS (1)
  • CreaPolitics (1)
  • yoannfol (1)
  • ghost (1)
  • jackobenco016 (1)
  • juba (1)
Pull Request Authors
  • lvaudor (2)
  • privefl (1)
  • kbenoit (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 289 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 9
  • Total maintainers: 1
cran.r-project.org: rainette

The Reinert Method for Textual Data Clustering

  • Versions: 9
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 289 Last month
Rankings
Stargazers count: 6.8%
Forks count: 10.8%
Average: 20.9%
Dependent repos count: 23.9%
Dependent packages count: 28.7%
Downloads: 34.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.6.0 depends
  • RSpectra * imports
  • Rcpp >= 1.0.3 imports
  • dendextend * imports
  • dplyr >= 1.0.0 imports
  • ggplot2 * imports
  • ggwordcloud * imports
  • gridExtra * imports
  • highr * imports
  • miniUI * imports
  • progressr * imports
  • purrr * imports
  • quanteda >= 2.1 imports
  • quanteda.textstats * imports
  • rlang * imports
  • shiny * imports
  • stringr * imports
  • tidyr * imports
  • FNN * suggests
  • knitr * suggests
  • quanteda.textmodels * suggests
  • rmarkdown * suggests
  • testthat * suggests
  • tm * suggests
  • vdiffr * suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact main composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite