textstem

Tools for fast text stemming & lemmatization

https://github.com/trinker/textstem

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords

lemmatization r stemming text-mining

Keywords from Contributors

sentiment-analysis text-analysis
Last synced: 6 months ago · JSON representation

Repository

Tools for fast text stemming & lemmatization

Basic Info
  • Host: GitHub
  • Owner: trinker
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 178 KB
Statistics
  • Stars: 44
  • Watchers: 6
  • Forks: 9
  • Open Issues: 6
  • Releases: 0
Topics
lemmatization r stemming text-mining
Created about 9 years ago · Last pushed over 7 years ago
Metadata Files
Readme

README.Rmd

---
title: "textstem"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  md_document:
    toc: true      
---

```{r, echo=FALSE, warning=FALSE}
desc <- suppressWarnings(readLines("DESCRIPTION"))
regex <- "(^Version:\\s+)(\\d+\\.\\d+\\.\\d+)"
loc <- grep(regex, desc)
ver <- gsub(regex, "\\2", desc[loc])
verbadge <- sprintf('Version

', ver, ver) verbadge <- '' pacman::p_load(textstem) pacman::p_load_current_gh('trinker/numform') nr <- numform::f_comma(length(presidential_debates_2012$dialogue)) nw <- numform::f_comma(sum(stringi::stri_count_words(presidential_debates_2012$dialogue), na.rm = TRUE)) ```` [![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/0.1.0/active.svg)](https://www.repostatus.org/#active) [![Build Status](https://travis-ci.org/trinker/textstem.svg?branch=master)](https://travis-ci.org/trinker/textstem) [![Coverage Status](https://coveralls.io/repos/trinker/textstem/badge.svg?branch=master)](https://coveralls.io/r/trinker/textstem?branch=master) [![](https://cranlogs.r-pkg.org/badges/textstem)](https://cran.r-project.org/package=textstem) `r verbadge` ![](tools/textstem_logo/r_textstem.png) **textstem** is a tool-set for stemming and lemmatizing words. Stemming is a process that removes affixes. Lemmatization is the process of grouping inflected forms together as a single base form. # Functions The main functions, task category, & descriptions are summarized in the table below: | Function | Task | Description | |-------------------------------|-------------|--------------------------------------------| | `stem_words` | stemming | Stem words | | `stem_strings` | stemming | Stem strings | | `lemmatize_words` | lemmatizing | Lemmatize words | | `lemmatize_strings` | lemmatizing | Lemmatize strings | | `make_lemma_dictionary_words` | lemmatizing | Generate a dictionary of lemmas for a text | # Installation To download the development version of **textstem**: Download the [zip ball](https://github.com/trinker/textstem/zipball/master) or [tar ball](https://github.com/trinker/textstem/tarball/master), decompress and run `R CMD INSTALL` on it, or use the **pacman** package to install the development version: ```r if (!require("pacman")) install.packages("pacman") pacman::p_load_gh("trinker/textstem") ``` # Contact You are welcome to: - submit suggestions and bug-reports at: - send a pull request on: - compose a friendly e-mail to: # Examples The following examples demonstrate some of the functionality of **textstem**. ## Load the Tools/Data ```{r, message=FALSE, warning=FALSE} if (!require("pacman")) install.packages("pacman") pacman::p_load(textstem, dplyr) data(presidential_debates_2012) ``` ## Stemming Versus Lemmatizing Before moving into the meat these two examples let's highlight the difference between stemming and lemmatizing. ### "Drive" Stemming vs. Lemmatizing ```{r} dw <- c('driver', 'drive', 'drove', 'driven', 'drives', 'driving') stem_words(dw) lemmatize_words(dw) ``` ### "Be" Stemming vs. Lemmatizing ```{r} bw <- c('are', 'am', 'being', 'been', 'be') stem_words(bw) lemmatize_words(bw) ``` ## Stemming Stemming is the act of removing inflections from a word not necessarily ["identical to the morphological root of the word" (wikipedia)](https://en.wikipedia.org/wiki/Stemming). Below I show stemming of several small strings. ```{r} y <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) stem_strings(y) ``` ## Lemmatizing ### Default Lemma Dictionary Lemmatizing is the ["grouping together the inflected forms of a word so they can be analysed as a single item" (wikipedia)](https://en.wikipedia.org/wiki/Lemmatisation). In the example below I reduce the strings to their lemma form. `lemmatize_strings` uses a lookup dictionary. The default uses [Mechura's (2016)](http://www.lexiconista.com) English lemmatization list available from the [**lexicon**](https://cran.r-project.org/package=lexicon) package. The `make_lemma_dictionary` function contains two additional engines for generating a lemma lookup table for use in `lemmatize_strings`. ```{r} y <- c( 'the dirtier dog has eaten the pies', 'that shameful pooch is tricky and sneaky', "He opened and then reopened the food bag", 'There are skies of blue and red roses too!', NA, "The doggies, well they aren't joyfully running.", "The daddies are coming over...", "This is 34.546 above" ) lemmatize_strings(y) ``` ### Hunspell Lemma Dictionary This lemmatization uses the [**hunspell**](https://CRAN.R-project.org/package=hunspell) package to generate lemmas. ```{r} lemma_dictionary_hs <- make_lemma_dictionary(y, engine = 'hunspell') lemmatize_strings(y, dictionary = lemma_dictionary_hs) ``` ### koRpus Lemma Dictionary This lemmatization uses the [**koRpus**](https://CRAN.R-project.org/package=koRpus) package and the [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) program to generate lemmas. You'll have to get TreeTagger set up, preferably in your machine's root directory. ```{r} lemma_dictionary_tt <- make_lemma_dictionary(y, engine = 'treetagger') lemmatize_strings(y, lemma_dictionary_tt) ``` ### Lemmatization Speed It's pretty fast too. Observe: ```{r} tic <- Sys.time() presidential_debates_2012$dialogue %>% lemmatize_strings() %>% head() (toc <- Sys.time() - tic) ``` That's `r nr` rows of text, or `r nw` words, in `r round(as.numeric(toc), 2)` seconds. ## Combine With Other Text Tools This example shows how stemming/lemmatizing might be complemented by other text tools such as `replace_contraction` from the **textclean** package. ```{r} library(textclean) 'aren\'t' %>% lemmatize_strings() 'aren\'t' %>% textclean::replace_contraction() %>% lemmatize_strings() ```

Owner

  • Name: Tyler Rinker
  • Login: trinker
  • Kind: user
  • Location: Buffalo, NY
  • Company: Anthology

Director, Data Scientist, open-source developer , #rstats enthusiast, #dataviz geek, and #nlp buff

GitHub Events

Total
  • Watch event: 1
  • Fork event: 1
Last Year
  • Watch event: 1
  • Fork event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 45
  • Total Committers: 3
  • Avg Commits per committer: 15.0
  • Development Distribution Score (DDS): 0.133
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Tyler t****r@g****m 39
Tyler Rinker t****r@c****m 5
Kenneth Benoit k****t@l****k 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 12
  • Total pull requests: 2
  • Average time to close issues: 4 days
  • Average time to close pull requests: about 4 hours
  • Total issue authors: 8
  • Total pull request authors: 2
  • Average comments per issue: 0.75
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • trinker (5)
  • mlinegar (1)
  • nateGeorge (1)
  • santoshbs (1)
  • jonathanbratt (1)
  • GabriellaS-K (1)
  • oguzozbay (1)
  • JaySLee (1)
Pull Request Authors
  • jlricon (1)
  • kbenoit (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 2,017 last-month
  • Total docker downloads: 178
  • Total dependent packages: 2
    (may contain duplicates)
  • Total dependent repositories: 10
    (may contain duplicates)
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: textstem

Tools for Stemming and Lemmatizing Text

  • Versions: 3
  • Dependent Packages: 2
  • Dependent Repositories: 10
  • Downloads: 2,017 Last month
  • Docker Downloads: 178
Rankings
Stargazers count: 7.8%
Forks count: 7.9%
Downloads: 8.7%
Dependent repos count: 9.2%
Average: 11.9%
Dependent packages count: 13.7%
Docker downloads count: 24.2%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-textstem
  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent repos count: 34.0%
Stargazers count: 39.4%
Average: 42.3%
Forks count: 44.7%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.3.0 depends
  • koRpus.lang.en * depends
  • SnowballC * imports
  • dplyr * imports
  • hunspell * imports
  • koRpus * imports
  • lexicon >= 0.4.1 imports
  • quanteda >= 0.99.12 imports
  • stats * imports
  • stringi * imports
  • textclean * imports
  • textshape * imports
  • utils * imports
  • testthat * suggests