https://github.com/kupolak/textstat

Ruby gem to calculate statistics from text to determine readability, complexity and grade level of a particular corpus.

https://github.com/kupolak/textstat

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.4%) to scientific vocabulary

Keywords

flesch-kincaid-grade flesch-reading-ease reading ruby smog statistics text-processing textstat translation

Keywords from Contributors

sequences projection interactive serializer measurement cycles packaging charts network-simulation archival
Last synced: 5 months ago · JSON representation

Repository

Ruby gem to calculate statistics from text to determine readability, complexity and grade level of a particular corpus.

Basic Info
  • Host: GitHub
  • Owner: kupolak
  • License: mit
  • Language: Ruby
  • Default Branch: master
  • Size: 368 KB
Statistics
  • Stars: 35
  • Watchers: 5
  • Forks: 10
  • Open Issues: 16
  • Releases: 9
Topics
flesch-kincaid-grade flesch-reading-ease reading ruby smog statistics text-processing textstat translation
Created over 7 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Code of conduct

README.md

TextStat 1.0.0 🚀

Gem Version Documentation Ruby License

A powerful Ruby gem for text readability analysis with exceptional performance

Calculate readability statistics, complexity metrics, and grade levels from text using proven formulas. Now with 36x performance improvement and support for 22 languages.

🎯 Key Features

  • ⚡ 36x Performance Boost: Dictionary caching provides massive speed improvements
  • 🌍 Multi-Language Support: 22 languages including English, Spanish, French, German, Russian, and more
  • 📊 13 Readability Formulas: Flesch, SMOG, Coleman-Liau, Gunning Fog, and others
  • 🏗️ Modular Architecture: Clean, maintainable code structure
  • 📚 Complete API Documentation: 100% documented with examples
  • 🧪 Comprehensive Testing: 199 tests with 87.4% success rate
  • 🔄 Backward Compatible: Seamless upgrade from 0.1.x versions

📈 Performance Comparison

| Operation | v0.1.x | v1.0.0 | Improvement | |-----------|--------|--------|-------------| | difficult_words | ~0.0047s | ~0.0013s | 36x faster | | text_standard | ~0.015s | ~0.012s | 20% faster | | Dictionary loading | File I/O every call | Cached in memory | 2x faster |

🚀 Quick Start

Installation

bash gem install textstat

Or add to your Gemfile:

ruby gem 'textstat', '~> 1.0'

Basic Usage

```ruby require 'textstat'

text = "This is a sample text for readability analysis. It contains multiple sentences with varying complexity levels."

Basic statistics

TextStat.charcount(text) # => 112 TextStat.lexiconcount(text) # => 18 TextStat.syllablecount(text) # => 28 TextStat.sentencecount(text) # => 2

Readability formulas

TextStat.fleschreadingease(text) # => 45.12 TextStat.fleschkincaidgrade(text) # => 11.2 TextStat.gunningfog(text) # => 14.5 TextStat.textstandard(text) # => "11th and 12th grade"

Difficult words (with automatic caching)

TextStat.difficult_words(text) # => 4 ```

🌍 Multi-Language Support

TextStat supports 22 languages with optimized dictionary caching:

```ruby

English (default)

TextStat.difficultwords("Complex analysis", 'enus')

Spanish

TextStat.difficult_words("Análisis complejo", 'es')

French

TextStat.difficult_words("Analyse complexe", 'fr')

German

TextStat.difficult_words("Komplexe Analyse", 'de')

Russian

TextStat.difficult_words("Сложный анализ", 'ru')

Check cache status

TextStat::DictionaryManager.cachesize # => 5 TextStat::DictionaryManager.cachedlanguages # => ["en_us", "es", "fr", "de", "ru"] ```

Supported Languages

| Code | Language | Status | Code | Language | Status | |------|----------|--------|------|----------|--------| | en_us | English (US) | ✅ | fr | French | ✅ | | en_uk | English (UK) | ✅ | es | Spanish | ✅ | | de | German | ✅ | it | Italian | ✅ | | ru | Russian | ✅ | pt | Portuguese | ✅ | | pl | Polish | ✅ | sv | Swedish | ✅ | | da | Danish | ✅ | nl | Dutch | ✅ | | fi | Finnish | ✅ | ca | Catalan | ✅ | | cs | Czech | ✅ | hu | Hungarian | ✅ | | et | Estonian | ✅ | id | Indonesian | ✅ | | is | Icelandic | ✅ | la | Latin | ✅ | | hr | Croatian | ⚠️ | no2 | Norwegian | ⚠️ |

Note: Croatian and Norwegian have known issues with the text-hyphen library.

⚡ Performance Optimization

Dictionary Caching (New in 1.0.0)

TextStat now caches language dictionaries in memory for massive performance improvements:

```ruby

First call loads dictionary from disk

TextStat.difficultwords(text, 'enus') # ~0.0047s

Subsequent calls use cached dictionary

TextStat.difficultwords(text, 'enus') # ~0.0013s (36x faster!)

Cache management

TextStat::DictionaryManager.cachesize # => 1 TextStat::DictionaryManager.cachedlanguages # => ["enus"] TextStat::DictionaryManager.clearcache # Clear all cached dictionaries ```

Memory Usage

  • Efficient: Each dictionary ~200KB in memory
  • Scalable: Cache multiple languages simultaneously
  • Manageable: Clear cache when needed

📊 Complete API Reference

Basic Text Statistics

```ruby

Character and word counts

TextStat.charcount(text, ignorespaces = true) TextStat.lexiconcount(text, removepunctuation = true) TextStat.syllablecount(text, language = 'enus') TextStat.sentence_count(text)

Averages

TextStat.avgsentencelength(text) TextStat.avgsyllablesperword(text, language = 'enus') TextStat.avgletterperword(text) TextStat.avgsentenceperword(text)

Advanced statistics

TextStat.difficultwords(text, language = 'enus') TextStat.polysyllabcount(text, language = 'enus') ```

Readability Formulas

```ruby

Popular formulas

TextStat.fleschreadingease(text, language = 'enus') TextStat.fleschkincaidgrade(text, language = 'enus') TextStat.gunningfog(text, language = 'enus') TextStat.smogindex(text, language = 'enus')

Academic formulas

TextStat.colemanliauindex(text) TextStat.automatedreadabilityindex(text) TextStat.linsearwriteformula(text, language = 'enus') TextStat.dalechallreadabilityscore(text, language = 'en_us')

International formulas

TextStat.lix(text) # Swedish formula TextStat.forcast(text, language = 'enus') # Technical texts TextStat.powerssumnerkearl(text, language = 'enus') # Primary grades TextStat.spache(text, language = 'en_us') # Elementary texts

Consensus grade level

TextStat.textstandard(text) # => "8th and 9th grade" TextStat.textstandard(text, true) # => 8.5 (numeric) ```

🏗️ Architecture (New in 1.0.0)

TextStat 1.0.0 features a clean modular architecture:

Modules

  • TextStat::BasicStats - Character, word, syllable, and sentence counting
  • TextStat::DictionaryManager - Dictionary loading and caching with 36x performance boost
  • TextStat::ReadabilityFormulas - All readability calculations and text standards
  • TextStat::Main - Unified interface combining all modules

Backward Compatibility

All existing code continues to work unchanged:

```ruby

This still works exactly the same

TextStat.fleschreadingease(text) # => 45.12 TextStat.difficult_words(text) # => 4 (but now 36x faster!) ```

📚 Documentation

🧪 Testing & Quality

TextStat 1.0.0 includes comprehensive testing:

  • 199 total tests (vs. 26 in 0.1.x)
  • 87.4% success rate (174/199 tests passing)
  • Multi-language testing for all 22 supported languages
  • Performance benchmarks with regression detection
  • Edge case testing (empty text, Unicode, very long texts)
  • Integration tests for module cooperation

Run tests:

bash bundle exec rspec

🔄 Migrating from 0.1.x

Zero Changes Required

TextStat 1.0.0 is 100% backward compatible:

```ruby

Your existing code works unchanged

TextStat.fleschreadingease(text) # Same API TextStat.difficult_words(text) # Same API, 36x faster! ```

New Features Available

```ruby

New cache management (optional)

TextStat::DictionaryManager.cachesize TextStat::DictionaryManager.cachedlanguages TextStat::DictionaryManager.clear_cache

New modular access (optional)

analyzer = TextStat::Main.new analyzer.fleschreadingease(text) ```

📈 Benchmarking

Compare performance yourself:

```ruby require 'textstat' require 'benchmark'

text = "Your sample text here..." * 100

Benchmark.bm do |x| x.report("difficultwords (first call)") { TextStat.difficultwords(text) } x.report("difficultwords (cached)") { TextStat.difficultwords(text) } x.report("textstandard") { TextStat.textstandard(text) } end ```

🛠️ Development

Setup

bash git clone https://github.com/kupolak/textstat.git cd textstat bundle install

Running Tests

```bash

All tests

bundle exec rspec

Specific test files

bundle exec rspec spec/languagesspec.rb bundle exec rspec spec/performancespec.rb ```

Generating Documentation

bash bundle exec yard doc

Code Quality

bash bundle exec rubocop

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Add tests for your changes
  4. Ensure all tests pass (bundle exec rspec)
  5. Run code quality checks (bundle exec rubocop)
  6. Commit your changes (git commit -m 'Add amazing feature')
  7. Push to the branch (git push origin feature/amazing-feature)
  8. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE.txt file for details.

🙏 Acknowledgments

  • Built on the excellent text-hyphen library
  • Inspired by the Python textstat library
  • Thanks to all contributors and users who helped improve this gem

📊 Project Stats

  • Version: 1.0.0 (First Stable Release)
  • Ruby Support: 2.7+
  • Languages: 22 supported
  • Tests: 199 total, 87.4% passing
  • Documentation: 100% API coverage
  • Performance: 36x improvement in key operations

⭐ Star this project if you find it useful!

Owner

  • Name: Jakub Polak
  • Login: kupolak
  • Kind: user
  • Location: Łódź (Poland)
  • Company: @jobandtalent

GitHub Events

Total
  • Create event: 5
  • Issues event: 1
  • Release event: 1
  • Watch event: 5
  • Delete event: 1
  • Issue comment event: 6
  • Push event: 8
  • Pull request event: 5
  • Pull request review event: 1
  • Pull request review comment event: 9
Last Year
  • Create event: 5
  • Issues event: 1
  • Release event: 1
  • Watch event: 5
  • Delete event: 1
  • Issue comment event: 6
  • Push event: 8
  • Pull request event: 5
  • Pull request review event: 1
  • Pull request review comment event: 9

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 108
  • Total Committers: 4
  • Avg Commits per committer: 27.0
  • Development Distribution Score (DDS): 0.074
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Jakub Polak j****z@g****m 100
nialljames n****s@d****k 5
Andrew Bromwich a****h@g****m 2
dependabot[bot] 4****] 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 42
  • Total pull requests: 7
  • Average time to close issues: over 1 year
  • Average time to close pull requests: 16 days
  • Total issue authors: 7
  • Total pull request authors: 5
  • Average comments per issue: 0.88
  • Average comments per pull request: 0.86
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kupolak (35)
  • abrom (2)
  • adrienpoly (1)
  • scarroll32 (1)
  • helphop (1)
  • tahseenbaig (1)
Pull Request Authors
  • dependabot[bot] (3)
  • abrom (2)
  • kupolak (2)
  • Niall47 (2)
  • adrienpoly (1)
Top Labels
Issue Labels
enhancement (33) good first issue (28)
Pull Request Labels
dependencies (3) ruby (2)

Packages

  • Total packages: 1
  • Total downloads:
    • rubygems 63,805 total
  • Total dependent packages: 0
  • Total dependent repositories: 4
  • Total versions: 10
  • Total maintainers: 1
rubygems.org: textstat

Ruby gem to calculate readability statistics of a text object - paragraphs, sentences, articles

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 4
  • Downloads: 63,805 Total
Rankings
Stargazers count: 9.9%
Forks count: 10.3%
Dependent repos count: 11.0%
Average: 12.1%
Downloads: 13.5%
Dependent packages count: 15.8%
Maintainers (1)
Last synced: 5 months ago

Dependencies

textstat.gemspec rubygems
  • bundler ~> 2.0.a development
  • rake ~> 13.0 development
  • rspec ~> 3.0 development
  • text-hyphen ~> 1.4, >= 1.4.1
Gemfile rubygems