TRUNAJOD

TRUNAJOD: A text complexity library to enhance natural language processing - Published in JOSS (2021)

https://github.com/dpalmasan/trunajod2.0

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 11 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: springer.com, ieee.org, joss.theoj.org, zenodo.org
  • Committers with academic emails
    1 of 8 committers (12.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

coherence cohesion entity-graph lexical-diversity natural-language-processing readability-metrics semantic-measurements spacy spacy-extensions text-analysis text-mining text-processing ttr type-token-ratio

Scientific Fields

Engineering Computer Science - 60% confidence
Last synced: 4 months ago · JSON representation

Repository

An easy-to-use library to extract indices from texts.

Basic Info
Statistics
  • Stars: 29
  • Watchers: 5
  • Forks: 7
  • Open Issues: 20
  • Releases: 3
Topics
coherence cohesion entity-graph lexical-diversity natural-language-processing readability-metrics semantic-measurements spacy spacy-extensions text-analysis text-mining text-processing ttr type-token-ratio
Created over 7 years ago · Last pushed over 4 years ago
Metadata Files
Readme Changelog Contributing License

README.md

TRUNAJOD: A text complexity library for text analysis built on spaCy

Actions Status Documentation Status PyPI - Python Version License: MIT PyPI Downloads Code style: black Built with spaCy JOSS paper DOI

TRUNAJOD is a Python library for text complexity analysis build on the high-performance spaCy library. With all the basic NLP capabilities provided by spaCy (dependency parsing, POS tagging, tokenizing), TRUNAJOD focuses on extracting measurements from texts that might be interesting for different applications and use cases. While most of the indices could be computed for different languages, currently we mostly support Spanish. We are happy if you contribute with indices implemented for your language!

Features

  • Utilities for text processing such as lemmatization, POS checkings.
  • Semantic measurements from text such as average coherence between sentences and average synonym overlap.
  • Giveness measurements such as pronoun density and pronoun noun ratio.
  • Built-in emotion lexicon to compute emotion calculations based on words in the text.
  • Lexico-semantic norm dataset to compute lexico-semantic variables from text.
  • Type token ratio (TTR) based metrics, and tunnable TTR metrics.
  • A built-in syllabizer (currently only for spanish).
  • Discourse markers based measurements to obtain measures of connectivity inside the text.
  • Plenty of surface proxies of text readability that can be computed directly from text.
  • Measurements of parse tree similarity as an approximation to syntactic complexity.
  • Parse tree correction to add periphrasis and heuristics for clause count, all based on linguistics experience.
  • Entity Grid and entity graphs model implementation as a measure of coherence.
  • An easy to use and user-friendly API.

Installation

TRUNAJOD can be installed by running pip install trunajod. It requires Python 3.6.2+ to run.

Getting Started

Using this package has some other pre-requisites. It assumes that you already have your model set up on spacy. If not, please first install or download a model (for Spanish users, a spanish model). Then you can get started with the following code snippet.

You can download pre-build TRUNAJOD models from the repo, under the models directory.

Below is a small snippet of code that can help you in getting started with this lib. Don´t forget to take a look at the documentation.

The example below assumes you have the es_core_news_sm spaCy Spanish model installed. You can install the model running: python -m spacy download es_core_news_sm. For other models, please check spaCy docs.

```python from TRUNAJOD import surfaceproxies from TRUNAJOD.entitygrid import EntityGrid from TRUNAJOD.lexicosemanticnorms import LexicoSemanticNorm import pickle import spacy import tarfile

class ModelLoader(object): """Class to load model.""" def init(self, modelfile): tar = tarfile.open(modelfile, "r:gz") self.creafrequency = {} self.infinitivemap = {} self.lemmatizer = {} self.spanishlexicosemanticnorms = {} self.stopwords = {} self.wordnetnounsynsets = {} self.wordnetverbsynsets = {}

    for member in tar.getmembers():
        f = tar.extractfile(member)
        if "crea_frequency" in member.name:
            self.crea_frequency = pickle.loads(f.read())
        if "infinitive_map" in member.name:
            self.infinitive_map = pickle.loads(f.read())
        if "lemmatizer" in member.name:
            self.lemmatizer = pickle.loads(f.read())
        if "spanish_lexicosemantic_norms" in member.name:
            self.spanish_lexicosemantic_norms = pickle.loads(f.read())
        if "stopwords" in member.name:
            self.stopwords = pickle.loads(f.read())
        if "wordnet_noun_synsets" in member.name:
            self.wordnet_noun_synsets = pickle.loads(f.read())
        if "wordnet_verb_synsets" in member.name:
            self.wordnet_verb_synsets = pickle.loads(f.read())

Load TRUNAJOD models

model = ModelLoader("trunajodmodelsv0.1.tar.gz")

Load spaCy model

nlp = spacy.load("escorenews_sm", disable=["ner", "textcat"])

example_text = ( "El espectáculo del cielo nocturno cautiva la mirada y suscita preguntas" "sobre el universo, su origen y su funcionamiento. No es sorprendente que " "todas las civilizaciones y culturas hayan formado sus propias " "cosmologías. Unas relatan, por ejemplo, que el universo ha" "sido siempre tal como es, con ciclos que inmutablemente se repiten; " "otras explican que este universo ha tenido un principio, " "que ha aparecido por obra creadora de una divinidad." )

doc = nlp(example_text)

Lexico-semantic norms

lexicosemanticnorms = LexicoSemanticNorm( doc, model.spanishlexicosemanticnorms, model.lemmatizer )

Frequency index

freqindex = surfaceproxies.frequencyindex(doc, model.creafrequency)

Clause count (heurístically)

clausecount = surfaceproxies.clausecount(doc, model.infinitivemap)

Compute Entity Grid

egrid = EntityGrid(doc)

print("Concreteness: {}".format(lexicosemanticnorms.getconcreteness())) print("Frequency Index: {}".format(freqindex)) print("Clause count: {}".format(clausecount)) print("Entity grid:") print(egrid.getegrid()) ```

This should output:

Concreteness: 1.95 Frequency Index: -0.7684649336888104 Clause count: 10 Entity grid: {'ESPECTÁCULO': ['S', '-', '-'], 'CIELO': ['X', '-', '-'], 'MIRADA': ['O', '-', '-'], 'UNIVERSO': ['O', '-', 'S'], 'ORIGEN': ['X', '-', '-'], 'FUNCIONAMIENTO': ['X', '-', '-'], 'CIVILIZACIONES': ['-', 'S', '-'], 'CULTURAS': ['-', 'X', '-'], 'COSMOLOGÍAS': ['-', 'O', '-'], 'EJEMPLO': ['-', '-', 'X'], 'TAL': ['-', '-', 'X'], 'CICLOS': ['-', '-', 'X'], 'QUE': ['-', '-', 'S'], 'SE': ['-', '-', 'O'], 'OTRAS': ['-', '-', 'S'], 'PRINCIPIO': ['-', '-', 'O'], 'OBRA': ['-', '-', 'X'], 'DIVINIDAD': ['-', '-', 'X']}

A real world example

TRUNAJOD lib was used to make TRUNAJOD web app, which is an application to assess text complexity and to check the adquacy of a text to a particular school level. To achieve this, several TRUNAJOD indices were analyzed for multiple Chilean school system texts (from textbooks), and latent features were created. Here is a snippet:

```python """Example of TRUNAJOD usage.""" import glob

import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import spacy import textract # To read .docx files import TRUNAJOD.givenness import TRUNAJOD.ttr from TRUNAJOD import surface_proxies from TRUNAJOD.syllabizer import Syllabizer

plt.rcParams["figure.figsize"] = (11, 4) plt.rcParams["figure.dpi"] = 200

nlp = spacy.load("escorenews_sm", disable=["ner", "textcat"])

features = { "lexicaldiversitymltd": [], "lexicaldensity": [], "posdissimilarity": [], "connectionwordsratio": [], "grade": [], } for filename in glob.glob("corpus//.docx"): text = textract.process(filename).decode("utf8") doc = nlp(text) features["lexicaldiversitymltd"].append( TRUNAJOD.ttr.lexicaldiversitymtld(doc) ) features["lexicaldensity"].append(surfaceproxies.lexicaldensity(doc)) features["posdissimilarity"].append( surfaceproxies.posdissimilarity(doc) ) features["connectionwordsratio"].append( surfaceproxies.connectionwords_ratio(doc) )

# In our case corpus was organized as:
# corpus/5B/5_2_55.docx where the folder that
# contained the doc, contained the school level, in
# this example 5th grade
features["grade"].append(filename.split("/")[1][0])

df = pd.DataFrame(features)

fig, axes = plt.subplots(2, 2)

sns.boxplot(x="grade", y="lexicaldiversitymltd", data=df, ax=axes[0, 0]) sns.boxplot(x="grade", y="lexicaldensity", data=df, ax=axes[0, 1]) sns.boxplot(x="grade", y="posdissimilarity", data=df, ax=axes[1, 0]) sns.boxplot(x="grade", y="connectionwordsratio", data=df, ax=axes[1, 1]) ```

Which yields:

TRUNAJOD web app example

TRUNAJOD web app backend was built using TRUNAJOD lib. A demo video is shown below (it is in Spanish):

TRUNAJOD demo

Contributing to TRUNAJOD

Bug reports and fixes are always welcome! Feel free to file issues, or ask for a feature request. We use Github issue tracker for this. If you'd like to contribute, feel free to submit a pull request. For more questions you can contact me at dipalma (at) udec (dot) cl.

More details can be found in CONTRIBUTING.

References

If you find anything of this useful, feel free to cite the following papers, from which a lot of this python library was made for (I am also in the process of submitting this lib to an open software journal):

  1. Palma, D., & Atkinson, J. (2018). Coherence-based automatic essay assessment. IEEE Intelligent Systems, 33(5), 26-36.
  2. Palma, D., Soto, C., Veliz, M., Riffo, B., & Gutiérrez, A. (2019, August). A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements. In International Conference on Human Interaction and Emerging Technologies (pp. 509-515). Springer, Cham.

```bib @article{Palma2021, doi = {10.21105/joss.03153}, url = {https://doi.org/10.21105/joss.03153}, year = {2021}, publisher = {The Open Journal}, volume = {6}, number = {60}, pages = {3153}, author = {Diego A. Palma and Christian Soto and Mónica Veliz and Bruno Karelovic and Bernardo Riffo}, title = {TRUNAJOD: A text complexity library to enhance natural language processing}, journal = {Journal of Open Source Software} }

@article{palma2018coherence, title={Coherence-based automatic essay assessment}, author={Palma, Diego and Atkinson, John}, journal={IEEE Intelligent Systems}, volume={33}, number={5}, pages={26--36}, year={2018}, publisher={IEEE} }

@inproceedings{palma2019data, title={A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements}, author={Palma, Diego and Soto, Christian and Veliz, M{\'o}nica and Riffo, Bernardo and Guti{\'e}rrez, Antonio}, booktitle={International Conference on Human Interaction and Emerging Technologies}, pages={509--515}, year={2019}, organization={Springer} } ```

Owner

  • Name: Diego Palma
  • Login: dpalmasan
  • Kind: user
  • Location: Seattle
  • Company: Meta

I am an electrical engineer, and MSc in Computer Science. Interested in teaching, programming, web application development, AI, NLP and Machine Learning.

JOSS Publication

TRUNAJOD: A text complexity library to enhance natural language processing
Published
April 21, 2021
Volume 6, Issue 60, Page 3153
Authors
Diego A. Palma ORCID
Universidad de Concepción
Christian Soto
Universidad de Concepción
Mónica Veliz
Universidad de Concepción
Bruno Karelovic
Universidad de Concepción
Bernardo Riffo
Universidad de Concepción
Editor
Daniel S. Katz ORCID
Tags
natural language processing machine learning text complexity text coherence

GitHub Events

Total
Last Year

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 119
  • Total Committers: 8
  • Avg Commits per committer: 14.875
  • Development Distribution Score (DDS): 0.445
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
dpalmasan d****n@g****m 66
Diego d****a@e****m 31
sourvad s****d@g****m 7
supersonic1999 j****y@g****m 5
Brandon Goding b****g@g****m 5
Daniel S. Katz d****z@i****g 2
Bruce Lee w****e@g****m 2
Alejandro Piad a****d@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 40
  • Total pull requests: 28
  • Average time to close issues: 6 days
  • Average time to close pull requests: about 14 hours
  • Total issue authors: 3
  • Total pull request authors: 8
  • Average comments per issue: 1.25
  • Average comments per pull request: 1.11
  • Merged pull requests: 20
  • Bot issues: 0
  • Bot pull requests: 4
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • dpalmasan (33)
  • mbdemoraes (6)
  • apiad (1)
Pull Request Authors
  • dpalmasan (14)
  • supersonic1999 (4)
  • dependabot[bot] (4)
  • sourvad (2)
  • brucewlee (1)
  • BrandonGoding (1)
  • apiad (1)
  • danielskatz (1)
Top Labels
Issue Labels
good first issue (16) enhancement (13) usability (3) bug (3) feature: coherence (1) feature: linguistic index (1) documentation (1) dependencies (1) help wanted (1)
Pull Request Labels
dependencies (4)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 517 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 3
  • Total maintainers: 1
pypi.org: trunajod

A python lib for readability analyses.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 517 Last month
Rankings
Dependent packages count: 10.1%
Stargazers count: 11.9%
Forks count: 13.3%
Average: 17.4%
Dependent repos count: 21.6%
Downloads: 30.0%
Maintainers (1)
Last synced: 4 months ago

Dependencies

docs/requirements.txt pypi
  • m2r *
  • spacy *
  • sphinx ==2.4.4
  • sphinxcontrib-bibtex *
requirements-test.txt pypi
  • mock >=3.0.5 test
  • pre-commit * test
  • pytest >=6.1.1 test
  • pytest-cov >=2.10.1 test
  • tox * test
setup.py pypi
  • spacy >=2.3.2