TRUNAJOD

TRUNAJOD: A text complexity library to enhance natural language processing - Published in JOSS (2021)

https://github.com/dpalmasan/trunajod2.0

Keywords

coherence cohesion entity-graph lexical-diversity natural-language-processing readability-metrics semantic-measurements spacy spacy-extensions text-analysis text-mining text-processing ttr type-token-ratio

Scientific Fields

Engineering Computer Science - 60% confidence

Last synced: 6 months ago · JSON representation

Repository

An easy-to-use library to extract indices from texts.

Basic Info

Host: GitHub
Owner: dpalmasan
License: mit
Language: Python
Default Branch: master
Homepage: https://trunajod20.readthedocs.io/en/latest/
Size: 22.3 MB

Statistics

Stars: 29
Watchers: 5
Forks: 7
Open Issues: 20
Releases: 3

Topics

coherence cohesion entity-graph lexical-diversity natural-language-processing readability-metrics semantic-measurements spacy spacy-extensions text-analysis text-mining text-processing ttr type-token-ratio

Created over 7 years ago · Last pushed over 4 years ago

Metadata Files

Readme Changelog Contributing License

README.md

TRUNAJOD: A text complexity library for text analysis built on spaCy

PyPI - Python Version

TRUNAJOD is a Python library for text complexity analysis build on the high-performance spaCy library. With all the basic NLP capabilities provided by spaCy (dependency parsing, POS tagging, tokenizing), TRUNAJOD focuses on extracting measurements from texts that might be interesting for different applications and use cases. While most of the indices could be computed for different languages, currently we mostly support Spanish. We are happy if you contribute with indices implemented for your language!

Features

Utilities for text processing such as lemmatization, POS checkings.
Semantic measurements from text such as average coherence between sentences and average synonym overlap.
Giveness measurements such as pronoun density and pronoun noun ratio.
Built-in emotion lexicon to compute emotion calculations based on words in the text.
Lexico-semantic norm dataset to compute lexico-semantic variables from text.
Type token ratio (TTR) based metrics, and tunnable TTR metrics.
A built-in syllabizer (currently only for spanish).
Discourse markers based measurements to obtain measures of connectivity inside the text.
Plenty of surface proxies of text readability that can be computed directly from text.
Measurements of parse tree similarity as an approximation to syntactic complexity.
Parse tree correction to add periphrasis and heuristics for clause count, all based on linguistics experience.
Entity Grid and entity graphs model implementation as a measure of coherence.
An easy to use and user-friendly API.

Installation

TRUNAJOD can be installed by running pip install trunajod. It requires Python 3.6.2+ to run.

Getting Started

Using this package has some other pre-requisites. It assumes that you already have your model set up on spacy. If not, please first install or download a model (for Spanish users, a spanish model). Then you can get started with the following code snippet.

You can download pre-build TRUNAJOD models from the repo, under the models directory.

Below is a small snippet of code that can help you in getting started with this lib. Don´t forget to take a look at the documentation.

The example below assumes you have the es_core_news_sm spaCy Spanish model installed. You can install the model running: python -m spacy download es_core_news_sm. For other models, please check spaCy docs.

```python from TRUNAJOD import surfaceproxies from TRUNAJOD.entitygrid import EntityGrid from TRUNAJOD.lexicosemanticnorms import LexicoSemanticNorm import pickle import spacy import tarfile

class ModelLoader(object): """Class to load model.""" def init(self, modelfile): tar = tarfile.open(modelfile, "r:gz") self.creafrequency = {} self.infinitivemap = {} self.lemmatizer = {} self.spanishlexicosemanticnorms = {} self.stopwords = {} self.wordnetnounsynsets = {} self.wordnetverbsynsets = {}

    for member in tar.getmembers():
        f = tar.extractfile(member)
        if "crea_frequency" in member.name:
            self.crea_frequency = pickle.loads(f.read())
        if "infinitive_map" in member.name:
            self.infinitive_map = pickle.loads(f.read())
        if "lemmatizer" in member.name:
            self.lemmatizer = pickle.loads(f.read())
        if "spanish_lexicosemantic_norms" in member.name:
            self.spanish_lexicosemantic_norms = pickle.loads(f.read())
        if "stopwords" in member.name:
            self.stopwords = pickle.loads(f.read())
        if "wordnet_noun_synsets" in member.name:
            self.wordnet_noun_synsets = pickle.loads(f.read())
        if "wordnet_verb_synsets" in member.name:
            self.wordnet_verb_synsets = pickle.loads(f.read())

Load TRUNAJOD models

model = ModelLoader("trunajodmodelsv0.1.tar.gz")

Load spaCy model

nlp = spacy.load("escorenews_sm", disable=["ner", "textcat"])

example_text = ( "El espectáculo del cielo nocturno cautiva la mirada y suscita preguntas" "sobre el universo, su origen y su funcionamiento. No es sorprendente que " "todas las civilizaciones y culturas hayan formado sus propias " "cosmologías. Unas relatan, por ejemplo, que el universo ha" "sido siempre tal como es, con ciclos que inmutablemente se repiten; " "otras explican que este universo ha tenido un principio, " "que ha aparecido por obra creadora de una divinidad." )

doc = nlp(example_text)

Lexico-semantic norms

lexicosemanticnorms = LexicoSemanticNorm( doc, model.spanishlexicosemanticnorms, model.lemmatizer )

Frequency index

freqindex = surfaceproxies.frequencyindex(doc, model.creafrequency)

Clause count (heurístically)

clausecount = surfaceproxies.clausecount(doc, model.infinitivemap)

Compute Entity Grid

egrid = EntityGrid(doc)

print("Concreteness: {}".format(lexicosemanticnorms.getconcreteness())) print("Frequency Index: {}".format(freqindex)) print("Clause count: {}".format(clausecount)) print("Entity grid:") print(egrid.getegrid()) ```

This should output:

Concreteness: 1.95 Frequency Index: -0.7684649336888104 Clause count: 10 Entity grid: {'ESPECTÁCULO': ['S', '-', '-'], 'CIELO': ['X', '-', '-'], 'MIRADA': ['O', '-', '-'], 'UNIVERSO': ['O', '-', 'S'], 'ORIGEN': ['X', '-', '-'], 'FUNCIONAMIENTO': ['X', '-', '-'], 'CIVILIZACIONES': ['-', 'S', '-'], 'CULTURAS': ['-', 'X', '-'], 'COSMOLOGÍAS': ['-', 'O', '-'], 'EJEMPLO': ['-', '-', 'X'], 'TAL': ['-', '-', 'X'], 'CICLOS': ['-', '-', 'X'], 'QUE': ['-', '-', 'S'], 'SE': ['-', '-', 'O'], 'OTRAS': ['-', '-', 'S'], 'PRINCIPIO': ['-', '-', 'O'], 'OBRA': ['-', '-', 'X'], 'DIVINIDAD': ['-', '-', 'X']}

A real world example

TRUNAJOD lib was used to make TRUNAJOD web app, which is an application to assess text complexity and to check the adquacy of a text to a particular school level. To achieve this, several TRUNAJOD indices were analyzed for multiple Chilean school system texts (from textbooks), and latent features were created. Here is a snippet:

```python """Example of TRUNAJOD usage.""" import glob

import matplotlib.pyplot as plt import pandas as pd import seaborn as sns import spacy import textract # To read .docx files import TRUNAJOD.givenness import TRUNAJOD.ttr from TRUNAJOD import surface_proxies from TRUNAJOD.syllabizer import Syllabizer

plt.rcParams["figure.figsize"] = (11, 4) plt.rcParams["figure.dpi"] = 200

nlp = spacy.load("escorenews_sm", disable=["ner", "textcat"])

features = { "lexicaldiversitymltd": [], "lexicaldensity": [], "posdissimilarity": [], "connectionwordsratio": [], "grade": [], } for filename in glob.glob("corpus//.docx"): text = textract.process(filename).decode("utf8") doc = nlp(text) features["lexicaldiversitymltd"].append( TRUNAJOD.ttr.lexicaldiversitymtld(doc) ) features["lexicaldensity"].append(surfaceproxies.lexicaldensity(doc)) features["posdissimilarity"].append( surfaceproxies.posdissimilarity(doc) ) features["connectionwordsratio"].append( surfaceproxies.connectionwords_ratio(doc) )

# In our case corpus was organized as:
# corpus/5B/5_2_55.docx where the folder that
# contained the doc, contained the school level, in
# this example 5th grade
features["grade"].append(filename.split("/")[1][0])

df = pd.DataFrame(features)

fig, axes = plt.subplots(2, 2)

sns.boxplot(x="grade", y="lexicaldiversitymltd", data=df, ax=axes[0, 0]) sns.boxplot(x="grade", y="lexicaldensity", data=df, ax=axes[0, 1]) sns.boxplot(x="grade", y="posdissimilarity", data=df, ax=axes[1, 0]) sns.boxplot(x="grade", y="connectionwordsratio", data=df, ax=axes[1, 1]) ```

Which yields:

TRUNAJOD web app example

TRUNAJOD web app backend was built using TRUNAJOD lib. A demo video is shown below (it is in Spanish):

Contributing to TRUNAJOD

Bug reports and fixes are always welcome! Feel free to file issues, or ask for a feature request. We use Github issue tracker for this. If you'd like to contribute, feel free to submit a pull request. For more questions you can contact me at dipalma (at) udec (dot) cl.

More details can be found in CONTRIBUTING.

References

If you find anything of this useful, feel free to cite the following papers, from which a lot of this python library was made for (I am also in the process of submitting this lib to an open software journal):

```bib @article{Palma2021, doi = {10.21105/joss.03153}, url = {https://doi.org/10.21105/joss.03153}, year = {2021}, publisher = {The Open Journal}, volume = {6}, number = {60}, pages = {3153}, author = {Diego A. Palma and Christian Soto and Mónica Veliz and Bruno Karelovic and Bernardo Riffo}, title = {TRUNAJOD: A text complexity library to enhance natural language processing}, journal = {Journal of Open Source Software} }

@article{palma2018coherence, title={Coherence-based automatic essay assessment}, author={Palma, Diego and Atkinson, John}, journal={IEEE Intelligent Systems}, volume={33}, number={5}, pages={26--36}, year={2018}, publisher={IEEE} }

@inproceedings{palma2019data, title={A Data-Driven Methodology to Assess Text Complexity Based on Syntactic and Semantic Measurements}, author={Palma, Diego and Soto, Christian and Veliz, M{\'o}nica and Riffo, Bernardo and Guti{\'e}rrez, Antonio}, booktitle={International Conference on Human Interaction and Emerging Technologies}, pages={509--515}, year={2019}, organization={Springer} } ```

Owner

Name: Diego Palma
Login: dpalmasan
Kind: user
Location: Seattle
Company: Meta

Website: https://dpalmasan.github.io/website/
Repositories: 71
Profile: https://github.com/dpalmasan

I am an electrical engineer, and MSc in Computer Science. Interested in teaching, programming, web application development, AI, NLP and Machine Learning.

JOSS Publication

TRUNAJOD: A text complexity library to enhance natural language processing

Published

April 21, 2021

DOI

10.21105/joss.03153

Volume 6, Issue 60, Page 3153

Authors

Diego A. Palma

Universidad de Concepción

Christian Soto
Universidad de Concepción

Mónica Veliz
Universidad de Concepción

Bruno Karelovic
Universidad de Concepción

Bernardo Riffo
Universidad de Concepción

Editor

Daniel S. Katz

GitHub Events

Total

Last Year

Committers

Last synced: 7 months ago

All Time

Total Commits: 119
Total Committers: 8
Avg Commits per committer: 14.875
Development Distribution Score (DDS): 0.445

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
dpalmasan	d**n@g**m	66
Diego	d**a@e**m	31
sourvad	s**d@g**m	7
supersonic1999	j**y@g**m	5
Brandon Goding	b**g@g**m	5
Daniel S. Katz	d**z@i**g	2
Bruce Lee	w**e@g**m	2
Alejandro Piad	a**d@g**m	1

Committer Domains (Top 20 + Academic)

ieee.org: 1 evernote.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 40
Total pull requests: 28
Average time to close issues: 6 days
Average time to close pull requests: about 14 hours
Total issue authors: 3
Total pull request authors: 8
Average comments per issue: 1.25
Average comments per pull request: 1.11
Merged pull requests: 20
Bot issues: 0
Bot pull requests: 4

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

dpalmasan (33)
mbdemoraes (6)
apiad (1)

Pull Request Authors

dpalmasan (14)
supersonic1999 (4)
dependabot[bot] (4)
sourvad (2)
brucewlee (1)
BrandonGoding (1)
apiad (1)
danielskatz (1)

Top Labels

Issue Labels

good first issue (16) enhancement (13) usability (3) bug (3) feature: coherence (1) feature: linguistic index (1) documentation (1) dependencies (1) help wanted (1)

Pull Request Labels

dependencies (4)

Packages

Total packages: 1
Total downloads:
- pypi 517 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 1

pypi.org: trunajod

A python lib for readability analyses.

Homepage: https://github.com/dpalmasan/TRUNAJOD2.0
Documentation: https://trunajod20.readthedocs.io/en/latest/
License: MIT
Latest release: 0.1.1
published almost 5 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 517 Last month

Rankings

Dependent packages count: 10.1%

Stargazers count: 11.9%

Forks count: 13.3%

Average: 17.4%

Dependent repos count: 21.6%

Downloads: 30.0%

Maintainers (1)

dpalmasan

Last synced: 6 months ago

TRUNAJOD

Science Score: 95.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TRUNAJOD: A text complexity library for text analysis built on spaCy

Features

Installation

Getting Started

Load TRUNAJOD models

Load spaCy model

Lexico-semantic norms

Frequency index

Clause count (heurístically)

Compute Entity Grid

A real world example

TRUNAJOD web app example

Contributing to TRUNAJOD

References

Owner

JOSS Publication

TRUNAJOD: A text complexity library to enhance natural language processing

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: trunajod

Rankings

Maintainers (1)

Dependencies