lexicalrichness

:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).

Keywords

data-mining data-science information-retrieval lexical-analysis lexical-analyzer linguistic-analysis natural-language natural-language-processing nlp python

Keywords from Contributors

mesh interpretability sequences projection interactive hacking network-simulation

Last synced: 10 months ago · JSON representation ·

Repository

:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).

Basic Info

Host: GitHub
Owner: LSYS
License: mit
Language: Python
Default Branch: master
Homepage: http://lexicalrichness.readthedocs.io/
Size: 3.46 MB

Statistics

Stars: 109
Watchers: 3
Forks: 22
Open Issues: 3
Releases: 13

Topics

data-mining data-science information-retrieval lexical-analysis lexical-analyzer linguistic-analysis natural-language natural-language-processing nlp python

Created about 8 years ago · Last pushed almost 3 years ago

Metadata Files

Readme Contributing License Citation

README.rst

===============
LexicalRichness
===============
|	|pypi| |conda-forge| |latest-release| |python-ver| 
|	|ci-status| |rtfd| |maintained|
|	|PRs| |codefactor| |isort|
|	|license| |mybinder| |zenodo|

`LexicalRichness `__ is a small Python module to compute textual lexical richness (aka lexical diversity) measures.

Lexical richness refers to the range and variety of vocabulary deployed in a text by a speaker/writer `(McCarthy and Jarvis 2007) `_ . Lexical richness is used interchangeably with lexical diversity, lexical variation, lexical density, and vocabulary richness and is measured by a wide variety of indices. Uses include (but not limited to) measuring writing quality, vocabulary knowledge `(Šišková 2012) `_ , speaker competence, and socioeconomic status `(McCarthy and Jarvis 2007) `_. 
See the `notebook `_ for examples.

.. TOC
.. contents:: **Table of Contents**
   :depth: 1
   :local:
	
1. Installation
---------------
**Install using PIP**

.. code-block:: bash

	pip install lexicalrichness

If you encounter, 

.. code-block:: python

	ModuleNotFoundError: No module named 'textblob'

install textblob:

.. code-block:: bash

	pip install textblob

*Note*: This error should only exist for :code:`versions <= v0.1.3`. Fixed in 
`v0.1.4 `__ by `David Lesieur `__ and `Christophe Bedetti `__.


**Install from Conda-Forge**

*LexicalRichness* is now also available on conda-forge. If you have are using the `Anaconda `__ or `Miniconda `__ distribution, you can create a conda environment and install the package from conda.

.. code-block:: bash

	conda create -n lex
	conda activate lex 
	conda install -c conda-forge lexicalrichness

*Note*: If you get the error :code:`CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'` with :code:`conda activate lex` in *Bash* either try

	* :code:`conda activate bash` in the *Anaconda Prompt* and then retry :code:`conda activate lex` in *Bash*
	* or just try :code:`source activate lex` in *Bash*

**Install manually using Git and GitHub**

.. code-block:: bash

	git clone https://github.com/LSYS/LexicalRichness.git
	cd LexicalRichness
	pip install .

**Run from the cloud**

Try the package on the cloud (without setting anything up on your local machine) by clicking the icon here:  

|mybinder|



2. Quickstart
-------------

.. code-block:: python

	>>> from lexicalrichness import LexicalRichness

	# text example
	>>> text = """Measure of textual lexical diversity, computed as the mean length of sequential words in
            		a text that maintains a minimum threshold TTR score.

            		Iterates over words until TTR scores falls below a threshold, then increase factor
            		counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor
            		threshold in the range of [0.660, 0.750].
            		(McCarthy 2005, McCarthy and Jarvis 2010)"""

	# instantiate new text object (use the tokenizer=blobber argument to use the textblob tokenizer)
	>>> lex = LexicalRichness(text)

	# Return word count.
	>>> lex.words
	57

	# Return (unique) word count.
	>>> lex.terms
	39

	# Return type-token ratio (TTR) of text.
	>>> lex.ttr
	0.6842105263157895

	# Return root type-token ratio (RTTR) of text.
	>>> lex.rttr
	5.165676192553671

	# Return corrected type-token ratio (CTTR) of text.
	>>> lex.cttr
	3.6526846651686067

	# Return mean segmental type-token ratio (MSTTR).
	>>> lex.msttr(segment_window=25)
	0.88

	# Return moving average type-token ratio (MATTR).
	>>> lex.mattr(window_size=25)
	0.8351515151515151

	# Return Measure of Textual Lexical Diversity (MTLD).
	>>> lex.mtld(threshold=0.72)
	46.79226361031519

	# Return hypergeometric distribution diversity (HD-D) measure.
	>>> lex.hdd(draws=42)
	0.7468703323966486
	
	# Return voc-D measure.
	>>> lex.vocd(ntokens=50, within_sample=100, iterations=3)
	46.27679899103406

	# Return Herdan's lexical diversity measure.
	>>> lex.Herdan
	0.9061378160786574

	# Return Summer's lexical diversity measure.
	>>> lex.Summer
	0.9294460323356605

	# Return Dugast's lexical diversity measure.
	>>> lex.Dugast
	43.074336212149774

	# Return Maas's lexical diversity measure.
	>>> lex.Maas
	0.023215679867353005

	# Return Yule's K.
	>>> lex.yulek
	153.8935056940597

	# Return Yule's I.
	>>> lex.yulei
	22.36764705882353
	
	# Return Herdan's Vm.
	>>> lex.herdanvm
	0.08539428890448784

	# Return Simpson's D.
	>>> lex.simpsond
	0.015664160401002505

	
3. Use LexicalRichness in your own pipeline
-------------------------------------------
:code:`LexicalRichness` comes packaged with minimal preprocessing + tokenization for a quick start. 

But for intermediate users, you likely have your preferred :code:`nlp_pipeline`:

.. code-block:: python

	# Your preferred preprocessing + tokenization pipeline
	def nlp_pipeline(text):
	    ...
	    return list_of_tokens

Use :code:`LexicalRichness` with your own :code:`nlp_pipeline`:

.. code-block:: python

	# Initiate new LexicalRichness object with your preprocessing pipeline as input
	lex = LexicalRichness(text, preprocessor=None, tokenizer=nlp_pipeline)

	# Compute lexical richness
	mtld = lex.mtld()
	
Or use :code:`LexicalRichness` at the end of your pipeline and input the :code:`list_of_tokens` with :code:`preprocessor=None` and :code:`tokenizer=None`:
	
.. code-block:: python

	# Preprocess the text
	list_of_tokens = nlp_pipeline(text)
	
	# Initiate new LexicalRichness object with your list of tokens as input
	lex = LexicalRichness(list_of_tokens, preprocessor=None, tokenizer=None)

	# Compute lexical richness
	mtld = lex.mtld()	
	
4. Using with Pandas
--------------------
Here's a minimal example using `lexicalrichness` with a `Pandas` `dataframe` with a column containing text:

.. code-block:: python

	def mtld(text):
	    lex = LexicalRichness(text)
	    return lex.mtld()
		
	df['mtld'] = df['text'].apply(mtld)


5. Attributes
-------------

+-------------------------+-----------------------------------------------------------------------------------+
| ``wordlist``            | list of words                                                   		      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``words``  		  | number of words (w) 				   			      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``terms``		  | number of unique terms (t)			                                      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``preprocessor``        | preprocessor used		                                                      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``tokenizer``           | tokenizer used		                                                      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``ttr``		  | type-token ratio computed as t / w (Chotlos 1944, Templin 1957)         	      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``rttr``	          | root TTR computed as t / sqrt(w) (Guiraud 1954, 1960)                             |
+-------------------------+-----------------------------------------------------------------------------------+
| ``cttr``	          | corrected TTR computed as t / sqrt(2w) (Carrol 1964)		              |
+-------------------------+-----------------------------------------------------------------------------------+
| ``Herdan`` 	          | log(t) / log(w) (Herdan 1960, 1964)                                               |
+-------------------------+-----------------------------------------------------------------------------------+
| ``Summer``    	  | log(log(t)) / log(log(w)) (Summer 1966)                                           |
+-------------------------+-----------------------------------------------------------------------------------+
| ``Dugast``          	  | (log(w) ** 2) / (log(w) - log(t) (Dugast 1978)				      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``Maas`` 	          | (log(w) - log(t)) / (log(w) ** 2) (Maas 1972)                                     |
+-------------------------+-----------------------------------------------------------------------------------+
| ``yulek``	          | Yule's K (Yule 1944, Tweedie and Baayen 1998)                                     |
+-------------------------+-----------------------------------------------------------------------------------+
| ``yulei``	          | Yule's I (Yule 1944, Tweedie and Baayen 1998)                                     |
+-------------------------+-----------------------------------------------------------------------------------+
| ``herdanvm``	          | Herdan's Vm (Herdan 1955, Tweedie and Baayen 1998)                                |
+-------------------------+-----------------------------------------------------------------------------------+
| ``simpsond``	          | Simpson's D (Simpson 1949, Tweedie and Baayen 1998)                               |
+-------------------------+-----------------------------------------------------------------------------------+

6. Methods
----------

+-------------------------+-----------------------------------------------------------------------------------+
| ``msttr``            	  | Mean segmental TTR (Johnson 1944)						      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``mattr``  		  | Moving average TTR (Covington 2007, Covington and McFall 2010)		      |
+-------------------------+-----------------------------------------------------------------------------------+
| ``mtld``		  | Measure of Lexical Diversity (McCarthy 2005, McCarthy and Jarvis 2010)            |
+-------------------------+-----------------------------------------------------------------------------------+
| ``hdd``                 | HD-D (McCarthy and Jarvis 2007)                                                   |
+-------------------------+-----------------------------------------------------------------------------------+
| ``vocd``                | voc-D (Mckee, Malvern, and Richards 2010)                                         |
+-------------------------+-----------------------------------------------------------------------------------+
| ``vocd_fig``            | Utility to plot empirical voc-D curve 	                                      |
+-------------------------+-----------------------------------------------------------------------------------+

**Plot the empirical voc-D curve**

.. code-block:: python

	lex.vocd_fig(
	    ntokens=50,  # Maximum number for the token/word size in the random samplings
	    within_sample=100,  # Number of samples
	    seed=42,  # Seed for reproducibility
	)

.. image:: https://raw.githubusercontent.com/LSYS/LexicalRichness/master/docs/images/vocd.png
	:width: 450


**Assessing method docstrings**

.. code-block:: python

	>>> import inspect

	# docstring for hdd (HD-D)
	>>> print(inspect.getdoc(LexicalRichness.hdd))

	Hypergeometric distribution diversity (HD-D) score.

	For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance
	of t with a random draw of size n < N (text size). The contribution of t to the final HD-D
	score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for
	each term t. Described in McCarthy and Javis 2007, p.g. 465-466.
	(McCarthy and Jarvis 2007)

	Parameters
	__________
	draws: int
	    Number of random draws in the hypergeometric distribution (default=42).

	Returns
	_______
	float
	
Alternatively, just do

.. code-block:: python

	>>> print(lex.hdd.__doc__)
	
	Hypergeometric distribution diversity (HD-D) score.

            For each term (t) in the text, compute the probabiltiy (p) of getting at least one appearance
            of t with a random draw of size n < N (text size). The contribution of t to the final HD-D
            score is p * (1/n). The final HD-D score thus sums over p * (1/n) with p computed for
            each term t. Described in McCarthy and Javis 2007, p.g. 465-466.
            (McCarthy and Jarvis 2007)

            Parameters
            ----------
            draws: int
                Number of random draws in the hypergeometric distribution (default=42).

            Returns
            -------
            float	
	    
7. Formulation & Algorithmic Details
------------------------------------
For details under the hood, please see `this section `_ in the docs (or `see here `_).

	    
8. Example use cases
--------------------
* `[1] `_ **SENTiVENT** used the metrics that LexicalRichness provides to estimate the classification difficulty of annotated categories in their corpus (Jacobs & Hoste 2020). The metrics show which categories will be more difficult for modeling approaches that rely on linguistic inputs because greater lexical diversity means greater data scarcity and more need for generalization. (h/t Gilles Jacobs)

	Jacobs, Gilles, and Véronique Hoste. "SENTiVENT: enabling supervised information extraction of company-specific events in economic and financial news." Language Resources and Evaluation (2021): 1-33.

	.. raw:: html

	   
	   Click here for citation metadata

	.. code-block:: bib

		@article{jacobs2021sentivent, 
			title={SENTiVENT: enabling supervised information extraction of company-specific events in economic and financial news},
			author={Jacobs, Gilles and Hoste, V{\'e}ronique},
			journal={Language Resources and Evaluation},
			pages={1--33},
			year={2021},
			publisher={Springer}
		}
	
	.. raw:: html

    
* | `[2] `_ **Measuring political media using text data.** This chapter of my thesis investigates whether political media bias manifests by coverage accuracy. As covaraites, I use characteristics of the text data (political speech and news article transcripts). One of the ways speeches can be characterized is via lexical richness.
    
	.. raw:: html

	   
	   Shen, Lucas (2021). Measuring political media using text data [Click for metadata]

	.. code-block:: bib

		@techreport{accuracybias, 
			title={Measuring Political Media Slant Using Text Data},
			author={Shen, Lucas},
			url={https://www.lucasshen.com/research/media.pdf},
			year={2021}
		}
	
	.. raw:: html    	    
	
* `[3] `_ **Unreadable News: How Readable is American News?** This study characterizes modern news by readability and lexical richness. Focusing on the NYT, they find increasing readability and lexical richness, suggesting that NYT feels competition from alternative sources to be accessible while maintaining its key demographic of college-educated Americans. 
   
	.. raw:: html

	   
	   NYT's lexical superiority?
		
		
			
			

			Source: (https://github.com/notnews/unreadable_news)
		
	   
	
	.. raw:: html    

* `[4] `_ **German is more complicated than English** This study analyses a small sample of English books and compares them to their German translation. Within the sample, it can be observed that the German translations tend to be shorter in length, but contain more unique terms than their English counterparts. LexicalRichness was used to generate the statistics modeled within the study. 
   
	.. raw:: html

	   
	   Words vs Terms in Each Book
		
		
			
			

			Source: (https://github.com/g-hurst/Comparing-Properties-of-German-and-English-Books)
		  
	
	.. raw:: html    
	
	    
9. Contributing
---------------
**Author**

`Lucas Shen `__

**Contributors**

.. image:: https://contrib.rocks/image?repo=lsys/lexicalrichness
   :target: https://github.com/lsys/lexicalrichness/graphs/contributors

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given. 
See here for `how to contribute  <./docs//CONTRIBUTING.rst>`__ to this project.
See here for `Contributor Code of
Conduct `__.

If you'd like to contribute via a Pull Request (PR), feel free to open an issue on the `Issue Tracker
`__ to discuss the potential contribution via a PR.

10. Citing
----------
If you have used this codebase and wish to cite it, here is the citation metadata.

Codebase:

.. code-block:: bib

	@misc{lex,
		author = {Shen, Lucas},
		doi = {10.5281/zenodo.6607007},
		license = {MIT license},
		title = {{LexicalRichness: A small module to compute textual lexical richness}},
		url = {https://github.com/LSYS/lexicalrichness},
		year = {2022}
	}

Documentation on formulations and algorithms:

.. code-block:: bib

	@misc{accuracybias, 
		title={Measuring Political Media Slant Using Text Data},
		author={Shen, Lucas},
		url={https://www.lucasshen.com/research/media.pdf},
		year={2021}
	}

The package is released under the `MIT
License `__.

.. macros -------------------------------------------------------------------------------------------------------
.. badges
.. |pypi| image:: https://badge.fury.io/py/lexicalrichness.svg
	:target: https://pypi.org/project/lexicalrichness/
.. |conda-forge| image:: https://img.shields.io/conda/vn/conda-forge/lexicalrichness   
	:target: https://anaconda.org/conda-forge/lexicalrichness
.. |latest-release| image:: https://img.shields.io/github/v/release/lsys/lexicalrichness   
	:target: https://github.com/LSYS/LexicalRichness/releases
.. |ci-status| image:: https://github.com/LSYS/LexicalRichness/actions/workflows/build.yml/badge.svg?branch=master   
	:target: https://github.com/LSYS/LexicalRichness/actions/workflows/build.yml
.. |python-ver| image:: https://img.shields.io/pypi/pyversions/lexicalrichness   
	:target: https://img.shields.io/pypi/pyversions/lexicalrichness
.. |codefactor| image:: https://www.codefactor.io/repository/github/lsys/lexicalrichness/badge
	:target: https://www.codefactor.io/repository/github/lsys/lexicalrichness     
.. |lgtm| image:: https://img.shields.io/lgtm/grade/python/g/LSYS/LexicalRichness.svg?logo=lgtm&logoWidth=18)
	:target: https://lgtm.com/projects/g/LSYS/LexicalRichness/context:python   
.. |maintained| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
   :target: https://GitHub.com/Naereen/StrapDown.js/graphs/commit-   
.. |PRs| image:: https://img.shields.io/badge/PRs-welcome-brightgreen.svg
	:target: http://makeapullrequest.com   
.. |license| image:: https://img.shields.io/github/license/LSYS/LexicalRichness?color=blue&label=License  
	:target: https://github.com/LSYS/LexicalRichness/blob/master/LICENSE   
.. |mybinder| image:: https://mybinder.org/badge_logo.svg
   :target: https://mybinder.org/v2/gh/LSYS/lexicaldiversity-example/main?labpath=example.ipynb	
.. |zenodo| image:: https://zenodo.org/badge/DOI/10.5281/zenodo.6607007.svg
   :target: https://doi.org/10.5281/zenodo.6607007
		
.. |rtfd| image:: https://readthedocs.org/projects/lexicalrichness/badge/?version=latest
    :target: https://lexicalrichness.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status
.. |isort| image:: https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336
	:target: https://pycqa.github.io/isort
	:alt: Imports: isort

Owner

Name: Lucas Shen Y. S.
Login: LSYS
Kind: user

Website: https://www.lucasshen.com
Repositories: 7
Profile: https://github.com/LSYS

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If found this software useful for your work, please cite it as below."
preferred-citation:
  authors:
  - family-names: "Shen"
    given-names: "Lucas"
  title: "LexicalRichness: A small module to compute textual lexical richness"
  year: 2022
  url: "https://pypi.org/project/lexicalrichness/"
  repository-code: "https://github.com/LSYS/lexicalrichness"
  license:  MIT license
  identifiers:
  - description: "This is from the archived snapshot of the code, supported by Zenodo."
    type: doi
    value: 10.5281/zenodo.6607007
  doi: 10.5281/zenodo.6607007

GitHub Events

Total

Watch event: 18
Fork event: 2

Last Year

Watch event: 18
Fork event: 2

Committers

Last synced: about 1 year ago

All Time

Total Commits: 86
Total Committers: 7
Avg Commits per committer: 12.286
Development Distribution Score (DDS): 0.116

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Lucas Shen Y. S	l**s@l**m	76
David Lesieur	d**d@d**m	5
dependabot[bot]	4****]	1
Pip coder	7****1	1
Garrett Hurst	6****t	1
Earl Brown	e**7@g**m	1
Christophe Bedetti	c**i@u**a	1

Committer Domains (Top 20 + Academic)

umontreal.ca: 1 davidlesieur.com: 1 lucasshen.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 30
Total pull requests: 52
Average time to close issues: 3 months
Average time to close pull requests: 7 days
Total issue authors: 7
Total pull request authors: 11
Average comments per issue: 0.83
Average comments per pull request: 0.21
Merged pull requests: 44
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

LSYS (23)
niekveldhuis (2)
GillesJ (1)
neubig (1)
mreygal (1)
ValRCS (1)
paulxvii (1)

Pull Request Authors

LSYS (41)
davidlesieur (2)
g-hurst (1)
gitter-badger (1)
codacy-badger (1)
cbedetti (1)
a-1an (1)
ekbrown (1)
Sreetama2001 (1)
dependabot[bot] (1)
xhulianoThe1 (1)

Top Labels

Issue Labels

Documentation (15) Release (12) enhancement (5) In Progress (2) hacktoberfest (2) bug (1)

Pull Request Labels

hacktoberfest (3) dependencies (1)

Packages

Total packages: 2
Total downloads:
- pypi 6,605 last-month

Total dependent packages: 5
(may contain duplicates)
Total dependent repositories: 8
(may contain duplicates)
Total versions: 19
Total maintainers: 1

pypi.org: lexicalrichness

A small module to compute textual lexical richness (aka lexical diversity).

Homepage: https://github.com/LSYS/lexicalrichness
Documentation: https://lexicalrichness.readthedocs.io/
License: MIT license
Latest release: 0.5.1
published almost 3 years ago

Versions: 15
Dependent Packages: 5
Dependent Repositories: 7
Downloads: 6,605 Last month

Rankings

Dependent packages count: 1.8%

Dependent repos count: 5.6%

Average: 6.2%

Downloads: 6.8%

Stargazers count: 8.2%

Forks count: 8.4%

Maintainers (1)

LSYS

Last synced: 11 months ago

conda-forge.org: lexicalrichness

Homepage: https://github.com/LSYS/lexicalrichness
License: MIT
Latest release: 0.3.0
published over 3 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 1

Rankings

Dependent repos count: 24.2%

Average: 38.7%

Forks count: 39.4%

Stargazers count: 39.8%

Dependent packages count: 51.5%

Last synced: 11 months ago

lexicalrichness

Science Score: 67.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: lexicalrichness

Rankings

Maintainers (1)

conda-forge.org: lexicalrichness

Rankings

Dependencies