https://github.com/brucewlee/lftk

[BEA @ ACL 2023] General-purpose tool for linguistic features extraction; Tested on readability assessment, essay scoring, fake news detection, hate speech detection, etc.

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.5%) to scientific vocabulary

Keywords

bea-workshop feature-extraction handcrafted-features linguistic-features natural-language-processing python readability-scores reading-time spacy text-analysis word-difficulty

Last synced: 5 months ago · JSON representation

Repository

[BEA @ ACL 2023] General-purpose tool for linguistic features extraction; Tested on readability assessment, essay scoring, fake news detection, hate speech detection, etc.

Basic Info

Host: GitHub
Owner: brucewlee
License: other
Language: Python
Default Branch: main
Homepage: https://lftk.rtfd.io/
Size: 7.19 MB

Statistics

Stars: 140
Watchers: 3
Forks: 25
Open Issues: 4
Releases: 0

Topics

bea-workshop feature-extraction handcrafted-features linguistic-features natural-language-processing python readability-scores reading-time spacy text-analysis word-difficulty

Created almost 3 years ago · Last pushed about 1 year ago

Metadata Files

Readme License

LFTK: Handcrafted Features in Computational Linguistics

:microscope: Comprehensive: LFTK is a Python research package that extracts various handcrafted features (e.g. number of words per sentence, Flesch-Kincaid Readabiility Score) that are commonly used in computational linguistics.
:fire: Blazing Fast: Extracting more than 200 handcrafted features takes less than 0.01 sec per word. Much faster than LFTK's predecessor, LingFeat. This time is reported excluding spaCy processing time, which is not a part of LFTK.
:rocket: Do More with SpaCy: LFTK is built on top of a popular NLP library named spaCy. Explore spaCy's pre-trained pipelines and get the most out of spaCy.

LFTK can calculate readability score, evaluate word difficulty, count number of nouns, and many more. There is much to explore in this package. Use our handcrafted features to support linguistic studies or build machine learning models.

Installation

Use package manager pip to install LFTK.

bash pip install lftk

Also, install spaCy and a trained spaCy pipeline of your choice. Here, we use en_core_web_sm. Though instaling LFTK can automatically install spaCy, but you will still have to download one of their trained pipelines.

```bash pip install spacy

python -m spacy download encoreweb_sm ```

News

Presentation at BEA @ ACL 2023.
Preprint available on ArXiv. Here
v.1.0.9 -> Documentation update! Keep track of our progress.
v.1.0.8 -> 7 features that extracts conjunctions are deleted. These features are replaced by those extractin subordinating conjunctions and coordinating conjunections.

Usage

```python import spacy import lftk

load a trained pipeline of your choice from spacy

remember we already downloaed "encoreweb_sm" pipeline above?

nlp = spacy.load("encoreweb_sm")

create a spaCy doc object

doc = nlp("I love research but my professor is strange.")

initiate LFTK extractor by passing in doc

you can pass in a list of multiple docs

LFTK = lftk.Extractor(docs = doc)

optionally, you can customize how LFTK extractor calculates handcrafted linguistic features

for example, include stop word? include puncutaion? maximum decimal digits?

LFTK.customize(stopwords=True, punctuations=False, rounddecimal=3)

now, extract the handcrafted linguistic features that you need

refer to them as feature keys

extractedfeatures = LFTK.extract(features = ["awordps", "akuppw", "nnoun"])

{'awordps': 8.0, 'akuppw': 5.754, 'n_noun': 2}

print(extracted_features) ``` Also, read Essential Tips and To-Do Guides.

Deep Dive: Handcrafted Linguistic Features

TL;DR: Google Sheet of All Handcrafted Linguistic Features

Each handcrafted linguistic feature represents a certain linguistic property. We categorize all features into the broad linguistic branches of lexico-semantics, syntax, discourse, and surface. The surface branch can also hold features that do not belong to any specific linguistic branch. Apart from linguistic branches, handcrafted features are also categorized into linguistic families. The linguistic families are meant to group features into smaller subcategories, enabling users to search more effectively for the feature they need. All family names are unique, and each family belongs to a specific formulation. This means that the features in a family are either all foundation or all derivation. A linguistic family also serves as an important building block of our feature extraction system. LFTK as a program is essentially a linked collection of several feature extraction modules where each module represents a linguistic family.

Each handcrafted linguistic feature can either foundation or derivation. Derivation-type linguistic features are derived on top of foundation-type linguistic features. For example, the total number of words and the total number of sentences in a given text is a foundation feature. On the other hand, the average number of words per sentence is a derivation feature as it builds on top of the two aforementioned foundation features.

Each handcrafted linguistic feature also has an assigned language value. If the linguistic feature is universally applicable across languages, it is denoted "general". These general linguistic features can be used with any language given that spaCy has a supporting pipeline for that functionality in that language. This can be easily checked on spaCy pipelines. If the feature is designed for a specific language, like English, it is denoted with the specific language code.

Programmatically Searching Handcrafted Features

```python import lftk

returns all available features as a list of dictionaries by default

searchedfeatures = lftk.searchfeatures()

[{'key': 'tword', 'name': 'totalnumberofwords', 'formulation': 'foundation', 'domain': 'surface', 'family': 'wordsent'}, {'key': 'tuword', 'name': 'totalnumberofuniquewords', 'formulation': 'foundation', 'domain': 'surface', 'family': 'wordsent'}, {'key': 'tsent', 'name': 'totalnumberof_sentences', 'formulation': 'foundation', 'domain': 'surface', 'family': 'wordsent'},...]

print(searched_features)

specify attributes

searchedfeatures = lftk.searchfeatures(domain = "surface", family = "avgwordsent")

[{'key': 'awordps', 'name': 'averagenumberofwordspersentence', 'formulation': 'derivation', 'domain': 'surface', 'family': 'avgwordsent'}, {'key': 'acharps', 'name': 'averagenumberofcharacterspersentence', 'formulation': 'derivation', 'domain': 'surface', 'family': 'avgwordsent'}, {'key': 'acharpw', 'name': 'averagenumberofcharactersper_word', 'formulation': 'derivation', 'domain': 'surface', 'family': 'avgwordsent'}]

print(searched_features)

return pandas dataframe instead of list of dictionaries

searchedfeatures = lftk.searchfeatures(domain = 'surface', family = "avgwordsent", return_format = "pandas")

key name formulation domain family

4 awordps averagenumberofwordsper_sentence derivation surface avgwordsent

5 acharps averagenumberofcharactersper_sentence derivation surface avgwordsent

6 acharpw averagenumberofcharactersper_word derivation surface avgwordsent

print(searched_features) ```

Attribute: domain

surface : surface-level features that often do not represent a specific linguistic property
lexico-semantics : attributes associated with words
discourse : high-level dependencies between words and sentences
syntax : arrangement of words and phrases

Attribute: family

wordsent : basic counts of words and sentences
worddiff : difficulty, familiarity, frequency of words
partofspeech : features that deals with part of speech properties, we follow the universal POS tagging scheme
entity : named entities or entities such as location or person
avgwordsent : averaging wordsent features over certain spans
avgworddiff : averaging worddiff features over certain spans
avgpartofspeech : averaging partofspeech features over certain spans
avgentity : averaging entity features over certain spans
lexicalvariation : features that measure lexical variation (that are not TTR)
typetokenratio : type token ratio is known to capture lexical richness of a text
readformula : traditional readability formulas that calculate text readability
readtimeformula : basic reading time formulas (in seconds)

Attribute: language

general : LFTK can extract this feature in a language-agnostic manner when supplied with an appropriate spaCy pipeline
en : LFTK can extract this feature in English only

Essential Tips and To-Do Guides

Q: How to extract features by group? Do I have to specify each feature individually?

No. We have a good way around, using the convenient search function. First, think about how you want to search for your handcrafted linguistic features. In this case, we only want wordsent family features that generally work across languages.

```Python import lftk

specify attributes and (IMPORTANT) set returnformat to "listkey"

searchedfeatures = lftk.searchfeatures(family = "wordsent", language = "general", returnformat = "listkey")

['tword', 'tstopword', 'tpunct', 'tuword', 'tsent', 'tchar']

print(searchedfeatures) How is this possible?searchfeatures``` function returns all available features by default and a user can restrict the returned features by specifying attributes. This is analogous to asking the function to "return all features that are {attribute 1}, {attribute 2}, ..." In the above case, "return all features that are {family = "wordsent"}, {language = "general"}".

Also, see how setting return_format variable to "list_key" returns a list of the feature keys that match the user-given attributes. Now, we pass those searched keys into extract function.

```Python

now, extract the handcrafted linguistic features that you need

extractedfeatures = LFTK.extract(features = searchedfeatures)

{'tword': 8, 'tstopword': 4, 'tpunct': 1, 'tuword': 9, 'tsent': 1, 'tchar': 36}

print(extracted_features) ```

Q: What if I wanted to extract features from multiple groups?

search_features function only allows users to pass one argument per attribute. This means that you will need to make multiple individual calls. For example, to obtain a list of features from wordsent family and readtimeformula family,

```Python searchedfeaturesA = lftk.searchfeatures(family = "wordsent", returnformat = "listkey") searchedfeaturesB = lftk.searchfeatures(family = "readtimeformula", returnformat = "listkey")

result = searchedfeaturesA + searchedfeaturesB ```

Then, you can call the usual extraction function,

Python extracted_features = LFTK.extract(features = result)

Publications

LFTK has been used in the following publications. If you don't see your paper on the list, but you used LFTK, let us know, and we'll add it to the list!

Citation

@inproceedings{lee-lee-2023-lftk, title = "{LFTK}: Handcrafted Features in Computational Linguistics", author = "Lee, Bruce W. and Lee, Jason", booktitle = "Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.bea-1.1", pages = "1--19", abstract = "Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, no actively-maintained open-source library extracts a wide variety of handcrafted features. The current handcrafted feature extraction practices have several inefficiencies, and a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system to give the community a rich set of pre-implemented handcrafted features.", }

Owner

Name: Bruce W. Lee (이웅성)
Login: brucewlee
Kind: user
Location: Philadelphia, PA
Company: University of Pennsylvania

Website: brucewlee.github.io
Repositories: 3
Profile: https://github.com/brucewlee

Research Scientist - NLP

GitHub Events

Total

Watch event: 24
Pull request event: 3
Fork event: 3
Create event: 1

Last Year

Watch event: 24
Pull request event: 3
Fork event: 3
Create event: 1

Committers

Last synced: 9 months ago

All Time

Total Commits: 48
Total Committers: 3
Avg Commits per committer: 16.0
Development Distribution Score (DDS): 0.042

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Bruce Lee	w**e@g**m	46
Dang Nguyen	d**t@g**m	1
Alex Strick van Linschoten	s****l	1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 4
Total pull requests: 27
Average time to close issues: 1 day
Average time to close pull requests: about 1 hour
Total issue authors: 4
Total pull request authors: 4
Average comments per issue: 1.75
Average comments per pull request: 0.04
Merged pull requests: 26
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

satoshi-2000 (1)
chaojiang06 (1)
Annafavaro (1)
masun (1)

Pull Request Authors

brucewlee (24)
dangne (2)
SiddharthPant (2)
strickvl (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 136 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 18
Total maintainers: 1

pypi.org: lftk

Comprehensive Handcrafted Linguistic Features Extraction in Python

Homepage: https://github.com/brucewlee/lftk
Documentation: https://lftk.readthedocs.io/
License: CC BY-NC 4.0
Latest release: 1.0.9
published almost 3 years ago

Versions: 18
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 136 Last month

Rankings

Dependent packages count: 6.6%

Downloads: 10.0%

Average: 22.0%

Forks count: 30.5%

Dependent repos count: 30.6%

Stargazers count: 32.3%

Maintainers (1)

brucewlee

Last synced: 6 months ago

Dependencies

requirements.txt pypi

Babel ==2.12.1
Jinja2 ==3.1.2
MarkupSafe ==2.1.2
PyYAML ==6.0
Pygments ==2.15.1
Sphinx ==6.2.1
alabaster ==0.7.13
bleach ==6.0.0
blis ==0.7.9
catalogue ==2.0.8
certifi ==2022.12.7
charset-normalizer ==3.1.0
click ==8.1.3
confection ==0.0.4
cymem ==2.0.7
docutils ==0.19
idna ==3.4
imagesize ==1.4.1
importlib-metadata ==6.6.0
jaraco.classes ==3.2.3
keyring ==23.13.1
langcodes ==3.3.0
lftk ==1.0.9
markdown-it-py ==2.2.0
mdit-py-plugins ==0.3.5
mdurl ==0.1.2
more-itertools ==9.1.0
murmurhash ==1.0.9
myst-parser ==1.0.0
ndjson ==0.3.1
numpy ==1.24.3
packaging ==23.1
pandas ==2.0.1
pathy ==0.10.1
pkginfo ==1.9.6
preshed ==3.0.8
pydantic ==1.10.7
python-dateutil ==2.8.2
pytz ==2023.3
readme-renderer ==37.3
requests ==2.29.0
requests-toolbelt ==0.10.1
rfc3986 ==2.0.0
rich ==13.3.5
six ==1.16.0
smart-open ==6.3.0
snowballstemmer ==2.2.0
spacy ==3.5.2
spacy-legacy ==3.0.12
spacy-loggers ==1.0.4
sphinxcontrib-applehelp ==1.0.4
sphinxcontrib-devhelp ==1.0.2
sphinxcontrib-htmlhelp ==2.0.1
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.3
sphinxcontrib-serializinghtml ==1.1.5
srsly ==2.4.6
thinc ==8.1.9
tqdm ==4.65.0
twine ==4.0.2
typer ==0.7.0
typing_extensions ==4.5.0
tzdata ==2023.3
urllib3 ==1.26.15
wasabi ==1.1.1
webencodings ==0.5.1
zipp ==3.15.0

setup.py pypi

ndjson *
pandas *
spacy *

https://github.com/brucewlee/lftk

Science Score: 23.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

readme.md

LFTK: Handcrafted Features in Computational Linguistics

Installation

News

Usage

load a trained pipeline of your choice from spacy

remember we already downloaed "encoreweb_sm" pipeline above?

create a spaCy doc object

initiate LFTK extractor by passing in doc

you can pass in a list of multiple docs

optionally, you can customize how LFTK extractor calculates handcrafted linguistic features

for example, include stop word? include puncutaion? maximum decimal digits?

now, extract the handcrafted linguistic features that you need

refer to them as feature keys

{'awordps': 8.0, 'akuppw': 5.754, 'n_noun': 2}

Deep Dive: Handcrafted Linguistic Features

Programmatically Searching Handcrafted Features

returns all available features as a list of dictionaries by default

specify attributes

return pandas dataframe instead of list of dictionaries

key name formulation domain family

4 awordps averagenumberofwordsper_sentence derivation surface avgwordsent

5 acharps averagenumberofcharactersper_sentence derivation surface avgwordsent

6 acharpw averagenumberofcharactersper_word derivation surface avgwordsent

Attribute: domain

Attribute: family

Attribute: language

Essential Tips and To-Do Guides

Q: How to extract features by group? Do I have to specify each feature individually?

specify attributes and (IMPORTANT) set returnformat to "listkey"

['tword', 'tstopword', 'tpunct', 'tuword', 'tsent', 'tchar']

now, extract the handcrafted linguistic features that you need

{'tword': 8, 'tstopword': 4, 'tpunct': 1, 'tuword': 9, 'tsent': 1, 'tchar': 36}

Q: What if I wanted to extract features from multiple groups?

Publications

Citation

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: lftk

Rankings

Maintainers (1)

Dependencies