Arabica

Arabica: A Python package for exploratory analysis of text data - Published in JOSS (2024)

https://github.com/petrkorab/arabica

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README and JOSS metadata
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

exploratory-data-analysis nlp text-mining

Last synced: 6 months ago · JSON representation

Repository

Python package for text mining of time-series data

Basic Info

Host: GitHub
Owner: PetrKorab
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 107 MB

Statistics

Stars: 75
Watchers: 3
Forks: 16
Open Issues: 0
Releases: 0

Topics

exploratory-data-analysis nlp text-mining

Created over 3 years ago · Last pushed 10 months ago

Metadata Files

Readme License

Arabica

Python package for text mining of time-series data

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central banker communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:

Descriptive n-gram analysis: n-gram frequencies
Time-series n-gram analysis: n-gram frequencies over a period
Text visualization: n-gram heatmap, line plot, word cloud
Sentiment analysis: VADER sentiment classifier
Financial sentiment analysis: with FinVADER
Structural breaks identification: Jenks Optimization Method

It automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:

Remove digits from the text
Remove the standard list(s) of stopwords
Remove an additional list of stop words

Arabica works with texts of languages based on the Latin alphabet, uses cleantext for punctuation cleaning, and enables stop words removal for languages in the NLTK corpus of stopwords.

It reads dates in:

US-style: MM/DD/YYYY (2013-12-31, Feb-09-2009, 2013-12-31 11:46:17, etc.)
European-style: DD/MM/YYYY (2013-31-12, 09-Feb-2009, 2013-31-12 11:46:17, etc.) date and datetime formats.

Installation

Arabica requires Python 3.8 - 3.10, NLTK - stop words removal, cleantext - text cleaning, wordcloud - word cloud visualization, plotnine - heatmaps and line graphs, matplotlib - word clouds and graphical operations, vaderSentiment - sentiment analysis, finvader - financial sentiment analysis, and jenskpy for breakpoint identification.

To install using pip, use:

pip install arabica

Usage

Import the library:

python from arabica import arabica_freq from arabica import cappuccino from arabica import coffee_break

Choose a method:

arabica_freq enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.

python def arabica_freq(text: str, # Text time: str, # Time date_format: str, # Date format: 'eur' - European, 'us' - American time_freq: str, # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup' max_words: int, # Maximum of most frequent n-grams displayed for each period stopwords: [], # Languages for stop words stopwords_ext: [], # Languages for extended stop words list, currently provided lists: 'english' skip: [], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset. numbers = True, # Remove numbers lower_case = True) # Lowercase text numbers: bool = False, # Remove numbers lower_case: bool = False # Lowercase text )

cappuccino enables cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.

python def cappuccino(text: str, # Text time: str, # Time date_format: str, # Date format: 'eur' - European, 'us' - American plot: str, # Chart type: 'wordcloud'/'heatmap'/'line' ngram: int, # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram time_freq: str, # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup' max_words int, # Maximum of most frequent n-grams displayed for each period stopwords: [], # Languages for stop words stopwords_ext: [], # Languages for extended stop words list, currently provided lists: 'english' skip: [], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset. numbers: bool = False, # Remove numbers lower_case: bool = False # Lowercase text )

coffee_break provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:

VADER is a lexicon and rule-based sentiment classifier attuned explicitly to general language expressed in social media
FinVADER improves VADER's classification accuracy on financial texts, including two financial lexicons

Break points in the time series are identified with the Fisher-Jenks algorithm (Jenks, 1977. Optimal data classification for choropleth maps).

python def coffee_break(text: str, # Text time: str, # Time date_format: str, # Date format: 'eur' - European, 'us' - American model: str, # Sentiment classifier, 'vader' - general language, 'finvader' - financial text skip: [], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset. preprocess: bool = False, # Clean data from numbers and punctuation time_freq: str, # Aggregation period: 'Y'/'M' n_breaks: int # Number of breakpoints: min. 2 )

Documentation, examples and tutorials

Read the documentation

For more examples of coding, read these tutorials:

General use:

Sentiment Analysis and Structural Breaks in Time-Series Text Data here
Visualization Module in Arabica Speeds Up Text Data Exploration here
Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis here

Applications:

Business Intelligence: Customer Satisfaction Measurement with N-gram and Sentiment Analysis here
Research meta-data analysis: Research Article Meta-data Description Made Quick and Easy here
Media coverage text mining

* Social media analysis

💬 Please visit here for any questions, issues, bugs, and suggestions.

Citation

Using arabica in a paper or thesis? Please cite this paper:

```bibtex

@article{Koráb:2024, author = {{Koráb}, P., and {Poměnková}, J.}, title = {Arabica: A Python package for exploratory analysis of text data}, journal = {Journal of Open Source Software}, volume = {97}, number = {9}, pages = {6186}, year = {2024}, doi = {doi.org/10.21105/joss.06186}, }

Owner

Name: Petr Koráb
Login: PetrKorab
Kind: user
Location: Prague
Company: Lentiamo

Website: https://petrkorab.medium.com
Twitter: PetrKorab
Repositories: 6
Profile: https://github.com/PetrKorab

Data Analyst at Lentiamo, Prague & Visiting Researcher, Zeppelin University, Friedrichshafen.

JOSS Publication

Arabica: A Python package for exploratory analysis of text data

Published

May 05, 2024

DOI

10.21105/joss.06186

Volume 9, Issue 97, Page 6186

Authors

Petr Koráb

Zeppelin University in Friedrichshafen, Germany

Jitka Poměnková

Brno University of Technology, Department of Radio Electronics, Czech Republic

Editor

Olivia Guest

GitHub Events

Total

Watch event: 6
Push event: 9
Fork event: 1

Last Year

Watch event: 6
Push event: 9
Fork event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 1,588
Total Committers: 5
Avg Commits per committer: 317.6
Development Distribution Score (DDS): 0.003

Past Year

Commits: 10
Committers: 1
Avg Commits per committer: 10.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Petr Koráb	x**b@g**m	1,584
petr-korab-testing	1****g	1
Olivia Guest	o****t	1
Kapil Sharma	b**7@g**m	1
Dr. Chandrakant Bangar	1****t	1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 10
Average time to close issues: 4 days
Average time to close pull requests: 2 days
Total issue authors: 4
Total pull request authors: 5
Average comments per issue: 1.75
Average comments per pull request: 0.7
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

karwester (1)
benjaminhuus32 (1)
linuxscout (1)
hosamn (1)

Pull Request Authors

PetrKorab (5)
petr-korab-testing (2)
oliviaguest (2)
imbishal7 (1)
DrChandrakant (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 556 last-month

Total dependent packages: 1
Total dependent repositories: 1
Total versions: 63
Total maintainers: 1

pypi.org: arabica

Python package for text mining of time-series data

Homepage: https://github.com/PetrKorab/Arabica
Documentation: https://arabica.readthedocs.io/
License: OSI Approved :: Apache Software License
Latest release: 1.8.2
published about 1 year ago

Versions: 63
Dependent Packages: 1
Dependent Repositories: 1
Downloads: 556 Last month

Rankings

Dependent packages count: 10.0%

Downloads: 10.4%

Stargazers count: 11.4%

Forks count: 11.9%

Average: 13.1%

Dependent repos count: 21.8%

Maintainers (1)

PetrKorab

Last synced: 6 months ago

Dependencies

.github/workflows/draft-pdf.yml actions

actions/checkout v3 composite
actions/upload-artifact v1 composite
openjournals/openjournals-draft-action master composite

docs/build/html/_sources/Data structure and requirements.rst.txt pypi

Datastructureandrequirements *
Fordescriptiveanalysis *
ItisrecommendedtousetheUS-styledates *
Theseaggregationcombinationsareprovided *

requirements.txt pypi

cleantext ==1.1.4
finvader *
jenkspy ==0.3.2
matplotlib ==3.6.0
matplotlib-inline ==0.1.6
mizani ==0.9.2
nltk ==3.6.2
pandas ==1.4.0
pillow ==9.4.0
plotnine ==0.10.1
regex ==2022.10.31
vaderSentiment ==3.3.2
wordcloud ==1.8.2.2

requirements_tests.txt pypi

arabica ==1.7.7 test
pandas ==1.4.0 test
pip ==23.3.1 test

setup.py pypi

cleantext *
finvader *
jenkspy *
matplotlib *
matplotlib-inline *
mizani *
nltk *
pandas *
pillow *
plotnine *
regex *
vaderSentiment *
wordcloud *

Arabica

Science Score: 93.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Arabica

Installation

Usage

Documentation, examples and tutorials

* Social media analysis

Citation

Owner

JOSS Publication

Arabica: A Python package for exploratory analysis of text data

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: arabica

Rankings

Maintainers (1)

Dependencies