Arabica
Arabica: A Python package for exploratory analysis of text data - Published in JOSS (2024)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README and JOSS metadata -
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
Python package for text mining of time-series data
Basic Info
Statistics
- Stars: 75
- Watchers: 3
- Forks: 16
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Arabica
Python package for text mining of time-series data
Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include social media conversations, product reviews, research metadata, central banker communication, and newspaper headlines. Arabica makes exploratory analysis of these datasets simple by providing:
- Descriptive n-gram analysis: n-gram frequencies
- Time-series n-gram analysis: n-gram frequencies over a period
- Text visualization: n-gram heatmap, line plot, word cloud
- Sentiment analysis: VADER sentiment classifier
- Financial sentiment analysis: with FinVADER
- Structural breaks identification: Jenks Optimization Method
It automatically cleans data from punctuation on input. It can also apply all or a selected combination of the following cleaning operations:
- Remove digits from the text
- Remove the standard list(s) of stopwords
- Remove an additional list of stop words
Arabica works with texts of languages based on the Latin alphabet, uses cleantext for punctuation cleaning, and enables stop words removal for languages in the NLTK corpus of stopwords.
It reads dates in:
- US-style: MM/DD/YYYY (2013-12-31, Feb-09-2009, 2013-12-31 11:46:17, etc.)
- European-style: DD/MM/YYYY (2013-31-12, 09-Feb-2009, 2013-31-12 11:46:17, etc.) date and datetime formats.
Installation
Arabica requires Python 3.8 - 3.10, NLTK - stop words removal, cleantext - text cleaning, wordcloud - word cloud visualization, plotnine - heatmaps and line graphs, matplotlib - word clouds and graphical operations, vaderSentiment - sentiment analysis, finvader - financial sentiment analysis, and jenskpy for breakpoint identification.
To install using pip, use:
pip install arabica
Usage
- Import the library:
python
from arabica import arabica_freq
from arabica import cappuccino
from arabica import coffee_break
- Choose a method:
arabica_freq enables a specific set of cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and returns a dataframe with aggregated unigrams, bigrams, and trigrams frequencies over a period.
python
def arabica_freq(text: str, # Text
time: str, # Time
date_format: str, # Date format: 'eur' - European, 'us' - American
time_freq: str, # Aggregation period: 'Y'/'M'/'D', if no aggregation: 'ungroup'
max_words: int, # Maximum of most frequent n-grams displayed for each period
stopwords: [], # Languages for stop words
stopwords_ext: [], # Languages for extended stop words list, currently provided lists: 'english'
skip: [], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
numbers = True, # Remove numbers
lower_case = True) # Lowercase text
numbers: bool = False, # Remove numbers
lower_case: bool = False # Lowercase text
)
cappuccino enables cleaning operations (lower casing, numbers, common stop words, and additional stop words removal) and provides plots for descriptive (word cloud) and time-series (heatmap, line plot) visualization.
python
def cappuccino(text: str, # Text
time: str, # Time
date_format: str, # Date format: 'eur' - European, 'us' - American
plot: str, # Chart type: 'wordcloud'/'heatmap'/'line'
ngram: int, # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
time_freq: str, # Aggregation period: 'Y'/'M', if no aggregation: 'ungroup'
max_words int, # Maximum of most frequent n-grams displayed for each period
stopwords: [], # Languages for stop words
stopwords_ext: [], # Languages for extended stop words list, currently provided lists: 'english'
skip: [], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
numbers: bool = False, # Remove numbers
lower_case: bool = False # Lowercase text
)
coffee_break provides sentiment analysis and breakpoint identification in aggregated time series of sentiment. The implemented models are:
VADER is a lexicon and rule-based sentiment classifier attuned explicitly to general language expressed in social media
FinVADER improves VADER's classification accuracy on financial texts, including two financial lexicons
Break points in the time series are identified with the Fisher-Jenks algorithm (Jenks, 1977. Optimal data classification for choropleth maps).
python
def coffee_break(text: str, # Text
time: str, # Time
date_format: str, # Date format: 'eur' - European, 'us' - American
model: str, # Sentiment classifier, 'vader' - general language, 'finvader' - financial text
skip: [], # Remove additional strings. Cuts the characters out without tokenization, useful for specific or rare characters. Be careful not to bias the dataset.
preprocess: bool = False, # Clean data from numbers and punctuation
time_freq: str, # Aggregation period: 'Y'/'M'
n_breaks: int # Number of breakpoints: min. 2
)
Documentation, examples and tutorials
- Read the documentation
For more examples of coding, read these tutorials:
General use:
- Sentiment Analysis and Structural Breaks in Time-Series Text Data here
- Visualization Module in Arabica Speeds Up Text Data Exploration here
- Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis here
Applications:
- Business Intelligence: Customer Satisfaction Measurement with N-gram and Sentiment Analysis here
- Research meta-data analysis: Research Article Meta-data Description Made Quick and Easy here
- Media coverage text mining
* Social media analysis
💬 Please visit here for any questions, issues, bugs, and suggestions.
Citation
Using arabica in a paper or thesis? Please cite this paper:
```bibtex
@article{Koráb:2024, author = {{Koráb}, P., and {Poměnková}, J.}, title = {Arabica: A Python package for exploratory analysis of text data}, journal = {Journal of Open Source Software}, volume = {97}, number = {9}, pages = {6186}, year = {2024}, doi = {doi.org/10.21105/joss.06186}, }
Owner
- Name: Petr Koráb
- Login: PetrKorab
- Kind: user
- Location: Prague
- Company: Lentiamo
- Website: https://petrkorab.medium.com
- Twitter: PetrKorab
- Repositories: 6
- Profile: https://github.com/PetrKorab
Data Analyst at Lentiamo, Prague & Visiting Researcher, Zeppelin University, Friedrichshafen.
JOSS Publication
Arabica: A Python package for exploratory analysis of text data
Authors
Tags
text mining exploratory data analysis sentiment analysis data visualizationGitHub Events
Total
- Watch event: 6
- Push event: 9
- Fork event: 1
Last Year
- Watch event: 6
- Push event: 9
- Fork event: 1
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Petr Koráb | x****b@g****m | 1,584 |
| petr-korab-testing | 1****g | 1 |
| Olivia Guest | o****t | 1 |
| Kapil Sharma | b****7@g****m | 1 |
| Dr. Chandrakant Bangar | 1****t | 1 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 4
- Total pull requests: 10
- Average time to close issues: 4 days
- Average time to close pull requests: 2 days
- Total issue authors: 4
- Total pull request authors: 5
- Average comments per issue: 1.75
- Average comments per pull request: 0.7
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- karwester (1)
- benjaminhuus32 (1)
- linuxscout (1)
- hosamn (1)
Pull Request Authors
- PetrKorab (5)
- petr-korab-testing (2)
- oliviaguest (2)
- imbishal7 (1)
- DrChandrakant (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 556 last-month
- Total dependent packages: 1
- Total dependent repositories: 1
- Total versions: 63
- Total maintainers: 1
pypi.org: arabica
Python package for text mining of time-series data
- Homepage: https://github.com/PetrKorab/Arabica
- Documentation: https://arabica.readthedocs.io/
- License: OSI Approved :: Apache Software License
-
Latest release: 1.8.2
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- actions/upload-artifact v1 composite
- openjournals/openjournals-draft-action master composite
- Datastructureandrequirements *
- Fordescriptiveanalysis *
- ItisrecommendedtousetheUS-styledates *
- Theseaggregationcombinationsareprovided *
- cleantext ==1.1.4
- finvader *
- jenkspy ==0.3.2
- matplotlib ==3.6.0
- matplotlib-inline ==0.1.6
- mizani ==0.9.2
- nltk ==3.6.2
- pandas ==1.4.0
- pillow ==9.4.0
- plotnine ==0.10.1
- regex ==2022.10.31
- vaderSentiment ==3.3.2
- wordcloud ==1.8.2.2
- arabica ==1.7.7 test
- pandas ==1.4.0 test
- pip ==23.3.1 test
- cleantext *
- finvader *
- jenkspy *
- matplotlib *
- matplotlib-inline *
- mizani *
- nltk *
- pandas *
- pillow *
- plotnine *
- regex *
- vaderSentiment *
- wordcloud *
