pandas-association-measures

Statistical association measures for Python pandas

https://github.com/fau-klue/pandas-association-measures

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 5 committers (20.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.5%) to scientific vocabulary

Keywords

corpus-linguistics pandas

Last synced: 10 months ago · JSON representation ·

Repository

Statistical association measures for Python pandas

Basic Info

Host: GitHub
Owner: fau-klue
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 888 KB

Statistics

Stars: 10
Watchers: 2
Forks: 2
Open Issues: 2
Releases: 12

Topics

corpus-linguistics pandas

Created about 7 years ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

Statistical Association Measures for Python pandas

Association measures are mathematical formulae that interpret cooccurrence frequency data. For each pair of words extracted from a corpus, they compute an association score, a single real value that indicates the amount of (statistical) association between the two words.

http://www.collocations.de/AM/index.html

Installation

Dependencies

pandas
scipy

Installation using pip

python3 -m pip install association-measures

Installation from source (requires Cython)

# Compile Cython code
python3 setup.py build_ext --inplace

# Cython already compiled
python3 setup.py install

Usage

Input

The module expects a pandas dataframe with reasonably named columns; i.e. the columns must follow one of the following notations:

contingency table

```python3

df item O11 O12 O21 O22 1 appreciated 1 15333 1 176663 2 certain 7 15327 113 176551 3 measuring 1 15333 7 176657 4 particularly 2 15332 45 176619 5 arrived 2 15332 3 176661 ```

frequency signature (see Evert 2008: Figure 8)

```python3

df item f f1 f2 N 1 appreciated 1 15334 2 191998 2 certain 7 15334 120 191998 3 measuring 1 15334 8 191998 4 particularly 2 15334 47 191998 5 arrived 2 15334 5 191998 ``wheref=O11,f1=O11+O12,f2=O11+O21,N=O11+O12+O21+O22`.

corpus frequencies (“keyword-friendly”)

```python3

df item f1 N1 f2 N2 1 appreciated 1 15334 1 176664 2 certain 7 15334 113 176664 3 measuring 1 15334 7 176664 4 particularly 2 15334 45 176664 5 arrived 2 15334 3 176664 ``wheref1=O11,f2=O21,N1=O11+O12,N2=O21+O22`.

Observed and Expected Frequencies

Given a dataframe following one of the notations specified above, you can calculate expected frequencies via

```python3

import associationmeasures.frequencies as fq fq.expectedfrequencies(df) E11 E12 E21 E22 1 0.159731 15333.840269 1.840269 176662.159731 2 9.583850 15324.416150 110.416150 176553.583850 3 0.638923 15333.361077 7.361077 176656.638923 4 3.753675 15330.246325 43.246325 176620.753675 5 0.399327 15333.600673 4.600673 176659.399327 ```

The observed_frequency method will convert to contingency notation:

```python3

import associationmeasures.frequencies as fq fq.observedfrequencies(df) O11 O12 O21 O22 1 1 15333 1 176663 2 7 15327 113 176551 3 1 15333 7 176657 4 2 15332 45 176619 5 2 15332 3 176661 ```

Note that all methods return dataframes that are indexed the same way the input dataframe is indexed:

```python3

df f f1 f2 N item appreciated 1 15334 2 191998 certain 7 15334 120 191998 measuring 1 15334 8 191998 particularly 2 15334 47 191998 arrived 2 15334 5 191998 fq.observed_frequencies(df) O11 O12 O21 O22 item appreciated 1 15333 1 176663 certain 7 15327 113 176551 measuring 1 15333 7 176657 particularly 2 15332 45 176619 arrived 2 15332 3 176661 ```

You can thus join the results directly to the input.

Association Measures

The following association measures are currently implemented (v0.2.2):

asymptotic hypothesis tests:
- z-score (z_score)
- t-score (t_score)
- parameter: disc
- Dunning's log-likelihood ratio (log_likelihood)
- parameter: signed
- simple-ll (simple_ll)
- parameter: signed
point estimates of association strength:
- Liddell (liddell)
- minimum sensitivity (min_sensitivity)
- log-ratio (log_ratio)
- parameters: disc, discounting
- Dice coefficient (dice)
information theory:
- mutual information (mutual_information)
  - parameter: disc
- local mutual information (local_mutual_information)
conservative estimates
- conservative log-ratio (conservative_log_ratio)
- parameters: disc, alpha, correct, one_sided, boundary, vocab

You can either calculate specific measures:

```python3

import associationmeasures.measures as am am.loglikelihood(df) item appreciated 2.448757 certain -0.829802 measuring 0.191806 particularly -1.059386 arrived 3.879126 ```

This assumes that df contains the necessary columns (observed frequencies in contingency notation and expected frequencies). In most cases, it is most convenient to just use score():

```python3

import associationmeasures.measures as am am.score(df, measures=['loglikelihood']) O11 O12 O21 O22 R1 R2 C1 C2 N E11 E12 E21 E22 loglikelihood ipm ipmreference ipm_expected item
appreciated 1 15333 1 176663 15334 176664 2 191996 191998 0.159731 15333.840269 1.840269 176662.159731 2.448757 65.214556 5.660463 10.416775 certain 7 15327 113 176551 15334 176664 120 191878 191998 9.583850 15324.416150 110.416150 176553.583850 -0.829802 456.501891 639.632296 625.006510 measuring 1 15333 7 176657 15334 176664 8 191990 191998 0.638923 15333.361077 7.361077 176656.638923 0.191806 65.214556 39.623240 41.667101 particularly 2 15332 45 176619 15334 176664 47 191951 191998 3.753675 15330.246325 43.246325 176620.753675 -1.059386 130.429112 254.720826 244.794217 arrived 2 15332 3 176661 15334 176664 5 191993 191998 0.399327 15333.600673 4.600673 176659.399327 3.879126 130.429112 16.981388 26.041938 ```

Note that by default, score() yields observed frequencies in contingency notation (and marginal frequencies) as well as expected frequencies. You can turn off this behaviour setting freq=False.

To calculate all available measures, don't specify any measures:

```python3

am.score(df, freq=False) zscore tscore loglikelihood simplell minsensitivity liddell dice logratio binomiallikelihood conservativelogratio mutualinformation localmutualinformation item appreciated 2.102442 0.840269 2.448757 1.987992 0.000065 0.420139 0.000130 3.526202 0.000000 0.0 0.796611 0.796611 certain -0.834636 -0.976603 -0.829802 -0.769331 0.000457 -0.021546 0.000906 -0.486622 0.117117 0.0 -0.136442 -0.955094 measuring 0.451726 0.361077 0.191806 0.173788 0.000065 0.045136 0.000130 0.718847 0.000000 0.0 0.194551 0.194551 particularly -0.905150 -1.240035 -1.059386 -0.988997 0.000130 -0.037321 0.000260 -0.965651 0.224042 0.0 -0.273427 -0.546853 arrived 2.533018 1.131847 3.879126 3.243141 0.000130 0.320143 0.000261 2.941240 0.000000 0.0 0.699701 1.399402 ```

You can also pass constant integer counts as parameters to score(). This is reasonable for the following notations:

frequency signature: integers f1 and N (DataFrame contains columns f and f2) ```python3

df f f2 item appreciated 1 2 certain 7 120 measuring 1 8 particularly 2 47 arrived 2 5 am.score(df, f1=15334, N=191998) ```
corpus frequencies: integers N1 and N2 (DataFrame contains columns f1 and f2) ```python3

df f1 f2 item appreciated 1 1 certain 7 113 measuring 1 7 particularly 2 45 arrived 2 3 am.score(df, N1=15334, N2=176664) ```

Some association measures have parameters (see above). You can pass these parameters as keywords to score(), e.g.: ```python3

am.score(df, measures=['loglikelihood'], signed=False, freq=False) loglikelihood item appreciated 2.448757 certain 0.829802 measuring 0.191806 particularly 1.059386 arrived 3.879126 ```

Topographic Maps

New since version 0.3: You can use association_measures.grid.topography to create a dataframe for visualising association measures in terms of topographic maps. It yields a lograthmically scaled grid from N1 to N2 with values of all association measures at resaonable sampling points of all combinations of f1 and f2. ```python3

from associationmeasures.grids import topography topography(N1=10e6, N2=10e6) O11 O12 O21 O22 R1 R2 C1 C2 N E11 ... dice logratio conservativelogratio mutualinformation localmutualinformation ipm ipmreference ipmexpected clrnormal logratiohardie index ...
0 0 10000000.0 0 10000000.0 10000000.0 10000000.0 0 20000000.0 20000000.0 0.0 ... 0.000000 0.000000 0.000000 inf NaN 0.0 0.0 0.00 0.000000 0.000000 1 0 10000000.0 1 9999999.0 10000000.0 10000000.0 1 19999999.0 20000000.0 0.5 ... 0.000000 -9.967226 0.000000 -2.698970 0.000000 0.0 0.1 0.05 0.000000 -9.965784 2 0 10000000.0 2 9999998.0 10000000.0 10000000.0 2 19999998.0 20000000.0 1.0 ... 0.000000 -10.966505 0.000000 -3.000000 0.000000 0.0 0.2 0.10 0.000000 -10.965784 3 0 10000000.0 3 9999997.0 10000000.0 10000000.0 3 19999997.0 20000000.0 1.5 ... 0.000000 -11.551228 0.000000 -3.176091 -0.000000 0.0 0.3 0.15 0.000000 -11.550747 4 0 10000000.0 4 9999996.0 10000000.0 10000000.0 4 19999996.0 20000000.0 2.0 ... 0.000000 -11.966145 0.000000 -3.301030 -0.000000 0.0 0.4 0.20 0.000000 -11.965784 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 39995 10000000 0.0 7205937 2794063.0 10000000.0 10000000.0 17205937 2794063.0 20000000.0 8602968.5 ... 0.735134 0.472742 0.468813 0.065352 653516.672773 1000000.0 720593.7 860296.85 0.471159 0.472742 39996 10000000 0.0 7821100 2178900.0 10000000.0 10000000.0 17821100 2178900.0 20000000.0 8910550.0 ... 0.718879 0.354557 0.350718 0.050095 500954.884892 1000000.0 782110.0 891055.00 0.353215 0.354557 39997 10000000 0.0 8488779 1511221.0 10000000.0 10000000.0 18488779 1511221.0 20000000.0 9244389.5 ... 0.702031 0.236371 0.232619 0.034122 341217.643897 1000000.0 848877.9 924438.95 0.235298 0.236371 39998 10000000 0.0 9213457 786543.0 10000000.0 10000000.0 19213457 786543.0 20000000.0 9606728.5 ... 0.684616 0.118186 0.114514 0.017424 174244.829132 1000000.0 921345.7 960672.85 0.117443 0.118186 39999 10000000 0.0 10000000 0.0 10000000.0 10000000.0 20000000 0.0 20000000.0 10000000.0 ... 0.666667 0.000000 0.000000 0.000000 0.000000 1000000.0 1000000.0 1000000.00 0.000000 0.000000

[40000 rows x 29 columns] ```

Development

The package is tested using pylint and pytest.

```bash

Installing dev requirements

make install

Compile Cython code

make compile

Lint

make lint

Unittest

make test

Coverage

make coverage

Performance

make performance ```

Owner

Name: fau-klue
Login: fau-klue
Kind: organization
Email: info@linguistik.uni-erlangen.de
Location: Erlangen

Website: https://www.linguistik.phil.fau.de/
Twitter: ccl_erlangen
Repositories: 11
Profile: https://github.com/fau-klue

Computational Corpus Linguistics at FAU Erlangen-Nürnberg

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Heinrich"
  given-names: "Philipp"
  orcid: "https://orcid.org/0000-0002-4785-9205"
- family-names: "Opolka"
  given-names: "Markus"
title: "Pandas Association Measures"
version: 0.3.1
date-released: 2025-02-28
url: "https://github.com/fau-klue/pandas-association-measures"

GitHub Events

Total

Release event: 1
Watch event: 2
Delete event: 2
Issue comment event: 2
Push event: 12
Pull request event: 3
Create event: 3

Last Year

Release event: 1
Watch event: 2
Delete event: 2
Issue comment event: 2
Push event: 12
Pull request event: 3
Create event: 3

Committers

Last synced: over 3 years ago

All Time

Total Commits: 158
Total Committers: 5
Avg Commits per committer: 31.6
Development Distribution Score (DDS): 0.323

Top Committers

Name	Email	Commits
Philipp Heinrich	p**h@f**e	107
Markus Opolka	m**s@m**e	45
dependabot[bot]	4**]@u**m	3
Markus Opolka	o**s@i**e	2
Andreas	a**h@f**e	1

Committer Domains (Top 20 + Academic)

fau.de: 2 iis.fraunhofer.de: 1 martialblog.de: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 9
Total pull requests: 30
Average time to close issues: 4 months
Average time to close pull requests: about 1 month
Total issue authors: 3
Total pull request authors: 4
Average comments per issue: 1.89
Average comments per pull request: 0.8
Merged pull requests: 25
Bot issues: 0
Bot pull requests: 8

Past Year

Issues: 0
Pull requests: 3
Average time to close issues: N/A
Average time to close pull requests: 3 minutes
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

ausgerechnet (7)
tsproisl (1)
martialblog (1)

Pull Request Authors

ausgerechnet (14)
dependabot[bot] (9)
martialblog (8)
SpitfireX (2)

Top Labels

Issue Labels

enhancement (2) bug (1) needs-feedback (1)

Pull Request Labels

dependencies (6) needs-feedback (1) dont merge yet (1) needs-work (1) enhancement (1) python (1)

Packages

Total packages: 1
Total downloads:
- pypi 70 last-month
Total docker downloads: 8

Total dependent packages: 1
Total dependent repositories: 3
Total versions: 17
Total maintainers: 2

pypi.org: association-measures

Statistical association measures for Python pandas

Homepage: https://github.com/fau-klue/pandas-association-measures
Documentation: https://association-measures.readthedocs.io/
License: MIT
Latest release: 0.3.1
published over 1 year ago

Versions: 17
Dependent Packages: 1
Dependent Repositories: 3
Downloads: 70 Last month
Docker Downloads: 8

Rankings

Docker downloads count: 4.3%

Dependent packages count: 4.8%

Dependent repos count: 8.9%

Average: 14.6%

Stargazers count: 18.5%

Downloads: 21.4%

Forks count: 29.8%

Maintainers (2)

martialblog ausgerechnet

Last synced: 10 months ago

Dependencies

.github/workflows/python-build.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/python-publish.yml actions

actions/checkout v2 composite
actions/setup-python v1 composite
pypa/gh-action-pypi-publish release/v1 composite

pyproject.toml pypi

requirements-dev.txt pypi

cython ==3.0.12 development
pylint ==2.17.5 development
pytest ==7.4.0 development
pytest-cov ==4.1.0 development
setuptools ==75.8.2 development
twine ==6.1.0 development
wheel ==0.45.1 development

requirements.txt pypi

numpy >=2.0,<3.0
pandas >=2.2.2,<3.0
scipy >=1.13.0,<2.0

setup.py pypi

pandas-association-measures

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Statistical Association Measures for Python pandas

Installation

Dependencies

Installation using pip

Installation from source (requires Cython)

Usage

Input

contingency table

frequency signature (see Evert 2008: Figure 8)

corpus frequencies (“keyword-friendly”)

Observed and Expected Frequencies

Association Measures

Topographic Maps

Development

Installing dev requirements

Compile Cython code

Lint

Unittest

Coverage

Performance

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: association-measures

Rankings

Maintainers (2)

Dependencies