comparing-keyword-importance-across-texts

The method extracts representative keywords for each document in a collection using comparative analysis

https://github.com/stephan-linzbach/comparing-keyword-importance-across-texts

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.1%) to scientific vocabulary

Keywords

comparative-analysis keyword-extraction log-odd-ratio word-importance

Last synced: 6 months ago · JSON representation ·

Repository

The method extracts representative keywords for each document in a collection using comparative analysis

Basic Info

Host: GitHub
Owner: Stephan-Linzbach
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 166 KB

Statistics

Stars: 1
Watchers: 2
Forks: 2
Open Issues: 6
Releases: 0

Topics

comparative-analysis keyword-extraction log-odd-ratio word-importance

Created over 1 year ago · Last pushed 6 months ago

Metadata Files

Readme License Citation

Comparing Keyword Importance Across Texts

Description

This method identifies and ranks the most important words in a collection of documents, such as articles, speeches, or social media posts, by analyzing their frequency and uniqueness within each document. Using measures like TF-IDF, PMI, and Log Odds Ratio, it highlights terms that are especially relevant to a specific document while contrasting them with others in the collection. This approach is ideal for uncovering key themes, comparing language use across texts, and tracking shifts in terminology or public discourse over time, making it a valuable tool for summarizing content or analyzing trends.

| | TF-IDF | Log Odds Ratio | PMI | |:-----------------|:----------------:|:----------------:|:----------------:| | Definition | Measures the importance of a term in a document not only by frequent usage but also through its absence in other documents. | Quantifies the increase in the relative importance of a term for a document in comparison to all other documents. | Measures the association between a term and a document, indicating a dependency. | | When to use? | Finding terms that are characteristic of a document and only used by a subset of other documents. | Finding terms that have higher relevance for a certain document. | Finding terms that are characteristic of a document and seldom used by other documents. | | Interpretability | High scores indicate greater importance of the term within the document. | Positive values indicate association with the document. Negative values indicate low importance of the term for the document. | High scores indicate strength of association between the term and document. Low scores indicate disassociation between the term and the document. |

Use Cases

Studying climate change discourse on Twitter over time: By extracting and comparing keywords, this method can reveal emerging terms (e.g., carbon neutrality), diminishing terms (e.g., global warming), and stable terms (e.g., climate crisis), offering insights into evolving public conversations and priorities.
Analyzing political speeches to identify shifts in rhetoric: Social scientists can track how key terms (e.g., freedom, equality, security) gain or lose prominence across different administrations or during election campaigns, providing a lens into changing political priorities and strategies.
Examining public sentiment in online forums: By comparing keyword importance across threads, researchers can uncover dominant themes, recurring concerns, or evolving opinions on topics like healthcare, education, or economic policies.
Studying cultural narratives in literature or media: Social scientists can analyze how specific terms (e.g., identity, tradition, modernity) are emphasized in different texts, revealing underlying societal values, conflicts, or trends over time.

Input Data

The method handles digital behavioral data, including social media posts, comments, search queries, clickstream text (e.g., website titles), forum threads, and open-text survey responses.

The corpus data used in the script is stored in JSON format at data/default_corpus.json and looks something like this:

JSON { "Document A": "This is the liberal solution: All text is good as well as bad. The good one has to take his own position. We are the liberal ones. Not the center nor the progressive ones.", "Document B": "This is the center solution: They are bad, not good, if everyone remains in his own position we are all alone which is bad. We are the center ones. Not the progressive nor the liberal ones.", "Document C": "This is the progressive solution: Another group's position is the problem. They don't move from their position. We are the progressive ones. Not the liberal nor the center ones." }

Note: The corpus should ideally be a larger text dataset to produce more meaningful results.

Output Data

The method will produce a CSV in the following form:

| Words | Document A | Document B | Document C | |:-----------------|:-----------------:|:-----------------:|-----------------:| | progressive | 0.24816330799414105 | 0.24816330799414105 | 1.2392023539955106 | | ones | 0.636647135255376 | 0.636647135255376 | 0.6276861812567451 | | position | 0.24816330799414105 | 0.24816330799414105 | 1.2392023539955106 | | solution | 0.636647135255376 | 0.636647135255376 | 0.6276861812567451 | | center | 0.20851385530561406 | 1.208513855305614 | 0.19955290130698336 | | liberal | 1.208513855305614 | 0.20851385530561406 | 0.19955290130698336 |

Moreover, in the output_config/ directory, you will find a JSON file that saves all the used parameters for the resulting table.

JSON { "corpus": "/path/to/your_corpus.json", "comparison_corpus": "", "language": "english", "min_df": null, "more_freq_than": 0, "less_freq_than": 100, "method": "pmi", "only_words": true, "return_values": true }

Hardware Requirements

The method runs on a small virtual machine provided by a cloud computing company (2 x86 CPU core, 4 GB RAM, 40GB HDD).

Environment Setup

Install Python version>=3.9 (preferably through Anaconda)

bash conda create -n env python=3.11

Download the repository or directly copy the raw code from keyword_extraction.py and requirements.txt

bash git clone https://git.gesis.org/bda/keyword_extraction.git

Install all the packages and libraries with specific versions required to run this method:

bash pip install -r requirements.txt

How to Use

You can configure the parameters in the config.json file and run the script:

bash python keyword_extraction.py

Alternatively, you can set the parameters directly in the command line when running the script.

To see the list of parameters you can specify, use the Command Line Options:

bash python keyword_extraction.py --help

Below is the output of the --help command, which lists all available options for the script:

text options: -h, --help show this help message and exit --corpus CORPUS A path to a json corpus in this format ./data/default_corpus.json. --comparison_corpus COMPARISON_CORPUS A path to a json comparison_corpus in this format ./data/default_corpus.json. You need this for the log_odd ratio. --config CONFIG If you do not have a config.json in the working directory or want to set your setting with the cli tool set this var to False. --language LANGUAGE Language (default: english) --min_df MIN_DF Minimum document frequency (default: 1) --more_freq_than MORE_FREQ_THAN Frequency threshold for more frequent words (default: 0) --less_freq_than LESS_FREQ_THAN Frequency threshold for less frequent words (default: 1.0) --method METHOD Choose a method from the list of implemented methods ['log_odds', 'tfidf', 'pmi', 'tfidf_pmi'] --stop_words STOP_WORDS Exclude stop_words from this list ['english']. --only_words ONLY_WORDS Exclude numbers, urls, and everything that is not alphabetic. --return_values RETURN_VALUES Use this parameter if you want the associated values of the respective method to be returned.

It also provides explanations on the role of the parameters in altering the method behavior. Next, execute:

bash python keyword_extraction.py --method pmi --corpus /path/to/your_corpus.json

Example Commands and Parameters

Below are example commands demonstrating how to use the method with different configurations and parameters to extract and analyze keyword importance effectively.

1. Pointwise Mutual Information (PMI) - Calculate PMI for words that appear in more than 80% of the documents:

bash python keyword_extraction.py --config False --method pmi --corpus /path/to/your_corpus.json --more_freq_than 80

Words that occur frequently across documents are prioritized.

2. TF-IDF (Term Frequency-Inverse Document Frequency) - Compute importance scores based on TF-IDF, excluding words that appear in fewer than two documents:

bash python keyword_extraction.py --config False --method tfidf --corpus /path/to/your_corpus.json --min_df 2

TF-IDF highlights words that are unique to specific documents compared to those shared across all documents. With min_df, we specify that words should appear in at least 2 documents. Thus, we exclude all words that only appear in one document.

3. PMI with TF-IDF - Combine PMI with TF-IDF scores, excluding the least frequent 20% of words. This is calculated using the weighted term frequencies from the TF-IDF matrix rather than raw term frequencies. The result is a PMI matrix that incorporates the TF-IDF weighting into the PMI calculation.

bash python keyword_extraction.py --config False --method tfidf_pmi --corpus /path/to/your_corpus.json --less_freq_than 20

This approach accounts for both document-specific importance and overall word weighting.

4. Log Odds Ratio - Compute Log Odds Ratio using a comparison corpus to identify word importance:

bash python keyword_extraction.py --config False --method log_odds --corpus /path/to/your_corpus.json --comparison_corpus /path/to/your_comparison_corpus.json

A comparison corpus is required to determine word frequencies under "normal" circumstances, reducing noise and highlighting significant terms.

Contact Details

Stephan.Linzbach\@gesis.org

Owner

Login: Stephan-Linzbach
Kind: user

Repositories: 1
Profile: https://github.com/Stephan-Linzbach

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Linzbach
    given-names: Stephan
    orcid: https://orcid.org/0009-0009-6955-2368
title: "Comparative Keyword Analysis"
version: 1.0.0
date-released: 2024-02-15

GitHub Events

Total

Commit comment event: 1
Issues event: 14
Watch event: 1
Member event: 1
Issue comment event: 17
Push event: 50
Pull request event: 4
Fork event: 1

Last Year

Commit comment event: 1
Issues event: 14
Watch event: 1
Member event: 1
Issue comment event: 17
Push event: 50
Pull request event: 4
Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 9
Total pull requests: 3
Average time to close issues: 14 days
Average time to close pull requests: 9 days
Total issue authors: 4
Total pull request authors: 1
Average comments per issue: 0.78
Average comments per pull request: 0.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 9
Pull requests: 3
Average time to close issues: 14 days
Average time to close pull requests: 9 days
Issue authors: 4
Pull request authors: 1
Average comments per issue: 0.78
Average comments per pull request: 0.67
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

momenifi (6)
FlxVctr (1)
shyamgupta196 (1)
taimoorkhan-nlp (1)

Pull Request Authors

shyamgupta196 (3)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

joblib ==1.3.2
numpy ==1.26.4
scikit-learn ==1.4.0
scipy ==1.12.0
threadpoolctl ==3.2.0