kpa-hierarchy
kpa_hierarchy code, to share code of key point analysis hierarchy.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Repository
kpa_hierarchy code, to share code of key point analysis hierarchy.
Basic Info
- Host: GitHub
- Owner: IBM
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 751 KB
Statistics
- Stars: 4
- Watchers: 5
- Forks: 1
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
kpa-hierarchy
Scope
This repository contains:
(1) The ThinkP dataset: a high quality benchmark dataset of key point hierarchies for business and product reviews.
(2) NEW: Code for KPH construction and evaluation.
Data
Information about using the data can be found here.
Setup
In order to run this repo, you need a Python Anaconda environment with all the requirements installed:
bash
conda create --name kpa_hierarchy python=3.9
conda activate kpa_hierarchy
pip install -r requirements.txt
Creating Pairwise Scores:
bash
python create_pairwise_scores.py --output_path "./out/pairwise_scores.csv"
To generate pairwise scores for the ThinkP dataset, use the create_pairwise_scores.py script.
This will load the ThinkP dataset and convert it to a dataframe with the predicted scores for all the key points pair in
each topic. Replace lines 17-19 with your code for computing
the pairwise scores: add a column for the dataframe with a unique method name and the scores for each row.
The scores for the methods reported in the paper are in eval/pairwise_scores/all_pairwise_scores.csv.
Arguments:
- gold_path : path to the gold data jsonl, default to the path of ThinkP.
- output_path: path of csv to save the pairwise scores dataframe.
Evaluating Pairwise scores:
bash
python eval_pairwise_scores.py --output_path "./out/eval_pairwise.csv" --methods APInc NLI BinInc KPA-Match NLI_BinInc_WL
This script runs evaluation over the scores computed in the previous section, and outputs: 1. the Precision-Recall graphs per domain. 2. The auc (for recall > 0.1) and best f1 score (using leave-one-topic-out) for choosing the classification threshold for each domain.
Arguments:
- output_path: path to the output .png file to be saved. The table with the scores will be saved to a
csv file in the same path.
- pairwise_scores_file: path to csv with pairwise scores (defaults to our provided pairwise scores).
- methods: list of space seperated methods to evaluate, i.e. columns in the dataframe in pairwisescoresfile
Constructing KPH
bash
python predict_kph.py --topic "AV6weBrZFFBfRGCbcRGO4g_neg" --viz --output_dir "./out/build_kph/" --pairwise_method "NLI_BinInc_WL" --tree_method "tncf"
Tree construction is performed in two steps: first, computing the pairwise scores for each pair of key points,
and then using the pairwise scores to construct the hierarchy. The first step is done in the previous section,
resulting in the pairwise scores dataframe. To run kph construction from the pairwise scores, run the predict_kph.py
script. This script constructs a single KPH, for a given classification threshold, and prints its evaluation measures
against the gold data.
Arguments:
- gold_path: path to the gold data jsonl (default to the path of ThinkP).
- pairwise_scores_file: path to a csv file with pairwise scores (defaults to our provided pairwise scores).
- pairwise_methods: the pairwise methods to use, i.e. a column in the dataframe in pairwisescoresfile (default to NLIBinIncWL)
- threshold: the decision threshold for counting two kps as related (default 0.5)
- topic: the (string) topic id of the business or product to build the kph for
- output_dir: path to output directory to save a directory with the jsonl file of the hierarchy and .txt for visualization")
- viz: create or not a user friendly visualization of the generated tree
- tree_methods: which kph method to use for tree construction, must be a key in kph_method_to_predictor_class, the dictionary
in predict_kph.py. The construction methods available in the paper are available.
Adding a new hierarchy construction method
KPH construction is done using a class that extends TreePredictor: its constructor receives a decision threshold and a
dataframe which contains all the rows for a certain topic in the pairwise scores df, with the relevant pairwise scores column
named "score". The class has a method called get_hierarchy that returns a KPH object. Both TreePredictor and KPH
are documented in KPH.py.
Once the class is ready, add an entry to kph_method_to_predictor_class with a unique name as key and the class name as value, and run
predict_kph.py as explained in the previous section.
Evaluating KPH constructions
bash
python eval_kph.py --output_dir ./out/eval_kph --tree_methods reduced_tree greedy_local_score greedy_best_edge tncf --pairwise_methods NLI_BinInc_WL
This script first creates and saves all KPHs with all thresholds for all the combinations of the construction methods and pairwise methods. Then it performs the evaluation, computes the best f1_score (using leave-one-topic-out in each domain) and saves a visualization of the best tree for each combination of methods and topic.
Arguments:
- output_dir: required, directory to save trees and evaluation results. previous evaluations in the same dir will be overriden.
the generated trees are saved during the run, so if the execution was terminated or if you want to add more methods to the evaluation,
You can use the same output dir and continue from where you left off
- gold_path: path to the gold data jsonl (default to the path of ThinkP).
- pairwise_scores_file: path to csv with pairwise scores (defaults to our provided pairwise scores).
- tree_methods: list of space seperated KPH construction methods to evaluate, must be keys in kph_method_to_predictor_class (as explained in the previous section)
- pairwise_methods: list of space seperated pairwise methods to evaluate, i.e. columns in the dataframe in pairwisescoresfile
- domains: list of domains to evaluates (by default, run for all domains).
Citing
If you are using ThinkP in a publication, please cite the following paper:
From Key Points to Key Point Hierarchy: Structured and Expressive Opinion Summarization
Arie Cattan, Lilach Eden, Yoav Kantor and Roy Bar-Haim.
ACL 2023.
Changelog
Major changes are documented here
Owner
- Name: International Business Machines
- Login: IBM
- Kind: organization
- Email: awesome@ibm.com
- Location: United States of America
- Website: https://www.ibm.com/opensource/
- Twitter: ibmdeveloper
- Repositories: 3,152
- Profile: https://github.com/IBM
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this data or software, please cite the paper below." title: "From Key Points to Key Point Hierarchy: Structured and Expressive Opinion Summarization" authors: - family-names: Cattan given-names: Arie - family-names: Eden given-names: Lilach - family-names: Kantor given-names: Yoav - family-names: Bar-Haim given-names: Roy version: 1.0.0 date-released: 2023-06-05 license: Apache-2.0 url: "https://arxiv.org/abs/2306.03853" repository-code: "https://github.com/IBM/kpa-hierarchy"
GitHub Events
Total
- Pull request event: 1
- Create event: 1
Last Year
- Pull request event: 1
- Create event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 2
- Average time to close issues: 2 months
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 0.5
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 2
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 2
Top Authors
Issue Authors
- 1mAlbert (1)
- rudra0713 (1)
Pull Request Authors
- renovate[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- jsonlines *
- matplotlib *
- networkx *
- numpy *
- pandas *
- scikit_learn *
- tabulate *
- torch *
- tqdm *