detection-of-verified-claims
Lexical and Semantic Similarity Based Detection of Verified Claims (SimBA)
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Repository
Lexical and Semantic Similarity Based Detection of Verified Claims (SimBA)
Basic Info
- Host: GitHub
- Owner: BDA-KTS
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 244 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
SimBa: Lexical and Semantic Similarity Based Detection of Verified Claims
Description
This method receives an input claim/sentence (called "query"), searches a registry of fact-checked claims and returns fact-checks for similar claims. More precisely, SimBa computes the queries' similarity with ~40.000 english fact-checked claims from ClaimsKG and returns a set of ranked claims, their relevance scores, veracity ratings and the corresponding fact-check sources.
This method facilitates fact-checking of arbitrary claims or statements (e.g. taken from online discourse and social media posts). It takes advantage of a unique and constantly updated repository of fact-checked claims mined from the web (ClaimsKG).
ClaimsKG is a structured knowledge base (KB) which serves as a registry of claims. The KB is updated at regular intervals. The latest release of ClaimsKG contains 74000 claims collected from 13 different fact-checking websites. For more details regarding ClaimsKG, please refer to the official webpage https://data.gesis.org/claimskg/
Use Cases
- Veracity Verification: Check whether a set of statements have already been fact-checked before. SimBa can be used on these statements to find existing fact-checks, including who performed the checks and when they were done.
- Check-Worthiness Analysis: Find out which claims have been fact-checked before and which have not to gain information on perceived check-worthiness of statements
- Information Spread Analysis: Find claims that are semantically similar to claims that have been previously fact-checked to analyze information spread
Input Data
The required input consists of an input query file data/sample/queries.tsv: a text file in .tsv format (tab-separated) containing one query per line. One query consists of an ID and a claim. For each of the claims, SimBa will retrieve the most similar fact-checked claims in ClaimsKG.
Example of data/sample/queries.tsv:
1 Covid-19 vaccines increase the risk of dying from the new Covid-19 variants
Optional input data: If desired, a different corpus than ClaimsKG can be supplied as database (data/claimsKG/corpus.tsv).
This file should be in tab-separated format (.tsv) and must follow the structure below:
Format of corpus.tsv
The file should contain the following columns:
| Column | Description | |--------|-------------| | qid | A unique identifier for the claim | | text | The textual content of the claim | | title | The title of the fact-checking review | | url | A link to the fact-checking article | | rating | The fact-checking assessment of the claim (e.g., true, false, half false, etc.) |
If available, a goldstandard file can be supplied which lists the optimal results.
For example, SimBa can be evaluated directly on the CLEF CheckThat! data using the respective datasets and gold files.
You can also use the same evaluation scripts to evaluate SimBas performance on your own data, provided you supply a goldstandard file (gold.tsv) in the same folder as your input claims file (e.g., data/
Output Data
The outputs are exported to two files:
1. Standard Output File: data/sample/sample.tsv
Contains the results in a tab-separated format with the following columns:
| Query ID | Q0 | Claim ID | Rank | Similarity Score | Method Name | |----------|----|---------:|-----:|-----------------:|-------------| | 1 | Q0 | http://data.gesis.org/claimskg/creative_work/4a27c731-c9a3-5ff6-81b3-cd46845d5ef9 | 1 | 51.24902489669692 | SimBa |
2. Client-Friendly Output File: data/sample/pred_client.tsv
Contains a more readable format with the following columns:
| Query | VClaim | ClaimReviewURL | Rating | Similarity | |-------|--------|----------------|--------|------------| | Covid-19 vaccines increase the risk of dying from the new Covid-19 variants | Vaccinated people are more susceptible to Covid-19 variants | https://factcheck.afp.com/http%253A%252F%252Fdoc.afp.com%252F9PB64D-1 | b'False' | 51.24902489669692 | | Covid-19 vaccines increase the risk of dying from the new Covid-19 variants | Covid-19 vaccines will leave people exposed to deadly illness during the next cold and flu season and germ theory is a hoax | https://factcheck.afp.com/covid-19-shots-not-designed-increase-cold-flu-lethality | b'False' | 51.102840014017175 | | Covid-19 vaccines increase the risk of dying from the new Covid-19 variants | Getting the first dose of Covid-19 vaccine increases risk of catching the novel coronavirus | https://factcheck.afp.com/misleading-facebook-posts-claim-covid-19-vaccine-increases-risk-catching-novel-coronavirus | b'Misleading' | 50.774385556493115 | | Covid-19 vaccines increase the risk of dying from the new Covid-19 variants | People vaccinated against Covid-19 pose a health risk to others by shedding spike proteins | https://factcheck.afp.com/covid-19-vaccine-does-not-make-people-dangerous-others | b'False' | 49.87148066707767 | | Covid-19 vaccines increase the risk of dying from the new Covid-19 variants | Mass vaccination will cause monster Covid-19 variants | https://factcheck.afp.com/mass-covid-19-vaccination-will-not-lead-out-control-variants | b'False' | 49.756611441979224 |
Hardware Requirements
The method requires higher hardware specifications for optimal performance. Below is the recommended machine configuration:
- CPU: 8-core x86 CPU (e.g., Intel Core i7/i9 or AMD Ryzen 7/9)
- GPU: NVIDIA GPU with at least 4GB VRAM (e.g., NVIDIA RTX 2000 or higher. Not compulsory but important for faster operations)
- RAM: 8 GB or more
- Storage: 256 GB SSD (for faster read/write operations) + 256 GB HDD (for additional storage)
Note: While the above specifications are recommended for optimal performance, SimBa can still run on more modest hardware. It has been successfully tested on a virtual machine with limited resources, though processing times will be significantly longer. The -c cache option becomes particularly helpful when working with limited hardware.
Environment Setup
This version of SimBa has been tested with Python 3.11.13 on Windows. Using other Python versions and/or operating systems might require other package versions.
Follow the steps below to install SimBa on your system using the recommended setup.
1. Install Python (Version 3.11.13)
Download Python 3.11.13 from the official Python website:
https://www.python.org/downloads/release/python-31113/.Install Python:
- During installation, make sure to check the box that says "Add Python to PATH". This step is crucial, as it ensures that Python and pip (Python's package manager) are available in your terminal or command prompt.
- Follow the on-screen instructions to complete the installation.
Verify the Installation: After installation, open your terminal (or command prompt) and type the following command to check if Python was installed correctly:
bash python --version
2. Clone the Repository and Navigate to the Main Project Directory
To download SimBa, clone the repository from GitHub.
Run the following commands in your terminal or command prompt:
bash
git clone <repository-url>
cd <repository name>
3. Install Required Dependencies and Data
SimBa's required libraries and dependencies are listed in the requirements.txt file. Install them using the following command:
bash
pip install -r requirements.txt
How to Use
Once everything is installed, you can run the SimBa project. To do so, use the following command in the terminal:
bash
python main.py sample
Or more generally:
bash
python main.py <dataset name>
Parameters:
- <dataset name>: A custom name for your input query dataset (e.g., "sample"). SimBa will automatically look for the input file at data/<dataset name>/queries.tsv.
This will use the ClaimsKG database to find fact-checked claims that are similar to your input queries. The results will be written to the folder of your dataset, e.g.:
data/<dataset>/pred_client.tsv: human-readable output (claims, URLs, ratings, similarity scores)data/<dataset>/pred_qrels.tsv: standard format for evaluation
Using Cache for Efficiency
For increased efficiency, SimBa generates embeddings for the claims in each database only once and stores them in a cache for re-use. To use this cache, if it already exists, supply the -c option:
bash
python main.py <dataset name> -c
Using Cache for Efficiency and Quick Testing
SimBa now comes with pre-computed embeddings for both queries and the default fact-checking corpus (ClaimsKG).
- Query embeddings are stored in:
data/cache/sample/* - Target embeddings (ClaimsKG corpus) are stored in:
data/cache/claimsKG/*
These allow you to run the system without recomputing embeddings from scratch, saving significant time and resources.
You can quickly test the system in the interactive environment (e.g., Jupyter Notebook or terminal) by running:
bash
python main.py sample -c
- The -c option enables the use of cached embeddings.
- The default query file is located at data/sample/queries.tsv.
Important: If you plan to use a different database or modify the corpus, make sure to regenerate the embeddings accordingly, or remove the cache to force regeneration.
Technical Details
SimBa is fully unsupervised, i.e. it does not need any training data. It operates in two steps:
- Candidate Retrieval: The semantically most similar claims are retrieved as candidates. Semantic similarity is computed using sentence embeddings.
- Re-Ranking: A computationally more costly re-ranking step is applied to all candidates in order to find the best matches. Again, sentence embeddings combined with a lexical feature are used.
SimBa was evaluated on the CLEF CheckThat! Lab Task 2 Claim Retrieval challenge data and achieved the following scores:
| Dataset | Map@1 | Map@3 | Map@5 | |---------|-------|-------|-------| | 2020 2a English | 0.9425 | 0.9617 | 0.9617 | | 2021 2a English | 0.9208 | 0.9431 | 0.9450 | | 2021 2b English | 0.4114 | 0.4388 | 0.4414 | | 2022 2a English | 0.9043 | 0.9258 | 0.9258 | | 2022 2b English | 0.4462 | 0.4744 | 0.4805 |
References
Hvelmeyer, Alica, Katarina Boland, and Stefan Dietze. 2022. SimBa at CheckThat! 2022: Lexical and Semantic Similarity-Based Detection of Verified Claims in an Unsupervised and Supervised Way. In: CEUR Workshop Proceedings, Vol. 3180, pp. 511531. PDF
Boland, Katarina, Hvelmeyer, Alica, Fafalios, Pavlos, Todorov, Konstantin, Mazhar, Usama, & Dietze, Stefan. 2023. Robust and Efficient Claim Retrieval for Online Fact-Checking Applications. Preprint. DOI
Contact Details
For further assistance or inquiries, please contact: katarina.boland@hhu.de
Owner
- Name: BDA-KTS
- Login: BDA-KTS
- Kind: organization
- Repositories: 1
- Profile: https://github.com/BDA-KTS
GitHub Events
Total
- Issues event: 9
- Issue comment event: 15
- Member event: 1
- Push event: 21
- Pull request event: 2
Last Year
- Issues event: 9
- Issue comment event: 15
- Member event: 1
- Push event: 21
- Pull request event: 2
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 7
- Total pull requests: 2
- Average time to close issues: about 1 month
- Average time to close pull requests: 2 minutes
- Total issue authors: 4
- Total pull request authors: 2
- Average comments per issue: 0.57
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 7
- Pull requests: 2
- Average time to close issues: about 1 month
- Average time to close pull requests: 2 minutes
- Issue authors: 4
- Pull request authors: 2
- Average comments per issue: 0.57
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- johanneskiesel (2)
- shyamgupta196 (2)
- rgaiacs (2)
- taimoorkhan-nlp (1)
Pull Request Authors
- elboukai (1)
- shyamgupta196 (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- GESIS-Methods-Hub/preview v1 composite
- actions/checkout v4 composite
- demoji ==1.1.0
- entity-fishing-client *
- gdown ==4.6.4
- huggingface-hub ==0.25.2
- matplotlib ==3.6.3
- nerd ==1.0.0
- nltk ==3.7
- numpy ==1.24.3
- pandas ==1.5.3
- protobuf ==3.20.3
- python_Levenshtein ==0.20.9
- regex ==2022.10.31
- requests ==2.25.1
- scikit_learn ==1.2.2
- scipy ==1.10.1
- sentence_transformers ==2.2.2
- spacy *
- tensorflow ==2.13.1
- tensorflow_hub ==0.12.0
- torch ==2.1.0
- trectools ==0.0.49