mh_scitweets_classifier
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: SEBSCHELLI
- License: mit
- Language: Python
- Default Branch: main
- Size: 81.1 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
SciTweets Classifier - Classification of Science-Relatedness of Tweets
Description
This repository contains a script to classify the science-relatedness of Tweets. The underlying classifier was trained as part of "SciTweets - A Dataset and Annotation Framework for Detecting Scientific Online Discourse" published at CIKM2022. The classifier distinguishes three different forms of science-relatedness for Tweets:

The three categories of science-relatedness that the classifier predicts (categories 1.1, 1.2, and 1.3) are defined as:
Category 1 - Science-related: Texts that fall under at least one of the following categories:
Category 1.1 - Scientific knowledge (scientifically verifiable claims): Does the text include a claim or a question that could be scientifically verified?
Category 1.2 - Reference to scientific knowledge: Does the text include at least one reference to scientific knowledge? References can either be direct, e.g., DOI, title of a paper or indirect, e.g., a link to an article that includes a direct reference
Category 1.3 - Related to scientific research in general: Does the text mention a scientific research context (e.g., mention of a scientist, scientific research efforts, research findings)?
Category 2 - Not science-related: Texts that don’t fall under either of the 3 previous categories.
Keywords
Science-Relatedness, Scientific Online Discourse, Tweets, Claims
Use Cases
- A social scientist wants to analyze scientific online discourse and needs to extract such data from a list of existing tweet texts.
- A social scientist wants to identify scientific claims in Tweets.
- A social scientists wants to identify scientific references in Tweets.
Repo Structure
This repository contains the following files:
├── classify.py the python script for classifying the science-relatedness of Tweets
├── example_tweets.tsv an exemplary dataset in tsv format (tab separated)
Setup
Environment Setup
To run the classifier the following software is required.
First, the script requires a Python environment with a version >= 3.9 1. Python >= 3.9
Second, within the Python environment, install the modules from requirements.txt with
python -m pip install -r requirements.txt
Note, the script might also run properly with different versions of the modules.
Hardware Requirements (Optional)
The classifier does not require specific hardware. When running the classification script classify.py for the first time, a network connection is required to download the underlying classifier. A GPU is not required but can speed up the classification, especially for larger collections of input Tweets.
Input Data
The input data has to be a .tsv file (tab separated) containing the Tweets to classify. The input file needs to have a text column. Optionally, if the input file has a urls column, the classify.py script will exchange the urls in the text in the text column with the urls from the urls column. For example, in the last Tweet in example_tweets.tsv the text
"Vestislav Apostolov, David M. J. Calderbank, Eveline Legendre: Weighted K-stability of polarized varieties and extremality of Sasaki manifolds https://t.co/wd6l9ARN21 https://t.co/rwzu51tW32"
will be updated based on the information in the urls column
"['https://arxiv.org/abs/2012.08628', 'https://arxiv.org/pdf/2012.08628']"
to
"Vestislav Apostolov, David M. J. Calderbank, Eveline Legendre: Weighted K-stability of polarized varieties and extremality of Sasaki manifolds https://arxiv.org/abs/2012.08628 https://arxiv.org/pdf/2012.08628"
which can improve the classifier's performance. All other columns in the input file will not be used by the classifier and will stay the same in the output file.
Sample Input and Output Data
The file example_tweets.tsv contains exemplary input data. After running classifier.py a new file will be created that contains the input data and three additional columns, one for each category, including the output scores of the classifier. The scores range from 0 to 1.
Structure of the input file:
tweetid text urls
... ... ...
... ... ...
Structure of the output file:
tweetid text urls cat1score cat2score cat3_score
... ... ... ... ... ...
... ... ... ... ... ...
How to Use
- Install required software and modules
Run the classifier with:
python3 classifier.py inputfilepath
where the inputfilepath is the path on your computer/server where the input file is located. 3. After the classifier is finished it will save the output file to the same location as the inputfilepath with "_pred" appended to the input file filename.
Contact Details
For questions or feedback, contact sebastian.schellhammer@gesis.org
Publication
Please cite the following paper if you are using the classifier:
Hafid, Salim, et al. "SciTweets-A Dataset and Annotation Framework for Detecting Scientific Online Discourse." Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, download.
bib
@inproceedings{hafid2022scitweets,
title={SciTweets-A Dataset and Annotation Framework for Detecting Scientific Online Discourse},
author={Hafid, Salim and Schellhammer, Sebastian and Bringay, Sandra and Todorov, Konstantin and Dietze, Stefan},
booktitle={Proceedings of the 31st ACM International Conference on Information \& Knowledge Management},
pages={3988--3992},
year={2022}
}
Owner
- Login: SEBSCHELLI
- Kind: user
- Repositories: 1
- Profile: https://github.com/SEBSCHELLI
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Schellhammer
given-names: Sebastian
type: software
title: "SciTweets Classifier - Classification of Science-Relatedness of Tweets"
version: 1.0
GitHub Events
Total
- Issues event: 1
- Issue comment event: 1
- Push event: 2
- Pull request event: 2
- Create event: 1
Last Year
- Issues event: 1
- Issue comment event: 1
- Push event: 2
- Pull request event: 2
- Create event: 1