an-information-retrieval-approach-to-building-datasets-for-hate-speech-detection
A hate speech data set constructed using IR pooling technique to enhance diversity
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Keywords
Repository
A hate speech data set constructed using IR pooling technique to enhance diversity
Basic Info
Statistics
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
An-Information-Retrieval-Approach-to-Building-Datasets-for-Hate-Speech-Detection
For more details about our hate speech dataset, please read the following research article
Md Mustafizur Rahman, Dinesh Balakrishnan, Dhiraj Murthy, Mucahid Kutlu, and Matthew Lease, An Information Retrieval Approach to Building Datasets for Hate Speech Detection. [pdf]
Source codes
- Pooling --> /codes/pooling.py
- Active learning --> /codes/active_learning.py
Benchmark Models
- BiLSTM [1] [Source code]
- LSTM [2] [Source code]
- BERT [3] [Source code]
The source code for BiLSTM and LSTM used in this project are collected from [4] where the authors made necessary correction for those two models.
Train and Test sets to Benchmark Models
- Train.csv --> /data/traintestsets/
- Test.csv --> /data/traintestsets/
Annotation Interface
Two different annotation interfaces used during pilot and main phases are provided in html format under /interface/ directory.
Author Distribution of Tweets
Total Number of Authors: 9534 1. Total number of Author with exactly 1 contribution: 9430 2. Total number of Author with exactly 2 contribution: 97 3. Total number of Author with more than 2 contribution: 7
References
[1] Sweta Agrawal and Amit Awekar. 2018. Deep learning for detecting cyberbullying across multiple social media platforms. In European Conference on Information Retrieval. Springer, 141–153.
[2] Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion. 759–760.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4] Aymé Arango, Jorge Pérez, and Barbara Poblete. 2019. Hate speech detection is not as easy as you may think: A closer look at model validation. In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval. 45–54.
Owner
- Name: Md Mustafizur Rahman
- Login: mdmustafizurrahman
- Kind: user
- Location: Austin, TX
- Company: The University of Texas at Austin
- Website: https://www.ischool.utexas.edu/~nahid
- Repositories: 2
- Profile: https://github.com/mdmustafizurrahman
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Md Mustafizur" given-names: "Rahman" - family-names: "Matthew" given-names: "Lease" title: "An-Information-Retrieval-Approach-to-Building-Datasets-for-Hate-Speech-Detection" version: 1.0.0 doi: 10.5281/zenodo.1234 date-released: 2021-09-26 url: "https://github.com/mdmustafizurrahman/An-Information-Retrieval-Approach-to-Building-Datasets-for-Hate-Speech-Detection"