qasports-dataset-scripts
Scripts used to generate the (Question-Answering) QASports2 Dataset
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary
Keywords
Repository
Scripts used to generate the (Question-Answering) QASports2 Dataset
Basic Info
- Host: GitHub
- Owner: leomaurodesenv
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://huggingface.co/datasets/leomaurodesenv/QASports2
- Size: 10.6 MB
Statistics
- Stars: 6
- Watchers: 2
- Forks: 4
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
📄 QASports2: Question-Answering Dataset about Sports
The first large-scale, open-domain sports question answering dataset
QASports is a comprehensive dataset featuring over 1 million question-answer-context tuples derived from more than 400,000 thoroughly preprocessed, cleaned, and organized documents about players, teams, and matches from multiple sports. The data is sourced from Wikipedia-like resources to ensure quality and relevance.
📚 Research Paper
Abstract
Sport is an engaging topic that is quickly evolving due to its growing popularity and revenue potential. This theme introduces several opportunities for question-answering (QA) systems, such as supporting tactical decision-making. Nevertheless, these QA systems require expert, specialized datasets and models. To advance the field of sports question-answering, we firstly present QASports2, a novel and significantly expanded dataset. QASports2 compiles information from 20 different wiki sports sources, resulting in over 1 million context-question-answer tuples. Subsequently, we introduce a data labeling validation algorithm that leverages both cluster sampling and large language models (LLMs). To corroborate the LLM results, we conducted a detailed manual review to ensure the labeling was accurate. We found Qwen2 to have the closest LLM results when compared to human labeling, reaching 83.9% question and 87.5% answer accuracy. Additionally, we conduct 399 comparative analysis of language models on QASports2 to evaluate their effectiveness as document reader and retriever models. We highlight the best models, BM25 and MiniLM which obtained 90.8% recall@20 and 93.4% f-score, respectively. Besides, this work outlines different aspects and applications this real-world sports data.
🚀 Quick Start
Prerequisites
- Python 3.10+
- uv package manager (Installation Guide)
Installation
```bash
Clone the repository
git clone https://github.com/leomaurodesenv/qasports-dataset-scripts.git cd qasports-dataset-scripts
Install dependencies
uv sync
Verify installation
uv run pre-commit run --all-files ```
📥 Download the Dataset
- 🎲 Full Dataset: OSF Repository
- 🎲 Formatted Dataset: Hugging Face Hub
- 🛠 Dataset v1: GitHub Release v1.1.0
🏗️ Dataset Generation Pipeline
The dataset generation process is organized into seven main stages, each contained in separate modules:
📁 Project Structure
src/
├── crawler/ # 🔍 Gather wiki links
├── fetching/ # 📥 Fetch raw HTML from links
├── processing/ # 🧹 Process and clean textual data
├── extracting_context/ # 📄 Extract contexts from data
├── question_answer/ # ❓ Generate questions and answers
├── sampling/ # 🎯 Sample representative questions
└── labeling_llm/ # 🏷️ Label samples using LLMs
🔄 Generation Steps
```bash
1. Crawler: Gather wiki links (~2 minutes)
uv run -m src.crawler.run
2. Fetching: Download wiki pages (~20 hours)
uv run -m src.fetching.run
3. Processing: Clean and process text (~50 minutes)
uv run -m src.processing.run
4. Context Extraction: Extract relevant contexts (~35 seconds)
uv run -m src.extracting_context.run
5. Q&A Generation: Create questions and answers (~5 days)
uv run -m src.questionanswer.run uv run -m src.questionanswer.run_huggingface # Optional
6. Sampling: Select representative questions (~1h 30 minutes)
uv run -m src.sampling.run
7. LLM Labeling: Label samples using LLMs (~91 hours)
uv run -m src.labeling_llm.run ```
🧪 Experiments
This repository includes experimental frameworks for evaluating QA systems using the QASports dataset.
Document Retriever Experiments
```bash
Run document retriever experiments (~24 days)
uv run -m experiments.doc_retriever --help
Example usage
uv run -m experiments.docretriever --model BM25 --numk 3 ```
Document Reader Experiments
```bash
Run document reader experiments (~37 hours)
uv run -m experiments.doc_reader --help
Example usage
uv run -m experiments.doc_reader --model RoBERTa --dataset SQuAD ```
📊 Dataset Statistics
- Total Questions: 1,000,000+
- Source Documents: 400,000+ preprocessed documents
- Data Sources: Wikipedia-like resources
- Question Types: Extractive QA, Wh-questions
- Sports Covered: Football, American Football, Basketball, Cricket, +15 Sports
```bash
Run dataset general analysis
uv run -m experiments.dataset_analysis --help
Example usage
uv run -m experiments.dataset_analysis --dataset QASports --sport RUGBY ```
🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Brazilian Computer Society for hosting the Dataset Showcase Workshop
- The research community for feedback and contributions
- All contributors to the QASports dataset
📖 Citation
If you use QASports in your research, please cite our paper:
bibtex
@inproceedings{jardim:2023:qasports-dataset,
author={Pedro Calciolari Jardim and Leonardo Mauro Pereira Moraes and Cristina Dutra Aguiar},
title = {{QASports}: A Question Answering Dataset about Sports},
booktitle = {Proceedings of the Brazilian Symposium on Databases: Dataset Showcase Workshop},
address = {Belo Horizonte, MG, Brazil},
url = {https://github.com/leomaurodesenv/qasports-dataset-scripts},
publisher = {Brazilian Computer Society},
pages = {1-12},
year = {2023}
}
Owner
- Name: Leonardo Mauro
- Login: leomaurodesenv
- Kind: user
- Location: Sao Carlos, SP - Brazil
- Company: Sinch
- Website: https://leomaurodesenv.github.io/
- Repositories: 14
- Profile: https://github.com/leomaurodesenv
Data Scientist | Tutor (Data Mining, Machine Learning, Business Intelligence)
Citation (CITATION.bib)
@inproceedings{jardim:2023:qasports-dataset,
author={Pedro Calciolari Jardim and Leonardo Mauro Pereira Moraes and Cristina Dutra Aguiar},
title = {{QASports}: A Question Answering Dataset about Sports},
booktitle = {Proceedings of the Brazilian Symposium on Databases: Dataset Showcase Workshop},
address = {Belo Horizonte, MG, Brazil},
url = {https://github.com/leomaurodesenv/qasports-dataset-scripts},
publisher = {Brazilian Computer Society},
pages = {1-12},
year = {2023}
}
GitHub Events
Total
- Create event: 25
- Commit comment event: 1
- Release event: 2
- Issues event: 6
- Watch event: 5
- Delete event: 30
- Issue comment event: 5
- Push event: 134
- Pull request review event: 5
- Pull request review comment event: 2
- Pull request event: 61
- Fork event: 1
Last Year
- Create event: 25
- Commit comment event: 1
- Release event: 2
- Issues event: 6
- Watch event: 5
- Delete event: 30
- Issue comment event: 5
- Push event: 134
- Pull request review event: 5
- Pull request review comment event: 2
- Pull request event: 61
- Fork event: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 4
- Total pull requests: 42
- Average time to close issues: 3 months
- Average time to close pull requests: 1 day
- Total issue authors: 3
- Total pull request authors: 4
- Average comments per issue: 0.0
- Average comments per pull request: 0.07
- Merged pull requests: 39
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 39
- Average time to close issues: 1 day
- Average time to close pull requests: 1 day
- Issue authors: 2
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 0.08
- Merged pull requests: 36
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- leomaurodesenv (2)
- leonardo-moraes-inbev (1)
- Pedro-C-Jardim (1)
Pull Request Authors
- leomaurodesenv (53)
- Pedro-C-Jardim (3)
- estebantapia-encora (1)
- Enzoonofre (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- beautifulsoup4 *
- black *
- datasets *
- farm-haystack *
- isort *
- mmh3 *
- pandas *
- protobuf *
- requests *
- sentence-transformers *
- tqdm *
- transformers *
- unidecode *
- actions/checkout v3 composite
- actions/setup-python v4 composite