qasports-dataset-scripts

Scripts used to generate the (Question-Answering) QASports2 Dataset

https://github.com/leomaurodesenv/qasports-dataset-scripts

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Keywords

crawler dataset python question-answering sports

Last synced: 6 months ago · JSON representation ·

Repository

Scripts used to generate the (Question-Answering) QASports2 Dataset

Basic Info

Host: GitHub
Owner: leomaurodesenv
License: mit
Language: Python
Default Branch: main
Homepage: https://huggingface.co/datasets/leomaurodesenv/QASports2
Size: 10.6 MB

Statistics

Stars: 6
Watchers: 2
Forks: 4
Open Issues: 0
Releases: 2

Topics

crawler dataset python question-answering sports

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Citation

📄 QASports2: Question-Answering Dataset about Sports

The first large-scale, open-domain sports question answering dataset

QASports is a comprehensive dataset featuring over 1 million question-answer-context tuples derived from more than 400,000 thoroughly preprocessed, cleaned, and organized documents about players, teams, and matches from multiple sports. The data is sourced from Wikipedia-like resources to ensure quality and relevance.

📚 Research Paper

Abstract

Sport is an engaging topic that is quickly evolving due to its growing popularity and revenue potential. This theme introduces several opportunities for question-answering (QA) systems, such as supporting tactical decision-making. Nevertheless, these QA systems require expert, specialized datasets and models. To advance the field of sports question-answering, we firstly present QASports2, a novel and significantly expanded dataset. QASports2 compiles information from 20 different wiki sports sources, resulting in over 1 million context-question-answer tuples. Subsequently, we introduce a data labeling validation algorithm that leverages both cluster sampling and large language models (LLMs). To corroborate the LLM results, we conducted a detailed manual review to ensure the labeling was accurate. We found Qwen2 to have the closest LLM results when compared to human labeling, reaching 83.9% question and 87.5% answer accuracy. Additionally, we conduct 399 comparative analysis of language models on QASports2 to evaluate their effectiveness as document reader and retriever models. We highlight the best models, BM25 and MiniLM which obtained 90.8% recall@20 and 93.4% f-score, respectively. Besides, this work outlines different aspects and applications this real-world sports data.

🚀 Quick Start

Prerequisites

Python 3.10+
uv package manager (Installation Guide)

Installation

```bash

Clone the repository

git clone https://github.com/leomaurodesenv/qasports-dataset-scripts.git cd qasports-dataset-scripts

Install dependencies

uv sync

Verify installation

uv run pre-commit run --all-files ```

📥 Download the Dataset

🎲 Full Dataset: OSF Repository
🎲 Formatted Dataset: Hugging Face Hub
🛠 Dataset v1: GitHub Release v1.1.0

🏗️ Dataset Generation Pipeline

The dataset generation process is organized into seven main stages, each contained in separate modules:

📁 Project Structure

src/ ├── crawler/ # 🔍 Gather wiki links ├── fetching/ # 📥 Fetch raw HTML from links ├── processing/ # 🧹 Process and clean textual data ├── extracting_context/ # 📄 Extract contexts from data ├── question_answer/ # ❓ Generate questions and answers ├── sampling/ # 🎯 Sample representative questions └── labeling_llm/ # 🏷️ Label samples using LLMs

🔄 Generation Steps

```bash

1. Crawler: Gather wiki links (~2 minutes)

uv run -m src.crawler.run

2. Fetching: Download wiki pages (~20 hours)

uv run -m src.fetching.run

3. Processing: Clean and process text (~50 minutes)

uv run -m src.processing.run

4. Context Extraction: Extract relevant contexts (~35 seconds)

uv run -m src.extracting_context.run

5. Q&A Generation: Create questions and answers (~5 days)

uv run -m src.questionanswer.run uv run -m src.questionanswer.run_huggingface # Optional

6. Sampling: Select representative questions (~1h 30 minutes)

uv run -m src.sampling.run

7. LLM Labeling: Label samples using LLMs (~91 hours)

uv run -m src.labeling_llm.run ```

🧪 Experiments

This repository includes experimental frameworks for evaluating QA systems using the QASports dataset.

Document Retriever Experiments

```bash

Run document retriever experiments (~24 days)

uv run -m experiments.doc_retriever --help

Example usage

uv run -m experiments.docretriever --model BM25 --numk 3 ```

Document Reader Experiments

```bash

Run document reader experiments (~37 hours)

uv run -m experiments.doc_reader --help

Example usage

uv run -m experiments.doc_reader --model RoBERTa --dataset SQuAD ```

📊 Dataset Statistics

Total Questions: 1,000,000+
Source Documents: 400,000+ preprocessed documents
Data Sources: Wikipedia-like resources
Question Types: Extractive QA, Wh-questions
Sports Covered: Football, American Football, Basketball, Cricket, +15 Sports

```bash

Run dataset general analysis

uv run -m experiments.dataset_analysis --help

Example usage

uv run -m experiments.dataset_analysis --dataset QASports --sport RUGBY ```

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Brazilian Computer Society for hosting the Dataset Showcase Workshop
The research community for feedback and contributions
All contributors to the QASports dataset

📖 Citation

If you use QASports in your research, please cite our paper:

bibtex @inproceedings{jardim:2023:qasports-dataset, author={Pedro Calciolari Jardim and Leonardo Mauro Pereira Moraes and Cristina Dutra Aguiar}, title = {{QASports}: A Question Answering Dataset about Sports}, booktitle = {Proceedings of the Brazilian Symposium on Databases: Dataset Showcase Workshop}, address = {Belo Horizonte, MG, Brazil}, url = {https://github.com/leomaurodesenv/qasports-dataset-scripts}, publisher = {Brazilian Computer Society}, pages = {1-12}, year = {2023} }

Owner

Name: Leonardo Mauro
Login: leomaurodesenv
Kind: user
Location: Sao Carlos, SP - Brazil
Company: Sinch

Website: https://leomaurodesenv.github.io/
Repositories: 14
Profile: https://github.com/leomaurodesenv

Data Scientist | Tutor (Data Mining, Machine Learning, Business Intelligence)

Citation (CITATION.bib)

@inproceedings{jardim:2023:qasports-dataset,
    author={Pedro Calciolari Jardim and Leonardo Mauro Pereira Moraes and Cristina Dutra Aguiar},
    title = {{QASports}: A Question Answering Dataset about Sports},
    booktitle = {Proceedings of the Brazilian Symposium on Databases: Dataset Showcase Workshop},
    address = {Belo Horizonte, MG, Brazil},
    url = {https://github.com/leomaurodesenv/qasports-dataset-scripts},
    publisher = {Brazilian Computer Society},
    pages = {1-12},
    year = {2023}
}

GitHub Events

Total

Create event: 25
Commit comment event: 1
Release event: 2
Issues event: 6
Watch event: 5
Delete event: 30
Issue comment event: 5
Push event: 134
Pull request review event: 5
Pull request review comment event: 2
Pull request event: 61
Fork event: 1

Last Year

Create event: 25
Commit comment event: 1
Release event: 2
Issues event: 6
Watch event: 5
Delete event: 30
Issue comment event: 5
Push event: 134
Pull request review event: 5
Pull request review comment event: 2
Pull request event: 61
Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 4
Total pull requests: 42
Average time to close issues: 3 months
Average time to close pull requests: 1 day
Total issue authors: 3
Total pull request authors: 4
Average comments per issue: 0.0
Average comments per pull request: 0.07
Merged pull requests: 39
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 39
Average time to close issues: 1 day
Average time to close pull requests: 1 day
Issue authors: 2
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 0.08
Merged pull requests: 36
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

leomaurodesenv (2)
leonardo-moraes-inbev (1)
Pedro-C-Jardim (1)

Pull Request Authors

leomaurodesenv (53)
Pedro-C-Jardim (3)
estebantapia-encora (1)
Enzoonofre (1)

Top Labels

Issue Labels

enhancement (2) bug (2)

Pull Request Labels

enhancement (40) bug (12) documentation (5) fix (5) good first issue (1)

Dependencies

requirements.txt pypi

beautifulsoup4 *
black *
datasets *
farm-haystack *
isort *
mmh3 *
pandas *
protobuf *
requests *
sentence-transformers *
tqdm *
transformers *
unidecode *

.github/workflows/continuous-integration.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

qasports-dataset-scripts

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

📄 QASports2: Question-Answering Dataset about Sports

📚 Research Paper

Abstract

🚀 Quick Start

Prerequisites

Installation

Clone the repository

Install dependencies

Verify installation

📥 Download the Dataset

🏗️ Dataset Generation Pipeline

📁 Project Structure

🔄 Generation Steps

1. Crawler: Gather wiki links (~2 minutes)

2. Fetching: Download wiki pages (~20 hours)

3. Processing: Clean and process text (~50 minutes)

4. Context Extraction: Extract relevant contexts (~35 seconds)

5. Q&A Generation: Create questions and answers (~5 days)

6. Sampling: Select representative questions (~1h 30 minutes)

7. LLM Labeling: Label samples using LLMs (~91 hours)

🧪 Experiments

Document Retriever Experiments

Run document retriever experiments (~24 days)

Example usage

Document Reader Experiments

Run document reader experiments (~37 hours)

Example usage

📊 Dataset Statistics

Run dataset general analysis

Example usage

🤝 Contributing

📄 License

🙏 Acknowledgments

📖 Citation

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies