qasports-dataset-scripts

Scripts used to generate the (Question-Answering) QASports2 Dataset

https://github.com/leomaurodesenv/qasports-dataset-scripts

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.6%) to scientific vocabulary

Keywords

crawler dataset python question-answering sports
Last synced: 6 months ago · JSON representation ·

Repository

Scripts used to generate the (Question-Answering) QASports2 Dataset

Basic Info
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 4
  • Open Issues: 0
  • Releases: 2
Topics
crawler dataset python question-answering sports
Created over 2 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Citation

README.md

📄 QASports2: Question-Answering Dataset about Sports

GitHub MIT license Python uv GitHub Workflow Status

The first large-scale, open-domain sports question answering dataset

QASports is a comprehensive dataset featuring over 1 million question-answer-context tuples derived from more than 400,000 thoroughly preprocessed, cleaned, and organized documents about players, teams, and matches from multiple sports. The data is sourced from Wikipedia-like resources to ensure quality and relevance.

📚 Research Paper

Abstract

Sport is an engaging topic that is quickly evolving due to its growing popularity and revenue potential. This theme introduces several opportunities for question-answering (QA) systems, such as supporting tactical decision-making. Nevertheless, these QA systems require expert, specialized datasets and models. To advance the field of sports question-answering, we firstly present QASports2, a novel and significantly expanded dataset. QASports2 compiles information from 20 different wiki sports sources, resulting in over 1 million context-question-answer tuples. Subsequently, we introduce a data labeling validation algorithm that leverages both cluster sampling and large language models (LLMs). To corroborate the LLM results, we conducted a detailed manual review to ensure the labeling was accurate. We found Qwen2 to have the closest LLM results when compared to human labeling, reaching 83.9% question and 87.5% answer accuracy. Additionally, we conduct 399 comparative analysis of language models on QASports2 to evaluate their effectiveness as document reader and retriever models. We highlight the best models, BM25 and MiniLM which obtained 90.8% recall@20 and 93.4% f-score, respectively. Besides, this work outlines different aspects and applications this real-world sports data.

🚀 Quick Start

Prerequisites

Installation

```bash

Clone the repository

git clone https://github.com/leomaurodesenv/qasports-dataset-scripts.git cd qasports-dataset-scripts

Install dependencies

uv sync

Verify installation

uv run pre-commit run --all-files ```

📥 Download the Dataset

🏗️ Dataset Generation Pipeline

The dataset generation process is organized into seven main stages, each contained in separate modules:

📁 Project Structure

src/ ├── crawler/ # 🔍 Gather wiki links ├── fetching/ # 📥 Fetch raw HTML from links ├── processing/ # 🧹 Process and clean textual data ├── extracting_context/ # 📄 Extract contexts from data ├── question_answer/ # ❓ Generate questions and answers ├── sampling/ # 🎯 Sample representative questions └── labeling_llm/ # 🏷️ Label samples using LLMs

🔄 Generation Steps

```bash

1. Crawler: Gather wiki links (~2 minutes)

uv run -m src.crawler.run

2. Fetching: Download wiki pages (~20 hours)

uv run -m src.fetching.run

3. Processing: Clean and process text (~50 minutes)

uv run -m src.processing.run

4. Context Extraction: Extract relevant contexts (~35 seconds)

uv run -m src.extracting_context.run

5. Q&A Generation: Create questions and answers (~5 days)

uv run -m src.questionanswer.run uv run -m src.questionanswer.run_huggingface # Optional

6. Sampling: Select representative questions (~1h 30 minutes)

uv run -m src.sampling.run

7. LLM Labeling: Label samples using LLMs (~91 hours)

uv run -m src.labeling_llm.run ```

🧪 Experiments

This repository includes experimental frameworks for evaluating QA systems using the QASports dataset.

Document Retriever Experiments

```bash

Run document retriever experiments (~24 days)

uv run -m experiments.doc_retriever --help

Example usage

uv run -m experiments.docretriever --model BM25 --numk 3 ```

Document Reader Experiments

```bash

Run document reader experiments (~37 hours)

uv run -m experiments.doc_reader --help

Example usage

uv run -m experiments.doc_reader --model RoBERTa --dataset SQuAD ```

📊 Dataset Statistics

  • Total Questions: 1,000,000+
  • Source Documents: 400,000+ preprocessed documents
  • Data Sources: Wikipedia-like resources
  • Question Types: Extractive QA, Wh-questions
  • Sports Covered: Football, American Football, Basketball, Cricket, +15 Sports

```bash

Run dataset general analysis

uv run -m experiments.dataset_analysis --help

Example usage

uv run -m experiments.dataset_analysis --dataset QASports --sport RUGBY ```

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Brazilian Computer Society for hosting the Dataset Showcase Workshop
  • The research community for feedback and contributions
  • All contributors to the QASports dataset

📖 Citation

If you use QASports in your research, please cite our paper:

bibtex @inproceedings{jardim:2023:qasports-dataset, author={Pedro Calciolari Jardim and Leonardo Mauro Pereira Moraes and Cristina Dutra Aguiar}, title = {{QASports}: A Question Answering Dataset about Sports}, booktitle = {Proceedings of the Brazilian Symposium on Databases: Dataset Showcase Workshop}, address = {Belo Horizonte, MG, Brazil}, url = {https://github.com/leomaurodesenv/qasports-dataset-scripts}, publisher = {Brazilian Computer Society}, pages = {1-12}, year = {2023} }

Owner

  • Name: Leonardo Mauro
  • Login: leomaurodesenv
  • Kind: user
  • Location: Sao Carlos, SP - Brazil
  • Company: Sinch

Data Scientist | Tutor (Data Mining, Machine Learning, Business Intelligence)

Citation (CITATION.bib)

@inproceedings{jardim:2023:qasports-dataset,
    author={Pedro Calciolari Jardim and Leonardo Mauro Pereira Moraes and Cristina Dutra Aguiar},
    title = {{QASports}: A Question Answering Dataset about Sports},
    booktitle = {Proceedings of the Brazilian Symposium on Databases: Dataset Showcase Workshop},
    address = {Belo Horizonte, MG, Brazil},
    url = {https://github.com/leomaurodesenv/qasports-dataset-scripts},
    publisher = {Brazilian Computer Society},
    pages = {1-12},
    year = {2023}
}

GitHub Events

Total
  • Create event: 25
  • Commit comment event: 1
  • Release event: 2
  • Issues event: 6
  • Watch event: 5
  • Delete event: 30
  • Issue comment event: 5
  • Push event: 134
  • Pull request review event: 5
  • Pull request review comment event: 2
  • Pull request event: 61
  • Fork event: 1
Last Year
  • Create event: 25
  • Commit comment event: 1
  • Release event: 2
  • Issues event: 6
  • Watch event: 5
  • Delete event: 30
  • Issue comment event: 5
  • Push event: 134
  • Pull request review event: 5
  • Pull request review comment event: 2
  • Pull request event: 61
  • Fork event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 4
  • Total pull requests: 42
  • Average time to close issues: 3 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 3
  • Total pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.07
  • Merged pull requests: 39
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 39
  • Average time to close issues: 1 day
  • Average time to close pull requests: 1 day
  • Issue authors: 2
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.08
  • Merged pull requests: 36
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • leomaurodesenv (2)
  • leonardo-moraes-inbev (1)
  • Pedro-C-Jardim (1)
Pull Request Authors
  • leomaurodesenv (53)
  • Pedro-C-Jardim (3)
  • estebantapia-encora (1)
  • Enzoonofre (1)
Top Labels
Issue Labels
enhancement (2) bug (2)
Pull Request Labels
enhancement (40) bug (12) documentation (5) fix (5) good first issue (1)

Dependencies

requirements.txt pypi
  • beautifulsoup4 *
  • black *
  • datasets *
  • farm-haystack *
  • isort *
  • mmh3 *
  • pandas *
  • protobuf *
  • requests *
  • sentence-transformers *
  • tqdm *
  • transformers *
  • unidecode *
.github/workflows/continuous-integration.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite