https://github.com/datascienceuibk/chroniclingamericaqa

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: acm.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (6.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Basic Info

Host: GitHub
Owner: DataScienceUIBK
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 1.21 MB

Statistics

Stars: 10
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created over 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

ChroniclingAmetricaQA, is a large-scale question-answering dataset comprising question-answer pairs over a collection of historical American newspapers to facilitate the development of QA and MRC systems over historical texts.

Download Links

Dataset

Structured as JSON files, the ChricinclingAmericaQA dataset includes train.json, dev.json, and test.json for training, validation, and testing phases, respectively.

Data Structure: ```json [ { "queryid": "", "question": "", "answer": "", "organswer": "", "paraid": "", "context": "", "rawocr": "", "publicationdate": "", "transque": "", "trans_ans": "", "url": "" } ]

```

Training Set: Download
Development Set: Download
Test Set: Download

Dataset Statistics

| | Training | Development | Test | | ----------------- | --------- | ----------- | ------ | | Num. of Questions | 439,302 | 24,111 | 24,084 |

Citation

If you find the dataset helpful, please consider citing our paper. @inproceedings{10.1145/3626772.3657891, author = {Piryani, Bhawna and Mozafari, Jamshid and Jatowt, Adam}, title = {ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages}, year = {2024}, isbn = {9798400704314}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3626772.3657891}, doi = {10.1145/3626772.3657891}, booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval}, pages = {2038–2048}, numpages = {11}, keywords = {heritage collections, ocr text, question answering}, location = {Washington DC, USA}, series = {SIGIR '24} }

License

This project is licensed under the MIT License - see the LICENSE file for details.

Owner

Name: DataScienceUIBK
Login: DataScienceUIBK
Kind: organization

Repositories: 1
Profile: https://github.com/DataScienceUIBK

GitHub Events

Total

Issues event: 2
Watch event: 5
Issue comment event: 4
Push event: 2

Last Year

Issues event: 2
Watch event: 5
Issue comment event: 4
Push event: 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/datascienceuibk/chroniclingamericaqa

Science Score: 36.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

Download Links

Dataset

Dataset Statistics

Citation

License

Owner

GitHub Events

Total

Last Year