llms4subjects

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

https://github.com/jd-coderepos/llms4subjects

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary

Keywords

ai artificial-intelligence dataset large-language-models llms natural-language-processing natural-language-understanding nlp semeval shared-task subject-indexing

Last synced: 10 months ago · JSON representation

Repository

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

Basic Info

Host: GitHub
Owner: jd-coderepos
License: cc-by-4.0
Default Branch: main
Homepage: https://sites.google.com/view/llms4subjects/
Size: 1.02 GB

Statistics

Stars: 6
Watchers: 2
Forks: 7
Open Issues: 0
Releases: 1

Topics

ai artificial-intelligence dataset large-language-models llms natural-language-processing natural-language-understanding nlp semeval shared-task subject-indexing

Created over 2 years ago · Last pushed 12 months ago

Metadata Files

Readme License Citation

Welcome to the LLMs4Subjects SemEval 2025 Shared Task Dataset Repository!

About

The LLMs4Subjects shared task invites the research community to develop cutting-edge, LLM-based semantic solutions for automated subject indexing of TIBthe German National Library of Science and Technologys ever-growing collection of technical records in various natural languages. This task, also known as subject tagging or subject classification, leverages the GND (Gemeinsame Normdatei in German or Integrated Authority File in English), an international authority file primarily used by German-speaking libraries to catalog and link information on people, organizations, topics, and works.

To support the development of systems for the LLMs4Subjects shared task, we provide participants with two types of datasets:

Curated, human-readable form of the GND subjects taxonomy.
A large-scale dataset of technical records from TIBs open-access collection, annotated with GND subjects, available in both English and German.

Although TIBs technical records span multiple languages, this shared task focuses on the most representative collections in English and German. We have utilized the TIB's open-access catalog of technical records (https://www.tib.eu/en/services/open-data), known as TIBKAT, and restricted it to records that include abstract metadata. This collection can be dynamically browsed on the TIB portal here. While the overall collection includes various types of technical records, this shared task focuses on the most representative types: article, book, conference, report, and thesis. Therefore, the official shared task dataset comprises only records of these five types.

For the convenience of our participants, both the GND and the TIBKAT datasets have been reorganized, appropriately formatted with human-readable tags, and released as the official shared task dataset in this repository. We recognize that standardized library taxonomies and collections often refer to age-old identifier mechanisms and are filled with codes. Processing and interpreting these codes can be time-consuming . Therefore, in consultation with TIB subject matter experts, we have preprocessed both the GND and TIBKAT datasets, converting their fine-grained coding into human-readable formats. This should help the LLMs4Subjects participants download the relevant data and get started right away.

This shared task offers the research community an opportunity to creatively develop LLMs for subject tagging of technical records based on the GND taxonomy. Systems need to demonstrate bilingual language modeling by processing technical records in both German and English. Moreover, successful solutions may be integrated directly into the operational workflows of the TIB Leibniz Information Centre for Science and Technology University Library .

Repositories Included

shared-task-datasets: This subfolder includes the human-readable formatted GND subjects taxonomy and the training and development sets for the TIBKAT records. Participants in the LLMs4Subjects shared task are requested to download the relevant files from this folder for system development.
supplementary-datasets: This subfolder includes all excluded data from the open-access GND and TIBKAT datasets that are not part of the LLMs4Subjects shared task. For instance, this may include records from TIBKAT in languages other than English or German or records where a specific record type is too sparse. Although not part of the official shared task, these records are available for participants to use as needed.
shared-task-eval-script: This subfolder contains the official evaluation script used to generate the quantitative evaluation results for LLMs4Subjects participant team submissions.

Contact

llms4subjects [at] gmail.com

Citation

The recommended citation for this dataset resource is provided below. If you find this resource useful, please consider citing it.

```bibtex @dataset{dsouza2024llms4subjects, author = {Jennifer D'Souza and Sameer Sadruddin and Holger Israel and Mathias Begoin and Diana Slawig}, title = {The SemEval 2025 LLMs4Subjects Shared Task Dataset}, year = {2024}, version = {1.0.1}, publisher = {Zenodo}, doi = {10.5281/zenodo.15185475}, url = {https://doi.org/10.5281/zenodo.15185475}, abstract = {To support the development of systems for the LLMs4Subjects shared task, we provide participants with two types of datasets: (1) a curated, human-readable form of the GND subjects taxonomy, and (2) a large-scale dataset of technical records from TIBs open-access collection, annotated with GND subjects, available in both English and German.} }

```

Acknowledgements

The LLMs4Subjects shared task, organized as SemEval 2025 Task 5, is jointly supported by the SCINEXT project (BMBF, German Federal Ministry of Education and Research, Grant ID: 01lS22070) and the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259).

The dataset is archived on Zenodo . This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Owner

Login: jd-coderepos
Kind: user

Repositories: 1
Profile: https://github.com/jd-coderepos

GitHub Events

Total

Release event: 2
Watch event: 5
Push event: 36
Pull request event: 7
Fork event: 6
Create event: 2

Last Year

Release event: 2
Watch event: 5
Push event: 36
Pull request event: 7
Fork event: 6
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 5 days
Total issue authors: 0
Total pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.5
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 5 days
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.5
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science