llms4subjects

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

https://github.com/jd-coderepos/llms4subjects

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary

Keywords

ai artificial-intelligence dataset large-language-models llms natural-language-processing natural-language-understanding nlp semeval shared-task subject-indexing
Last synced: 6 months ago · JSON representation

Repository

The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository

Basic Info
Statistics
  • Stars: 6
  • Watchers: 2
  • Forks: 7
  • Open Issues: 0
  • Releases: 1
Topics
ai artificial-intelligence dataset large-language-models llms natural-language-processing natural-language-understanding nlp semeval shared-task subject-indexing
Created almost 2 years ago · Last pushed 7 months ago
Metadata Files
Readme License Citation

README.md

Welcome to the LLMs4Subjects SemEval 2025 Shared Task Dataset Repository!

About

The LLMs4Subjects shared task invites the research community to develop cutting-edge, LLM-based semantic solutions for automated subject indexing of TIBthe German National Library of Science and Technologys ever-growing collection of technical records in various natural languages. This task, also known as subject tagging or subject classification, leverages the GND (Gemeinsame Normdatei in German or Integrated Authority File in English), an international authority file primarily used by German-speaking libraries to catalog and link information on people, organizations, topics, and works.

To support the development of systems for the LLMs4Subjects shared task, we provide participants with two types of datasets:

  1. Curated, human-readable form of the GND subjects taxonomy.
  2. A large-scale dataset of technical records from TIBs open-access collection, annotated with GND subjects, available in both English and German.

Although TIBs technical records span multiple languages, this shared task focuses on the most representative collections in English and German. We have utilized the TIB's open-access catalog of technical records (https://www.tib.eu/en/services/open-data), known as TIBKAT, and restricted it to records that include abstract metadata. This collection can be dynamically browsed on the TIB portal here. While the overall collection includes various types of technical records, this shared task focuses on the most representative types: article, book, conference, report, and thesis. Therefore, the official shared task dataset comprises only records of these five types.

For the convenience of our participants, both the GND and the TIBKAT datasets have been reorganized, appropriately formatted with human-readable tags, and released as the official shared task dataset in this repository. We recognize that standardized library taxonomies and collections often refer to age-old identifier mechanisms and are filled with codes. Processing and interpreting these codes can be time-consuming . Therefore, in consultation with TIB subject matter experts, we have preprocessed both the GND and TIBKAT datasets, converting their fine-grained coding into human-readable formats. This should help the LLMs4Subjects participants download the relevant data and get started right away.

This shared task offers the research community an opportunity to creatively develop LLMs for subject tagging of technical records based on the GND taxonomy. Systems need to demonstrate bilingual language modeling by processing technical records in both German and English. Moreover, successful solutions may be integrated directly into the operational workflows of the TIB Leibniz Information Centre for Science and Technology University Library .

Repositories Included

  • shared-task-datasets: This subfolder includes the human-readable formatted GND subjects taxonomy and the training and development sets for the TIBKAT records. Participants in the LLMs4Subjects shared task are requested to download the relevant files from this folder for system development.

  • supplementary-datasets: This subfolder includes all excluded data from the open-access GND and TIBKAT datasets that are not part of the LLMs4Subjects shared task. For instance, this may include records from TIBKAT in languages other than English or German or records where a specific record type is too sparse. Although not part of the official shared task, these records are available for participants to use as needed.

  • shared-task-eval-script: This subfolder contains the official evaluation script used to generate the quantitative evaluation results for LLMs4Subjects participant team submissions.

Contact

llms4subjects [at] gmail.com

Citation

The recommended citation for this dataset resource is provided below. If you find this resource useful, please consider citing it.

```bibtex @dataset{dsouza2024llms4subjects, author = {Jennifer D'Souza and Sameer Sadruddin and Holger Israel and Mathias Begoin and Diana Slawig}, title = {The SemEval 2025 LLMs4Subjects Shared Task Dataset}, year = {2024}, version = {1.0.1}, publisher = {Zenodo}, doi = {10.5281/zenodo.15185475}, url = {https://doi.org/10.5281/zenodo.15185475}, abstract = {To support the development of systems for the LLMs4Subjects shared task, we provide participants with two types of datasets: (1) a curated, human-readable form of the GND subjects taxonomy, and (2) a large-scale dataset of technical records from TIBs open-access collection, annotated with GND subjects, available in both English and German.} }

```

Acknowledgements

The LLMs4Subjects shared task, organized as SemEval 2025 Task 5, is jointly supported by the SCINEXT project (BMBF, German Federal Ministry of Education and Research, Grant ID: 01lS22070) and the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259).

The dataset is archived on Zenodo DOI. This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

CC BY-SA 4.0

Owner

  • Login: jd-coderepos
  • Kind: user

GitHub Events

Total
  • Release event: 2
  • Watch event: 5
  • Push event: 36
  • Pull request event: 7
  • Fork event: 6
  • Create event: 2
Last Year
  • Release event: 2
  • Watch event: 5
  • Push event: 36
  • Pull request event: 7
  • Fork event: 6
  • Create event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 5 days
  • Total issue authors: 0
  • Total pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 5 days
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.5
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • osma (2)
  • lisa-kluge (1)
  • mfakaehler (1)
Top Labels
Issue Labels
Pull Request Labels