llms4subjects
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 6 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.6%) to scientific vocabulary
Keywords
Repository
The official SemEval 2025 Task 5 - LLMs4Subjects - Shared Task Dataset repository
Basic Info
- Host: GitHub
- Owner: jd-coderepos
- License: cc-by-4.0
- Default Branch: main
- Homepage: https://sites.google.com/view/llms4subjects/
- Size: 1.02 GB
Statistics
- Stars: 6
- Watchers: 2
- Forks: 7
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Welcome to the LLMs4Subjects SemEval 2025 Shared Task Dataset Repository!
About
The LLMs4Subjects shared task invites the research community to develop cutting-edge, LLM-based semantic solutions for automated subject indexing of TIBthe German National Library of Science and Technologys ever-growing collection of technical records in various natural languages. This task, also known as subject tagging or subject classification, leverages the GND (Gemeinsame Normdatei in German or Integrated Authority File in English), an international authority file primarily used by German-speaking libraries to catalog and link information on people, organizations, topics, and works.
To support the development of systems for the LLMs4Subjects shared task, we provide participants with two types of datasets:
- Curated, human-readable form of the GND subjects taxonomy.
- A large-scale dataset of technical records from TIBs open-access collection, annotated with GND subjects, available in both English and German.
Although TIBs technical records span multiple languages, this shared task focuses on the most representative collections in English and German. We have utilized the TIB's open-access catalog of technical records (https://www.tib.eu/en/services/open-data), known as TIBKAT, and restricted it to records that include abstract metadata. This collection can be dynamically browsed on the TIB portal here. While the overall collection includes various types of technical records, this shared task focuses on the most representative types: article, book, conference, report, and thesis. Therefore, the official shared task dataset comprises only records of these five types.
For the convenience of our participants, both the GND and the TIBKAT datasets have been reorganized, appropriately formatted with human-readable tags, and released as the official shared task dataset in this repository. We recognize that standardized library taxonomies and collections often refer to age-old identifier mechanisms and are filled with codes. Processing and interpreting these codes can be time-consuming . Therefore, in consultation with TIB subject matter experts, we have preprocessed both the GND and TIBKAT datasets, converting their fine-grained coding into human-readable formats. This should help the LLMs4Subjects participants download the relevant data and get started right away.
This shared task offers the research community an opportunity to creatively develop LLMs for subject tagging of technical records based on the GND taxonomy. Systems need to demonstrate bilingual language modeling by processing technical records in both German and English. Moreover, successful solutions may be integrated directly into the operational workflows of the TIB Leibniz Information Centre for Science and Technology University Library .
Repositories Included
shared-task-datasets: This subfolder includes the human-readable formatted GND subjects taxonomy and the training and development sets for the TIBKAT records. Participants in the LLMs4Subjects shared task are requested to download the relevant files from this folder for system development.
supplementary-datasets: This subfolder includes all excluded data from the open-access GND and TIBKAT datasets that are not part of the LLMs4Subjects shared task. For instance, this may include records from TIBKAT in languages other than English or German or records where a specific record type is too sparse. Although not part of the official shared task, these records are available for participants to use as needed.
shared-task-eval-script: This subfolder contains the official evaluation script used to generate the quantitative evaluation results for LLMs4Subjects participant team submissions.
Contact
llms4subjects [at] gmail.com
Citation
The recommended citation for this dataset resource is provided below. If you find this resource useful, please consider citing it.
```bibtex @dataset{dsouza2024llms4subjects, author = {Jennifer D'Souza and Sameer Sadruddin and Holger Israel and Mathias Begoin and Diana Slawig}, title = {The SemEval 2025 LLMs4Subjects Shared Task Dataset}, year = {2024}, version = {1.0.1}, publisher = {Zenodo}, doi = {10.5281/zenodo.15185475}, url = {https://doi.org/10.5281/zenodo.15185475}, abstract = {To support the development of systems for the LLMs4Subjects shared task, we provide participants with two types of datasets: (1) a curated, human-readable form of the GND subjects taxonomy, and (2) a large-scale dataset of technical records from TIBs open-access collection, annotated with GND subjects, available in both English and German.} }
```
Acknowledgements
The LLMs4Subjects shared task, organized as SemEval 2025 Task 5, is jointly supported by the SCINEXT project (BMBF, German Federal Ministry of Education and Research, Grant ID: 01lS22070) and the NFDI4DataScience initiative (DFG, German Research Foundation, Grant ID: 460234259).
The dataset is archived on Zenodo . This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License.
Owner
- Login: jd-coderepos
- Kind: user
- Repositories: 1
- Profile: https://github.com/jd-coderepos
GitHub Events
Total
- Release event: 2
- Watch event: 5
- Push event: 36
- Pull request event: 7
- Fork event: 6
- Create event: 2
Last Year
- Release event: 2
- Watch event: 5
- Push event: 36
- Pull request event: 7
- Fork event: 6
- Create event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 5 days
- Total issue authors: 0
- Total pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: 5 days
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.5
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- osma (2)
- lisa-kluge (1)
- mfakaehler (1)
