grobid-datacat-trainingdata

Training datasets for GROBID sale catalogues models.

https://github.com/datacatalogue/grobid-datacat-trainingdata

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.9%) to scientific vocabulary

Keywords

grobid tei-xml training-data xml
Last synced: 6 months ago · JSON representation ·

Repository

Training datasets for GROBID sale catalogues models.

Basic Info
  • Host: GitHub
  • Owner: DataCatalogue
  • License: cc-by-4.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 98.4 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
grobid tei-xml training-data xml
Created almost 4 years ago · Last pushed over 3 years ago
Metadata Files
Readme License Citation

README.md

Training datasets for training GROBID sale catalogues models

Each directory of this repository contains datasets created to train GROBID sale catalogues models. Datasets are divided based on where original documents are being kept, and then are organized by authors/auction houses.

Annotated files are in the TEI-XML format.

Naming convention

  • BnF files are named with their Gallica ark identifier.
  • INHA files are named with their digital identifier ("identifiant numérique") provided in their online notice.

GROBID models

  • Segmentation : the segmentation model aims to obtain a high level segmentation of the catalogues.

Data quality

Before being pushed to the main branch, annotated files have at least been proofread once, and are validated with an XSD by a Github action.

Toolbox

This repository also contains a set of tools that can be used on the training sets.

  • PDF Preprocessing
  • Quality assessment
  • XML validity checker (used by a Github action)

Owner

  • Name: DataCatalogue
  • Login: DataCatalogue
  • Kind: organization
  • Location: France

GitHub organization for the research project DataCatalogue (Inria - BnF - INHA).

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Romary"
  given-names: "Laurent"
  orcid: "https://orcid.org/0000-0002-5659-4675"
- family-names: "Scheithauer"
  given-names: "Hugo"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "TrainingData"
version: 1.0.0
date-released: 2021-12-20
url: "https://https://github.com/DataCatalogue/TrainingData"

GitHub Events

Total
Last Year

Dependencies

toolbox/xml_checker/requirements.txt pypi
  • click ==8.0.4
  • lxml ==4.8.0
.github/workflows/xml_checker.yaml actions
  • JasonEtco/create-an-issue v2.4.0 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • andstor/file-existence-action v1 composite