grobid-datacat-trainingdata

Training datasets for GROBID sale catalogues models.

https://github.com/datacatalogue/grobid-datacat-trainingdata

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.9%) to scientific vocabulary

Keywords

grobid tei-xml training-data xml

Last synced: 10 months ago · JSON representation ·

Repository

Training datasets for GROBID sale catalogues models.

Basic Info

Host: GitHub
Owner: DataCatalogue
License: cc-by-4.0
Language: Python
Default Branch: main
Homepage:
Size: 98.4 MB

Statistics

Stars: 0
Watchers: 0
Forks: 1
Open Issues: 0
Releases: 0

Topics

grobid tei-xml training-data xml

Created over 4 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

Training datasets for training GROBID sale catalogues models

Each directory of this repository contains datasets created to train GROBID sale catalogues models. Datasets are divided based on where original documents are being kept, and then are organized by authors/auction houses.

Annotated files are in the TEI-XML format.

Naming convention

BnF files are named with their Gallica ark identifier.
INHA files are named with their digital identifier ("identifiant numérique") provided in their online notice.

GROBID models

Segmentation : the segmentation model aims to obtain a high level segmentation of the catalogues.

Data quality

Before being pushed to the main branch, annotated files have at least been proofread once, and are validated with an XSD by a Github action.

Toolbox

This repository also contains a set of tools that can be used on the training sets.

PDF Preprocessing
Quality assessment
XML validity checker (used by a Github action)

Owner

Name: DataCatalogue
Login: DataCatalogue
Kind: organization
Location: France

Repositories: 7
Profile: https://github.com/DataCatalogue

GitHub organization for the research project DataCatalogue (Inria - BnF - INHA).

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Romary"
  given-names: "Laurent"
  orcid: "https://orcid.org/0000-0002-5659-4675"
- family-names: "Scheithauer"
  given-names: "Hugo"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "TrainingData"
version: 1.0.0
date-released: 2021-12-20
url: "https://https://github.com/DataCatalogue/TrainingData"

GitHub Events

Total

Last Year

Dependencies

toolbox/xml_checker/requirements.txt pypi

click ==8.0.4
lxml ==4.8.0

.github/workflows/xml_checker.yaml actions

JasonEtco/create-an-issue v2.4.0 composite
actions/checkout v2 composite
actions/setup-python v2 composite
andstor/file-existence-action v1 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science