ladas

Layout Analysis Dataset with Segmonto (LADaS)

https://github.com/defi-colaf/ladas

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Layout Analysis Dataset with Segmonto (LADaS)

Basic Info
  • Host: GitHub
  • Owner: DEFI-COLaF
  • License: cc-by-4.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 2.81 GB
Statistics
  • Stars: 20
  • Watchers: 5
  • Forks: 1
  • Open Issues: 2
  • Releases: 5
Created about 2 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

Layout Analysis Dataset with SegmOnto (LADaS)

DOI License: CC BY 4.0

LADaS, created by the ALMANaCH team-project at Inria, continued in partnership with other researchers, is a multidocuments diachronic layout analysis dataset. This dataset includes:

  • Monographs from the Bibliothque Nationale de France (17th century - today);
  • PhD Thesis, in various fields (not only STEM, 20th-21st century);
  • Selling Catalogs (for manuscripts and art pieces), in various fields (18th-20th century);
  • Noisy digitization (with fingers for example, 20th-21st century);
  • Academic papers (mostly Humanities and Social Sciences) (19th-21st century);
  • Magazines about technologies and video games, from 1920s to 2010;
  • Misc stuff found here and there.

The data are in YoloV8 txt format (class centerx centery width height).

The script in document is mostly Latin script, and language is mostly French with some representation of the main western academic languages.

Annotation

Label Annotation have been conducted using the SegmOnto vocabulary. An annotation guide is available here.

More details about some subsets

./figures/corpus.png

Last update of the plot: 15/12/2023

Structure

The data can be found in ./data. Each subset is present in its own subset folder if you want to train cross-genre.

A script, collate.sh allows for having a single directory with train/dev/test folders for YoloV8 training.

Partners

|Funding|Project|Comment| |---|---|---| |||Originally established and funded as part of the DEFI COLaF (20232027).| |||Funded by the European Union under Grant Agreement n. 101132163. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them. | ||The Geographic Horizon of writers|Funded by FNS-Spark project N220833. |

Licence

68747470733a2f2f692e6372656174697665636f6d6d6f6e732e6f72672f6c2f62792f322e302f38387833312e706e67

Citation

See the CITATION.CFF file

Contact

Thibault Clrice ( th[a-z]+.cle[a-z]+ [at] inria.fr)

Owner

  • Name: DEFI-COLaF
  • Login: DEFI-COLaF
  • Kind: organization

GitHub Events

Total
  • Release event: 1
  • Watch event: 2
  • Push event: 17
  • Pull request event: 2
  • Fork event: 2
  • Create event: 2
Last Year
  • Release event: 1
  • Watch event: 2
  • Push event: 17
  • Pull request event: 2
  • Fork event: 2
  • Create event: 2