gallicalbum

Dataset of blank pages from Gallica

https://github.com/hugoschtr/gallicalbum

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.1%) to scientific vocabulary
Last synced: 7 months ago · JSON representation ·

Repository

Dataset of blank pages from Gallica

Basic Info
  • Host: GitHub
  • Owner: HugoSchtr
  • Language: Python
  • Default Branch: main
  • Size: 6.49 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created over 2 years ago · Last pushed over 2 years ago
Metadata Files
Readme Citation

README.md

Gallicalbum

Gallicalbum is a dataset of blank pages taken from manuscripts digitized and distrubuted via the Gallica web portal.

The dataset is distributed in the form of a CSV file listing URL to access the images on the IIIF server offered by Gallica.

Download the dataset

In order to download the dataset, you need to:

  • clone the repository
  • create a Python virtual environnement and install requests with pip
  • run python download.py (Linux syntax)

The download.py script will create a directory named data/ in which we can find all the images composing the Gallicalbum dataset!

A typical series of commands, on Linux, to download the dataset could be:

sh $ git clone git@github.com:HugoSchtr/Gallicalbum.git $ cd Gallicalbum/ $ python -m venv env $ source env/bin/activate $ pip install -r requirements.txt $ python download.py

Et voilà!

Image definition (the -low option)

Currently, Gallica has put in place a download limitation of 5 HD images per minute. Downloading the entire dataset can take a while because we had to put a 1 minute-long cool down in place every 5 images. If you want to take a quick look at the dataset or if you don't mind using a lower definition for the images, you can use the -low option when executing download.py. It will download the whole dataset very quickly but with images in low resolution.

The command looks like this:

sh python download.py -low

Note that images downloaded this way will have a bottom infobox crediting Gallica, which is not the case on the HD images.

Examples of images contained in the dataset

| | |

Citation

If you use this dataset, please cite us!

@misc{Chague_Gallicalbum_2023, author = {Chagué, Alix and Scheithauer, Hugo}, month = aug, title = {{Gallicalbum}}, url = {https://github.com/HugoSchtr/Gallicalbum/}, year = {2023} }

Chagué, A., & Scheithauer, H. (2023). Gallicalbum [Data set]. https://github.com/HugoSchtr/Gallicalbum/

Any question?

You can contact us by email at hugo.scheithauer[at]inria.fr or alix.chague[at]inria.fr if you have any question or suggestion to improve this dataset.

Owner

  • Name: Hugo Scheithauer
  • Login: HugoSchtr
  • Kind: user
  • Location: Paris
  • Company: Inria

PhD Candidate in the ALMAnaCH research team at Inria Paris.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Gallicalbum
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Alix
    family-names: Chagué
    email: alix.chague@inria.fr
    affiliation: 'ALMAnaCH, INRIA'
    orcid: 'https://orcid.org/0000-0002-0136-4434'
  - given-names: Hugo
    family-names: Scheithauer
    email: hugo.scheithauer@inria.fr
    affiliation: 'ALMAnaCH, INRIA'
    orcid: 'https://orcid.org/0000-0002-5659-4675'
repository-code: 'https://github.com/HugoSchtr/Gallicalbum/'
abstract: >-
  Gallicalbum is a dataset of blank pages taken from
  manuscripts digitized and distrubuted via the Gallica web
  portal. It is distributed in the form of a CSV file with a
  Python script to build the dataset.
keywords:
  - Gallica
  - manuscripts
  - blank page
  - dataset
license: CC-BY-SA-4.0
date-released: '2023-08-29'

GitHub Events

Total
Last Year

Committers

Last synced: 12 months ago

All Time
  • Total Commits: 22
  • Total Committers: 2
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.273
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Alix Chagué a****e@i****r 16
Hugo Scheithauer h****r@g****m 6
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 3
  • Total pull requests: 3
  • Average time to close issues: about 2 hours
  • Average time to close pull requests: 5 days
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.33
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • alix-tz (1)
Pull Request Authors
  • alix-tz (3)
Top Labels
Issue Labels
Pull Request Labels