Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 2 committers (50.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary
Repository
Dataset of blank pages from Gallica
Basic Info
- Host: GitHub
- Owner: HugoSchtr
- Language: Python
- Default Branch: main
- Size: 6.49 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Gallicalbum
Gallicalbum is a dataset of blank pages taken from manuscripts digitized and distrubuted via the Gallica web portal.
The dataset is distributed in the form of a CSV file listing URL to access the images on the IIIF server offered by Gallica.
Download the dataset
In order to download the dataset, you need to:
- clone the repository
- create a Python virtual environnement and install
requestswith pip - run
python download.py(Linux syntax)
The download.py script will create a directory named data/ in which we can find all the images composing the Gallicalbum dataset!
A typical series of commands, on Linux, to download the dataset could be:
sh
$ git clone git@github.com:HugoSchtr/Gallicalbum.git
$ cd Gallicalbum/
$ python -m venv env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python download.py
Et voilà!
Image definition (the -low option)
Currently, Gallica has put in place a download limitation of 5 HD images per minute. Downloading the entire dataset can take a while because we had to put a 1 minute-long cool down in place every 5 images. If you want to take a quick look at the dataset or if you don't mind using a lower definition for the images, you can use the -low option when executing download.py. It will download the whole dataset very quickly but with images in low resolution.
The command looks like this:
sh
python download.py -low
Note that images downloaded this way will have a bottom infobox crediting Gallica, which is not the case on the HD images.
Examples of images contained in the dataset
|
|
| 
Citation
If you use this dataset, please cite us!
@misc{Chague_Gallicalbum_2023,
author = {Chagué, Alix and Scheithauer, Hugo},
month = aug,
title = {{Gallicalbum}},
url = {https://github.com/HugoSchtr/Gallicalbum/},
year = {2023}
}
Chagué, A., & Scheithauer, H. (2023). Gallicalbum [Data set]. https://github.com/HugoSchtr/Gallicalbum/
Any question?
You can contact us by email at hugo.scheithauer[at]inria.fr or alix.chague[at]inria.fr if you have any question or suggestion to improve this dataset.
Owner
- Name: Hugo Scheithauer
- Login: HugoSchtr
- Kind: user
- Location: Paris
- Company: Inria
- Twitter: HugoSchtr
- Repositories: 21
- Profile: https://github.com/HugoSchtr
PhD Candidate in the ALMAnaCH research team at Inria Paris.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Gallicalbum
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Alix
family-names: Chagué
email: alix.chague@inria.fr
affiliation: 'ALMAnaCH, INRIA'
orcid: 'https://orcid.org/0000-0002-0136-4434'
- given-names: Hugo
family-names: Scheithauer
email: hugo.scheithauer@inria.fr
affiliation: 'ALMAnaCH, INRIA'
orcid: 'https://orcid.org/0000-0002-5659-4675'
repository-code: 'https://github.com/HugoSchtr/Gallicalbum/'
abstract: >-
Gallicalbum is a dataset of blank pages taken from
manuscripts digitized and distrubuted via the Gallica web
portal. It is distributed in the form of a CSV file with a
Python script to build the dataset.
keywords:
- Gallica
- manuscripts
- blank page
- dataset
license: CC-BY-SA-4.0
date-released: '2023-08-29'
GitHub Events
Total
Last Year
Committers
Last synced: 12 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Alix Chagué | a****e@i****r | 16 |
| Hugo Scheithauer | h****r@g****m | 6 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 3
- Total pull requests: 3
- Average time to close issues: about 2 hours
- Average time to close pull requests: 5 days
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.67
- Average comments per pull request: 0.33
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- alix-tz (1)
Pull Request Authors
- alix-tz (3)