freemmax_oa
Corpus for a language model of (early) modern French
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
Corpus for a language model of (early) modern French
Basic Info
- Host: GitHub
- Owner: FreEM-corpora
- Language: HTML
- Default Branch: master
- Size: 158 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 5
- Open Issues: 2
- Releases: 3
Metadata Files
README.md
FreEM max OA
A Large Corpus for Early modern French - open access
For more information about FreEM corpora, cf. our website.
Description
This repo contains documents of very different sources (wikipedia, scrapping, XML).
- Documents gathered found online or given by colleagues are stored in the 0_source folder. Those given in a
.doc/.txtformat or found online are loosely encoded in TEI. - New
teiHeaderare available with limited but highly structured information in the 1_header folder. These headers are used to generate the table of content. - It is possible to generate the final corpus with
python3 build.py. For legal reason, we are not allowed to modify files, so we provide the script to mofidy them. This script creates - a new version of the transcriptions with a minimal TEI encoding. They are adapted to a dedicated ODD/schema.
- Cleaned
.txtfiles
After execution of the script, we obtain the following data:
|-0_source
|-(.*).xml
|-(.*).xml
|-(.*).xml
|- ...
|-1_header
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|- ...
|-2_TEI
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|-(.*)_dAlembert.xml
|- ...
|-3_TXT
|-(.*)_dAlembert.txt
|-(.*)_dAlembert.txt
|-(.*)_dAlembert.txt
|-(.*)_dAlembert.txt
|- ...
|-ODD
|-ODD_clean.ODD
|-out
|-ODD_clean.rng
|-scripts
|-build.xsl.xsl
|-make_TOC.xsl
|-1to2.xsl
Table of content
A list of the files is available here.
Warning
This corpus is the open access version of the FreEM max corpus. Some (important) corpora are withdrawn from the available data.
Licences
Licences vary from one file and one project to another. Please pay attention to the <licence> element in the <teiHeader>.
Cite this repository
bibtex
@software{gabay_simon_2022_6481135,
author = {Gabay, Simon and
Bartz, Alexandre and
Gambette, Philippe and
Chagu, Alix},
title = {{FreEM-corpora/FreEMmax\_OA: FreEM max OA: A Large
Corpus for Early modern French - Open access
version}},
month = apr,
year = 2022,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {1.0.0},
doi = {10.5281/zenodo.6481135},
url = {https://doi.org/10.5281/zenodo.6481135}
}
bibtex
@inproceedings{gabay:hal-03596653,
TITLE = {{From FreEM to D'AlemBERT}},
AUTHOR = {Gabay, Simon and Ortiz Suarez, Pedro and Bartz, Alexandre and Chagu{\'e}, Alix and Bawden, Rachel and Gambette, Philippe and Sagot, Beno{\^i}t},
URL = {https://hal.inria.fr/hal-03596653},
NOTE = {8 pages, 2 figures, 4 tables},
BOOKTITLE = {{Proceedings of the 13th Language Resources and Evaluation Conference}},
ADDRESS = {Marseille, France},
ORGANIZATION = {{European Language Resources Association}},
YEAR = {2022},
MONTH = Jun,
HAL_ID = {hal-03596653},
HAL_VERSION = {v1},
}
Please keep me posted if you use this data!
Contact
simon.gabay[at]unige.ch
Owner
- Name: FreEM-corpora
- Login: FreEM-corpora
- Kind: organization
- Repositories: 2
- Profile: https://github.com/FreEM-corpora
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1