freemmax_oa

Corpus for a language model of (early) modern French

https://github.com/freem-corpora/freemmax_oa

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Corpus for a language model of (early) modern French

Basic Info

Host: GitHub
Owner: FreEM-corpora
Language: HTML
Default Branch: master
Size: 158 MB

Statistics

Stars: 2
Watchers: 2
Forks: 5
Open Issues: 2
Releases: 3

Created over 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme Citation

FreEM max OA

A Large Corpus for Early modern French - open access

For more information about FreEM corpora, cf. our website.

Description

This repo contains documents of very different sources (wikipedia, scrapping, XML).

Documents gathered found online or given by colleagues are stored in the 0_source folder. Those given in a .doc / .txt format or found online are loosely encoded in TEI.
New teiHeader are available with limited but highly structured information in the 1_header folder. These headers are used to generate the table of content.
It is possible to generate the final corpus with python3 build.py. For legal reason, we are not allowed to modify files, so we provide the script to mofidy them. This script creates
a new version of the transcriptions with a minimal TEI encoding. They are adapted to a dedicated ODD/schema.
Cleaned .txt files

After execution of the script, we obtain the following data:

Table of content

A list of the files is available here.

Warning

This corpus is the open access version of the FreEM max corpus. Some (important) corpora are withdrawn from the available data.

Licences

Licences vary from one file and one project to another. Please pay attention to the <licence> element in the <teiHeader>.

Cite this repository

bibtex @software{gabay_simon_2022_6481135, author = {Gabay, Simon and Bartz, Alexandre and Gambette, Philippe and Chagu, Alix}, title = {{FreEM-corpora/FreEMmax\_OA: FreEM max OA: A Large Corpus for Early modern French - Open access version}}, month = apr, year = 2022, note = {If you use this software, please cite it as below.}, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.6481135}, url = {https://doi.org/10.5281/zenodo.6481135} } bibtex @inproceedings{gabay:hal-03596653, TITLE = {{From FreEM to D'AlemBERT}}, AUTHOR = {Gabay, Simon and Ortiz Suarez, Pedro and Bartz, Alexandre and Chagu{\'e}, Alix and Bawden, Rachel and Gambette, Philippe and Sagot, Beno{\^i}t}, URL = {https://hal.inria.fr/hal-03596653}, NOTE = {8 pages, 2 figures, 4 tables}, BOOKTITLE = {{Proceedings of the 13th Language Resources and Evaluation Conference}}, ADDRESS = {Marseille, France}, ORGANIZATION = {{European Language Resources Association}}, YEAR = {2022}, MONTH = Jun, HAL_ID = {hal-03596653}, HAL_VERSION = {v1}, }

Please keep me posted if you use this data!

Contact

simon.gabay[at]unige.ch

Owner

Name: FreEM-corpora
Login: FreEM-corpora
Kind: organization

Repositories: 2
Profile: https://github.com/FreEM-corpora

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science