freemmax_oa

Corpus for a language model of (early) modern French

https://github.com/freem-corpora/freemmax_oa

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Corpus for a language model of (early) modern French

Basic Info
  • Host: GitHub
  • Owner: FreEM-corpora
  • Language: HTML
  • Default Branch: master
  • Size: 158 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 5
  • Open Issues: 2
  • Releases: 3
Created about 4 years ago · Last pushed about 3 years ago
Metadata Files
Readme Citation

README.md

FreEM max OA

DOI

A Large Corpus for Early modern French - open access

For more information about FreEM corpora, cf. our website.

Description

This repo contains documents of very different sources (wikipedia, scrapping, XML).

  • Documents gathered found online or given by colleagues are stored in the 0_source folder. Those given in a .doc / .txt format or found online are loosely encoded in TEI.
  • New teiHeader are available with limited but highly structured information in the 1_header folder. These headers are used to generate the table of content.
  • It is possible to generate the final corpus with python3 build.py. For legal reason, we are not allowed to modify files, so we provide the script to mofidy them. This script creates
  • a new version of the transcriptions with a minimal TEI encoding. They are adapted to a dedicated ODD/schema.
  • Cleaned .txt files

After execution of the script, we obtain the following data:

|-0_source |-(.*).xml |-(.*).xml |-(.*).xml |- ... |-1_header |-(.*)_dAlembert.xml |-(.*)_dAlembert.xml |-(.*)_dAlembert.xml |- ... |-2_TEI |-(.*)_dAlembert.xml |-(.*)_dAlembert.xml |-(.*)_dAlembert.xml |- ... |-3_TXT |-(.*)_dAlembert.txt |-(.*)_dAlembert.txt |-(.*)_dAlembert.txt |-(.*)_dAlembert.txt |- ... |-ODD |-ODD_clean.ODD |-out |-ODD_clean.rng |-scripts |-build.xsl.xsl |-make_TOC.xsl |-1to2.xsl

Table of content

A list of the files is available here.

Warning

This corpus is the open access version of the FreEM max corpus. Some (important) corpora are withdrawn from the available data.

Licences

Licences vary from one file and one project to another. Please pay attention to the <licence> element in the <teiHeader>.

Cite this repository

bibtex @software{gabay_simon_2022_6481135, author = {Gabay, Simon and Bartz, Alexandre and Gambette, Philippe and Chagu, Alix}, title = {{FreEM-corpora/FreEMmax\_OA: FreEM max OA: A Large Corpus for Early modern French - Open access version}}, month = apr, year = 2022, note = {If you use this software, please cite it as below.}, publisher = {Zenodo}, version = {1.0.0}, doi = {10.5281/zenodo.6481135}, url = {https://doi.org/10.5281/zenodo.6481135} } bibtex @inproceedings{gabay:hal-03596653, TITLE = {{From FreEM to D'AlemBERT}}, AUTHOR = {Gabay, Simon and Ortiz Suarez, Pedro and Bartz, Alexandre and Chagu{\'e}, Alix and Bawden, Rachel and Gambette, Philippe and Sagot, Beno{\^i}t}, URL = {https://hal.inria.fr/hal-03596653}, NOTE = {8 pages, 2 figures, 4 tables}, BOOKTITLE = {{Proceedings of the 13th Language Resources and Evaluation Conference}}, ADDRESS = {Marseille, France}, ORGANIZATION = {{European Language Resources Association}}, YEAR = {2022}, MONTH = Jun, HAL_ID = {hal-03596653}, HAL_VERSION = {v1}, }

Please keep me posted if you use this data!

Contact

simon.gabay[at]unige.ch

Owner

  • Name: FreEM-corpora
  • Login: FreEM-corpora
  • Kind: organization

GitHub Events

Total
  • Fork event: 1
Last Year
  • Fork event: 1