freem-semid-norm

FreEM SemiD norm refers both to: a normalisation model and the normalised corpus used to develop it — a dataset of Middle French texts, normalised according to semi-diplomatic guidelines.

https://github.com/soniasol/freem-semid-norm

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.3%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

FreEM SemiD norm refers both to: a normalisation model and the normalised corpus used to develop it — a dataset of Middle French texts, normalised according to semi-diplomatic guidelines.

Basic Info
  • Host: GitHub
  • Owner: soniasol
  • Language: PLSQL
  • Default Branch: main
  • Homepage:
  • Size: 7.81 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created over 1 year ago · Last pushed 12 months ago
Metadata Files
Readme Citation

README.md

FreEM SemiD norm

FreEM SemiD norm (French Early Modern Semi-Diplomatic Normalisation) refers both to:

  • a normalisation model, and
  • the normalised corpus used to develop it — a dataset of Middle French texts, normalised according to semi-diplomatic guidelines.

Funder

This research was conducted as part of the SETAF project, funded by the Swiss National Science Foundation (SNSF). Project number: 205056.

How to cite our work

  • Our paper:

Sonia Solfrini, Mylène Dejouy, Aurélia Marques Oliveira, Pierre-Olivier Beaulnes. « Normaliser le moyen français : du graphématique au semi-diplomatique », actes de CORIA-TALN-RJCRI-RECITAL 2025, juillet 2025, Marseille, France. ⟨hal-05137564⟩.

  • Our corpus:

bibtex @misc{FreEM-SemiD-norm_dataset_2025, author = {Solfrini, Sonia and Dejouy, Mylène and Marques Oliveira, Aurélia and Beaulnes, Pierre-Olivier}, title = {{FreEM SemiD norm corpus}}, month = may, year = 2025, howpublished = {\url{https://github.com/soniasol/FreEM-SemiD-norm}}, note = {Accessed Month Day, Year} }

  • Our model:

bibtex @misc{FreEM-SemiD-norm_model_2025, author = {Solfrini, Sonia and Gabay, Simon}, title = {{FreEM SemiD norm model}}, month = may, year = 2025, publisher = {Zenodo}, note = {{v.} 1.0.0}, doi = {10.5281/zenodo.15551750}, url = {https://doi.org/10.5281/zenodo.15551750}, } DOI

License

Contact

For questions or contributions, please contact Sonia Solfrini at Sonia.Solfrini@unige.ch.

Dataset

Our corpus is available in the dataset folder. It is organized as follows:

  • corpus-to-process/
    Contains each text in plain .txt format: one file with the original text and one file with the normalised version. A script is included to convert and merge these files into .tsv format.

  • corpus/
    Contains each text in .tsv format. Each file includes two columns:

    • the original lines of text
    • the corresponding normalised lines
  • split/
    Contains the dataset divided into training, validation, and test sets. See the scripts section below for details on how the split was generated.

  • data/
    Contains the split corpus in source–target format:

    • train.src / train.trg
    • dev.src / dev.trg
    • test.src / test.trg

A detailed overview of the corpus content, including text titles and metadata, is available in table.csv.

Scripts

See the scripts folder for all scripts used in our experiments, along with a README.md that outlines the steps followed to train and evaluate the model.

Other files

The other-files folder includes additional resources such as subword-tokenized files, BPE vocabularies/models, intermediate outputs, and evaluation results. A README.md in this folder explains further the structure and usage of these files, which support model training and evaluation with Fairseq.

Results

Our results are available in the results folder.

We experimented with multiple LSTM-based model configurations (XS, S, M) and vocabulary sizes. The best results were obtained using the "S" configuration (2 encoder/decoder layers, 256 embedding dim, 512 hidden size) with a vocabulary of 1,000 subword units:

| Configuration | BLEU | TER | ChrF | |---------------|-------|-------|--------| | XS | 86.64 | 7.69 | 94.93 | | S | 87.08 | 7.35 | 95.02 | | M | 86.18 | 7.76 | 94.70 |

Model

The best-performing trained model is available in the Releases section of this repository and on Zenodo: DOI.

Owner

  • Name: Sonia
  • Login: soniasol
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset and/or our scripts, please cite this repository as below."
title: "FreEM SemiD norm (French Early Modern Semi-Diplomatic Normalisation) corpus and scripts"
abstract: >-
  This repository contains both a manually normalised corpus of Middle French texts
  and the scripts used to train and evaluate FreEM SemiD norm, a normalisation model.
type: software
authors:
  - family-names: Solfrini
    given-names: Sonia
    affiliation: University of Geneva
    orcid: 0009-0009-7367-048X
  - family-names: Gabay
    given-names: Simon
    affiliation: University of Geneva
    orcid: 0000-0001-9094-4475
  - family-names: Beaulnes
    given-names: Pierre-Olivier
    affiliation: University of Geneva
    orcid: 0009-0009-2475-6017
  - family-names: Marques Oliveira
    given-names: Aurélia
    affiliation: University of Geneva
    orcid: 0009-0009-9678-9811
  - family-names: Dejouy
    given-names: Mylène
    affiliation: University of Geneva
    orcid: 0009-0000-9696-9868
  - family-names: Solfaroli Camillocci
    given-names: Daniela
    affiliation: University of Geneva
    orcid: 0000-0002-2601-668X
repository-code: https://github.com/soniasol/FreEM-SemiD-norm
url: https://github.com/soniasol/FreEM-SemiD-norm
doi: 10.5281/zenodo.15551750
keywords:
  - Middle French
  - Automatic normalisation
  - Digital humanities
  - Early Modern French
license: MIT
version: "1.0"
date-released: 2025-05-30

GitHub Events

Total
  • Release event: 1
  • Push event: 51
  • Create event: 1
Last Year
  • Release event: 1
  • Push event: 51
  • Create event: 1