Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Repository
A processing workflow for automated collation
Basic Info
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 17
- Releases: 1
Metadata Files
README.md

This repository contains scripts for a collation processing workflow and its evaluation.
Installing
```bash
Recommended steps (use virtualenv)
virtualenv env -p python3.8 source env/bin/activate
end recommended steps
begin install
pip install -r requirements.txt ```
Language models with pie-extended
Models will be installed automatically if they are not.
You can get a list of available models with
bash
pie-extended list
Sample usage
```bash
Lemmatise raw (txt) files for ulterior collation
python3 main.py
Collate annotated files in XML
containing (possibly human-corrected) linguistic information
python3 main.py
Assign categories (graphematic, flexional, morphosyntactic, lexical) to the variation sites
python3 main.py
Or, alternatively, do it all in one row
python3 main.py
To evaluate the results:
bash
python eval.py <path_to_gt_xml> <path_to_results_xml> [--print_diff]
For simple collation from the txt sources, without preprocessing:
bash
python main.py <path> [--simple]
More info about usage and examples are available below.
Format for XML annotated files
If you want to use directly XML annotated files,
they must be in TEI, and contain <w> tags,
with @lemma, and possibly @pos and @msd tags,
```xml <w lemma="mëisme" pos="ADJind" msd="NOMB.=s|GENRE=m|CAS=r"
meisme ``
Or, possibly, use an@type`,
```xml <w lemma="mëisme" type="ADJind|NOMB.=s|GENRE=m|CAS=r"
meisme ```
Examples and usage additional info
Lemmatisation
```bash
Lemmatise raw (txt) files for ulterior collation
python3 main.py
Example
python3 main.py data/input --lemmatise --lang fro --engine pie ```
This step takes txt files and produces annoted XML files, which will be saved in the directory lemmat. Attention: when you lemmatise different sources, define the path to the output directory to store the different results separately; the default path is out, which will produce out/lemmat.
The only currently available engine is pie in pie-extended. For a list of available models, type
bash
pie-extended list
Collation
```bash
Collate annotated files in XML
containing (possibly human-corrected) linguistic information
python3 main.py
Example
python3 main.py data/input --collate ```
This step takes XML files and collate them. The results are saved in XML and in txt (as a table) in the directory coll (the default path is out, which will produce out/coll).
Before collating, you might want to correct the XML generated by the previous step. For avoiding over-writing, move the XML files to a new directory before editing them and edit the path accordingly before launching the command.
Categorisation
```bash
Assign categories to the variation sites
python3 main.py
Example
python3 main.py put/coll/out.xml --categorise ```
This step takes the XML result of the collation and assign a category to each variation site.
The linguistic information on each <rdg> inside the <app> is used to assign the category: for example, if the <rdg>s have the same value of @lemma, @pos and @msd, the variation will be graphematic. The category is stored in the attribute @ana on the <app>. Currently supported categories are graphematic, flexional, morphosyntactic, lexical.
All together
```bash
python3 main.py
Example
python3 main.py data/input --lemmatise --lang fro --engine pie --collate --categorise ```
Owner
- Name: CondorCompPhil
- Login: CondorCompPhil
- Kind: organization
- Repositories: 2
- Profile: https://github.com/CondorCompPhil
Citation (CITATION.CFF)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Camps" given-names: "Jean-Baptiste" orcid: "https://orcid.org/0000-0003-0385-7037" - family-names: "Ing" given-names: "Lucence" orcid: "https://orcid.org/0000-0002-8742-3000" - family-names: "Spadini" given-names: "Elena" orcid: "https://orcid.org/0000-0002-4522-2833" title: "falcon" version: 0.0.1-alpha doi: date-released: 2021-12-13 url: "https://github.com/CondorCompPhil/falcon"
GitHub Events
Total
Last Year
Dependencies
- Levenshtein ==0.16.0
- collatex ==2.2
- jinja2 ==3.0.3
- lxml ==4.6.5
- pie-extended ==0.0.40