falcon

A processing workflow for automated collation

https://github.com/condorcompphil/falcon

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

A processing workflow for automated collation

Basic Info

Host: GitHub
Owner: CondorCompPhil
Language: Python
Default Branch: master
Homepage:
Size: 99.4 MB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 17
Releases: 1

Created about 7 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

README.md

This repository contains scripts for a collation processing workflow and its evaluation.

Installing

```bash

Recommended steps (use virtualenv)

virtualenv env -p python3.8 source env/bin/activate

end recommended steps

begin install

pip install -r requirements.txt ```

Language models with pie-extended

Models will be installed automatically if they are not.

You can get a list of available models with

bash pie-extended list

Sample usage

```bash

Lemmatise raw (txt) files for ulterior collation

python3 main.py [--lemmatise] [--lang] [--engine]

Collate annotated files in XML

containing (possibly human-corrected) linguistic information

python3 main.py [--collate]

Assign categories (graphematic, flexional, morphosyntactic, lexical) to the variation sites

python3 main.py [--categorise]

Or, alternatively, do it all in one row

python3 main.py [--lemmatise] [--lang] [--engine] [--collate] [--categorise] ```

To evaluate the results:

bash python eval.py <path_to_gt_xml> <path_to_results_xml> [--print_diff]

For simple collation from the txt sources, without preprocessing:

bash python main.py <path> [--simple]

More info about usage and examples are available below.

Format for XML annotated files

If you want to use directly XML annotated files, they must be in TEI, and contain <w> tags, with @lemma, and possibly @pos and @msd tags,

```xml <w lemma="mëisme" pos="ADJind" msd="NOMB.=s|GENRE=m|CAS=r"

meisme ``Or, possibly, use an@type`,

```xml <w lemma="mëisme" type="ADJind|NOMB.=s|GENRE=m|CAS=r"

meisme ```

Examples and usage additional info

Lemmatisation

```bash

Lemmatise raw (txt) files for ulterior collation

python3 main.py [--lemmatise] [--lang] [--engine]

Example

python3 main.py data/input --lemmatise --lang fro --engine pie ```

This step takes txt files and produces annoted XML files, which will be saved in the directory lemmat. Attention: when you lemmatise different sources, define the path to the output directory to store the different results separately; the default path is out, which will produce out/lemmat.

The only currently available engine is pie in pie-extended. For a list of available models, type

bash pie-extended list

Collation

```bash

Collate annotated files in XML

containing (possibly human-corrected) linguistic information

python3 main.py [--collate]

Example

python3 main.py data/input --collate ```

This step takes XML files and collate them. The results are saved in XML and in txt (as a table) in the directory coll (the default path is out, which will produce out/coll).

Before collating, you might want to correct the XML generated by the previous step. For avoiding over-writing, move the XML files to a new directory before editing them and edit the path accordingly before launching the command.

Categorisation

```bash

Assign categories to the variation sites

python3 main.py [--categorise]

Example

python3 main.py put/coll/out.xml --categorise ```

This step takes the XML result of the collation and assign a category to each variation site.

The linguistic information on each <rdg> inside the <app> is used to assign the category: for example, if the <rdg>s have the same value of @lemma, @pos and @msd, the variation will be graphematic. The category is stored in the attribute @ana on the <app>. Currently supported categories are graphematic, flexional, morphosyntactic, lexical.

All together

```bash python3 main.py [--lemmatise] [--lang] [--engine] [--collate] [--categorise]

Example

python3 main.py data/input --lemmatise --lang fro --engine pie --collate --categorise ```

Owner

Name: CondorCompPhil
Login: CondorCompPhil
Kind: organization

Repositories: 2
Profile: https://github.com/CondorCompPhil

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Camps"
  given-names: "Jean-Baptiste"
  orcid: "https://orcid.org/0000-0003-0385-7037"
- family-names: "Ing"
  given-names: "Lucence"
  orcid: "https://orcid.org/0000-0002-8742-3000"
- family-names: "Spadini"
  given-names: "Elena"
  orcid: "https://orcid.org/0000-0002-4522-2833"
title: "falcon"
version: 0.0.1-alpha
doi: 
date-released: 2021-12-13
url: "https://github.com/CondorCompPhil/falcon"

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

Levenshtein ==0.16.0
collatex ==2.2
jinja2 ==3.0.3
lxml ==4.6.5
pie-extended ==0.0.40