falcon

A processing workflow for automated collation

https://github.com/condorcompphil/falcon

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A processing workflow for automated collation

Basic Info
  • Host: GitHub
  • Owner: CondorCompPhil
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 99.4 MB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 17
  • Releases: 1
Created almost 7 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Citation

README.md

falcon logo

This repository contains scripts for a collation processing workflow and its evaluation.

Installing

```bash

Recommended steps (use virtualenv)

virtualenv env -p python3.8 source env/bin/activate

end recommended steps

begin install

pip install -r requirements.txt ```

Language models with pie-extended

Models will be installed automatically if they are not.

You can get a list of available models with

bash pie-extended list

Sample usage

```bash

Lemmatise raw (txt) files for ulterior collation

python3 main.py [--lemmatise] [--lang] [--engine]

Collate annotated files in XML

containing (possibly human-corrected) linguistic information

python3 main.py [--collate]

Assign categories (graphematic, flexional, morphosyntactic, lexical) to the variation sites

python3 main.py [--categorise]

Or, alternatively, do it all in one row

python3 main.py [--lemmatise] [--lang] [--engine] [--collate] [--categorise] ```

To evaluate the results:

bash python eval.py <path_to_gt_xml> <path_to_results_xml> [--print_diff]

For simple collation from the txt sources, without preprocessing:

bash python main.py <path> [--simple]

More info about usage and examples are available below.

Format for XML annotated files

If you want to use directly XML annotated files, they must be in TEI, and contain <w> tags, with @lemma, and possibly @pos and @msd tags,

```xml <w lemma="mëisme" pos="ADJind" msd="NOMB.=s|GENRE=m|CAS=r"

meisme `` Or, possibly, use an@type`,

```xml <w lemma="mëisme" type="ADJind|NOMB.=s|GENRE=m|CAS=r"

meisme ```

Examples and usage additional info

Lemmatisation

```bash

Lemmatise raw (txt) files for ulterior collation

python3 main.py [--lemmatise] [--lang] [--engine]

Example

python3 main.py data/input --lemmatise --lang fro --engine pie ```

This step takes txt files and produces annoted XML files, which will be saved in the directory lemmat. Attention: when you lemmatise different sources, define the path to the output directory to store the different results separately; the default path is out, which will produce out/lemmat.

The only currently available engine is pie in pie-extended. For a list of available models, type

bash pie-extended list

Collation

```bash

Collate annotated files in XML

containing (possibly human-corrected) linguistic information

python3 main.py [--collate]

Example

python3 main.py data/input --collate ```

This step takes XML files and collate them. The results are saved in XML and in txt (as a table) in the directory coll (the default path is out, which will produce out/coll).

Before collating, you might want to correct the XML generated by the previous step. For avoiding over-writing, move the XML files to a new directory before editing them and edit the path accordingly before launching the command.

Categorisation

```bash

Assign categories to the variation sites

python3 main.py [--categorise]

Example

python3 main.py put/coll/out.xml --categorise ```

This step takes the XML result of the collation and assign a category to each variation site.

The linguistic information on each <rdg> inside the <app> is used to assign the category: for example, if the <rdg>s have the same value of @lemma, @pos and @msd, the variation will be graphematic. The category is stored in the attribute @ana on the <app>. Currently supported categories are graphematic, flexional, morphosyntactic, lexical.

All together

```bash python3 main.py [--lemmatise] [--lang] [--engine] [--collate] [--categorise]

Example

python3 main.py data/input --lemmatise --lang fro --engine pie --collate --categorise ```

Owner

  • Name: CondorCompPhil
  • Login: CondorCompPhil
  • Kind: organization

Citation (CITATION.CFF)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Camps"
  given-names: "Jean-Baptiste"
  orcid: "https://orcid.org/0000-0003-0385-7037"
- family-names: "Ing"
  given-names: "Lucence"
  orcid: "https://orcid.org/0000-0002-8742-3000"
- family-names: "Spadini"
  given-names: "Elena"
  orcid: "https://orcid.org/0000-0002-4522-2833"
title: "falcon"
version: 0.0.1-alpha
doi: 
date-released: 2021-12-13
url: "https://github.com/CondorCompPhil/falcon"

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • Levenshtein ==0.16.0
  • collatex ==2.2
  • jinja2 ==3.0.3
  • lxml ==4.6.5
  • pie-extended ==0.0.40