https://github.com/centre-for-humanities-computing/graphematics

A collection of string processing and visualization scripts following the steps outlined in the "Handbuch zur graphematischen Rekonstruktion vormoderner Schreibsprachen".

https://github.com/centre-for-humanities-computing/graphematics

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

A collection of string processing and visualization scripts following the steps outlined in the "Handbuch zur graphematischen Rekonstruktion vormoderner Schreibsprachen".

Basic Info
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created about 2 years ago · Last pushed 11 months ago
Metadata Files
Readme

README.md

DOI

Graphematic Analysis

This repository contains a collection of string processing and visualization scripts with the purpose to follow the steps of the Handbuch zur graphematischen Rekonstruktion.

The Python code in this repository can be used to replicate the analysis outlined in the handbook, with individual scripts responsible for performing separate steps of the analysis.

This approach is not entirely automated and requires manual input and inspection of files at certain steps. Where necessary, those steps are outlined below.

Requirements

In order to run this code locally on your own computer, it is recommended that you have Python ≥ 3.9 installed. The necessary requirements to run the code in this repository can be found in the requirements.txt file. These can be installed in the following way:

```bash

update pip

python -m pip install --upgrade pip

install the requirements

python -m pip install -r requirements.txt ```

Advanced users We recommend installing packages from the requirements.txt file in a virtual environment to avoid potential conflicts with existing Python installations. A minimal script for this is provided in setup.sh, which should be satisfactory for macOS and Linux. For Windows users, we recommend enabling the [Windows Subsytem for Linux](https://learn.microsoft.com/en-us/windows/wsl/about).

Performing the analysis

0. Preparing the document

The initial transcribed document should be saved in Word doc format according to the instructions outlined in the Handbuch. There should preferably be no additional formating, such as footnotes and margin notes.

This initial wordlist should be saved in the folder called data/0_raw_data.

1. Creating wordlists

With the transcribed document in place, we go through this document to extract a wordlist. This wordlist counts how many times each indvidual word form appears in the document.

To do this, we run the following code, changing the FILENAME parameter to match the name of your own file:

bash python src/wordlist_extract.py --filename FILENAME

The results from this script are saved in the folder data/1_wordlists.

2. Segmented wordlists

The next step is to take this extracted wordlist and to create a segmented word list. This requires expert domain knowledge and is done manually, according to the instructions laid out in the Handbuch.

Once completed, the new file should be saved in the folder called data/2_segmented_wordlists.

3. Parsing segments

Our next step takes the segmented wordlist and parses them in such a way that each graph appears in a separate column in an Excel worksheet. This allows for them to be manually linked up with specific sounds for the next steps of the analysis.

To produce the parsed, segmented wordlist, we run the following:

bash python src/word_parser.py --filename FILENAME

The results from this script are saved in the folder data/3_parsed_segmented.

4. Annotating the segments

The outputs from the previous step are manually inspected and annotated according to the instructions outlined in the Handbuch. In the resulting file, we have all of the individual graphs in the document aligned with their specific sound position.

This annotated, segmented wordlist should be saved in the folder data/4_annotated_segmented.

5. Clustering sound_positions

We then take the annotated, segmented wordlist and calculate the occurrence of different sound positions in the new document. The output from this is a table showing all occurrences of each individual sound position, the words in which those occur, and the number of occurrences (of each word).

This is created using the following script:

bash python src/sound_position.py --filename FILENAME

The output from this step is saved in the folder called output/clustered_graph_list.

6. Manually rearranging data

As outlined in the Handbuch, the next step is to manually sort and re-arrange these outputs according to morpheme type and allographs.

The output from this step should be saved in the folder data/5_boxes_raw.

7a. Calculating leading graphs

The next step is to calculate the leading graph - in other words, the allograph which covers more than 50% of occurences in the original document.

To calculate this value, we run the following, changing the filenames as you please:

bash python src/leading_graph.py --filename FILENAME --output FILENAME

The result from this script will be saved in the folder called data/6_boxes.

7b. Calculating distances

We can finally calculate the graphemic distance between any two graphemes as extracted from a document using the previous steps. To do this, we specify specficially which graphemes we wish to compare.

To do this, we run the following code:

bash python3.11 src/graphematic.py --files FILE1 FILE2 ... --outfile FILENAME

For example, to compare hypothetical results for {â}closedsyllable vs {â}opensyllable:

bash python3.11 src/graphematic.py --files {â}_closed_syllable.xlsx {â}_open_syllable.xlsx --outfile results.xlsx

In any case, the results are saved in the folder called output/distance.

8. Visualizing vowel and consonant distributions

Finally, we can create simple barplots to show the distribution of vowels, consonants, vowel clusters and consonant clusters in our original document. For this, we only need the segmented wordlist created as part of step two above.

Focusing just on vowels, we can present the results either by ordering the vowel (clusters) alphabetically, or in order of descending size.

To create results with vowels arranged alphabetically:

bash python src/vowels_plot.py --filename FILENAME --alphabetical

To plot based on descending frequency, simply remove the final flag:

bash python src/vowels_plot.py --filename FILENAME

For consonants, the relevant script is src/consonants_plot.py, which can be run in the same way.

In each case, the visualizations are saved into the folder called output/graphs. A table of the same results is saved alongside this in the folder called output/frequencies.

Owner

  • Name: Center for Humanities Computing Aarhus
  • Login: centre-for-humanities-computing
  • Kind: organization
  • Email: chcaa@cas.au.dk
  • Location: Aarhus, Denmark

GitHub Events

Total
  • Push event: 9
Last Year
  • Push event: 9

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 30
  • Total Committers: 2
  • Avg Commits per committer: 15.0
  • Development Distribution Score (DDS): 0.133
Past Year
  • Commits: 9
  • Committers: 1
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Ross Deans Kristensen-McLachlan r****m@c****k 26
AndersRoen 2****2@p****k 4
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • matplotlib ==3.8.2
  • numpy ==1.26.4
  • pandas ==2.2.0
  • python_docx ==1.1.0