https://github.com/centre-for-humanities-computing/graphematics
A collection of string processing and visualization scripts following the steps outlined in the "Handbuch zur graphematischen Rekonstruktion vormoderner Schreibsprachen".
https://github.com/centre-for-humanities-computing/graphematics
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Repository
A collection of string processing and visualization scripts following the steps outlined in the "Handbuch zur graphematischen Rekonstruktion vormoderner Schreibsprachen".
Basic Info
- Host: GitHub
- Owner: centre-for-humanities-computing
- Language: Python
- Default Branch: main
- Homepage: https://link.springer.com/book/10.57088/978-3-7329-8860-0
- Size: 441 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Graphematic Analysis
This repository contains a collection of string processing and visualization scripts with the purpose to follow the steps of the Handbuch zur graphematischen Rekonstruktion.
The Python code in this repository can be used to replicate the analysis outlined in the handbook, with individual scripts responsible for performing separate steps of the analysis.
This approach is not entirely automated and requires manual input and inspection of files at certain steps. Where necessary, those steps are outlined below.
Requirements
In order to run this code locally on your own computer, it is recommended that you have Python ≥ 3.9 installed. The necessary requirements to run the code in this repository can be found in the requirements.txt file. These can be installed in the following way:
```bash
update pip
python -m pip install --upgrade pip
install the requirements
python -m pip install -r requirements.txt ```
Advanced users
We recommend installing packages from the requirements.txt file in a virtual environment to avoid potential conflicts with existing Python installations. A minimal script for this is provided in setup.sh, which should be satisfactory for macOS and Linux. For Windows users, we recommend enabling the [Windows Subsytem for Linux](https://learn.microsoft.com/en-us/windows/wsl/about).Performing the analysis
0. Preparing the document
The initial transcribed document should be saved in Word doc format according to the instructions outlined in the Handbuch. There should preferably be no additional formating, such as footnotes and margin notes.
This initial wordlist should be saved in the folder called data/0_raw_data.
1. Creating wordlists
With the transcribed document in place, we go through this document to extract a wordlist. This wordlist counts how many times each indvidual word form appears in the document.
To do this, we run the following code, changing the FILENAME parameter to match the name of your own file:
bash
python src/wordlist_extract.py --filename FILENAME
The results from this script are saved in the folder data/1_wordlists.
2. Segmented wordlists
The next step is to take this extracted wordlist and to create a segmented word list. This requires expert domain knowledge and is done manually, according to the instructions laid out in the Handbuch.
Once completed, the new file should be saved in the folder called data/2_segmented_wordlists.
3. Parsing segments
Our next step takes the segmented wordlist and parses them in such a way that each graph appears in a separate column in an Excel worksheet. This allows for them to be manually linked up with specific sounds for the next steps of the analysis.
To produce the parsed, segmented wordlist, we run the following:
bash
python src/word_parser.py --filename FILENAME
The results from this script are saved in the folder data/3_parsed_segmented.
4. Annotating the segments
The outputs from the previous step are manually inspected and annotated according to the instructions outlined in the Handbuch. In the resulting file, we have all of the individual graphs in the document aligned with their specific sound position.
This annotated, segmented wordlist should be saved in the folder data/4_annotated_segmented.
5. Clustering sound_positions
We then take the annotated, segmented wordlist and calculate the occurrence of different sound positions in the new document. The output from this is a table showing all occurrences of each individual sound position, the words in which those occur, and the number of occurrences (of each word).
This is created using the following script:
bash
python src/sound_position.py --filename FILENAME
The output from this step is saved in the folder called output/clustered_graph_list.
6. Manually rearranging data
As outlined in the Handbuch, the next step is to manually sort and re-arrange these outputs according to morpheme type and allographs.
The output from this step should be saved in the folder data/5_boxes_raw.
7a. Calculating leading graphs
The next step is to calculate the leading graph - in other words, the allograph which covers more than 50% of occurences in the original document.
To calculate this value, we run the following, changing the filenames as you please:
bash
python src/leading_graph.py --filename FILENAME --output FILENAME
The result from this script will be saved in the folder called data/6_boxes.
7b. Calculating distances
We can finally calculate the graphemic distance between any two graphemes as extracted from a document using the previous steps. To do this, we specify specficially which graphemes we wish to compare.
To do this, we run the following code:
bash
python3.11 src/graphematic.py --files FILE1 FILE2 ... --outfile FILENAME
For example, to compare hypothetical results for {â}closedsyllable vs {â}opensyllable:
bash
python3.11 src/graphematic.py --files {â}_closed_syllable.xlsx {â}_open_syllable.xlsx --outfile results.xlsx
In any case, the results are saved in the folder called output/distance.
8. Visualizing vowel and consonant distributions
Finally, we can create simple barplots to show the distribution of vowels, consonants, vowel clusters and consonant clusters in our original document. For this, we only need the segmented wordlist created as part of step two above.
Focusing just on vowels, we can present the results either by ordering the vowel (clusters) alphabetically, or in order of descending size.
To create results with vowels arranged alphabetically:
bash
python src/vowels_plot.py --filename FILENAME --alphabetical
To plot based on descending frequency, simply remove the final flag:
bash
python src/vowels_plot.py --filename FILENAME
For consonants, the relevant script is src/consonants_plot.py, which can be run in the same way.
In each case, the visualizations are saved into the folder called output/graphs. A table of the same results is saved alongside this in the folder called output/frequencies.
Owner
- Name: Center for Humanities Computing Aarhus
- Login: centre-for-humanities-computing
- Kind: organization
- Email: chcaa@cas.au.dk
- Location: Aarhus, Denmark
- Website: https://chc.au.dk/
- Repositories: 130
- Profile: https://github.com/centre-for-humanities-computing
GitHub Events
Total
- Push event: 9
Last Year
- Push event: 9
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Ross Deans Kristensen-McLachlan | r****m@c****k | 26 |
| AndersRoen | 2****2@p****k | 4 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- matplotlib ==3.8.2
- numpy ==1.26.4
- pandas ==2.2.0
- python_docx ==1.1.0