greek-complexity

Query XML treebank to explore syntactic complexity in Ancient Greek

https://github.com/nevenjovanovic/greek-complexity

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Query XML treebank to explore syntactic complexity in Ancient Greek

Basic Info
  • Host: GitHub
  • Owner: nevenjovanovic
  • License: cc-by-4.0
  • Language: XQuery
  • Default Branch: main
  • Size: 9.58 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created almost 4 years ago · Last pushed about 3 years ago
Metadata Files
Readme License Citation Zenodo

README.md

Linguistic complexity in ancient Greek - Sentence complexity and grammar

We query a set of Greek texts, hand-encoded for morphology and syntax (as treebanks) by Vanessa Gorman, to explore complexity in Greek sentence. The treebanks and queries in this repository are published under a CC-BY license.

Contents

  • The encoded texts (Alpheios dependency scheme), cloned from the Greek-Dependency-Trees repository, are in data directory
  • Various XQuery scripts to transform and analyze the files are in scripts
  • Reports made by scripts are in info

How to use

Download the files or clone the repository. Install BaseX XML database.

In BaseX, run the script create-grccomp-db.xq to create the grc-com database. Query the database by running other scripts in the scripts/xq directory. Adapt the scripts to query as needed.

A list of queries (from simple to complex)

Create DB, get some statistics

  1. Create the grc-com database: create-grccomp-db.xq
  2. Get basic information about the database, how many words, sentences, documents: db-basic-info.xq
  3. Get stats on sentence length: db-stats-sentence.xq
  4. Get stats on relations: db-stats-relations

Statistics on syntactic relations

  1. Which POS have role of PRED (and similar): list-pred-types.xq
  2. Which POS have role of COORD (and similar): list-coord-types.xq

Analyse lemmata and their functions

  1. For a subset of sentences (based on number of elements, words etc), list lemmata: list-lemmata.xq
  2. For a lemma in a subset of sentences (based on number of elements), list its syntactic relations: lemma-list-functions.xq
  3. For a specific syntactic relation of lemma in a subset, list all sentences: relation-lemma-12-18-words.xq

Retrieve specific syntactic features

  1. Find sentences with all basic roles (PRED, SBJ, OBJ, ADV): find-sentences-all-basic-roles.xq
  2. Find sentences with ellipsis (a role is missing and is artificially added during annotation), exactly 6 sentence elements: find-ellipsis.xq
  3. Find sentences with 12 words or less where PRED is adjective: find-sentences-with-pred-adj.xq
  4. Find sentences with 12 words or less where PRED is conjunction: find-sentences-with-pred-conj.xq
  5. Find sentences with 15 words or less without PRED: find-sentences-no-pred.xq
  6. Sentences with PRED and COORD dependent on sentence root: find-pred-coord-0.xq
  7. Find sentences with 12 words or less where the article is not ATR (or its variations): find-article-not-atr.xq
  8. Find sentences with COORD by asyndeton (u): find-coord-sentences-asyndeton.xq
  9. Find sentences with PRED_CO: find-coord-pred-co.xq
  10. Find sentences with some number of words where some word has some _CO function: find-suffix-co.xq
  11. Find infinitive used as PRED: find-pred-inf.xq
  12. Find sentences without AuxY: find-sentences-no-auxy.xq
  13. Find sentences with many AuxY: find-sentences-with-many-auxy.xq
  14. Find sentences without OBJ, PNOM, SBJ (and combinations): find-no-sbj-obj-pnom.xq
  15. Find sentences without nouns or adjectives: find-no-nouns.xq
  16. List syntactic roles of participles with frequencies of occurrences: find-participles-roles.xq
  17. Find substantivated participles: find-participles-substantivated.xq
  18. Find substantivated infinitives: find-infinitives-substantivated.xq
  19. Find sentences where article is head: find-sentences-with-subst-expr.xq
  20. Find sentences with transitive verbs as PRED without OBJ: find-sentences-no-obj.xq; the list of transitive verbs was compiled with find-verbs-obj.xq
  21. Find verbs ruling PNOM which appear without PNOM as well: find-sentences-no-pnom.xq; the list of verbs ruling PNOM was compiled with find-pnom-pred.xq

Results

  • Database: grc-com
  • Date: 2022-06-02+02:00
  • Documents: 153
  • Sentences: 26781
  • Words: 633763
  1. Stats on relations: relations-stats.md
  2. Stats on PRED: pred-stats.md
  3. Stats on COORD: coord-stats.md
  4. Sentences with all basic roles (PRED, SBJ, OBJ, ADV) expressed: sentences-basic-roles.md
  5. Sentences with ellipsis (artificially added elements), 6 sentence elements: sentences-ellipsis-6.md
  6. Sentences with PRED adjective: sentences-pred-adj.md
  7. Sentences with PRED conjunction: sentences-pred-c.md
  8. Sentences without PRED relation: sentences-no-pred.md
  9. Sentences where the article is not ATR: sentences-article-not-atr.md
  10. Sentences with COORD performed by punctuation (asyndeton): sentences-coord-asyndeton.md
  11. Sentences with PRED_CO: sentences-pred-co.md
  12. Sentences with infinitives used as PRED: sentences-inf-pred.md
  13. Sentences without AuxY (particles): sentences-no-auxy.md
  14. Sentences with many AuxY: sentences-many-auxy.md
  15. Sentences without OBJ, PNOM, SBJ (and combinations): no-sbj-obj-pnom.md
  16. Sentences without nouns or adjectives: no-nouns-adj.md
  17. Sentences with transitive verbs (active) as PRED, no OBJ: sentences-trans-no-obj.md
  18. Syntactic roles of participles: roles-participles.md
  19. Sentences with substantivated participles: subst-participles.md
  20. Sentences with substantivated infinitives: subst-inf.md
  21. Sentences where article is head: article-head.md
  22. Sentences with verbs taking PNOM in which the verbs are PRED but have no PNOM: pnom-no-pnom.md

On a server

  1. Landing page with list of functions
  2. Basic information on treebanks
  3. Retrieve a subset of sentences based on word count (default: 12 to 18 elements)
  4. List lemmata in a subset of sentences (default: 12 to 18 elements)
  5. List relations (sentence functions) for a lemma (default: καί, 12 to 18 elements)
  6. For relation of lemma, list sentences in subset (default: καί as PRED, 12 to 18 elements)
  7. Retrieve a subset of sentences without participles
  8. Retrieve a subset of sentences without participles and subordinate conjunctions
  9. Retrieve a subset of sentences without participles, infinitives, and subordinate conjunctions
  10. Retrieve a subset based on number of words, with PRED and COORD dependent on sentence root

Modules and functions for web application (RESTXQ)

  1. Modules (xqm, directory /scripts/webapp/repo/)
    1. Functions for analysing treebanks (in general): grccom-analysis.xqm
    2. Functions for displaying HTML (in general): grccom.xqm
  2. Functions for individual pages (xq, directory /scripts/webapp/app/grccom)
    1. Landing page: grccom-home.xq
    2. Basic information on database: grccom-basic-ana.xq

AGDT data format

For syntactic roles, see the description by Giuseppe G. A. Celano, Guidelines for the Ancient Greek Dependency Treebank 2.0.

``` Data Format

    The data given in this treebank is provided as an XML document.  Each 
    word contains six required attributes:

    id: This is a unique identifier, and corresponds to the word's linear 
    position in the sentence.  The first word in a sentence is given 
    id 1.

    cid: This is a canonical identifier for the word within the larger corpus.

    form: The token form of the word.

    lemma: The base lemma from which the word is derived, in Beta Code.

    head: The id of the word's parent.  If a word depends on the sentence 
    root, its head is 0.

    relation: The syntactic relation between the word and its parent.  A 
    catalogue of syntactic tags can be found in the syntactic guidelines 
    described below.

    postag: The morphological analysis for the word.  This field is 9 
    characters long, and corresponds to the following morphological 
    features:

        1:  part of speech

            n   noun
            v   verb
            t   participle
            a   adjective
            d   adverb
            l   article
            g   particle
            c   conjunction
            r   preposition
            p   pronoun
            m   numeral
            i   interjection
            e   exclamation
            u   punctuation

        2:  person

            1   first person
            2   second person
            3   third person

        3:  number

            s   singular
            p   plural
            d   dual

        4:  tense

            p   present
            i   imperfect
            r   perfect
            l   pluperfect
            t   future perfect
            f   future
            a   aorist

        5:  mood

            i   indicative
            s   subjunctive
            o   optative
            n   infinitive
            m   imperative
            p   participle

        6:  voice

            a   active
            p   passive
            m   middle
            e   medio-passive

        7:  gender

            m   masculine
            f   feminine
            n   neuter

        8:  case

            n   nominative
            g   genitive
            d   dative
            a   accusative
            v   vocative
            l   locative

        9:  degree

            c   comparative
            s   superlative

        ---

        For example, the postag for the noun "a)/ndra" is "n-s---ma-", 
        which corresponds to the following features:

        1: n    noun
        2: -
        3: s    singular
        4: -
        5: -
        6: -
        7: m    masculine
        8: a    accusative
        9: -

```

Editor of this repository

  • Neven Jovanović (nevenjovanovic), Department of Classical Philology, Faculty of Humanities and Social Sciences, University of Zagreb; orcid.org/0000-0002-9119-399X

Owner

  • Name: Neven Jovanović
  • Login: nevenjovanovic
  • Kind: user
  • Location: Zagreb, Croatia
  • Company: University of Zagreb, Faculty of Humanities and Social Sciences

Classical philologist, university teacher of Greek and Latin. Digital humanities, textual editing. University of Zagreb, Croatia.

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this collection of scripts and data, please cite it as below."
authors:
  - family-names: Jovanović
    given-names: Neven
    orcid: https://orcid.org/0000-0002-9119-399X
title: nevenjovanovic/greek-complexity: Complexity in Ancient Greek Treebanks, First Set of Queries (number of elements, lemmata, relations)
version: v1.0.0
date-released: 2023-01-14

GitHub Events

Total
Last Year