greek-complexity

Query XML treebank to explore syntactic complexity in Ancient Greek

https://github.com/nevenjovanovic/greek-complexity

Last synced: 9 months ago · JSON representation ·

Repository

Query XML treebank to explore syntactic complexity in Ancient Greek

Basic Info

Host: GitHub
Owner: nevenjovanovic
License: cc-by-4.0
Language: XQuery
Default Branch: main
Size: 9.58 MB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 1

Created about 4 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation Zenodo

Linguistic complexity in ancient Greek - Sentence complexity and grammar

We query a set of Greek texts, hand-encoded for morphology and syntax (as treebanks) by Vanessa Gorman, to explore complexity in Greek sentence. The treebanks and queries in this repository are published under a CC-BY license.

How to use

Download the files or clone the repository. Install BaseX XML database.

In BaseX, run the script create-grccomp-db.xq to create the grc-com database. Query the database by running other scripts in the scripts/xq directory. Adapt the scripts to query as needed.

A list of queries (from simple to complex)

Create DB, get some statistics

Create the grc-com database: create-grccomp-db.xq
Get basic information about the database, how many words, sentences, documents: db-basic-info.xq
Get stats on sentence length: db-stats-sentence.xq
Get stats on relations: db-stats-relations

Statistics on syntactic relations

Which POS have role of PRED (and similar): list-pred-types.xq
Which POS have role of COORD (and similar): list-coord-types.xq

Analyse lemmata and their functions

For a subset of sentences (based on number of elements, words etc), list lemmata: list-lemmata.xq
For a lemma in a subset of sentences (based on number of elements), list its syntactic relations: lemma-list-functions.xq
For a specific syntactic relation of lemma in a subset, list all sentences: relation-lemma-12-18-words.xq

Retrieve specific syntactic features

Find sentences with all basic roles (PRED, SBJ, OBJ, ADV): find-sentences-all-basic-roles.xq
Find sentences with ellipsis (a role is missing and is artificially added during annotation), exactly 6 sentence elements: find-ellipsis.xq
Find sentences with 12 words or less where PRED is adjective: find-sentences-with-pred-adj.xq
Find sentences with 12 words or less where PRED is conjunction: find-sentences-with-pred-conj.xq
Find sentences with 15 words or less without PRED: find-sentences-no-pred.xq
Sentences with PRED and COORD dependent on sentence root: find-pred-coord-0.xq
Find sentences with 12 words or less where the article is not ATR (or its variations): find-article-not-atr.xq
Find sentences with COORD by asyndeton (u): find-coord-sentences-asyndeton.xq
Find sentences with PRED_CO: find-coord-pred-co.xq
Find sentences with some number of words where some word has some _CO function: find-suffix-co.xq
Find infinitive used as PRED: find-pred-inf.xq
Find sentences without AuxY: find-sentences-no-auxy.xq
Find sentences with many AuxY: find-sentences-with-many-auxy.xq
Find sentences without OBJ, PNOM, SBJ (and combinations): find-no-sbj-obj-pnom.xq
Find sentences without nouns or adjectives: find-no-nouns.xq
List syntactic roles of participles with frequencies of occurrences: find-participles-roles.xq
Find substantivated participles: find-participles-substantivated.xq
Find substantivated infinitives: find-infinitives-substantivated.xq
Find sentences where article is head: find-sentences-with-subst-expr.xq
Find sentences with transitive verbs as PRED without OBJ: find-sentences-no-obj.xq; the list of transitive verbs was compiled with find-verbs-obj.xq
Find verbs ruling PNOM which appear without PNOM as well: find-sentences-no-pnom.xq; the list of verbs ruling PNOM was compiled with find-pnom-pred.xq

Results

Database: grc-com
Date: 2022-06-02+02:00
Documents: 153
Sentences: 26781
Words: 633763

Stats on relations: relations-stats.md
Stats on PRED: pred-stats.md
Stats on COORD: coord-stats.md
Sentences with all basic roles (PRED, SBJ, OBJ, ADV) expressed: sentences-basic-roles.md
Sentences with ellipsis (artificially added elements), 6 sentence elements: sentences-ellipsis-6.md
Sentences with PRED adjective: sentences-pred-adj.md
Sentences with PRED conjunction: sentences-pred-c.md
Sentences without PRED relation: sentences-no-pred.md
Sentences where the article is not ATR: sentences-article-not-atr.md
Sentences with COORD performed by punctuation (asyndeton): sentences-coord-asyndeton.md
Sentences with PRED_CO: sentences-pred-co.md
Sentences with infinitives used as PRED: sentences-inf-pred.md
Sentences without AuxY (particles): sentences-no-auxy.md
Sentences with many AuxY: sentences-many-auxy.md
Sentences without OBJ, PNOM, SBJ (and combinations): no-sbj-obj-pnom.md
Sentences without nouns or adjectives: no-nouns-adj.md
Sentences with transitive verbs (active) as PRED, no OBJ: sentences-trans-no-obj.md
Syntactic roles of participles: roles-participles.md
Sentences with substantivated participles: subst-participles.md
Sentences with substantivated infinitives: subst-inf.md
Sentences where article is head: article-head.md
Sentences with verbs taking PNOM in which the verbs are PRED but have no PNOM: pnom-no-pnom.md

On a server

Modules and functions for web application (RESTXQ)

Modules (xqm, directory /scripts/webapp/repo/)
1. Functions for analysing treebanks (in general): grccom-analysis.xqm
2. Functions for displaying HTML (in general): grccom.xqm
Functions for individual pages (xq, directory /scripts/webapp/app/grccom)
1. Landing page: grccom-home.xq
2. Basic information on database: grccom-basic-ana.xq

AGDT data format

For syntactic roles, see the description by Giuseppe G. A. Celano, Guidelines for the Ancient Greek Dependency Treebank 2.0.

``` Data Format

    The data given in this treebank is provided as an XML document.  Each 
    word contains six required attributes:

    id: This is a unique identifier, and corresponds to the word's linear 
    position in the sentence.  The first word in a sentence is given 
    id 1.

    cid: This is a canonical identifier for the word within the larger corpus.

    form: The token form of the word.

    lemma: The base lemma from which the word is derived, in Beta Code.

    head: The id of the word's parent.  If a word depends on the sentence 
    root, its head is 0.

    relation: The syntactic relation between the word and its parent.  A 
    catalogue of syntactic tags can be found in the syntactic guidelines 
    described below.

    postag: The morphological analysis for the word.  This field is 9 
    characters long, and corresponds to the following morphological 
    features:

        1:  part of speech

            n   noun
            v   verb
            t   participle
            a   adjective
            d   adverb
            l   article
            g   particle
            c   conjunction
            r   preposition
            p   pronoun
            m   numeral
            i   interjection
            e   exclamation
            u   punctuation

        2:  person

            1   first person
            2   second person
            3   third person

        3:  number

            s   singular
            p   plural
            d   dual

        4:  tense

            p   present
            i   imperfect
            r   perfect
            l   pluperfect
            t   future perfect
            f   future
            a   aorist

        5:  mood

            i   indicative
            s   subjunctive
            o   optative
            n   infinitive
            m   imperative
            p   participle

        6:  voice

            a   active
            p   passive
            m   middle
            e   medio-passive

        7:  gender

            m   masculine
            f   feminine
            n   neuter

        8:  case

            n   nominative
            g   genitive
            d   dative
            a   accusative
            v   vocative
            l   locative

        9:  degree

            c   comparative
            s   superlative

        ---

        For example, the postag for the noun "a)/ndra" is "n-s---ma-", 
        which corresponds to the following features:

        1: n    noun
        2: -
        3: s    singular
        4: -
        5: -
        6: -
        7: m    masculine
        8: a    accusative
        9: -

```

Editor of this repository

Neven Jovanović (nevenjovanovic), Department of Classical Philology, Faculty of Humanities and Social Sciences, University of Zagreb; orcid.org/0000-0002-9119-399X

Owner

Name: Neven Jovanović
Login: nevenjovanovic
Kind: user
Location: Zagreb, Croatia
Company: University of Zagreb, Faculty of Humanities and Social Sciences

Website: https://orcid.org/my-orcid?orcid=0000-0002-9119-399X
Repositories: 9
Profile: https://github.com/nevenjovanovic

Classical philologist, university teacher of Greek and Latin. Digital humanities, textual editing. University of Zagreb, Croatia.

Citation (CITATION.cff)

cff-version: 1.1.0
message: "If you use this collection of scripts and data, please cite it as below."
authors:
  - family-names: Jovanović
    given-names: Neven
    orcid: https://orcid.org/0000-0002-9119-399X
title: nevenjovanovic/greek-complexity: Complexity in Ancient Greek Treebanks, First Set of Queries (number of elements, lemmata, relations)
version: v1.0.0
date-released: 2023-01-14

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science