Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: CCS-ZCU
  • License: cc-by-sa-4.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 7.03 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 9 months ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

GreLa ETL


Authors

  • Vojtěch Kaše (& team of collaborators)

License

CC-BY-SA 4.0, see attached License.md


Description

This repository serves for the creation, maintenance, and enrichment of the GreLa corpus

GreLa is a comprehensive corpus of Greek and Latin literature from the 8 c. BCE to the 17. c. CE. It covers more than 11,000 works, 26,000,000 sentences and 380,000,000 tokens. It is formed as a merge of the following corpora: * LAGT: Lemmatized Ancient Greek Texts, combining all ancient Greek texts from Perseus Digital Library, First 1,000 Years of Greek, Glaux and OGA. * Corpus Corporum: a comprehensive corpus of Latin literature * NOSCEMUS: a database of early Modern scientific literature * EMLAP: Early Modern Latin Alchemical Prints

| subcorpus | worksN | sentencesN | tokens_N | |:------------|:----------|:--------------|:------------| | cc | 7,819 | 11,835,457 | 201,939,293 | | emlap | 73 | 220,846 | 3,495,212 | | lagt | 1,957 | 2,703,678 | 35,808,742 | | noscemus | 996 | 11,802,783 | 139,401,899 | | vulgate | 73 | 35,254 | 603,091 |

GreLa is structured as a relational database currently consisting of three tables: works, sentences, and tokens. The tables are mapped on each other using the keys grela_id and sentence_id. grela_id is formed as a combination of the subcorpus akronym and the ID of the work in the respective subcorpus (<subcorpus-akronym>_<work-id>, e.g. cc_1271O). sentence_id extends grela_id by positional index of the sentence, starting from 0 (e.g. cc_12710_0 and cc_12710_1 stand for the first two sentences from the work with the ID 12710 in Corpus Corporum).

In the tokens table, you can, for instance, search using the lemma and pos_tag fields. You can also retrieve the position of the token within the respective sentence using char_start and char_end.

In the works table, we are gradually adding additional metadata for individual works. Most importantly, we offer a date using the fields not_before and not_after. While for early modern works these two attributes are often the same, as the date of publication is known, for works from antiquity, we often have only a rough estimate, which can only be expressed by means of an interval. This dating convention invites a Monte Carlo approach to modeling temporal uncertainty, which we proposed in this paper.

The database is implemented using DuckDB, an open-source column-oriented Relational Database Management System (RDBMS) designed to provide high performance on complex queries against large databases.

Database Schema Documentation

Table: sentences

| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | sentenceid | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | position | INTEGER | YES | N/A | | text | VARCHAR | YES | N/A |

Table: tokens

| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | sentenceid | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | tokentext | VARCHAR | YES | N/A | | lemma | VARCHAR | YES | N/A | | pos | VARCHAR | YES | N/A | | charstart | INTEGER | YES | N/A | | charend | INTEGER | YES | N/A | | tokenid | BIGINT | YES | N/A |

Table: works

| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | grelasource | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | author | VARCHAR | YES | N/A | | title | VARCHAR | YES | N/A | | notbefore | DOUBLE | YES | N/A | | notafter | DOUBLE | YES | N/A | | lagttlgepithet | VARCHAR | YES | N/A | | lagtgenre | VARCHAR | YES | N/A | | lagtprovenience | VARCHAR | YES | N/A | | noscemusplace | VARCHAR | YES | N/A | | noscemusgenre | VARCHAR | YES | N/A | | noscemusdiscipline | VARCHAR | YES | N/A | | titleshort | VARCHAR | YES | N/A | | emlapnoscemusid | DOUBLE | YES | N/A | | placepublication | VARCHAR | YES | N/A | | placegeonames | VARCHAR | YES | N/A | | authorviaf | DOUBLE | YES | N/A | | titleviaf | DOUBLE | YES | N/A | | date_random | DOUBLE | YES | N/A |

Getting started

GreLa is now accessible via an API. To get started, check this Google Colab notebook.

```python

currently, we maintain the database on our CCS-Lab server

```

How to cite

[once a release is created and published via zenodo, put its citation here]

Ackwnowledgement

[This work has been supported by ...]

Owner

  • Name: CCS-Lab (Computing Culture & Society)
  • Login: CCS-ZCU
  • Kind: organization
  • Email: kase@kfi.zcu.cz
  • Location: Czech Republic

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
references:
  - type: "software"
authors: 
  affiliation: ""
      family-names: ""
      given-names: ""
      orcid: ""-
license: cc-by-nc-4.0
repository-code: ""
title: ""
version:
doi: 
date-released: 2023-04-27
abstract: ""
keywords:
  - ""

GitHub Events

Total
  • Push event: 6
  • Create event: 1
Last Year
  • Push event: 6
  • Create event: 1

Dependencies

requirements.txt pypi
  • gensim *
  • ipykernel *
  • jupyter *
  • matplotlib *
  • networkx *
  • nltk *
  • pandas *
  • scikit-learn *
  • sddk *
  • seaborn *
  • virtualenv *