grela

https://github.com/ccs-zcu/grela

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: CCS-ZCU
License: cc-by-sa-4.0
Language: Jupyter Notebook
Default Branch: master
Size: 7.03 MB

Statistics

Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

GreLa ETL

Authors

Vojtěch Kaše (& team of collaborators)

License

CC-BY-SA 4.0, see attached License.md

Description

This repository serves for the creation, maintenance, and enrichment of the GreLa corpus

GreLa is a comprehensive corpus of Greek and Latin literature from the 8 c. BCE to the 17. c. CE. It covers more than 11,000 works, 26,000,000 sentences and 380,000,000 tokens. It is formed as a merge of the following corpora: * LAGT: Lemmatized Ancient Greek Texts, combining all ancient Greek texts from Perseus Digital Library, First 1,000 Years of Greek, Glaux and OGA. * Corpus Corporum: a comprehensive corpus of Latin literature * NOSCEMUS: a database of early Modern scientific literature * EMLAP: Early Modern Latin Alchemical Prints

| subcorpus | worksN | sentencesN | tokens_N | |:------------|:----------|:--------------|:------------| | cc | 7,819 | 11,835,457 | 201,939,293 | | emlap | 73 | 220,846 | 3,495,212 | | lagt | 1,957 | 2,703,678 | 35,808,742 | | noscemus | 996 | 11,802,783 | 139,401,899 | | vulgate | 73 | 35,254 | 603,091 |

GreLa is structured as a relational database currently consisting of three tables: works, sentences, and tokens. The tables are mapped on each other using the keys grela_id and sentence_id. grela_id is formed as a combination of the subcorpus akronym and the ID of the work in the respective subcorpus (<subcorpus-akronym>_<work-id>, e.g. cc_1271O). sentence_id extends grela_id by positional index of the sentence, starting from 0 (e.g. cc_12710_0 and cc_12710_1 stand for the first two sentences from the work with the ID 12710 in Corpus Corporum).

In the tokens table, you can, for instance, search using the lemma and pos_tag fields. You can also retrieve the position of the token within the respective sentence using char_start and char_end.

In the works table, we are gradually adding additional metadata for individual works. Most importantly, we offer a date using the fields not_before and not_after. While for early modern works these two attributes are often the same, as the date of publication is known, for works from antiquity, we often have only a rough estimate, which can only be expressed by means of an interval. This dating convention invites a Monte Carlo approach to modeling temporal uncertainty, which we proposed in this paper.

The database is implemented using DuckDB, an open-source column-oriented Relational Database Management System (RDBMS) designed to provide high performance on complex queries against large databases.

Database Schema Documentation

Table: `sentences`

Table: `tokens`

| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | sentenceid | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | tokentext | VARCHAR | YES | N/A | | lemma | VARCHAR | YES | N/A | | pos | VARCHAR | YES | N/A | | charstart | INTEGER | YES | N/A | | charend | INTEGER | YES | N/A | | tokenid | BIGINT | YES | N/A |

Table: `works`

| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | grelasource | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | author | VARCHAR | YES | N/A | | title | VARCHAR | YES | N/A | | notbefore | DOUBLE | YES | N/A | | notafter | DOUBLE | YES | N/A | | lagttlgepithet | VARCHAR | YES | N/A | | lagtgenre | VARCHAR | YES | N/A | | lagtprovenience | VARCHAR | YES | N/A | | noscemusplace | VARCHAR | YES | N/A | | noscemusgenre | VARCHAR | YES | N/A | | noscemusdiscipline | VARCHAR | YES | N/A | | titleshort | VARCHAR | YES | N/A | | emlapnoscemusid | DOUBLE | YES | N/A | | placepublication | VARCHAR | YES | N/A | | placegeonames | VARCHAR | YES | N/A | | authorviaf | DOUBLE | YES | N/A | | titleviaf | DOUBLE | YES | N/A | | date_random | DOUBLE | YES | N/A |

Getting started

GreLa is now accessible via an API. To get started, check this Google Colab notebook.

```python

currently, we maintain the database on our CCS-Lab server

```

How to cite

[once a release is created and published via zenodo, put its citation here]

Ackwnowledgement

[This work has been supported by ...]

Owner

Name: CCS-Lab (Computing Culture & Society)
Login: CCS-ZCU
Kind: organization
Email: kase@kfi.zcu.cz
Location: Czech Republic

Website: https://ccs.zcu.cz
Repositories: 1
Profile: https://github.com/CCS-ZCU

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
references:
  - type: "software"
authors: 
  affiliation: ""
      family-names: ""
      given-names: ""
      orcid: ""-
license: cc-by-nc-4.0
repository-code: ""
title: ""
version:
doi: 
date-released: 2023-04-27
abstract: ""
keywords:
  - ""

GitHub Events

Total

Push event: 6
Create event: 1

Last Year

Push event: 6
Create event: 1

Dependencies

requirements.txt pypi

gensim *
ipykernel *
jupyter *
matplotlib *
networkx *
nltk *
pandas *
scikit-learn *
sddk *
seaborn *
virtualenv *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

grela

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

GreLa ETL

Authors

License

Description

Database Schema Documentation

Table: `sentences`

Table: `tokens`

Table: `works`

Getting started

currently, we maintain the database on our CCS-Lab server

How to cite

Ackwnowledgement

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

grela

Science Score: 54.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

GreLa ETL

Authors

License

Description

Database Schema Documentation

Table: sentences

Table: tokens

Table: works

Getting started

currently, we maintain the database on our CCS-Lab server

How to cite

Ackwnowledgement

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

Table: `sentences`

Table: `tokens`

Table: `works`