Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: CCS-ZCU
- License: cc-by-sa-4.0
- Language: Jupyter Notebook
- Default Branch: master
- Size: 7.03 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
GreLa ETL
Authors
- Vojtěch Kaše (& team of collaborators)
License
CC-BY-SA 4.0, see attached License.md
Description
This repository serves for the creation, maintenance, and enrichment of the GreLa corpus
GreLa is a comprehensive corpus of Greek and Latin literature from the 8 c. BCE to the 17. c. CE. It covers more than 11,000 works, 26,000,000 sentences and 380,000,000 tokens. It is formed as a merge of the following corpora: * LAGT: Lemmatized Ancient Greek Texts, combining all ancient Greek texts from Perseus Digital Library, First 1,000 Years of Greek, Glaux and OGA. * Corpus Corporum: a comprehensive corpus of Latin literature * NOSCEMUS: a database of early Modern scientific literature * EMLAP: Early Modern Latin Alchemical Prints
| subcorpus | worksN | sentencesN | tokens_N | |:------------|:----------|:--------------|:------------| | cc | 7,819 | 11,835,457 | 201,939,293 | | emlap | 73 | 220,846 | 3,495,212 | | lagt | 1,957 | 2,703,678 | 35,808,742 | | noscemus | 996 | 11,802,783 | 139,401,899 | | vulgate | 73 | 35,254 | 603,091 |
GreLa is structured as a relational database currently consisting of three tables: works, sentences, and tokens. The tables are mapped on each other using the keys grela_id and sentence_id. grela_id is formed as a combination of the subcorpus akronym and the ID of the work in the respective subcorpus (<subcorpus-akronym>_<work-id>, e.g. cc_1271O). sentence_id extends grela_id by positional index of the sentence, starting from 0 (e.g. cc_12710_0 and cc_12710_1 stand for the first two sentences from the work with the ID 12710 in Corpus Corporum).
In the tokens table, you can, for instance, search using the lemma and pos_tag fields. You can also retrieve the position of the token within the respective sentence using char_start and char_end.
In the works table, we are gradually adding additional metadata for individual works. Most importantly, we offer a date using the fields not_before and not_after. While for early modern works these two attributes are often the same, as the date of publication is known, for works from antiquity, we often have only a rough estimate, which can only be expressed by means of an interval. This dating convention invites a Monte Carlo approach to modeling temporal uncertainty, which we proposed in this paper.
The database is implemented using DuckDB, an open-source column-oriented Relational Database Management System (RDBMS) designed to provide high performance on complex queries against large databases.
Database Schema Documentation
Table: sentences
| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | sentenceid | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | position | INTEGER | YES | N/A | | text | VARCHAR | YES | N/A |
Table: tokens
| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | sentenceid | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | tokentext | VARCHAR | YES | N/A | | lemma | VARCHAR | YES | N/A | | pos | VARCHAR | YES | N/A | | charstart | INTEGER | YES | N/A | | charend | INTEGER | YES | N/A | | tokenid | BIGINT | YES | N/A |
Table: works
| Column Name | Data Type | Is Nullable | Default Value | |-----------------|-------------|-------------|---------------| | grelasource | VARCHAR | YES | N/A | | grelaid | VARCHAR | YES | N/A | | author | VARCHAR | YES | N/A | | title | VARCHAR | YES | N/A | | notbefore | DOUBLE | YES | N/A | | notafter | DOUBLE | YES | N/A | | lagttlgepithet | VARCHAR | YES | N/A | | lagtgenre | VARCHAR | YES | N/A | | lagtprovenience | VARCHAR | YES | N/A | | noscemusplace | VARCHAR | YES | N/A | | noscemusgenre | VARCHAR | YES | N/A | | noscemusdiscipline | VARCHAR | YES | N/A | | titleshort | VARCHAR | YES | N/A | | emlapnoscemusid | DOUBLE | YES | N/A | | placepublication | VARCHAR | YES | N/A | | placegeonames | VARCHAR | YES | N/A | | authorviaf | DOUBLE | YES | N/A | | titleviaf | DOUBLE | YES | N/A | | date_random | DOUBLE | YES | N/A |
Getting started
GreLa is now accessible via an API. To get started, check this Google Colab notebook.
```python
currently, we maintain the database on our CCS-Lab server
```
How to cite
[once a release is created and published via zenodo, put its citation here]
Ackwnowledgement
[This work has been supported by ...]
Owner
- Name: CCS-Lab (Computing Culture & Society)
- Login: CCS-ZCU
- Kind: organization
- Email: kase@kfi.zcu.cz
- Location: Czech Republic
- Website: https://ccs.zcu.cz
- Repositories: 1
- Profile: https://github.com/CCS-ZCU
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
references:
- type: "software"
authors:
affiliation: ""
family-names: ""
given-names: ""
orcid: ""-
license: cc-by-nc-4.0
repository-code: ""
title: ""
version:
doi:
date-released: 2023-04-27
abstract: ""
keywords:
- ""
GitHub Events
Total
- Push event: 6
- Create event: 1
Last Year
- Push event: 6
- Create event: 1
Dependencies
- gensim *
- ipykernel *
- jupyter *
- matplotlib *
- networkx *
- nltk *
- pandas *
- scikit-learn *
- sddk *
- seaborn *
- virtualenv *