li_etl
Deduplicated and enriched merge of the EDH and EDCS dataset
Science Score: 46.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
1 of 6 committers (16.7%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary
Keywords
Repository
Deduplicated and enriched merge of the EDH and EDCS dataset
Basic Info
Statistics
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 2
Topics
Metadata Files
README.md
LI ETL (Latin Inscriptions - Extract, Transform, Load)
Authors
- Vojtěch Kaše
, SDAM project, vojtech.kase@gmail.com - Petra Hermankova
, SDAM project, petra@ancientsocialcomplexity.org - Adela Sobotkova
, SDAM project, admin@ancientsocialcomplexity.org
Description
This repository serves for generation of two datasets: LIST (Latin Inscriptions in Space and Time, https://zenodo.org/record/7587556#.ZEor6i9BxhF) and LIRE (Latin Inscriptions of the Roman Empire, https://zenodo.org/record/5776109#.ZEosBC9BxpQ), where the second is a filtered, spatio-temporally more restricted, version of the first one. Both were created by aggregation of EDH and EDCS epigraphic datasets enriched by additional metadata. The repository does not contain the datasets as such, but the scripts used to generating them (see the scripts subdirectory).
For inscriptions which are covered by both EDCS and EDH source datasets, it contains attributes from both of them. In cases in which an inscription is available only in one dataset, it contains attributes only from that one dataset. Some crucial attributes shared by both datasets:
* clean_text_interpretive_word: text of the inscription
* not_before: start of the dating interval
* not_after : end of the dating interval
* geography : latitude/longitude defining geospatial position in form of a point
In the case of other metadata attributes, the information cannot be easily transferred between the two sources. For instance, EDCS has the attribute inscr_type which should bear approximately the same information as type_of_inscription_clean in EDH. However, the inscr_type attribute from EDCS uses a different classification system than EDH, relies on latin labels of inscription types etc. This project overcomes this issue by developing and applying a machine learning classification model (see scripts/CLASSIFIER_TRAINING&TESTING.ipynb and scripts/CLASSIFIER-APPLICATION.ipynb). This way the dataset is enriched by two additional attributes: type_of_inscription_auto and type_of_inscription_prob.
For an overview of all metadata, see LIST_v0.4_metadata.csv. For an overview of the data, see the jupyter notebook 5_DATA-OVERVIEW.ipynb in the scripts subdirectory.
The final datasets are available via Zenodo:
* LIST dataset: https://zenodo.org/record/7870085#.ZEoyjy9BxhE (using geopandas library, you can load the data directly into your Python environment using the following command: LIST = gpd.read_parquet("https://zenodo.org/record/7870085/files/LIST_v0-4.parquet?download=1"))
* LIRE dataset: https://zenodo.org/record/7577788#.ZEo3rS9BxhE (using geopandas library, you can load the data directly into your Python environment using the following command: LIRE = gpd.read_parquet("https://zenodo.org/record/7577788/files/LIRE_v2-1.parquet?download=1"))
Source Data
EDCS dataset is accessed and transformed by the series of Python and R scripts in EDCS ETL repository, created by the SDAM project. The latest version of the dataset (as JSON file) can be accessed via Sciencedata.dk using the following url: https://sciencedata.dk/shared/1f5f56d09903fe259c0906add8b3a55e.
EDH dataset is accessed and transformed by the series of Python and R scripts in EDH ETL repository and in EDH exploration repository, created by the SDAM Project. The latest version of the dataset (as JSON file) can be accessed via Sciencedata.dk using the following url: https://sciencedata.dk/shared/b6b6afdb969d378b70929e86e58ad975.
Software
- Python 3
- Jupyter notebooks app/JupyterLab/JupyterHub
- Python 3 additional libraries listed
requirements.txt
Getting Started
After you clone the repository, we recommend you to create a virtual environment lire_venv using the virtualenv library and to run the notebooks with it as their kernel:
```bash
virtualenv livenv
source livenv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt # install anything in requirements.txt
python -m ipykernel install --user --name=li_venv # add to kernels
```
Owner
- Name: Social Dynamics in the Ancient Mediterranean
- Login: sdam-au
- Kind: organization
- Location: Aarhus, Denmark
- Website: sdam.au.dk
- Twitter: sdam_au
- Repositories: 7
- Profile: https://github.com/sdam-au
Research group of social complexity and dynamics in the Ancient Mediterranean
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| kasev | v****e@g****m | 129 |
| Petra Heřmánková | p****a@g****m | 15 |
| Adela Sobotkova | a****a@f****g | 7 |
| Vojtěch Kaše | v****e@V****e | 1 |
| Vojtěch Kaše | k****v@v****k | 1 |
| Vojtěch Kaše | k****v@e****z | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- geoplot *
- google-auth *
- gspread *
- gspread_dataframe *
- ipython *
- jupyter *
- nltk *
- pyinterval *
- scikit-learn *
- sddk *
- tabulate *
- tempun *