edcs_etl

ETL repository for Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/)

https://github.com/sdam-au/edcs_etl

Last synced: 11 months ago · JSON representation

Repository

ETL repository for Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/)

Basic Info

Host: GitHub
Owner: sdam-au
License: cc-by-sa-4.0
Language: Shell
Default Branch: master
Size: 107 MB

Statistics

Stars: 0
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 5

Created over 5 years ago · Last pushed almost 4 years ago

Metadata Files

Readme License Citation

ETL workflow for quantitative analysis of inscriptions from the EDCS dataset

ETL

This repository contains scripts for accesing, extracting and transforming epigraphic datasets from the Epigraphic Database Clauss-Slaby. We have developed a series of scripts, merging the data together and streamlining them for quantitative analysis of epigraphic trends.

Authors

Petra Hemnkov SDAM project, petra.hermankova@cas.au.dk

License

CC-BY-SA 4.0

How to cite us

2022 version 2

DATASET 2022: Hemnkov, Petra. (2022). EDCS_text_cleaned_2022_09_12 (v2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.7072337 https://zenodo.org/record/7072337

SCRIPTS 2022: Petra Hemnkov. (2022). sdam-au/EDCS_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.7072355 https://doi.org/10.5281/zenodo.7072355

The 2022 datasets contains 537,286 cleaned and streamlined Latin inscriptions from the Epigraphic Database Clauss Slaby (EDCS, http://www.manfredclauss.de/), aggregated on 2022/09/12, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 27 attributes with original and streamlined data. Compared to the 2021 dataset, there are 36,750 more inscriptions and 2 less attributes containing redundant legacy data, thus the entire dataset is approximately the same size but some of the attributes are streamlined (465.5 MB in 2022 compared to 451.5 MB MB from 2021.): some of the attribute names have changed for better consistency, e.g. Material > material, Latitude > latitude; some attributes are no longer available due to the improvements of the LatEpig tool, e.g. start_yr, notes_dating, inscription_stripped_final; and some new attributes were added due to the improvements of the cleaning process, e.g. clean_text_conservative. For full overview, see the Metadata section.

Metadata

EDCS 2022 dataset metadata with descriptions for all attributes.

2021 version 1

DATASET 2021: Hemnkov, Petra. (2021). EDCS_text_cleaned_2021_03_01 (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888817 https://zenodo.org/record/4888817

SCRIPTS 2021: Petra Hemnkov. (2022). sdam-au/EDCS_ETL: Scripts (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6497148 https://doi.org/10.5281/zenodo.6497148

The 2021 dataset contains 500,536 cleaned and streamlined Latin inscriptions from the Epigraphic Database Clauss Slaby (EDCS, http://www.manfredclauss.de/), aggregated on 2021/03/01, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 29 attributes with original and streamlined data. For full overview, see the Metadata section.

Metadata

EDCS 2021 dataset metadata with descriptions for all attributes.

Data

The original raw data

is published at www.manfredclauss.de webinterface as HTML. The output of the webinterface is accessed and saved by a third party tool, Lat Epig 2.0, developed at Macquarie University in Sydney, in a series of CVS files by their respective province.

The scripts access the main dataset via a webinterface, transform the data into one dataframe object and save the outcome to SDAM project directory on sciencedata.dk and on Zenodo. Since the most important data files are in a public folder, you can use and re-run our analyses even without a sciencedata.dk account and access to our team folder. A separate Python package sddk was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified. You can access the file without having to login into sciencedata.dk. Here is a path to the file on sciencedata.dk:

SDAM_root/SDAM_data/EDCS/public/EDCS_text_cleaned[timestamp].json or https://sciencedata.dk/public/1f5f56d09903fe259c0906add8b3a55e/EDCS_text_cleaned_[timestamp].json

To access the files created in previous steps of the ETL process, you can use the dataset from the public folder, or you have to rerun all scripts on your own.

The final (streamlined) dataset

is produced by the scripts in this repository is called EDCS_text_cleaned_[timestamp].json and published on Zenodo in all its versions, for details and links see How to cite us section above.

Additionally, the identical dataset can be accessed via Sciencedata.dk: SDAM_root/SDAM_data/EDCS/public folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/1f5f56d09903fe259c0906add8b3a55e/.

Scripts

Data accessing scripts

The data is accessed via a third party tool, Lat Epig 2.0, and saved as a series of TSV files by their respective Roman Province and saved in the folder data. We furter use R for accessing the data from a series of TSVs and combining them into one dataframe, exported as JSON file. Subsequently, we use series of R scripts for further cleaning and transformming the data. The scripts can be found in the folder scripts and they are named according to the sequence they should run in.

If you are trying to access the ETL scripts creted in 2020-2021 that created the version 1.0 of the dataset (Hemnkov, Petra. (2021). EDCS_text_cleaned_2021_03_01 (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888817 https://zenodo.org/record/4888817), we refer you to the release 1.0 to 1.3 on Zenodo. Because of the external dependencies and changes in third party software and the databases between 2020 and 2022, the ETL scripts has changed since then (release v2.0).

Instructions for accessing the raw data

Clone https://github.com/mqAncientHistory/Lat-Epig repository to your local computer
Change the branche to scrapeprovinces
Make sure you have Docker installed, if not follow the installation instructions for your OS https://docs.docker.com/engine/install/ and post-installation https://docs.docker.com/engine/install/linux-postinstall/ (Linux)
Run in the terminal: bash dockerScraperAll.sh
The scraper will run on its own (for several hours, depending on your internet connection and your computer, usually around 4-5 hours) and when it's done, the data will show in the main folder labelled full_scrape_[today's-date]. All inscriptions are saved as TSV file and JSON file, labelled with their metadata containing the date of accessing, source, name o fthe province and their number.
Copy the entire folder to the EDCSETL repository for further processing (don't forget to rename the folder to `YYYYMM_allProvinces` or make necessary changes in the follwing scripts).

Alternatively, if you are using the old version of the tool (pre-2022 version), you would be using the script 10LatEpig20searchby_provinces.bsh to access the data. However, in the 2022 version the file is deprecated. The bash script programmatically extracted all non-empty inscriptions from individual provinces into separate CSV files. Run time ca. 16-20 hrs. The script was to be used within the local instantiation of the Lat Epig 2.0 tool. The CSV files were saved within that repository to the folder output.

11rEDCSmergecleanattrs.Rmd

Merging TSV files and cleaning attributes

The current script works with TSV files stored in the YYYY_MM_allProvinces folder. If you wish to work with JSON files, amend the script.

|| File | Source commentary | | :--- | ---: | ---: | | input |2022_09_allProvinces in folder data| containting TSVs with inscriptions in individual provinces, accessed via Epigraphy Scraper Jupyter Notebook | output | EDCS_merged_cleaned_attrs_[timestamp].json ||

12rEDCScleaning_text.Rmd

Cleaning text of an inscription

|| File | Source commentary | | :--- | ---: | ---: | | input| EDCS_merged_cleaned_attrs_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes.| | output| EDCS_text_cleaned_[timestamp].json||

The following scripts are exploratory only (do not change the dataset, only explore the contents of the dataset)

13rEDCSexploration.Rmd

Exploration of the entire dataset

|| File | Source commentary | | :--- | ---: | ---: | | input| EDCS_text_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text.| | output| NA||

14rEDCStext_exploration.Rmd

Exploration of the text of inscriptions

|| File | Source commentary | | :--- | ---: | ---: | | input| EDCS_text_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text.| | output| NA||

15rEDCStextlemmatizationUDpipe.Rmd

Lemmatization of the text of inscriptions with UDpipe tool. However, upon closer inspection, the results of such lemmatization were unsatisfactory.

|| File | Source commentary | | :--- | ---: | ---: | | input| EDCS_text_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text.| | output| EDCS_text_lemmatized_udpipe_[timestamp]].json||

Related publications

Hemnkov, P., Kae, V., & Sobotkova, A. (2021). Inscriptions as data: Digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1), 99. https://doi.org/10.1515/jdh-2021-1004 - the article working with version 1, but version 2 follows the same principles. Some attribute names may vary in the version 2 as well as the contents of the dataset (that reflect the changes made by the EDCS).

Owner

Name: Social Dynamics in the Ancient Mediterranean
Login: sdam-au
Kind: organization
Location: Aarhus, Denmark

Website: sdam.au.dk
Twitter: sdam_au
Repositories: 7
Profile: https://github.com/sdam-au

Research group of social complexity and dynamics in the Ancient Mediterranean

GitHub Events

Total

Last Year

Committers

Last synced: about 1 year ago

All Time

Total Commits: 50
Total Committers: 1
Avg Commits per committer: 50.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Petra Heřmánková	p**a@g**m	50

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

edcs_etl

Science Score: 49.0%

Basic Info

Statistics

Metadata Files

ETL workflow for quantitative analysis of inscriptions from the EDCS dataset

Authors

License

How to cite us

2022 version 2

2021 version 1

Data

The original raw data

The final (streamlined) dataset

Scripts

Data accessing scripts

Instructions for accessing the raw data

Related publications

GitHub Events

Total

Last Year

All Time

Past Year

Top Committers

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels