edcs_etl
ETL repository for Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/)
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 20 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.6%) to scientific vocabulary
Repository
ETL repository for Epigraphic Database Clauss Slaby (http://www.manfredclauss.de/)
Basic Info
- Host: GitHub
- Owner: sdam-au
- License: cc-by-sa-4.0
- Language: Shell
- Default Branch: master
- Size: 107 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 5
Metadata Files
README.md
ETL workflow for quantitative analysis of inscriptions from the EDCS dataset
- ETL
This repository contains scripts for accesing, extracting and transforming epigraphic datasets from the Epigraphic Database Clauss-Slaby. We have developed a series of scripts, merging the data together and streamlining them for quantitative analysis of epigraphic trends.
Authors
License
How to cite us
2022 version 2
DATASET 2022: Hemnkov, Petra. (2022). EDCS_text_cleaned_2022_09_12 (v2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.7072337
https://zenodo.org/record/7072337
SCRIPTS 2022: Petra Hemnkov. (2022). sdam-au/EDCS_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.7072355
https://doi.org/10.5281/zenodo.7072355
The 2022 datasets contains 537,286 cleaned and streamlined Latin inscriptions from the Epigraphic Database Clauss Slaby (EDCS, http://www.manfredclauss.de/), aggregated on 2022/09/12, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 27 attributes with original and streamlined data. Compared to the 2021 dataset, there are 36,750 more inscriptions and 2 less attributes containing redundant legacy data, thus the entire dataset is approximately the same size but some of the attributes are streamlined (465.5 MB in 2022 compared to 451.5 MB MB from 2021.): some of the attribute names have changed for better consistency, e.g. Material > material, Latitude > latitude; some attributes are no longer available due to the improvements of the LatEpig tool, e.g. start_yr, notes_dating, inscription_stripped_final; and some new attributes were added due to the improvements of the cleaning process, e.g. clean_text_conservative. For full overview, see the Metadata section.
Metadata
EDCS 2022 dataset metadata with descriptions for all attributes.
2021 version 1
DATASET 2021: Hemnkov, Petra. (2021). EDCS_text_cleaned_2021_03_01 (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888817
https://zenodo.org/record/4888817
SCRIPTS 2021: Petra Hemnkov. (2022). sdam-au/EDCS_ETL: Scripts (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6497148
https://doi.org/10.5281/zenodo.6497148
The 2021 dataset contains 500,536 cleaned and streamlined Latin inscriptions from the Epigraphic Database Clauss Slaby (EDCS, http://www.manfredclauss.de/), aggregated on 2021/03/01, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 29 attributes with original and streamlined data. For full overview, see the Metadata section.
Metadata
EDCS 2021 dataset metadata with descriptions for all attributes.
Data
The original raw data
is published at www.manfredclauss.de webinterface as HTML. The output of the webinterface is accessed and saved by a third party tool, Lat Epig 2.0, developed at Macquarie University in Sydney, in a series of CVS files by their respective province.
The scripts access the main dataset via a webinterface, transform the data into one dataframe object and save the outcome to SDAM project directory on sciencedata.dk and on Zenodo. Since the most important data files are in a public folder, you can use and re-run our analyses even without a sciencedata.dk account and access to our team folder. A separate Python package sddk was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified. You can access the file without having to login into sciencedata.dk. Here is a path to the file on sciencedata.dk:
SDAM_root/SDAM_data/EDCS/public/EDCS_text_cleaned[timestamp].json or https://sciencedata.dk/public/1f5f56d09903fe259c0906add8b3a55e/EDCS_text_cleaned_[timestamp].json
To access the files created in previous steps of the ETL process, you can use the dataset from the public folder, or you have to rerun all scripts on your own.
The final (streamlined) dataset
is produced by the scripts in this repository is called EDCS_text_cleaned_[timestamp].json and published on Zenodo in all its versions, for details and links see How to cite us section above.
Additionally, the identical dataset can be accessed via Sciencedata.dk: SDAM_root/SDAM_data/EDCS/public folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/1f5f56d09903fe259c0906add8b3a55e/.
Scripts
Data accessing scripts
The data is accessed via a third party tool, Lat Epig 2.0, and saved as a series of TSV files by their respective Roman Province and saved in the folder data. We furter use R for accessing the data from a series of TSVs and combining them into one dataframe, exported as JSON file. Subsequently, we use series of R scripts for further cleaning and transformming the data. The scripts can be found in the folder scripts and they are named according to the sequence they should run in.
If you are trying to access the ETL scripts creted in 2020-2021 that created the version 1.0 of the dataset (Hemnkov, Petra. (2021). EDCS_text_cleaned_2021_03_01 (Version 1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888817 https://zenodo.org/record/4888817), we refer you to the release 1.0 to 1.3 on Zenodo. Because of the external dependencies and changes in third party software and the databases between 2020 and 2022, the ETL scripts has changed since then (release v2.0).
Instructions for accessing the raw data
- Clone https://github.com/mqAncientHistory/Lat-Epig repository to your local computer
- Change the branche to
scrapeprovinces - Make sure you have Docker installed, if not follow the installation instructions for your OS https://docs.docker.com/engine/install/ and post-installation https://docs.docker.com/engine/install/linux-postinstall/ (Linux)
- Run in the terminal: bash dockerScraperAll.sh
- The scraper will run on its own (for several hours, depending on your internet connection and your computer, usually around 4-5 hours) and when it's done, the data will show in the main folder labelled
full_scrape_[today's-date]. All inscriptions are saved as TSV file and JSON file, labelled with their metadata containing the date of accessing, source, name o fthe province and their number. - Copy the entire folder to the EDCSETL repository for further processing (don't forget to rename the folder to `YYYYMM_allProvinces` or make necessary changes in the follwing scripts).
Alternatively, if you are using the old version of the tool (pre-2022 version), you would be using the script 10LatEpig20searchby_provinces.bsh to access the data. However, in the 2022 version the file is deprecated. The bash script programmatically extracted all non-empty inscriptions from individual provinces into separate CSV files. Run time ca. 16-20 hrs. The script was to be used within the local instantiation of the Lat Epig 2.0 tool. The CSV files were saved within that repository to the folder output.
11rEDCSmergecleanattrs.Rmd
Merging TSV files and cleaning attributes
The current script works with TSV files stored in the YYYY_MM_allProvinces folder. If you wish to work with JSON files, amend the script.
|| File | Source commentary |
| :--- | ---: | ---: |
| input |2022_09_allProvinces in folder data| containting TSVs with inscriptions in individual provinces, accessed via Epigraphy Scraper Jupyter Notebook
| output | EDCS_merged_cleaned_attrs_[timestamp].json ||
12rEDCScleaning_text.Rmd
Cleaning text of an inscription
|| File | Source commentary |
| :--- | ---: | ---: |
| input| EDCS_merged_cleaned_attrs_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes.|
| output| EDCS_text_cleaned_[timestamp].json||
The following scripts are exploratory only (do not change the dataset, only explore the contents of the dataset)
13rEDCSexploration.Rmd
Exploration of the entire dataset
|| File | Source commentary |
| :--- | ---: | ---: |
| input| EDCS_text_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text.|
| output| NA||
14rEDCStext_exploration.Rmd
Exploration of the text of inscriptions
|| File | Source commentary |
| :--- | ---: | ---: |
| input| EDCS_text_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text.|
| output| NA||
15rEDCStextlemmatizationUDpipe.Rmd
Lemmatization of the text of inscriptions with UDpipe tool. However, upon closer inspection, the results of such lemmatization were unsatisfactory.
|| File | Source commentary |
| :--- | ---: | ---: |
| input| EDCS_text_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes and cleaned text.|
| output| EDCS_text_lemmatized_udpipe_[timestamp]].json||
Related publications
Hemnkov, P., Kae, V., & Sobotkova, A. (2021). Inscriptions as data: Digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1), 99. https://doi.org/10.1515/jdh-2021-1004 - the article working with version 1, but version 2 follows the same principles. Some attribute names may vary in the version 2 as well as the contents of the dataset (that reflect the changes made by the EDCS).
Owner
- Name: Social Dynamics in the Ancient Mediterranean
- Login: sdam-au
- Kind: organization
- Location: Aarhus, Denmark
- Website: sdam.au.dk
- Twitter: sdam_au
- Repositories: 7
- Profile: https://github.com/sdam-au
Research group of social complexity and dynamics in the Ancient Mediterranean
GitHub Events
Total
Last Year
Committers
Last synced: about 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Petra Heřmánková | p****a@g****m | 50 |
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0

