edh_etl

This repository contains scripts for accessing, extracting and transforming epigraphic datasets from the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/) in a reproducible manner.

https://github.com/sdam-au/edh_etl

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 26 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.5%) to scientific vocabulary

Keywords

dataset epigraphy etl inscriptions jupyter-notebook python r xml
Last synced: 6 months ago · JSON representation ·

Repository

This repository contains scripts for accessing, extracting and transforming epigraphic datasets from the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/) in a reproducible manner.

Basic Info
  • Host: GitHub
  • Owner: sdam-au
  • License: cc-by-sa-4.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Homepage:
  • Size: 60.3 MB
Statistics
  • Stars: 8
  • Watchers: 3
  • Forks: 1
  • Open Issues: 0
  • Releases: 4
Topics
dataset epigraphy etl inscriptions jupyter-notebook python r xml
Created over 6 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

ETL workflow for quantitative analysis of inscriptions from the EDH dataset

  • ETL

License: CC BY-NC-SA 4.0 Project_status


Purpose

This repository contains scripts for accesing, extracting and transforming epigraphic datasets from the Epigraphic Database Heidelberg. The repository will serve as a template for SDAM future collaborative research projects in accesing and analysing large digital datasets.

The scripts access the main dataset via a web API, tranform it into one dataframe object, merge and enrich these data with geospatial data and additional data from XML files, and save the outcome to SDAM project directory on sciencedata.dk and the finished product on Zenodo. Since the most important data files are in a public folder, you can use and re-run our analyses even without a sciencedata.dk account and access to our team folder. If you face any issues with accessing the data, please contact us at sdam.cas@list.au.dk.

A separate Python package sddk was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified.

Authors

  • Petra Heřmánková SDAM project, petra.hermankova@cas.au.dk
  • Vojtěch Kaše SDAM project, vojtech.kase@gmail.com

License

CC-BY-SA 4.0

How to cite us

2022 version 2

DATASET 2022: Heřmánková, Petra, & Kaše, Vojtěch. (2022). EDH_text_cleaned_2022_11_03 (v2.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.7303886 http://doi.org/10.5281/zenodo.7303886

SCRIPTS 2022: Heřmánková, Petra, & Kaše, Vojtěch. (2022). sdam-au/EDH_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.7303867 https://doi.org/10.5281/zenodo.7303867

The 2022 datasets contains 81,883 cleaned and streamlined Latin inscriptions from the Epigraphic Database Heidelberg (EDH, https://edh-www.adw.uni-heidelberg.de/), aggregated on 2022/11/03, created for the purpose of a quantitative study of epigraphic trends by the Social Dynamics in the Ancient Mediterranean Project (SDAM, http://sdam.au.dk). The dataset contains 69 attributes with original and streamlined data. Compared to the 2021 dataset, there are 407 more inscriptions and 5 fewer attributes containing redundant legacy data, thus the entire dataset is approximately the same size but some of the attributes are streamlined (260 MB in 2022 compared to 234 MB in 2021). Some of the attribute were removed as they are no longer available due to the changes in the EDH itself, e.g. edh_geography_uri, external_image_uris, fotos, geography, military, social_economic_legal_history, uri; and some new attributes were added due to the streamlining of the ETL process, e.g. pleiades_id. For full overview, see the Metadata section.

Metadata

EDH 2022 dataset metadata with descriptions for all attributes

2022 version 1

DATASET 2021: Heřmánková, Petra, & Kaše, Vojtěch. (2021). EDH_text_cleaned_2021_01_21 (v1.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4888168 http://doi.org/10.5281/zenodo.4888168

SCRIPTS 2021: Heřmánková, Petra, & Kaše, Vojtěch. (2021). sdam-au/EDH_ETL: Scripts (v2.0). Zenodo. https://doi.org/10.5281/zenodo.6478243 https://doi.org/10.5281/zenodo.6478243

Metadata

EDH 2021 dataset metadata with descriptions for all attributes.

Data

The original raw data

The original data come from two sources:

  1. the Epidoc XML files available at https://edh.ub.uni-heidelberg.de/data (inscriptions)
  2. the web API available at https://edh.ub.uni-heidelberg.de/data/api (inscriptions and geospatial data)

The scripts merge data from these two sources into Pandas dataframe, which is then exported into one JSON file for further usage. A separate Python package sddk was created specifically for accessing sciencedata.dk from Python (see https://github.com/sdam-au/sddk). If you want to save the dataset in a different location, the scripts might be easily modified. You can access the file without having to login into sciencedata.dk. Here is a path to the file on sciencedata.dk: SDAM_root/SDAM_data/EDH/public folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/

To access the files created in previous steps of the ETL process, you can use the dataset from the public folder, or you have to rerun all scripts on your own.

The final (streamlined) dataset

is produced by the scripts in this repository is called EDH_text_cleaned_[timestamp].json and published on Zenodo in all its versions, for details and links see How to cite us section above.

Additionally, the identical dataset can be accessed via Sciencedata.dk: SDAM_root/SDAM_data/EDH/public folder on sciencedata.dk or alternatively as https://sciencedata.dk/public/b6b6afdb969d378b70929e86e58ad975/.

Scripts

Data accessing scripts

We use Python scripts (Jupyter notebooks) for accessing the API & extracting data from it, parse the XML files for additional metadata and combining these two reseources into one. Subsequently, we use both R and Python for further cleaning and transformming the data. The scripts can be found in the folder scripts and they are named according to the sequence they should run in.


10py_EXTRACTING-GEOGRAPHIES.ipynb

The data via the API are easily accessible and might be extracted by means of R and Python in a rather straigtforward way. First we extract the geocordinates from the public API, using the script 1_0.

Extracting geographical coordinates || File | Source commentary | | :--- | ---: | ---: | | input |edhGeographicData.json| containting all EDH geographies, loaded from https://edh.ub.uni-heidelberg.de/data/api | output | EDH_geo_dict_[timestamp].json ||

11pyEXTRACTIONedh-inscriptions-from-web-api.ipynb

As a next step we access the public API to access and download all the incriptions. To obtain the whole dataset of circa 81,000+ inscriptions into a Python dataframe takes about 12 minutes (see the respective script 1_1). We have decided to save the dataframe as a JSON file for interoperability reasons between Python and R.

Extracting all inscriptions from API || File | Source commentary | | :--- | ---: | ---: | | input| requests to https://edh.ub.uni-heidelberg.de/data/api|| | output| EDH_onebyone[timestamp].json||

12pyEXTRACTIONedh-xml_files.ipynb

However, the dataset from the API is a simplified one (when compared with the records online and in XML), primarily to be used for queries in the web interface. For instance, the API data encode the whole information about dating by means of two variables: "notbefore" and "notbefore". This makes us curious about how the data translate dating information like "around the middle of the 4th century CE." etc. Therefore, we decided to enrich the JSON created from the API files with data from the original XML files, which also including some additional variables (see script 1_2).

Extracting XML files || File | Source commentary | | :--- | ---: | ---: | | input| edhEpidocDump_HD[first_number]-HD[last_number].zip| https://edh.ub.uni-heidelberg.de/data/download | output| EDH_xml_data_[timestamp].json||

13pyMERGINGAPIGEOand_XML.ipynb

To enrich the JSON with geodata extracted in the script 1_0, we have developed the following script: script 1_3).

Merging geographies, API, and XML files || File | Source commentary | | :--- | ---: | ---: | | input 1 | EDH_geographies_raw.json| https://edh.ub.uni-heidelberg.de/data/download| | input 2| EDH_onebyone[timestamp].json|| | input 3| EDH_xml_data_[timestamp].json|| | output| EDH_merged_[timestamp].json||

14rDATASETATTRIBUTES_CLEANING.Rmd

In the next step we clean and streamline the API attributes in a reproducible way in R, (see script 1_4) so they are ready for any future analysis. We keep the original attributes along with the new clean ones.

Cleaning and streamlining attributes || File | Source commentary | | :--- | ---: | ---: | | input| EDH_merged_[timestamp].json|The current script works with JSON file containing all merged inscriptions.| | output| EDH_attrs_cleaned_[timestamp].json||

15rTEXTINSCRIPTION_CLEANING

The cleaning process of the text of inscriptions is in the script 1_5.

Cleaning and streamlining of the text of the inscription || File | Source commentary | | :--- | ---: | ---: | | input| EDH_attrs_cleaned_[timestamp].json|The current script works with JSON file containing all inscriptions will their streamlined attributes.| | output| EDH_text_cleaned_[timestamp].json ||

The following scripts document the basic usage cases for Python and R (they do not change the dataset, only demonstrate the access to the data using both languages)

2pyPYTHONUSAGETEST.ipynb

Script demonstrating loading the dataset to Python via Sciencedata.dk (with or without credentials), using sddk package.

2rRUSAGETEST.Rmd

Script demonstrating loading the dataset to R via Sciencedata.dk (without credentials).


Related publications

Heřmánková, P., Kaše, V., & Sobotkova, A. (2021). Inscriptions as data: Digital epigraphy in macro-historical perspective. Journal of Digital History, 1(1), 99. https://doi.org/10.1515/jdh-2021-1004 - the article working with version 1, but version 2 follows the same principles. Some attribute names may vary in the version 2 as well as the contents of the dataset (that reflect the changes made by the EDH).

Owner

  • Name: Social Dynamics in the Ancient Mediterranean
  • Login: sdam-au
  • Kind: organization
  • Location: Aarhus, Denmark

Research group of social complexity and dynamics in the Ancient Mediterranean

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
references:
  - type: "software"
authors: 
  - affiliation: "Aarhus University"
    family-names: "Heřmánková"
    given-names: "Petra"
    orcid: "https://orcid.org/0000-0002-6349-0540"
  - affiliation: "Aarhus University"
    family-names: "Kaše"
    given-names: "Vojtěch"
    orcid: "https://orcid.org/0000-0002-6601-1605"
license: cc-by-nc-4.0
repository-code: "https://github.com/sdam-au/EDH_ETL"
title: "EDH ETL"
version: 1.1
doi: 10.5281/zenodo.6478243
date-released: 2021-06-01
abstract: "This repository contains scripts for accesing, extracting and transforming epigraphic datasets from the Epigraphic Database Heidelberg (https://edh.ub.uni-heidelberg.de/) in a reproducible manner."
keywords:
  - "digital epigraphy"
  - "Latin inscriptions"
  - "digital history"
  - "Roman history"
  - "dataset"

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 3
  • Total pull requests: 0
  • Average time to close issues: 8 months
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • kasev (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • beautifulsoup4 *
  • geopandas *
  • google-auth *
  • gspread *
  • gspread_dataframe *
  • matplotlib *
  • numpy *
  • pandas *
  • sddk *
  • seaborn *