dh2025

Repository containing the data and code for our short paper on the novel beginning study presented at DH2025 in Lisbon

https://github.com/canspinproject/dh2025

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
    Organization canspinproject has institutional domain (www.canspin.uni-rostock.de)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (6.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Repository containing the data and code for our short paper on the novel beginning study presented at DH2025 in Lisbon

Basic Info
  • Host: GitHub
  • Owner: CANSpiNproject
  • Language: HTML
  • Default Branch: main
  • Size: 28.5 MB
Statistics
  • Stars: 0
  • Watchers: 0
  • Forks: 0
  • Open Issues: 0
  • Releases: 5
Created 10 months ago · Last pushed 7 months ago
Metadata Files
Readme Citation

README.md

dh2025

version License: GPL v3 DOI

This repository contains the data and code for our short paper "They crossed the valley of Catamarca: A study of narrative space in novel openings" presented at DH2025 in Lisbon.

Content

Folders and files

  • cs1 annotation data as .tsv files:
    • /canspin-deu-19
    • /canspin-deu-20 (for legal reasons, this data from the 20th century is only available as shuffled tsv)
    • /canspin-spa-19
    • /canspin-lat-19
  • cs1 annotation data as Catma project:
    • /CATMA_4AA4ADC0-4C28-54F9-B6A1-5DCEFF34B90B_DH2025_CANSpiN
  • data and documentation of the novel beginning analysis:
    • /novel_beginning_analysis
    • categories.md: documents our definitions of the novel beginning analysis categories and their application to the texts
    • categorization.tsv: contains the novel beginning analysis data
  • data and visualizations derived from the analysis:
    • /results
    • annotation_distribution__<chapter_id>.html.json: contains data on the distribution of cs1 annotations over a chapter in text units of 200 (spa-19 and lat-19) and 300 tokens (deu-19 and deu-20)
    • annotation_statistics__first_1000_token.json: documents the relative and absolute cs1 annotations amounts and most frequent token per annotation class for the first 1000 token of the first chapters of all texts
    • annotation_statistics__whole_chapters.json: documents the relative and absolute cs1 annotations amounts and most frequent token per annotation class for the whole first chapters of all texts
    • /visualizations
      • annotation_distribution__<chapter_id>.html/.png: visualizes the distribution data of cs1 annotations for each chapter
      • cs1_annotation_amounts__1000_tokens.html/.png: shows the proportion of annotation amount to the token amount in the first 1000 tokens of each text
      • cs1_annotation_amounts__all_tokens.html/.png: shows the proportion of annotation amount to the token amount in the whole first chapter of each text
      • first_character_event_overview.png: shows the token position the first character event occurs in each text
      • first-character-event-cs1-relation__<chapter_id>.png: combines the data on cs1 annotation distribution of each chapter with the first character event data of each chapter
  • bibliography of the short paper:
    • bibliography.bib
  • notebook to recreate analysis results that are already saved in the /results folder:
    • perform_analysis.ipynb

Corpus overview

It consists of the first chapters of eight german, spanish, and latin-american novels from the 19th and 20th century. The data originates from the corpora of the European Literary Text Collection (ELTeC), the Corpus de novelas hispanoamericanas del siglo XIX (conha19), the Complete Works of Uwe Johnson project (CWUJ), and E-Books.

| Corpus | ID | Title | Author | Year | Token | Source | |--------|----|-------|--------|------|-------|--------| | DEU19 | DEU19001 | Weisse Sclaven oder die Leiden des Volkes | Willkomm, Ernst Adolf | 1845 | 5491 | ELTeC-deu | | DEU19 | DEU19030 | Die verlorene Handschrift | Freytag, Gustav | 1864 | 7179 | ELTeC-deu | | DEU20 | DEU20002 | Ansichten eines Clowns | Böll, Heinrich | 1963 | 2689 | E-Book: Kiepenheuer & Witsch 2009 | restricted | | DEU20 | DEU20021 | Zwei Ansichten | Johnson, Uwe | 1965 | 744 | CWUJ | restricted | | SPA19 | SPA19001 | El Señor de Bembibre | Gil y Carrasco, Enrique | 1855 | 1883 | ELTeC-spa | | SPA19 | SPA19008 | Los templarios | Mora, Juan de Dios | 1856 | 4309 | ELTeC-spa | | LAT19 | LAT19004 | El falso Inca. Cronicón de la conquista | Payró, Roberto | 1905 | 1210 | conha19 | | LAT19 | LAT19041 | El pozo del Yocci | Gorriti, Juana Manuela | 1876 | 1074 | conha19 |

Annotation overview

Classes

The annotation system CANSpiN.CS1 (v1.1.0) is defined in the respective guideline.

Amount

annotation_overview

Usage

To use the notebook perform_analysis.ipynb, install the gitma-canspin package (v1.6.5) following the instructions of its README. The notebook enables the user to reproduce the analysis steps we have performed. It is not necessary to execute it, if you wish to see the analysis results only. In this case, see our paper and the content of the /results folder.

Licenses

The original texts are in the public domain, with the exception of the German-language novels from the 20th century, which are protected by copyright. Accordingly, the latter data is published here in a derived format as shuffled .tsv only.

We publish the annotations under Creative Commons Attribution International 4.0 licence, the Jupyter Notebook under GNU General Public License 3.

The Aspekta font used for the creation of visualizations with the Pillow package in the notebook is licensed under the Open Font License 1.1.

Owner

  • Name: Computational Approaches to Narrative Space in 19th and 20th Century Novels
  • Login: CANSpiNproject
  • Kind: organization
  • Location: Germany

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: dh2025
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Nils
    family-names: Kellner
    affiliation: University of Rostock
    orcid: 'https://orcid.org/0009-0002-3966-5635'
  - given-names: Marc
    family-names: Lemke
    affiliation: University of Rostock
    orcid: 'https://orcid.org/0009-0004-8065-8191'
  - given-names: Ulrike
    family-names: Henny-Krahmer
    affiliation: University of Rostock
    orcid: 'https://orcid.org/0000-0003-2852-065X'
  - given-names: Julián C.
    family-names: Spinelli
    affiliation: University of Buenos Aires
    orcid: 'https://orcid.org/0009-0003-0895-815X'
  - given-names: Erik
    family-names: Renz
    affiliation: University of Rostock
    orcid: 'https://orcid.org/0009-0005-8288-7470'
  - given-names: Anika
    family-names: Piotraschke
    affiliation: University of Rostock
    orcid: 'https://orcid.org/0009-0004-3076-5781'
identifiers:
  - type: doi
    value: 10.5281/zenodo.15423438
repository-code: 'https://github.com/CANSpiNproject/dh2025'
url: 'https://www.canspin.uni-rostock.de/en'
abstract: >-
  This is a repository containing the data and code for our short paper 
  "They crossed the valley of Catamarca: A study of narrative space in novel openings" 
  presented at DH2025 in Lisbon.
keywords:
  - CANSpiN
  - SPP 2207
  - Digital Humanities
  - Computational Literary Studies
  - DH2025
license: CC-BY-4.0
commit: b4715de59e57a3646c9e48e84cd98c6faf797ad6
version: 1.0.4
date-released: '2025-05-23'

GitHub Events

Total
  • Release event: 3
  • Delete event: 2
  • Push event: 40
  • Create event: 7
Last Year
  • Release event: 3
  • Delete event: 2
  • Push event: 40
  • Create event: 7