cluster_htrc

Identifying the boundaries of main content of fiction and non-fiction works in the HathiTrust Extracted Features dataset.

https://github.com/alucic2/cluster_htrc

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: ieee.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.3%) to scientific vocabulary

Keywords

clustering-algorithm clustering-analysis detecting-paratext-boundaries digital-libraries extracting-features scanned-documents smoothing-methods
Last synced: 6 months ago · JSON representation ·

Repository

Identifying the boundaries of main content of fiction and non-fiction works in the HathiTrust Extracted Features dataset.

Basic Info
  • Host: GitHub
  • Owner: alucic2
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 240 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
clustering-algorithm clustering-analysis detecting-paratext-boundaries digital-libraries extracting-features scanned-documents smoothing-methods
Created almost 5 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License Citation

README.md

Identification of main content in the works included in the HathiTrust Extracted Features dataset

Code for clustering digitized pages of the works based on the features that are available through the HathiTrust Extracted Features dataset v.2.0 with the aim of separating main content of a work from paratextual elements. Reference: A. Lucic, R. Burke and J. Shanahan, "Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents," 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019, pp. 53-56, doi: 10.1109/JCDL.2019.00018. The conference paper is available here

Running the code

Several python libraries that are required for running the code are included in the requirements.txt file. The code depends on the methods developed under the htrc-feature-reader python library. This library can be installed through pip or conda package manager: pip install htrc-feature-reader or conda install -c htrc htrc-feature-reader

Motivation for the development of this method

This work developed as part of the Reading Chicago Reading project at DePaul University in 2018. The HathiTrust Research Center Advanced Collaborative computational support grant that the project received allowed us to explore a set of in copyright and out of copyright fiction and non-fiction works related to the analysis of the One Book One Chicago program that were included in the Extracted Features dataset. To be able to limit the extraction of text features to main content of the work we needed to establish where the main content begins and ends in the digitized pages. If paratext elements such as Table of Contents, Epilogue, Bibliography, Critical Introduction are not excluded before extracting text measures from non-fiction and fiction works, these elements can skew the metrics obtained from the work (e.g. count of locations or personal names in the work). Paratext boundaries are not a consistent metadata element that accompany digital files included in digital libraries. Even if such information exists in the accompanying metadata files, this information needs to be verified.

Modeling paratext as the outlier of main work

The conclusion of the work was that paratext elements lend themselves to being modeled as outliers of main work. As the amount of paratext increases in a volume, however, it is harder to establish the beginning and end of the main content.

Acknowledgment

We thank HathiTrust Research Center for the Advanced Collaborative Support grant and for the use of the HathiTrust Research Data Capsule.

Future work

We plan to continue developing this method to establish the upper bounds of accuracy with which paratext elements can be identified and excluded from digital files. We also plan to explore the degree to which different paratext elements lends themselves to being identified in a work using automated methods.

Owner

  • Name: Ana Lucic
  • Login: alucic2
  • Kind: user
  • Company: University of Illinois at Urbana-Champaign

Citation (citation.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Identification of main content in the works
  included in the HathiTrust Extracted Features
  dataset
message: 'Please use this citation, if you use this software'
type: software
authors:
  - given-names: Ana
    family-names: Lucic
    affiliation: University of Illinois
  - given-names: Robin
    family-names: Burke
    affiliation: University of Colorado
  - given-names: John
    family-names: Shanahan
    affiliation: DePaul University
repository-code: 'https://github.com/alucic2/cluster_htrc'

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • collection ==0.1.6
  • htrc-feature-reader ==1.81
  • matplotlib ==3.5.2
  • numpy ==1.22.3
  • pandas ==1.4.2
  • scikit-learn ==1.0.2
  • seaborn =0.11.2