cluster_htrc
Identifying the boundaries of main content of fiction and non-fiction works in the HathiTrust Extracted Features dataset.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: ieee.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Keywords
Repository
Identifying the boundaries of main content of fiction and non-fiction works in the HathiTrust Extracted Features dataset.
Basic Info
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Identification of main content in the works included in the HathiTrust Extracted Features dataset
Code for clustering digitized pages of the works based on the features that are available through the HathiTrust Extracted Features dataset v.2.0 with the aim of separating main content of a work from paratextual elements. Reference: A. Lucic, R. Burke and J. Shanahan, "Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents," 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2019, pp. 53-56, doi: 10.1109/JCDL.2019.00018. The conference paper is available here
Running the code
Several python libraries that are required for running the code are included in the requirements.txt file. The code depends on the methods developed under the htrc-feature-reader python library. This library can be installed through pip or conda package manager: pip install htrc-feature-reader or conda install -c htrc htrc-feature-reader
Motivation for the development of this method
This work developed as part of the Reading Chicago Reading project at DePaul University in 2018. The HathiTrust Research Center Advanced Collaborative computational support grant that the project received allowed us to explore a set of in copyright and out of copyright fiction and non-fiction works related to the analysis of the One Book One Chicago program that were included in the Extracted Features dataset. To be able to limit the extraction of text features to main content of the work we needed to establish where the main content begins and ends in the digitized pages. If paratext elements such as Table of Contents, Epilogue, Bibliography, Critical Introduction are not excluded before extracting text measures from non-fiction and fiction works, these elements can skew the metrics obtained from the work (e.g. count of locations or personal names in the work). Paratext boundaries are not a consistent metadata element that accompany digital files included in digital libraries. Even if such information exists in the accompanying metadata files, this information needs to be verified.
Modeling paratext as the outlier of main work
The conclusion of the work was that paratext elements lend themselves to being modeled as outliers of main work. As the amount of paratext increases in a volume, however, it is harder to establish the beginning and end of the main content.
Acknowledgment
We thank HathiTrust Research Center for the Advanced Collaborative Support grant and for the use of the HathiTrust Research Data Capsule.
Future work
We plan to continue developing this method to establish the upper bounds of accuracy with which paratext elements can be identified and excluded from digital files. We also plan to explore the degree to which different paratext elements lends themselves to being identified in a work using automated methods.
Owner
- Name: Ana Lucic
- Login: alucic2
- Kind: user
- Company: University of Illinois at Urbana-Champaign
- Twitter: analucic3000
- Repositories: 2
- Profile: https://github.com/alucic2
Citation (citation.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: >-
Identification of main content in the works
included in the HathiTrust Extracted Features
dataset
message: 'Please use this citation, if you use this software'
type: software
authors:
- given-names: Ana
family-names: Lucic
affiliation: University of Illinois
- given-names: Robin
family-names: Burke
affiliation: University of Colorado
- given-names: John
family-names: Shanahan
affiliation: DePaul University
repository-code: 'https://github.com/alucic2/cluster_htrc'
GitHub Events
Total
Last Year
Dependencies
- collection ==0.1.6
- htrc-feature-reader ==1.81
- matplotlib ==3.5.2
- numpy ==1.22.3
- pandas ==1.4.2
- scikit-learn ==1.0.2
- seaborn =0.11.2