https://github.com/centre-for-humanities-computing/chicago_corpus
https://github.com/centre-for-humanities-computing/chicago_corpus
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: centre-for-humanities-computing
- License: mit
- Default Branch: main
- Size: 5.76 MB
Statistics
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
The Chicago Corpus

As part of the efforts of the Fabula-NET project at the Center for Humanities Computing, Århus University, we present a dataset of quality judgments on 9,000 19th and 20th century English-language literary novels by 3,166 predominantly Anglophone authors.
The data includes annotation of expert opinions and crowd-based resources to allow comparative analyses between different literary quality evaluations, as well as several textual metrics chosen for their connection with literary reception. A large part of the corpus is subjected to copyright (see the available pre-1924 works here). We release quality and reception measures together with stylometric and sentiment data for each of the 9,000 novels to promote future research and comparison. Read the Paper presenting this resource.
⚡ Data included
- 9,000 titles
- Author, title & year
- Various textual metrics
- Various reception metrics
For an overview of all included data, see the corpus documentation.
Available formats: .xlsx, .json
🔍 Example
| BOOKID | TITLE | AUTHFIRST | AUTHLAST | PUBLDATE | ... | AVGRATING | SCIFIAWARDS | PULITZER | TRANSLATIONS | ... | PERPLEXITY | MEAN_SENT | READABILITY | | ---------------- | --------------- |------------------- |-----------|--------------|-----|------------------|----------------|----------|--------------|------|------------|-----------|-------------| | 6913 |A Clash of Kings | George R. R. | Martin | 1999 | ... | 4.41 | 1 | 0 | 38 | ... | 79.97| -0.002 | 92.73 | | 20636 | Dune | Frank | Herbert | 1965 | ... | 4.25 | 1 | 0 | 398 | ... | 72.74 | -0.007 | 85.18 | | 22741 | Beloved | Toni | Morrison | 1987 | ... | 3.92 | 0 | 1 | 68 | ... | 68.78 | 0.030 | 91.71 | | 5778 | Misery | Stephen | King | 1987 | ... | 4.20 | 0 | 0 | 74 | ... | 68.09 | -0.032 | 82.54 | | 86 | The Portrait of a Lady | Henry | James | 1881 | ... | 3.78 | 0 | 0 | 53 | ... | 80.35 | 0.150 | 71.65 |
Above: Example of titles and corresponding values for selected metrics
📈 Corpus statistics

The corpus of texts from which we constructed our dataset was assembled by Hoyt Long and Richard Jean So in the Textual Optics Lab; it encompasses 9088 novels published in the United States between 1880 and 2000 and was compiled based on the number of libraries holding each title (based on the WorldCat catalogue), favoring works with a higher number of library holdings.
| Titles | Authors | Titles per author | | -------------------------- | --------------------| -------------------------------------------------------------- | | 9088 | 3166 | 2.88 |
Above: Number of titles/authors in the corpus
Below: Mean & SD of some of the included features
| Metric | Wordcount | Sentence Length | Wordlength | Type/Token Ratio | Compressibility | Bigram Entropy | Word Entropy | Flesch Ease | Dale Chall New | Mean Sentiment | Std Sentiment | End Sentiment | Beginning Sentiment | Hurst Exponent | Approximate Entropy | |----------------------|-------------|-------------------|--------------|--------------------|-------------------|-------------------|-----------------|----------------|------------------|------------------|-----------------|----------------|------------------------|-------------------|-------------------------| | Mean (µ) | 118584.71 | 86.56 | 3.67 | 0.69 | 2.92 | 14.63 | 9.69 | 82.70 | 5.10 | 0.03 | 0.35 | 0.03 | 0.04 | 0.61 | 1.75 | | St. dev. (±) | 64746.05 | 29.44 | 0.18 | 0.02 | 0.14 | 0.55 | 0.30 | 6.48 | 0.33 | 0.04 | 0.04 | 0.07 | 0.05 | 0.04 | 0.15 |
🏆 "Quality", "reader appreciation" or "popularity" metrics

Beyond textual features, we present various "quality proxies", that is, ways of estimating valuation in literary culture, such as whether or not titles are included in Bestseller or Canon lists. We also include what we call "continuous" proxies, that is, scores per title, for example of GoodReads ratings or translation numbers (see the corpus documentation).
Because of the library holdings selection criteria, the corpus comprises much high-quality fiction from authors who have received prestigious distinctions, such as the Nobel Prize (i.a., Toni Morrison), the National Book Award (i.a., Don DeLillo). Yet, library holdings appear to indicate both high distinction and mass popularity, reflecting library users' demand and preferences. So the corpus also comprises widely popular novels from mainstream literature (i.a., Agatha Christie), and notable works on the broad spectrum of so-called "genre literature", from Mystery to Science Fiction (i.a., Tolkien, Philip K. Dick etc.). An examination of the relation between various proxies in this corpus is forthcoming.
📖 Documentation
| | | | --------------------------- | --------------------------------------------------------------------------------- | | 📄 Paper | The Chicago resource paper. | | ✏️ Documentation | Detailed description of measures and proxies included in the dataset. | | 🗂️ Previous works | Publications that have previously used the Chicago Corpus. | | 🔬 Textual Optics Lab | The Chicago Corpus at the Textual Optics Lab, University of Chicago. | | 📚 Citation | Bibtex citation. | | 🔥 EmotionArcs | Emotion Arcs of the Chicago Corpus (a linked dataset). | | 🔬 CHC | Center for Humanities Computing, hosting the FabulaNET project. |
Owner
- Name: Center for Humanities Computing Aarhus
- Login: centre-for-humanities-computing
- Kind: organization
- Email: chcaa@cas.au.dk
- Location: Aarhus, Denmark
- Website: https://chc.au.dk/
- Repositories: 130
- Profile: https://github.com/centre-for-humanities-computing
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0