newspaper-front-pages
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: wragge
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Size: 1.54 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 1
Metadata Files
README.md
Trove newspaper front pages
This repository demonstrates how to harvest information about the contents of newspaper front pages from Trove. It then uses the harvested data to explore how the contents of front pages have changed over time.
Notebooks
- Harvest articles published on page one of newspapers – uses the Trove Newspaper Harvester as a library to download information about all articles published on the front pages of newspapers (about 19 million articles)
- Process the harvested data – converts the large
ndjsonfile created by the Trove Newspaper Harvester into parquet formatted datasets - Exploring changes in the front pages of newspapers – uses the parquet datasets to visualise changes in front pages over time
Datasets
front_pages.parquet
Contains summary information about articles published on the front pages of newspapers. There are 16,398,514 rows of data (274.4mb). It was created on 2 August 2023. Includes the following columns:
| Column | Description |
|--------|-------------|
article_id| Trove numeric identifier for article|
title | title of the article
newspaper_id | Trove numeric identifier for the newspaper in which the article was published
date | date the article was published
category | category of the article, eg: 'Article', 'Advertising'
word_count | number of words in the article
page_id | Trove numeric identifier for the page on which the article was published
front_pages_totals.parquet
Derived from front_pages.parquet by adding together the word counts for articles within each category, giving us the total words per category for each front page. It was created on 2 August 2023. There are 4,351,009 rows of data (35.1mb). Includes the following columns:
| Column | Description |
|--------|-------------|
date | date the page was published
page_id | Trove numeric identifier for the page
newspaper_id | Trove numeric identifier for the newspaper
category | article category eg: 'Article', 'Advertising'
total | number of words in this category on this page
Created by Tim Sherratt, August 2023
Owner
- Name: Tim Sherratt
- Login: wragge
- Kind: user
- Website: https://timsherratt.org
- Repositories: 209
- Profile: https://github.com/wragge
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- altair *
- black *
- duckdb *
- ipywidgets *
- isort *
- jupyter-archive *
- jupyterlab *
- jupyterlab-code-formatter *
- pandas *
- pyarrow *
- python-dotenv *
- requests *
- tqdm *
- trove-newspaper-harvester *
- voila *
- 120 dependencies