newspaper-front-pages

https://github.com/wragge/newspaper-front-pages

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (5.8%) to scientific vocabulary

Last synced: 11 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: wragge
License: mit
Language: Jupyter Notebook
Default Branch: main
Size: 1.54 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 1

Created almost 3 years ago · Last pushed almost 3 years ago

Metadata Files

Readme License Citation

Trove newspaper front pages

This repository demonstrates how to harvest information about the contents of newspaper front pages from Trove. It then uses the harvested data to explore how the contents of front pages have changed over time.

Notebooks

Harvest articles published on page one of newspapers – uses the Trove Newspaper Harvester as a library to download information about all articles published on the front pages of newspapers (about 19 million articles)
Process the harvested data – converts the large ndjson file created by the Trove Newspaper Harvester into parquet formatted datasets
Exploring changes in the front pages of newspapers – uses the parquet datasets to visualise changes in front pages over time

Datasets

`front_pages.parquet`

Contains summary information about articles published on the front pages of newspapers. There are 16,398,514 rows of data (274.4mb). It was created on 2 August 2023. Includes the following columns:

| Column | Description | |--------|-------------| article_id| Trove numeric identifier for article| title | title of the article newspaper_id | Trove numeric identifier for the newspaper in which the article was published date | date the article was published category | category of the article, eg: 'Article', 'Advertising' word_count | number of words in the article page_id | Trove numeric identifier for the page on which the article was published

`front_pages_totals.parquet`

Derived from front_pages.parquet by adding together the word counts for articles within each category, giving us the total words per category for each front page. It was created on 2 August 2023. There are 4,351,009 rows of data (35.1mb). Includes the following columns:

| Column | Description | |--------|-------------| date | date the page was published page_id | Trove numeric identifier for the page newspaper_id | Trove numeric identifier for the newspaper category | article category eg: 'Article', 'Advertising' total | number of words in this category on this page

Created by Tim Sherratt, August 2023

Owner

Name: Tim Sherratt
Login: wragge
Kind: user

Website: https://timsherratt.org
Repositories: 209
Profile: https://github.com/wragge

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.in pypi

altair *
black *
duckdb *
ipywidgets *
isort *
jupyter-archive *
jupyterlab *
jupyterlab-code-formatter *
pandas *
pyarrow *
python-dotenv *
requests *
tqdm *
trove-newspaper-harvester *
voila *

requirements.txt pypi

120 dependencies

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science