top_news
Collecting URLs Daily From News Feeds of Major National News Sites 2022--
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Keywords
Repository
Collecting URLs Daily From News Feeds of Major National News Sites 2022--
Basic Info
Statistics
- Stars: 12
- Watchers: 3
- Forks: 3
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Top News! URLs from News Feeds of Major National News Sites (2022-)
We automatically pull daily news data from major national news sites: ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and WaPo using Github Workflows. For the latest version, please take a look at the respective JSON files.
As of March 2025, we have about 700k unique URLs.
Other Scripts + Data
The script for aggregating the URLs and March-2025 dump of URLs (.zip)
The script for downloading the article text and parsing some features using newspaper3k, e.g., publication date, authors, etc. and putting it in a DB is here. The script checks the local DB before incrementally processing new data.
- The June 2023 full-text dump is here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6
- The March 2025 dump (minus the exceptions listed below) is in the same place.
Newspaper3k can't parse USAT, Politico, and ABC URLs. I use custom Google search to dig up the URLs and get the data. The script is here.
Get Started With Exploring the Data
To explore the DB, some code (Jupyter NB) ...
```python
from sqlite_utils import Database from itertools import islice
db = Database("../cbs.db") print("Tables:", db.table_names()) ```
Tables: ['../cbs_stories']
Table Schema
python
schema = db[table_name].schema
print("Schema:\n")
print(schema)
``` Schema:
CREATE TABLE ../cbs_stories ```
```python dbfile = "../cbs.db" tablename = "../cbs_stories" # yup! it has the ../
db = Database(db_file)
for row in islice(db[tablename].rows, 5): print(f"URL: {row['url']}") print(f"Title: {row['title']}") print(f"Date: {row['publishdate']}") print(f"Text preview: {row['text'][:100]}...\n") ```
Exporting to Pandas
```python
Option 1: Convert all data to a DataFrame
df = pd.DataFrame(list(db[table_name].rows))
Option 2: If the table is very large, you might want to limit rows
df = pd.DataFrame(list(islice(db[table_name].rows, 1000))) # first 1000 rows
Print info about the DataFrame
print(f"DataFrame shape: {df.shape}") print(f"Columns: {df.columns.tolist()}") print(df.head()) ```
🔗 Adjacent Repositories
- notnews/good_nyt — Patterns in NYT production from 1987 to 2007
- notnews/foxnewstranscripts — Fox News Transcripts 2003--2025
- notnews/uknotnews — Not News: Provision of Apolitical News in the British News Media
- notnews/nbc_transcripts — NBC transcripts 2011--2014
- notnews/hard_news — The Softening of Network Television News
Owner
- Name: Not News
- Login: notnews
- Kind: organization
- Website: http://notnews.github.io
- Repositories: 15
- Profile: https://github.com/notnews
News about news
Citation (Citation.cff)
cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
authors:
- family-names: "Sood"
given-names: "Gaurav"
- family-names: "Willis"
given-names: "Derek"
title: "Top News! URLs from News Feeds of Major National News Sites (2022--)"
version: 1.0.0
date-released: 2022-01-01
url: "https://github.com/notnews/topnews"
repository-code: "https://github.com/notnews/topnews"
abstract: >
A comprehensive collection of news URLs from major national news sources including
ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and
WaPo.
keywords:
- news
- dataset
- media
- journalism
- content analysis
license: "CC-BY-4.0"
doi: "" # Add a DOI if you have one
references:
- authors:
- family-names: "Sood"
given-names: "Gaurav"
- family-names: "Willis"
given-names: "Derek"
title: "Full-text dump of news articles (March 2025)"
type: dataset
year: 2025
url: "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6"