top_news

Collecting URLs Daily From News Feeds of Major National News Sites 2022--

https://github.com/notnews/top_news

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Keywords

abc cbs la-times nbc news newspaper3k npr nyt politico propublica rss-feed usa-today wapo

Last synced: 6 months ago · JSON representation ·

Repository

Collecting URLs Daily From News Feeds of Major National News Sites 2022--

Basic Info

Host: GitHub
Owner: notnews
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 1.42 GB

Statistics

Stars: 12
Watchers: 3
Forks: 3
Open Issues: 0
Releases: 0

Topics

abc cbs la-times nbc news newspaper3k npr nyt politico propublica rss-feed usa-today wapo

Created over 3 years ago · Last pushed 6 months ago

Metadata Files

Readme Citation

Top News! URLs from News Feeds of Major National News Sites (2022-)

We automatically pull daily news data from major national news sites: ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and WaPo using Github Workflows. For the latest version, please take a look at the respective JSON files.

As of March 2025, we have about 700k unique URLs.

Other Scripts + Data

The script for aggregating the URLs and March-2025 dump of URLs (.zip)
The script for downloading the article text and parsing some features using newspaper3k, e.g., publication date, authors, etc. and putting it in a DB is here. The script checks the local DB before incrementally processing new data.
- The June 2023 full-text dump is here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6
- The March 2025 dump (minus the exceptions listed below) is in the same place.
Newspaper3k can't parse USAT, Politico, and ABC URLs. I use custom Google search to dig up the URLs and get the data. The script is here.

Get Started With Exploring the Data

To explore the DB, some code (Jupyter NB) ...

```python

from sqlite_utils import Database from itertools import islice

db = Database("../cbs.db") print("Tables:", db.table_names()) ```

Tables: ['../cbs_stories']

Table Schema

python schema = db[table_name].schema print("Schema:\n") print(schema)

``` Schema:

CREATE TABLE ../cbs_stories ```

```python dbfile = "../cbs.db" tablename = "../cbs_stories" # yup! it has the ../

db = Database(db_file)

for row in islice(db[tablename].rows, 5): print(f"URL: {row['url']}") print(f"Title: {row['title']}") print(f"Date: {row['publishdate']}") print(f"Text preview: {row['text'][:100]}...\n") ```

Exporting to Pandas

```python

Option 1: Convert all data to a DataFrame

df = pd.DataFrame(list(db[table_name].rows))

Option 2: If the table is very large, you might want to limit rows

df = pd.DataFrame(list(islice(db[table_name].rows, 1000))) # first 1000 rows

Print info about the DataFrame

print(f"DataFrame shape: {df.shape}") print(f"Columns: {df.columns.tolist()}") print(df.head()) ```

🔗 Adjacent Repositories

notnews/good_nyt — Patterns in NYT production from 1987 to 2007
notnews/foxnewstranscripts — Fox News Transcripts 2003--2025
notnews/uknotnews — Not News: Provision of Apolitical News in the British News Media
notnews/nbc_transcripts — NBC transcripts 2011--2014
notnews/hard_news — The Softening of Network Television News

Owner

Name: Not News
Login: notnews
Kind: organization

Website: http://notnews.github.io
Repositories: 15
Profile: https://github.com/notnews

News about news

Citation (Citation.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
authors:
  - family-names: "Sood"
    given-names: "Gaurav"
  - family-names: "Willis"
    given-names: "Derek"
title: "Top News! URLs from News Feeds of Major National News Sites (2022--)"
version: 1.0.0
date-released: 2022-01-01
url: "https://github.com/notnews/topnews" 
repository-code: "https://github.com/notnews/topnews"
abstract: >
  A comprehensive collection of news URLs from major national news sources including 
  ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and 
  WaPo. 
keywords:
  - news
  - dataset
  - media
  - journalism
  - content analysis
license: "CC-BY-4.0"
doi: "" # Add a DOI if you have one
references:
  - authors:
      - family-names: "Sood"
        given-names: "Gaurav"
      - family-names: "Willis"
        given-names: "Derek"
    title: "Full-text dump of news articles (March 2025)"
    type: dataset
    year: 2025
    url: "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science