top_news

Collecting URLs Daily From News Feeds of Major National News Sites 2022--

https://github.com/notnews/top_news

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary

Keywords

abc cbs la-times nbc news newspaper3k npr nyt politico propublica rss-feed usa-today wapo
Last synced: 6 months ago · JSON representation ·

Repository

Collecting URLs Daily From News Feeds of Major National News Sites 2022--

Basic Info
  • Host: GitHub
  • Owner: notnews
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 1.42 GB
Statistics
  • Stars: 12
  • Watchers: 3
  • Forks: 3
  • Open Issues: 0
  • Releases: 0
Topics
abc cbs la-times nbc news newspaper3k npr nyt politico propublica rss-feed usa-today wapo
Created over 3 years ago · Last pushed 6 months ago
Metadata Files
Readme Citation

README.md

Top News! URLs from News Feeds of Major National News Sites (2022-)

We automatically pull daily news data from major national news sites: ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and WaPo using Github Workflows. For the latest version, please take a look at the respective JSON files.

As of March 2025, we have about 700k unique URLs.

Other Scripts + Data

  1. The script for aggregating the URLs and March-2025 dump of URLs (.zip)

  2. The script for downloading the article text and parsing some features using newspaper3k, e.g., publication date, authors, etc. and putting it in a DB is here. The script checks the local DB before incrementally processing new data.

    • The June 2023 full-text dump is here: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6
    • The March 2025 dump (minus the exceptions listed below) is in the same place.
  3. Newspaper3k can't parse USAT, Politico, and ABC URLs. I use custom Google search to dig up the URLs and get the data. The script is here.

Get Started With Exploring the Data

To explore the DB, some code (Jupyter NB) ...

```python

from sqlite_utils import Database from itertools import islice

db = Database("../cbs.db") print("Tables:", db.table_names()) ```

Tables: ['../cbs_stories']

Table Schema

python schema = db[table_name].schema print("Schema:\n") print(schema)

``` Schema:

CREATE TABLE ../cbs_stories ```

```python dbfile = "../cbs.db" tablename = "../cbs_stories" # yup! it has the ../

db = Database(db_file)

for row in islice(db[tablename].rows, 5): print(f"URL: {row['url']}") print(f"Title: {row['title']}") print(f"Date: {row['publishdate']}") print(f"Text preview: {row['text'][:100]}...\n") ```

Exporting to Pandas

```python

Option 1: Convert all data to a DataFrame

df = pd.DataFrame(list(db[table_name].rows))

Option 2: If the table is very large, you might want to limit rows

df = pd.DataFrame(list(islice(db[table_name].rows, 1000))) # first 1000 rows

Print info about the DataFrame

print(f"DataFrame shape: {df.shape}") print(f"Columns: {df.columns.tolist()}") print(df.head()) ```

🔗 Adjacent Repositories

Owner

  • Name: Not News
  • Login: notnews
  • Kind: organization

News about news

Citation (Citation.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
authors:
  - family-names: "Sood"
    given-names: "Gaurav"
  - family-names: "Willis"
    given-names: "Derek"
title: "Top News! URLs from News Feeds of Major National News Sites (2022--)"
version: 1.0.0
date-released: 2022-01-01
url: "https://github.com/notnews/topnews" 
repository-code: "https://github.com/notnews/topnews"
abstract: >
  A comprehensive collection of news URLs from major national news sources including 
  ABC, CBS, CNN, LA Times, NBC, NPR, NYT, Politico, ProPublica, USA Today, and 
  WaPo. 
keywords:
  - news
  - dataset
  - media
  - journalism
  - content analysis
license: "CC-BY-4.0"
doi: "" # Add a DOI if you have one
references:
  - authors:
      - family-names: "Sood"
        given-names: "Gaurav"
      - family-names: "Willis"
        given-names: "Derek"
    title: "Full-text dump of news articles (March 2025)"
    type: dataset
    year: 2025
    url: "https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZNAKK6"