archive_news_cc

Closed Caption Transcripts of News Videos from archive.org 2014--2023

https://github.com/notnews/archive_news_cc

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.9%) to scientific vocabulary

Keywords

archive-org closed-caption-transcripts closed-captions news news-videos transcript transcripts

Last synced: 6 months ago · JSON representation ·

Repository

Closed Caption Transcripts of News Videos from archive.org 2014--2023

Basic Info

Host: GitHub
Owner: notnews
Language: HTML
Default Branch: master
Homepage:
Size: 25.2 MB

Statistics

Stars: 47
Watchers: 5
Forks: 4
Open Issues: 0
Releases: 0

Topics

archive-org closed-caption-transcripts closed-captions news news-videos transcript transcripts

Created about 8 years ago · Last pushed 10 months ago

Metadata Files

Readme Citation

Closed Captions of News Videos from Archive.org

The repository provides scripts for downloading the data, and link to two datasets that were built using the scripts:

Scripts
Data

Downloading the Data from Archive.org

Download closed caption transcripts of nearly 1.3M news shows from http://archive.org.

There are three steps to downloading the transcripts:

We start by searching https://archive.org/advancedsearch.php with collection collection:"tvarchive". This gets us unique identifiers for each of the news shows. An identifier is a simple string that combines channelname, showname, time, and date. The current final list of identifiers (2009--Nov. 2017) is posted here.
Next, we use the identifier to build a URL where the metadata file and HTML file with the closed captions is posted. The general base URL is http://archive.org/download followed by the identifier.
The third script parses the downloaded metadata and HTML closed caption files and creates a CSV along with the meta data.

For instance, we will go http://archive.org/download/CSPAN20090604230000 for identifier CSPAN_20090604_230000 And from http://archive.org/download/CSPAN20090604230000/CSPAN20090604230000meta.xml, we read the link http://archive.org/details/CSPAN20090604_230000, from which we get the text from HTML file. We also store the meta data from the META XML file.

Scripts

Get Show Identifiers
- Get Identifiers For Each Show (Channel, Show, Date, Time)
- Produces data/search.csv
Download Metadata and HTML Files
- Download the Metadata and HTML Files
- Saves the metadata and HTML files to two separate folders specified in --meta and --html respectively. The default folder names are meta and html respectively.
Parse Metadata and HTML Files
- Parses metadata and HTML Files and Saves to a CSV
- Produces a CSV. Here's an example

Running the Scripts

Get all TV Archive identifiers from archive.org.

python get_news_identifiers.py -o ../data/search.csv
Download metadata and HTML files for all the shows in the sample input file

python scrape_archive_org.py ../data/search-test.csv

This will create two directories meta and html by default in the same folder as where the script is. We have included the first 25 metadata and first 25 html files.

You can change the folder for meta by using the --meta flag. To change the directory for html, use the --html flag and specify the new directory. For instance,

python scrape_archive_org.py --meta meta-foxnews --html html-foxnews ../data/search-test.csv

Use -c/--compress option to store and parse the downloaded files in compression format (GZip).
Parse and extract meta fields and text from sample metadata and HTML files.

python parse_archive.py ../data/search-test.csv

A sample output file.

Data

The data are hosted on Harvard Dataverse

Dataset Summary:

500k Dataset from 2014:
- CSV: archive-cc-2014.csv.xza* (2.7 GB, split into 2GB files)
- HTML: html-2014.7za* (10.4 GB, split into 2GB files)
860k Dataset from 2017:
- CSV: archive-cc-2017.csv.gza* (10.6 GB, split into 2GB files)
- HTML: html-2017.tar.gza* (20.2 GB, split into 2GB files)
- Meta: meta-2017.tar.gza* (2.6 GB, split into 2GB files)
917k Dataset from 2022:
- CSV: archive-cc-2022.csv.gza* (12.6 GB, split into 2GB files)
- HTML: html-2022.tar.gza* (41.1 GB, split into 2GB files)
- Meta: meta-2022.tar.gz (2.1 GB)
179k Dataset from 2023:
- CSV: archive-cc-2023.csv.gz (1.7 GB)
- HTML: html-2023.tar.gza* (7.3 GB, split into 2GB files)
- Meta: meta-2023.tar.gz (317 MB)

Please note that the file sizes and splitting information mentioned above are approximate.

License

We are releasing the scripts under the MIT License.

Suggested Citation

Please credit Internet Archive for the data.

If you wanted to refer to this particular corpus so that the research is reproducible, you can cite it as: archive.org TV News Closed Caption Corpus. Laohaprapanon, Suriyan and Gaurav Sood. 2017. https://github.com/notnews/archive_news_cc/

🔗 Adjacent Repositories

notnews/lacctocsv — Los Angeles Closed-Caption Television News Archive Data to CSV
notnews/foxnewstranscripts — Fox News Transcripts 2003--2025
notnews/cnn_transcripts — CNN Transcripts 2000--2025
notnews/msnbc_transcripts — MSNBC Transcripts: 2003--2022
notnews/nbc_transcripts — NBC transcripts 2011--2014

Owner

Name: Not News
Login: notnews
Kind: organization

Website: http://notnews.github.io
Repositories: 15
Profile: https://github.com/notnews

News about news

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
title: "archive.org TV News Closed Caption Corpus"
authors:
  - family-names: "Laohaprapanon"
    given-names: "Suriyan"
  - family-names: "Sood"
    given-names: "Gaurav"
date-released: 2023
url: "https://github.com/notnews/archive_news_cc/"
repository-code: "https://github.com/notnews/archive_news_cc/"
type: dataset

GitHub Events

Total

Watch event: 1
Push event: 4
Fork event: 1

Last Year

Watch event: 1
Push event: 4
Fork event: 1

Dependencies

requirements.txt pypi

bs4 *
pandas *
requests *

.github/workflows/adjacent-recommender.yml actions

actions/checkout v4 composite
gojiplus/adjacent v1.3 composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

archive_news_cc

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Closed Captions of News Videos from Archive.org

Downloading the Data from Archive.org

Scripts

Running the Scripts

Data

License

Suggested Citation

🔗 Adjacent Repositories

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Dependencies