scribesalad

A collection of YouTube videos transcripts : Podcasts (Joe Rogan Experience, Tim Ferris, Jocko podcast, ..), lectures (YaleCourses, MIT lectures, ..). A big transcripts salad spanning history, geography, science, politics, film making and more.

https://github.com/wa3dbk/scribesalad

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (3.2%) to scientific vocabulary

Keywords

artificial-intelligence history joe-rogan-experience jordan-b-peterson multilingual politics science social-issues transcripts video yalecourses youtube-video youtube-videos-transcripts
Last synced: 6 months ago · JSON representation

Repository

A collection of YouTube videos transcripts : Podcasts (Joe Rogan Experience, Tim Ferris, Jocko podcast, ..), lectures (YaleCourses, MIT lectures, ..). A big transcripts salad spanning history, geography, science, politics, film making and more.

Basic Info
  • Host: GitHub
  • Owner: wa3dbk
  • License: gpl-3.0
  • Default Branch: master
  • Homepage:
  • Size: 20 GB
Statistics
  • Stars: 80
  • Watchers: 9
  • Forks: 19
  • Open Issues: 3
  • Releases: 0
Topics
artificial-intelligence history joe-rogan-experience jordan-b-peterson multilingual politics science social-issues transcripts video yalecourses youtube-video youtube-videos-transcripts
Created about 7 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

ScribeSalad

In absence of searchable transcripts, many interesting YouTube videos, podcasts, lectures and talks are hard to explore, quote and summarize. ScribeSalad is a multi-lingual open data project regrouping over 940k YouTube video transcripts discussing social and political issues, psychology, history and scientific topics ranging from biology, mathematics to artificial intelligence : TedX, Yale courses, MIT lectures, National Geographic, The Joe Rogan Experience, Big Think, IQ squared, Jordan B. Peterson talks, Tim Ferris, Jocko Podcast and more.

Available transcripts (in english)

Other languages

Arabic (ar), French (fr), German (de), Spanish (es), Russian (ru), Turkish (tr), Portuguese (pt), Italian (it), Japanese (ja), Korean (ko)

Transcription quality

Some of the transcriptions originate from YouTube (subtitles uploaded by the video's owner) while the rest are generated automatically using a high-accuracy large-vocabulary continuous speech recognition system (~90% of accuracy in clean conditions : no background noise, no heavy accents and good quality audio).

Filenames and formats

The transcripts identified using the corresponding YouTube videos IDs and each one is available in three formats : text, vtt (Text Tracks Format) and srt (SubRip Subtitle Format).

To open the original video, replace "ID" in https://www.youtube.com/watch?v=ID by the transcript filename.

Terms of use

This is an open data project, feel free to fork this repository, download, share and use any of the transcripts.

TODO

  • Cleaning-up transcripts : removing fillers (hum, ah, etc) and repetitions.
  • Re-aligning transcripts : re-aligning transcripts and fixing overlapping timecodes.
  • Topic modeling : automatically discovering the abstract "topics" that occur in a each transcript.
  • Speaker identification : who spoken when ? and for how long ?
  • Creating a search engine : exploring subjects by speaker, topic, channel, etc.
  • Multiligual transcripts : Translating all transcripts to other languages.
  • More channels & more videos.

Owner

  • Name: Waad Ben Kheder
  • Login: wa3dbk
  • Kind: user
  • Location: Paris, France

PhD in computer science, R&D engineer in Vocapia Research

GitHub Events

Total
  • Watch event: 5
  • Push event: 75
  • Fork event: 1
Last Year
  • Watch event: 5
  • Push event: 75
  • Fork event: 1

Committers

Last synced: 11 months ago

All Time
  • Total Commits: 3,285
  • Total Committers: 1
  • Avg Commits per committer: 3,285.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 366
  • Committers: 1
  • Avg Commits per committer: 366.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
wa3dbk w****k@g****m 3,285

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 1
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • richieM (1)
Pull Request Authors
  • 404-html (2)
Top Labels
Issue Labels
Pull Request Labels