scribesalad
A collection of YouTube videos transcripts : Podcasts (Joe Rogan Experience, Tim Ferris, Jocko podcast, ..), lectures (YaleCourses, MIT lectures, ..). A big transcripts salad spanning history, geography, science, politics, film making and more.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (3.2%) to scientific vocabulary
Keywords
Repository
A collection of YouTube videos transcripts : Podcasts (Joe Rogan Experience, Tim Ferris, Jocko podcast, ..), lectures (YaleCourses, MIT lectures, ..). A big transcripts salad spanning history, geography, science, politics, film making and more.
Statistics
- Stars: 80
- Watchers: 9
- Forks: 19
- Open Issues: 3
- Releases: 0
Topics
Metadata Files
README.md
ScribeSalad
In absence of searchable transcripts, many interesting YouTube videos, podcasts, lectures and talks are hard to explore, quote and summarize. ScribeSalad is a multi-lingual open data project regrouping over 940k YouTube video transcripts discussing social and political issues, psychology, history and scientific topics ranging from biology, mathematics to artificial intelligence : TedX, Yale courses, MIT lectures, National Geographic, The Joe Rogan Experience, Big Think, IQ squared, Jordan B. Peterson talks, Tim Ferris, Jocko Podcast and more.
Available transcripts (in english)
A-C : 13OClock Podcast, 2Comics2Cigars, 3Blue1Brown, 8News Now Las Vegas, ABC Science, AI Coffee Break With Letitia, AI lectures & talks, AJ+, Aba & Preach, Aba VERZUZ Preach, Abbey Sharp, Adam Something, Aegon Targaryen, Agadmators Chess Channel, Agt Jake, Akaash Singh Comedy, Al Jazeera English, Alanah Pearce, Alex Hormozi, Alexander Amini, Ali Dawah, All Things Comedy, AltShiftX, Amanda Cerny, Amazon Science, American Enterprise Institute, Andrew Couch, Andrew Huberman, Android Developers, Ani Core, Animated Biology With Arpan, Anime Culture Corner, Anime Uproar, Anna Aakana, Answer the internet, Anthony Pompliano, Apple TV, Are You Garbage, Ari Seff, Arlun Grim, Artof The Problem, Arts At MIT, Arxiv Insights, Asap SCIENCE, Atlas Obscura, Audio Science Review, BAFTA Guru, BBA Science, BBCIdeas, Babel The Language Magazine, Babergh Mid Suffolk District Councils, Bandai Namco Esports, Bandit Games, Bart and Geo, Ben Shapiro, Berlin Science Week, Better Ideas, Better Left Unsaid, Beverly Biology, Big Mood, Big Think, Bill Burr, Bill Burr, Biographics, Bite-sized Philosophy, BrightInsight, Chris D'Elia, Coffee Break, Coffee Break, Coffeezilla, Comics explained, Conan OBrien Needs A Friend, Cracked, CrashCourse
D-I : Dan Carlin, Dose Of Truth, Fire of learning, Future of Life Institute, H3 podcast, Harvard_University, History Hyenas clips, Hugo Larochelle, IQ squared
J : Jocko Podcast, Joe Rogan Clips, Joe Rogan Experience, Joe Rogan MMA Show, Joma Tech, Jordan B. Peterson, Jordan Peterson Fan Clips, Jordan Peterson clips, Jubilee
K-M : Kurzgesagt, Lang Focus, Lex Fridman, Mark Normand, MIT courses, More Chris D'Elia, Motivation Madness
N-R : National Geographic, NativLang, Nerd writer, Nobel minds, No Presh Network, NowYouSeeIt, Pop Culture Detective, RT Documentaries, Rubin Report, Russell Brand
S-V : Skavlan, Siraj Raval, Storytellers, TED, The Linguistics Channel, The Monday Morning Podcast, Theo Von, Theo Von Clips, TheSchoolOfLife, ThinkBigAnimation, TigerBellyClips, Tim Ferris, TFATK, TwoCents, Visual politik
W-Y : Wendover Productions, WhatIf, WhitneyCummings, Wired, Wisecrack, Wolfram, YCombinator, Yale Courses, YangSpeaks, YannLeCun, YannicKilcher, Yeagerists, Your Mom's House Podcast
Other languages
Arabic (ar), French (fr), German (de), Spanish (es), Russian (ru), Turkish (tr), Portuguese (pt), Italian (it), Japanese (ja), Korean (ko)
Transcription quality
Some of the transcriptions originate from YouTube (subtitles uploaded by the video's owner) while the rest are generated automatically using a high-accuracy large-vocabulary continuous speech recognition system (~90% of accuracy in clean conditions : no background noise, no heavy accents and good quality audio).
Filenames and formats
The transcripts identified using the corresponding YouTube videos IDs and each one is available in three formats : text, vtt (Text Tracks Format) and srt (SubRip Subtitle Format).
To open the original video, replace "ID" in https://www.youtube.com/watch?v=ID by the transcript filename.
Terms of use
This is an open data project, feel free to fork this repository, download, share and use any of the transcripts.
TODO
- Cleaning-up transcripts : removing fillers (hum, ah, etc) and repetitions.
- Re-aligning transcripts : re-aligning transcripts and fixing overlapping timecodes.
- Topic modeling : automatically discovering the abstract "topics" that occur in a each transcript.
- Speaker identification : who spoken when ? and for how long ?
- Creating a search engine : exploring subjects by speaker, topic, channel, etc.
- Multiligual transcripts : Translating all transcripts to other languages.
- More channels & more videos.
Owner
- Name: Waad Ben Kheder
- Login: wa3dbk
- Kind: user
- Location: Paris, France
- Twitter: wa3dbk
- Repositories: 3
- Profile: https://github.com/wa3dbk
PhD in computer science, R&D engineer in Vocapia Research
GitHub Events
Total
- Watch event: 5
- Push event: 75
- Fork event: 1
Last Year
- Watch event: 5
- Push event: 75
- Fork event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 1
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- richieM (1)
Pull Request Authors
- 404-html (2)