https://github.com/coderefinery/video-processing
Processed videos from CodeRefinery (and the workspace while creating them)
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.5%) to scientific vocabulary
Repository
Processed videos from CodeRefinery (and the workspace while creating them)
Basic Info
Statistics
- Stars: 4
- Watchers: 5
- Forks: 2
- Open Issues: 5
- Releases: 0
Metadata Files
README.md
Video processing files
This repo has the files used to do our video processing. It uses git-annex for the big files and other things are committed to git. It provides non-YouTube public access to our videos, and is also our working place for releasing videos (so a lot of the instructions below are for those that help processing them).
We also made a description of git-annex for data management, targeted to scientists and researchers, if you want to know what's going on behind the scenes.
What is available here?
Browse the repo - course links are below. More can be added later depending on demand.
Getting public copies of videos from git-annex
Raw video data is stored using git-annex and synced around different places (our HPC cluster, the computers that process the videos, the object store Allas provided by CSC). Allas allows you to download the videos you might like:
console
$ git clone https://github.com/coderefinery/video-processing/
$ git annex get python-for-scicomp-2023/out/day1.1-icebreaker.mkv
get python-for-scicomp-2023/out/day1.1-icebreaker.mkv (from allas...)
Only processed videos are available to the general public (the raw private ones are recorded with git-annex in this repo, but not available for download). Also, this is a test setup and everything may be subject to change or depreciation.
(How was this set up? Get the environment variables needed for the
git-annex S3 special
remote - I did
this by running allas_conf on one of the CSC computers. Then run
git annex initremote allas type=S3 encryption=none chunk=50MiB
embedcreds=no host=a3s.fi protocol=https bucket=aaltoscicomp-video
publicurl=https://aaltoscicomp-video.a3s.fi/ fileprefix=1-
public=yes autoenable=true - it caches the authentication locally on
that computer only, it doesn't spread to anywhere else.)
How it works
This repository stores the stuff used to process videos for CodeRefinery / Aalto Scientific Computing / etc(?). Here's how it works in general:
- On the streaming computer, videos are placed into
COURSE/raw/*.mkv - git-annex is used to sync everything around.
- Raw videos are copied to the cluster
- Whisper is used to generate a subtitle file
.srt - ffmpeg-editlist is used to cut videos into segments and generate the YouTube descriptions.
- The subtitles can be fixed up at the same time as the video is being sliced.
- ffmpeg-editlist runs and splits the video into pieces. This also
splits the raw subtitle
.srtfiles into subtitles of each sub-part. This allows us to parellelize the subtitle fixing and the video slicing. - ffmpeg-editlist also generates the descriptions for YouTube.
- A Makefile automates subtitle generation
git annex sync --contentmoves all content around as desired, making sure that the cluster has a full copy and other remotes have only what they have requested.
Subtitle editing
If you are helping with subtitle editing:
Find the
COURSE/raw/*.srtsubtitle file and edit it as follows:I don't watch all the video, but (very quickly) browse the text. Think 5 minutes (or less) skimming per hour of video, if there are no changes. Only focus on the important parts that can affect understanding, not making it a perfect presentation quality transcript. (I don't watch the video, I assume the transcript is correct except when it's clearly written wrong.)
Remove all names, replace with
[name]. Find and replace can be useful here, but note there may be misspellings too, so you may have to go try several times as you see more other spellings.Fix up any command names, for example
dash dash argumentbecomes--argument, capitalization, etc. And other things that affect understanding.If you can't understand what someone is trying to say, replace with
[???]or similar.But it doesn't have to be perfect. Getting it done fast is the most important thing. "normal" speech doesn't have to be made perfect, but do what makes sense (what is worth your time).
Various subtitle editor programs can make this easier, but it's also just a text file. I've used
subtitleditoron Linux, which can playback the video right at each subtitle if you need to hear the original.If you notice something very wrong (Whisper has broken, it's not adding punctuation, etc), then don't try to fix it up, just leave it and make it minally usable.
Slicing the videos
If you are volunteering to help generate the edit list:
- See https://github.com/coderefinery/ffmpeg-editlist for the basics.
- I guess in reality, look at a template file from a previous workshop and copy and modify that.
- Set all the things you see in that file.
- Priorities:
- Roughly split the videos into the different lessons.
- Try to remove the breaks and exercise times. It's OK to have a few seconds before/after: better that than missing some.
- When it's easy, add some chapters into it, so that people can quickly navigate around the video.
- Make sure the description is accurate.
- (see other docs)
git-annex setup for private video files
Raw videos files are private and only synced via our cluster.
Only do this if you are pulling the private (raw) big video files to your own computer to view them: otherwise, you can use git normally and the video files appear as broken symbolic links. For the final videos, you can get them using the public copy above.
Privacy notice: the git-annex info on which computers have which files get
publicly distributed through the repository (including through
Github). The info about your computer is the UUID and the
MY-COMPUTER-NAME which is in the repo.
To set up this repo to connect to the Triton cluster:
(pull repo from github)
git remote add triton triton.aalto.fi:/scratch/scicomp/video-processing/
git config remote.triton.annex-shell /share/apps/git-annex/10.20230228.path/git-annex-shell
git annex init MY-COMPUTER-NAME # set up git-annex
git annex wanted . present # don't download everything, but keep what is here
git annex sync
git annex get python-for-scicomp/2023/raw/FILE.mkv
Owner
- Name: coderefinery
- Login: coderefinery
- Kind: organization
- Email: support@coderefinery.com
- Website: https://coderefinery.org
- Repositories: 141
- Profile: https://github.com/coderefinery
GitHub Events
Total
- Issues event: 5
- Issue comment event: 1
- Push event: 119
Last Year
- Issues event: 5
- Issue comment event: 1
- Push event: 119