https://github.com/activeinferenceinstitute/journal-utilities
Utilities and Documentation for creating contents for the Active Inference Journal
https://github.com/activeinferenceinstitute/journal-utilities
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary
Repository
Utilities and Documentation for creating contents for the Active Inference Journal
Basic Info
- Host: GitHub
- Owner: ActiveInferenceInstitute
- License: cc0-1.0
- Language: Python
- Default Branch: main
- Size: 144 MB
Statistics
- Stars: 5
- Watchers: 1
- Forks: 1
- Open Issues: 1
- Releases: 1
Metadata Files
README.md
Journal-Utilities
Utilities and Documentation for creating contents for the Active Inference Journal https://github.com/ActiveInferenceInstitute/ActiveInferenceJournal
There are two transcription methods in this repo. The first uses the AssemblyAI tools and the second runs WhisperX locally with SurrealDB for storage
Assembly AI Transcription
Installation
Create a python virtual environment
bash
python -m venv venv
source venv/bin/activate
Install Pandoc, XeLaTeX
Install Pandoc
install xeLaTeX for font support:
bash
sudo apt-get install texlive-xetex
Step 1: Download Audio Transcript
Download the m4a audio file from YouTube and upload to a https accessible location.
Step 2: Generate Single Source Transcript
Setup environment variables
Add new row in the Coda spreadsheet: https://coda.io/d/ActInf-JournaldwYsKMwppRN/Tracking-SpreadsheetsupJk#_luEV4
Run transcription
cd into the session's root folder
bash
cd ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_1/
Copy transcribe command that calls 2_audio_to_markdown/SubmitToCloudWhisper.py. Should look like this:
bash
python '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/SubmitToCloudWhisper.py' 'cFPIP-06W' '4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw' ONLINEPATH 'https://arweave.net' AUTHKEYFILENAME '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/authkey.txt' WORD_BOOST_FILE_LIST 'word_boost.txt' SENTIMENT_ANALYSIS False IAB_CATEGORIES False CUSTOM_SPELL_BOOSTED True CUSTOM_SPELLING_FILE_LIST 'custom_spelling.csv' | tee 'trace.txt'
Generate single source transcript
update the AssemblyAI-generated speaker labels "A" "B"... into "Daniel" "Bleu".
add words to the word_boost.txt file
cd into the session's metadata folder
bash
cd metadata/
Copy transcribe command that calls 2_audio_to_markdown/sentenceToTranscripts.py. Should look like this:
bash
python '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/sentencesToTranscripts.py' 'cFPIP-06W' '/mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_6/Metadata' 'cFPIP-06W_4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw.sentences.csv' INSPEAKERDIR '/mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_6/Metadata' SPEAKERFILE 'cFPIP-06W_4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw.speakers.csv' | tee cFPIP-06W.m4a_transcript.json
Step 5: Markdown to Final Outputs
The parse_markdown function in 5_markdown_to_final/markdown_transcript_parser.py converts the markdown file to an SRT and MD file (without timestamps). write_output_files will save the files to disk. Look at tests/test_output_final_artifacts.py for usage.
In the case of a course with multiple lectures like Physics as Information Processsing , concatenate_markdown_files will combine the markdown files into one file. This file can then be converted to a PDF or HTML using pandoc.
Convert Markdown to PDF
bash
cd /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields
pandoc --pdf-engine xelatex -f markdown-implicit_figures all_transcripts.md --lua-filter=images/scholarly-metadata.lua --lua-filter=images/author-info-blocks.lua -o all_transcripts.pdf
Convert Markdown to HTML
bash
cd /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields
pandoc -f markdown-implicit_figures all_transcripts.md --lua-filter=images/scholarly-metadata.lua --lua-filter=images/author-info-blocks.lua -o all_transcripts.html
remove all instances of /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/ from the HTML file to make the images work.
Database and WhisperX Tools 2024
Installation
Create Conda Environment
bash
conda create --name whisperx python=3.10
conda activate whisperx
Install PyTorch CUDA 11.8
bash
conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
Install WhisperX
bash
pip install git+https://github.com/m-bain/whisperx.git
pip install python-dotenv
pip install mkl==2024.0 # downgrade to fix `libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent`
pip install surrealdb
pip install pyytdata
Install ffmpeg
bash
wget -O - -q https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x
Setup .env file
Generate a Hugging Face Token and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1
bash
cp .env.sample .env
Update HUGGINGFACE_TOKEN value in .env with your token.
Get the YouTube Data API v3 Key from https://console.developers.google.com/apis/ and update API_KEY value in .env.
Start Database
bash
surreal start --log trace --user root --pass root --bind 0.0.0.0:8080 rocksdb:///mnt/md0/projects/Journal-Utilities/data/database
Ingest MP4 files into Database, Convert to WAV files
bash
cd src
python ingest_db_create_wav.py
Run Transcribe
bash
cd src
python transcribe.py
Query DB
bash
surreal sql --endpoint http://localhost:8080 --username root --password root --namespace actinf --database actinf
Upgrade DB
sudo surreal upgrade
surreal fix rocksdb://database
Acknowledgements
- Initial Scripts 1 & 2, and initial README contributed by Dave Douglass, November 2022.
- Initial Scripts 5 contributed by Holly Grimm @hollygrimm, December 2023.
Owner
- Name: Active Inference Institute
- Login: ActiveInferenceInstitute
- Kind: user
- Location: Online
- Company: Active Inference Institute
- Website: http://activeinference.org/
- Twitter: InferenceActive
- Repositories: 3
- Profile: https://github.com/ActiveInferenceInstitute
http://activeinference.org/
GitHub Events
Total
- Watch event: 2
- Fork event: 1
Last Year
- Watch event: 2
- Fork event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 1
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- hollygrimm (1)
Pull Request Authors
- hollygrimm (1)