https://github.com/activeinferenceinstitute/journal-utilities

Utilities and Documentation for creating contents for the Active Inference Journal

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.0%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Utilities and Documentation for creating contents for the Active Inference Journal

Basic Info

Host: GitHub
Owner: ActiveInferenceInstitute
License: cc0-1.0
Language: Python
Default Branch: main
Size: 144 MB

Statistics

Stars: 5
Watchers: 1
Forks: 1
Open Issues: 1
Releases: 1

Created over 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

Journal-Utilities

Utilities and Documentation for creating contents for the Active Inference Journal https://github.com/ActiveInferenceInstitute/ActiveInferenceJournal

There are two transcription methods in this repo. The first uses the AssemblyAI tools and the second runs WhisperX locally with SurrealDB for storage

Assembly AI Transcription

Installation

Create a python virtual environment

bash python -m venv venv source venv/bin/activate

Install Pandoc, XeLaTeX

Install Pandoc

install xeLaTeX for font support: bash sudo apt-get install texlive-xetex

Step 1: Download Audio Transcript

Download the m4a audio file from YouTube and upload to a https accessible location.

Step 2: Generate Single Source Transcript

Setup environment variables

Add new row in the Coda spreadsheet: https://coda.io/d/ActInf-JournaldwYsKMwppRN/Tracking-SpreadsheetsupJk#_luEV4

Run transcription

cd into the session's root folder bash cd ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_1/

Copy transcribe command that calls 2_audio_to_markdown/SubmitToCloudWhisper.py. Should look like this: bash python '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/SubmitToCloudWhisper.py' 'cFPIP-06W' '4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw' ONLINEPATH 'https://arweave.net' AUTHKEYFILENAME '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/authkey.txt' WORD_BOOST_FILE_LIST 'word_boost.txt' SENTIMENT_ANALYSIS False IAB_CATEGORIES False CUSTOM_SPELL_BOOSTED True CUSTOM_SPELLING_FILE_LIST 'custom_spelling.csv' | tee 'trace.txt'

Generate single source transcript

update the AssemblyAI-generated speaker labels "A" "B"... into "Daniel" "Bleu".

add words to the word_boost.txt file

cd into the session's metadata folder bash cd metadata/

Copy transcribe command that calls 2_audio_to_markdown/sentenceToTranscripts.py. Should look like this: bash python '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/sentencesToTranscripts.py' 'cFPIP-06W' '/mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_6/Metadata' 'cFPIP-06W_4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw.sentences.csv' INSPEAKERDIR '/mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_6/Metadata' SPEAKERFILE 'cFPIP-06W_4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw.speakers.csv' | tee cFPIP-06W.m4a_transcript.json

Step 5: Markdown to Final Outputs

The parse_markdown function in 5_markdown_to_final/markdown_transcript_parser.py converts the markdown file to an SRT and MD file (without timestamps). write_output_files will save the files to disk. Look at tests/test_output_final_artifacts.py for usage.

In the case of a course with multiple lectures like Physics as Information Processsing , concatenate_markdown_files will combine the markdown files into one file. This file can then be converted to a PDF or HTML using pandoc.

Convert Markdown to PDF

bash cd /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields pandoc --pdf-engine xelatex -f markdown-implicit_figures all_transcripts.md --lua-filter=images/scholarly-metadata.lua --lua-filter=images/author-info-blocks.lua -o all_transcripts.pdf

Convert Markdown to HTML

bash cd /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields pandoc -f markdown-implicit_figures all_transcripts.md --lua-filter=images/scholarly-metadata.lua --lua-filter=images/author-info-blocks.lua -o all_transcripts.html

remove all instances of /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/ from the HTML file to make the images work.

Database and WhisperX Tools 2024

Installation

Create Conda Environment

bash conda create --name whisperx python=3.10 conda activate whisperx

Install PyTorch CUDA 11.8

bash conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Install WhisperX

bash pip install git+https://github.com/m-bain/whisperx.git pip install python-dotenv pip install mkl==2024.0 # downgrade to fix `libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent` pip install surrealdb pip install pyytdata

Install ffmpeg

bash wget -O - -q https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

Setup .env file

Generate a Hugging Face Token and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1

bash cp .env.sample .env Update HUGGINGFACE_TOKEN value in .env with your token.

Get the YouTube Data API v3 Key from https://console.developers.google.com/apis/ and update API_KEY value in .env.

Start Database

bash surreal start --log trace --user root --pass root --bind 0.0.0.0:8080 rocksdb:///mnt/md0/projects/Journal-Utilities/data/database

Ingest MP4 files into Database, Convert to WAV files

bash cd src python ingest_db_create_wav.py

Run Transcribe

bash cd src python transcribe.py

Query DB

bash surreal sql --endpoint http://localhost:8080 --username root --password root --namespace actinf --database actinf

Upgrade DB

sudo surreal upgrade surreal fix rocksdb://database

Acknowledgements

Initial Scripts 1 & 2, and initial README contributed by Dave Douglass, November 2022.
Initial Scripts 5 contributed by Holly Grimm @hollygrimm, December 2023.

Owner

Name: Active Inference Institute
Login: ActiveInferenceInstitute
Kind: user
Location: Online
Company: Active Inference Institute

Website: http://activeinference.org/
Twitter: InferenceActive
Repositories: 3
Profile: https://github.com/ActiveInferenceInstitute

http://activeinference.org/

GitHub Events

Total

Watch event: 2
Fork event: 1

Last Year

Watch event: 2
Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 1
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

https://github.com/activeinferenceinstitute/journal-utilities

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Journal-Utilities

Assembly AI Transcription

Installation

Create a python virtual environment

Install Pandoc, XeLaTeX

Step 1: Download Audio Transcript

Step 2: Generate Single Source Transcript

Setup environment variables

Run transcription

Generate single source transcript

Step 5: Markdown to Final Outputs

Convert Markdown to PDF

Convert Markdown to HTML

Database and WhisperX Tools 2024

Installation

Create Conda Environment

Install PyTorch CUDA 11.8

Install WhisperX

Install ffmpeg

Setup .env file

Start Database

Ingest MP4 files into Database, Convert to WAV files

Run Transcribe

Query DB

Upgrade DB

Acknowledgements

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels