https://github.com/activeinferenceinstitute/journal-utilities

Utilities and Documentation for creating contents for the Active Inference Journal

https://github.com/activeinferenceinstitute/journal-utilities

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.0%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Utilities and Documentation for creating contents for the Active Inference Journal

Basic Info
  • Host: GitHub
  • Owner: ActiveInferenceInstitute
  • License: cc0-1.0
  • Language: Python
  • Default Branch: main
  • Size: 144 MB
Statistics
  • Stars: 5
  • Watchers: 1
  • Forks: 1
  • Open Issues: 1
  • Releases: 1
Created over 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme License

README.md

Journal-Utilities

Utilities and Documentation for creating contents for the Active Inference Journal https://github.com/ActiveInferenceInstitute/ActiveInferenceJournal

There are two transcription methods in this repo. The first uses the AssemblyAI tools and the second runs WhisperX locally with SurrealDB for storage


Assembly AI Transcription

Installation

Create a python virtual environment

bash python -m venv venv source venv/bin/activate

Install Pandoc, XeLaTeX

Install Pandoc

install xeLaTeX for font support: bash sudo apt-get install texlive-xetex

Step 1: Download Audio Transcript

Download the m4a audio file from YouTube and upload to a https accessible location.

Step 2: Generate Single Source Transcript

Setup environment variables

Add new row in the Coda spreadsheet: https://coda.io/d/ActInf-JournaldwYsKMwppRN/Tracking-SpreadsheetsupJk#_luEV4

Run transcription

cd into the session's root folder bash cd ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_1/

Copy transcribe command that calls 2_audio_to_markdown/SubmitToCloudWhisper.py. Should look like this: bash python '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/SubmitToCloudWhisper.py' 'cFPIP-06W' '4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw' ONLINEPATH 'https://arweave.net' AUTHKEYFILENAME '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/authkey.txt' WORD_BOOST_FILE_LIST 'word_boost.txt' SENTIMENT_ANALYSIS False IAB_CATEGORIES False CUSTOM_SPELL_BOOSTED True CUSTOM_SPELLING_FILE_LIST 'custom_spelling.csv' | tee 'trace.txt'

Generate single source transcript

update the AssemblyAI-generated speaker labels "A" "B"... into "Daniel" "Bleu".

add words to the word_boost.txt file

cd into the session's metadata folder bash cd metadata/

Copy transcribe command that calls 2_audio_to_markdown/sentenceToTranscripts.py. Should look like this: bash python '/mnt/md0/projects/Journal-Utilities/2_audio_to_markdown/sentencesToTranscripts.py' 'cFPIP-06W' '/mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_6/Metadata' 'cFPIP-06W_4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw.sentences.csv' INSPEAKERDIR '/mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/Discussion_6/Metadata' SPEAKERFILE 'cFPIP-06W_4WKy_TVLReB2KAN6cGr5zk-GvfzfsRAihgK7Kc_Equw.speakers.csv' | tee cFPIP-06W.m4a_transcript.json

Step 5: Markdown to Final Outputs

The parse_markdown function in 5_markdown_to_final/markdown_transcript_parser.py converts the markdown file to an SRT and MD file (without timestamps). write_output_files will save the files to disk. Look at tests/test_output_final_artifacts.py for usage.

In the case of a course with multiple lectures like Physics as Information Processsing , concatenate_markdown_files will combine the markdown files into one file. This file can then be converted to a PDF or HTML using pandoc.

Convert Markdown to PDF

bash cd /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields pandoc --pdf-engine xelatex -f markdown-implicit_figures all_transcripts.md --lua-filter=images/scholarly-metadata.lua --lua-filter=images/author-info-blocks.lua -o all_transcripts.pdf

Convert Markdown to HTML

bash cd /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields pandoc -f markdown-implicit_figures all_transcripts.md --lua-filter=images/scholarly-metadata.lua --lua-filter=images/author-info-blocks.lua -o all_transcripts.html

remove all instances of /mnt/md0/projects/ActiveInferenceJournal/Courses/PhysicsAsInformationProcessing_ChrisFields/ from the HTML file to make the images work.


Database and WhisperX Tools 2024

Installation

Create Conda Environment

bash conda create --name whisperx python=3.10 conda activate whisperx

Install PyTorch CUDA 11.8

bash conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Install WhisperX

bash pip install git+https://github.com/m-bain/whisperx.git pip install python-dotenv pip install mkl==2024.0 # downgrade to fix `libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent` pip install surrealdb pip install pyytdata

Install ffmpeg

bash wget -O - -q https://github.com/yt-dlp/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz | xz -qdc| tar -x

Setup .env file

Generate a Hugging Face Token and accept the user agreement for the following models: Segmentation and Speaker-Diarization-3.1

bash cp .env.sample .env Update HUGGINGFACE_TOKEN value in .env with your token.

Get the YouTube Data API v3 Key from https://console.developers.google.com/apis/ and update API_KEY value in .env.

Start Database

bash surreal start --log trace --user root --pass root --bind 0.0.0.0:8080 rocksdb:///mnt/md0/projects/Journal-Utilities/data/database

Ingest MP4 files into Database, Convert to WAV files

bash cd src python ingest_db_create_wav.py

Run Transcribe

bash cd src python transcribe.py

Query DB

bash surreal sql --endpoint http://localhost:8080 --username root --password root --namespace actinf --database actinf

Upgrade DB

sudo surreal upgrade surreal fix rocksdb://database

Acknowledgements

  • Initial Scripts 1 & 2, and initial README contributed by Dave Douglass, November 2022.
  • Initial Scripts 5 contributed by Holly Grimm @hollygrimm, December 2023.

Owner

  • Name: Active Inference Institute
  • Login: ActiveInferenceInstitute
  • Kind: user
  • Location: Online
  • Company: Active Inference Institute

http://activeinference.org/

GitHub Events

Total
  • Watch event: 2
  • Fork event: 1
Last Year
  • Watch event: 2
  • Fork event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 1
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • hollygrimm (1)
Pull Request Authors
  • hollygrimm (1)
Top Labels
Issue Labels
Pull Request Labels