datapipe

An audio ETL pipeline for generating datasets from youtube sources

https://github.com/projecte-aina/datapipe

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

An audio ETL pipeline for generating datasets from youtube sources

Basic Info
  • Host: GitHub
  • Owner: projecte-aina
  • License: agpl-3.0
  • Language: Python
  • Default Branch: master
  • Size: 16.5 MB
Statistics
  • Stars: 7
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created almost 4 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

About

Datapipe is a data processing pipeline that (currently) extracts audio clips from youtube videos and generates two transcription candidates with a Vosk (Kaldi) and a Wav2Vec2 model. The goal of the software is to ease the generation of datasets for ASR by automatically extracting and processing large audio sources.

Datapipe workflow

Datapipe

Setup cluster

Install k3s

```bash curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644

Check for Ready node,takes maybe 30 seconds

k3s kubectl get node

Create alias for Kubectl

mkdir -p ~/.kube/ && sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config && sudo chown $USER:$USER ~/.kube/config && chmod 600 ~/.kube/config && export KUBECONFIG=~/.kube/config Install kustomize bash curl -s \ "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash && \ sudo mv kustomize /usr/local/bin/ Create namespace bash kubectl create namespace assistent Encode secret password bash

Get BASE64 encoded password

echo -n "password123#$" | base64 -i - ```

Create secret file and paste encoded password (k8s/postgresql/secret.yaml) As a recommendation keep POSTGRESUSER variable set to default (datapipe) ```yml apiVersion: v1 kind: Secret metadata: namespace: assistent name: datapipe-db-secret data: POSTGRESUSER: "ZGF0YXBpcGU=" POSTGRES_PASSWORD: "cGFzc3dvcmQxMjMjJA==" ```

//: # ()

Deployment

bash make deploy

Start using datapipe

Access to any pod that was set up using projecteaina/datapipe image (example: converter-, fetcher-.. ) bash kubectl -n assistent exec -it fetcher-YOUR_POD_ID bash Using the cli add new channel bash python -m cli add-channel https://www.youtube.com/user/gencat/

Setup development environment

Okteto allows you to develop inside a container. When you run okteto up your Kubernetes deployment is replaced by a development container that contains your development tools. Learn more about Okteto

Install okteto bash curl https://get.okteto.com -sSfL | sh

In the case that your cluster setup is not local, please set the KUBECONFIG env variable to the path of your kube config file. ```bash

Example for setting KUBECONFIG generated by goteleport to access remote cluster

export KUBECONFIG=${HOME?}/teleport-kubeconfig.yaml ``` If you are using a local cluster setup then run next command

bash mkdir -p ~/.kube/ && sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config && sudo chown $USER:$USER ~/.kube/config && chmod 600 ~/.kube/config && export KUBECONFIG=~/.kube/config Select and start development container bash okteto up

Authors

License

Licensed under the GNU Affero General Public License v3.0. Copy of the license

This tool was initially built by the community and its further development and maintanence is being funded by the Catalan Ministry of the Vice-presidency, Digital Policies and Territory of Generalitat within the framework of Projecte AINA.

Owner

  • Name: Projecte Aina
  • Login: projecte-aina
  • Kind: organization
  • Email: aina@bsc.es

Citation (CITATION.cff)

cff-version: 1.3.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "O'Reilly Ibañez"
  given-names: "Ciaran"
  email: "ciaran@oreilly.cat"
  website: "https://oreilly.cat"
- family-names: "Petrea"
  given-names: "Paul Andrei"
  email: "paul.petrea@bsc.es"
title: "Audio datapipe"
type: "software"
version: 1.0
date-released: 2022-05-05
license: APGL-3.0
license-url: "https://www.gnu.org/licenses/agpl-3.0.en.html"
url: "https://github.com/projecte-aina/datapipe"

GitHub Events

Total
  • Watch event: 1
  • Push event: 23
  • Fork event: 1
Last Year
  • Watch event: 1
  • Push event: 23
  • Fork event: 1