datapipe
An audio ETL pipeline for generating datasets from youtube sources
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.4%) to scientific vocabulary
Repository
An audio ETL pipeline for generating datasets from youtube sources
Basic Info
- Host: GitHub
- Owner: projecte-aina
- License: agpl-3.0
- Language: Python
- Default Branch: master
- Size: 16.5 MB
Statistics
- Stars: 7
- Watchers: 2
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
About
Datapipe is a data processing pipeline that (currently) extracts audio clips from youtube videos and generates two transcription candidates with a Vosk (Kaldi) and a Wav2Vec2 model. The goal of the software is to ease the generation of datasets for ASR by automatically extracting and processing large audio sources.
Datapipe workflow

Setup cluster
Install k3s
```bash curl -sfL https://get.k3s.io | sh -s - --write-kubeconfig-mode 644
Check for Ready node,takes maybe 30 seconds
k3s kubectl get node
Create alias for Kubectl
mkdir -p ~/.kube/ && sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config &&
sudo chown $USER:$USER ~/.kube/config && chmod 600 ~/.kube/config && export KUBECONFIG=~/.kube/config
Install kustomize
bash
curl -s \
"https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash && \
sudo mv kustomize /usr/local/bin/
Create namespace
bash
kubectl create namespace assistent
Encode secret password
bash
Get BASE64 encoded password
echo -n "password123#$" | base64 -i - ```
Create secret file and paste encoded password (k8s/postgresql/secret.yaml) As a recommendation keep POSTGRESUSER variable set to default (datapipe) ```yml apiVersion: v1 kind: Secret metadata: namespace: assistent name: datapipe-db-secret data: POSTGRESUSER: "ZGF0YXBpcGU=" POSTGRES_PASSWORD: "cGFzc3dvcmQxMjMjJA==" ```
//: # ()
Deployment
bash
make deploy
Start using datapipe
Access to any pod that was set up using projecteaina/datapipe image (example: converter-, fetcher-.. )
bash
kubectl -n assistent exec -it fetcher-YOUR_POD_ID bash
Using the cli add new channel
bash
python -m cli add-channel https://www.youtube.com/user/gencat/
Setup development environment
Okteto allows you to develop inside a container. When you run okteto up your Kubernetes deployment is replaced by a development container that contains your development tools. Learn more about Okteto
Install okteto
bash
curl https://get.okteto.com -sSfL | sh
In the case that your cluster setup is not local, please set the KUBECONFIG env variable to the path of your kube config file. ```bash
Example for setting KUBECONFIG generated by goteleport to access remote cluster
export KUBECONFIG=${HOME?}/teleport-kubeconfig.yaml ``` If you are using a local cluster setup then run next command
bash
mkdir -p ~/.kube/ && sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config &&
sudo chown $USER:$USER ~/.kube/config && chmod 600 ~/.kube/config && export KUBECONFIG=~/.kube/config
Select and start development container
bash
okteto up
Authors
License
Licensed under the GNU Affero General Public License v3.0. Copy of the license
This tool was initially built by the community and its further development and maintanence is being funded by the Catalan Ministry of the Vice-presidency, Digital Policies and Territory of Generalitat within the framework of Projecte AINA.
Owner
- Name: Projecte Aina
- Login: projecte-aina
- Kind: organization
- Email: aina@bsc.es
- Twitter: projecte_aina
- Repositories: 8
- Profile: https://github.com/projecte-aina
Citation (CITATION.cff)
cff-version: 1.3.0 message: "If you use this software, please cite it as below." authors: - family-names: "O'Reilly Ibañez" given-names: "Ciaran" email: "ciaran@oreilly.cat" website: "https://oreilly.cat" - family-names: "Petrea" given-names: "Paul Andrei" email: "paul.petrea@bsc.es" title: "Audio datapipe" type: "software" version: 1.0 date-released: 2022-05-05 license: APGL-3.0 license-url: "https://www.gnu.org/licenses/agpl-3.0.en.html" url: "https://github.com/projecte-aina/datapipe"
GitHub Events
Total
- Watch event: 1
- Push event: 23
- Fork event: 1
Last Year
- Watch event: 1
- Push event: 23
- Fork event: 1