Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: ho1447
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 361 KB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Modular Speech Command Recognition System

Table of Contents

  1. Value Proposition
  2. Contributors
  3. System Diagram
  4. Summary of Outside Materials
  5. Summary of Infrastructure Requirements
  6. Detailed Design Plan
  7. Difficulty Points Achieved

Value Proposition

Voice-controlled interfaces are increasingly common in smart home devices, vehicles, and industrial machinery. Most systems today rely on proprietary cloud APIs like Google Assistant or Alexa, which introduce privacy risks, internet dependency, and latency. Our system improves on this by providing a cloud-native machine learning service that enables fast, customizable, and private speech command recognition.

We train and serve models on Chameleon Cloud, exposing a speech recognition API that can be used in existing smart systems. The system supports real-time command detection and is later adaptable for edge deployment.

Current non-ML status: Manual control interfaces, rule-based keyword spotting, or reliance on cloud APIs. Business metric: Recognition accuracy, latency per inference, system responsiveness under noise.)

Contributors

| Name | Responsible for | Link to their commits in this repo | |---------------------------------|------------------------------|------------------------------------| | All team members | Overall system architecture | | | Vorrapard Kumthongdee | Model training | https://github.com/ho1447/ML-SysOpsProject/commits/main/?author=vorrapard | | Iris Ho | Model serving and monitoring | https://github.com/ho1447/ML-SysOpsProject/commits/main/?author=ho1447 | | Angelina Huang | Data pipeline | https://github.com/ho1447/ML-SysOpsProject/commits/main/?author=phh242 | | Jay Roy | Continuous X pipeline | https://github.com/ho1447/ML-SysOpsProject/commits/main/?author=jayroy9825|

System Diagram

System diagram

Summary of Outside Materials

| | How it was created | Conditions of use | |------------------------------|--------------------------------------------------------------------------------|------------------------| | Speech Commands v2 (3.34 GB) | Created by Google, includes 105k+ WAV clips of spoken commands | Free for academic use | | Background noise data | Packaged with SCv2 dataset for audio augmentation | Free for academic use | | Wav2Vec2.0 (95m parameters) | Pretrained self-supervised model for audio embeddings (HuggingFace) | Apache 2.0 License | | SpeechBrain | Open-source toolkit for speech processing (feature extraction, classification) | MIT License |

Summary of Infrastructure Requirements

| Requirement | How many/when | Justification | |-----------------|---------------------------------------------------|-------------------------------------------------------------| | m1.medium VMs | 3 for entire project duration | Run API server, monitoring, preprocessing | | gpu_mi100 | 4 hour block twice a week | Train models like Wav2Vec2.0 or CNN-based classifiers | | Floating IPs | 1 for entire project duration, 1 for sporadic use | Expose API externally, test in canary/staging environments | | Persistent Vols | 50 GB | Store dataset, processed features, model artifacts and logs |

Detailed Design Plan

Model Training and Training Platforms

  1. Strategy:

    • Use a three-part model:
      • Feature extraction (Mel spectrograms using torchaudio)
      • Noise classification model (CNN-based)
      • Speech command classification model (MobileNetV2 or Wav2Vec2.0)
    • Train on Google Speech Commands v2 with augmentation
    • Tune hyperparameters with Ray Tune
  2. Tools:

    • Ray Train for distributed training on Chameleon Cloud (instructions for running on Chameleon)
    • MLflow to track experiment runs and parameters
  3. Justification:

    • Enables modular updates and robust performance in noisy conditions
    • Scalable training supports model reuse or extension (e.g., multi-language)
  4. Course links:

    • Unit 4: Training at scale with Ray and augmentation
    • Unit 5: MLflow for experiment logging
    • Difficulty point: Ray Tune for HPO + multi-model setup

Model Serving and Monitoring Platforms

  1. Strategy:

    • Package models into a container and expose them via a FastAPI endpoint
    • Perform inference using ONNX-optimized models on both CPU and edge device (Raspberry pi)
    • Compare latency and concurrency behavior
  2. Monitoring:

    • Log prediction confidence, input quality (signal-to-noise)
    • Use a dashboard to visualize misclassification trends and input stats
  3. Course links:

    • Unit 6: Serving via API and edge deployment for a low resource device with latency/concurrency monitoring
    • Unit 7: Log-based and live monitoring of performance
    • Difficulty point: ONNX and edge device deployment + dashboard for model degradation

Data Pipeline

  1. Persistent storage:

    • Object storage bucket on CHI@TACC (21.96 GB): docker-compose-etl.yaml
      • speechcommandsv0.02
      • speechcommandsv0.02_processed
      • speechcommandsv0.02processedmel
      • speechcommandstestsetv0.02
      • speechcommandstestsetv0.02_processed
      • speechcommandstestsetv0.02processedmel
    • Block storage volume on KVM@TACC (50 GB): docker-compose-block.yaml
      • Minio
      • Postgres
      • MLflow
      • Jupyter
      • Prometheus
      • Grafana
      • Label Studio
  2. Offline data:

    • Training dataset: speechcommandsv0.02processed and speechcommandsv0.02processed_mel
    • speechcommandsv0.02

      • Consists of one-second .wav audio files, each containing an English word spoken by different speakers
      • Crowdsourced by Google, where participants were prompted to say a specific command such as "yes", "no", "stop", etc.
      • Also includes realistic background audio files ("doingthedishes.wav", "running_tap.wav") which can be mixed into training data to simulate noisy environments
      • Dataset sample
      • This data can be used by Alexa to initiate the voice assistant, control media playback, etc.
  3. Data pipeline:

    • Retrieves the data from its original source and loads it into the object store: docker-compose-etl.yaml
      • extract-data
        • Downloads speechcommandsv0.02 and speechcommandstestsetv0.02
        • Unzips speechcommandsv0.02 and speechcommandstestsetv0.02
      • process-data
        • Normalizes the .wav audio files in speechcommandsv0.02 and speechcommandstestsetv0.02
        • Overlays speech command audio files with background noise audio files, saving the results to:
          • speechcommandsv0.02_processed
          • speechcommandstestsetv0.02_processed
        • Generates mel spectrograms for the processed audio files, saving the results to:
          • speechcommandsv0.02processedmel
          • speechcommandstestsetv0.02processedmel
      • transform-data
        • Organizes speechcommandsv0.02processed and speechcommandsv0.02processed_mel into directories ("training", "validation", "evaluation") according to command labels
          • Decides which set the data should belong to by taking and using a hash of the filename
          • Training:Validation:Evaluation = 8:1:1
      • load-data
        • Loads training data into the object store
  4. Online data:

    • Sends new data to the FastAPI inference endpoint during "production" use: onlinedatapipeline.py
      • Uses speechcommandstestsetv0.02_processed and as "new" data
      • Shuffle the paths to the files and send to the FastAPI inference endpoint

Continuous X

1. Selecting Site

The Modular-Speech Continuous X Pipeline sets up infrastructure predominantly on KVM\@TACC using Chameleon Cloud. We start by selecting the site.

```python from chi import server, context

context.version = "1.0" context.choose_site(default="KVM@TACC") ```

This pipeline glues together the Model Training, Evaluation, Serving, and Data Operations components. The ultimate goal is rapid development-to-deployment cycles with iterative improvements—this is the Ops in MLOps.

We'll provision resources and install tooling through infrastructure-as-code:

  • Terraform: Manages our cloud infra declaratively.
  • Ansible: Installs Kubernetes and Argo ecosystem tools.
  • Argo CD: Enables GitOps-based continuous delivery.
  • Argo Workflows: Powers the container-native orchestration of our ML pipelines.

Start by cloning the infrastructure repository:

bash git clone --recurse-submodules https://github.com/ho1447/ML-SysOps_Project.git

2. Setup Environment

Install Terraform:

bash mkdir -p /work/.local/bin wget https://releases.hashicorp.com/terraform/1.10.5/terraform_1.10.5_linux_amd64.zip unzip -o -q terraform_1.10.5_linux_amd64.zip mv terraform /work/.local/bin rm terraform_1.10.5_linux_amd64.zip export PATH=/work/.local/bin:$PATH

Prepare the path for additional tools:

bash export PATH=/work/.local/bin:$PATH export PYTHONUSERBASE=/work/.local

Install Kubespray dependencies:

bash PYTHONUSERBASE=/work/.local pip install --user -r ./Modular-Speech/continuous_X/ansible/k8s/kubespray/requirements.txt

3. Provision Infrastructure with Terraform

Navigate to the Terraform config directory:

bash cd /work/Modular-Speech/continuous_X/tf/kvm/ export PATH=/work/.local/bin:$PATH unset $(set | grep -o "^OS_[A-Za-z0-9_]*")

Initialize and apply configuration:

bash terraform init export TF_VAR_suffix=speech_proj export TF_VAR_key=id_rsa_chameleon_speech terraform validate terraform apply -auto-approve

4. Ansible for Configuration Management

Ensure your environment is ready:

bash export PATH=/work/.local/bin:$PATH export PYTHONUSERBASE=/work/.local

Check connectivity:

bash ansible -i inventory.yml all -m ping

Run a hello-world test:

bash ansible-playbook -i inventory.yml general/hello_host.yml

5. Deploy Kubernetes

SSH and prepare Kubernetes installation:

```bash cd /work/.ssh/ ssh-add idrsachameleon_speech

cd /work/Modular-Speech/continuousX/ansible ansible-playbook -i inventory.yml prek8s/prek8sconfigure.yml ```

Deploy Kubernetes with Kubespray:

bash cd ./k8s/kubespray ansible-playbook -i ../inventory/mycluster --become --become-user=root ./cluster.yml

6. Argo CD for Application Deployment

Set up ArgoCD for platform services:

```bash cd /work/.ssh ssh-add idrsachameleon_speech

cd /work/Modular-Speech/continuousX/ansible ansible-playbook -i inventory.yml argocd/argocdadd_platform.yml ```

Platform includes:

  • MinIO
  • MLFlow
  • PostgreSQL
  • Label Studio
  • Grafana
  • Prometheus

Deploy the initial container image for Modular-Speech:

bash ansible-playbook -i inventory.yml argocd/workflow_build_init.yml

Deploy staging environment:

bash ansible-playbook -i inventory.yml argocd/argocd_add_staging.yml

Canary and production environments:

bash ansible-playbook -i inventory.yml argocd/argocd_add_canary.yml ansible-playbook -i inventory.yml argocd/argocd_add_prod.yml

7. Model Lifecycle - Part 1

To manually trigger training and evaluation:

  • Use the train-model Argo Workflow template.
  • Provide the public IPs for:

    • training endpoint
    • evaluation endpoint
    • MLFlow

Model training triggers via REST API and returns a RUN_ID, which we poll via MLFlow’s API.

Evaluation endpoint returns a model version, which will be used to tag the container.

8. Model Lifecycle - Part 2

Progress through environments:

  • Staging: Test performance and integration.
  • Canary: Serve a subset of real users.
  • Production: Full rollout after validation.

To promote models:

text Argo Workflows > promote-model > Submit

This copies artifacts and builds new images for each environment using templates like build-container-image.yaml.

9. Teardown with Terraform

To remove infrastructure:

bash cd /work/Modular-Speech/continuous_X/tf/kvm export TF_VAR_suffix=speech_proj export TF_VAR_key=id_rsa_chameleon_speech terraform destroy -auto-approve

Difficulty Points Achieved

We have satisfied 4 difficulty points across different units in our project proposal, ensuring our approach is robust, scalable, and aligned with the requirements.

Owner

  • Login: ho1447
  • Kind: user

Citation (CITATIONS.bib)

@article{speechcommandsv2,
   author = {{Warden}, P.},
    title = "{Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}",
  journal = {ArXiv e-prints},
archivePrefix = "arXiv",
   eprint = {1804.03209},
 primaryClass = "cs.CL",
 keywords = {Computer Science - Computation and Language, Computer Science - Human-Computer Interaction},
     year = 2018,
    month = apr,
    url = {https://arxiv.org/abs/1804.03209},
}

GitHub Events

Total
  • Issues event: 2
  • Member event: 3
  • Push event: 72
  • Fork event: 1
  • Create event: 2
Last Year
  • Issues event: 2
  • Member event: 3
  • Push event: 72
  • Fork event: 1
  • Create event: 2