https://github.com/deeprec-ai/extension

DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.

https://github.com/deeprec-ai/extension

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 4 months ago · JSON representation

Repository

DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.

Basic Info
  • Host: GitHub
  • Owner: DeepRec-AI
  • License: apache-2.0
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 1.59 MB
Statistics
  • Stars: 11
  • Watchers: 7
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Created about 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License

README.md

DeepRec Extension

Introduction

DeepRec Extension is an easy-to-use, stable and efficient large-scale distributed training system based on DeepRec.

Features

Auto-scaling

Large-scale distributed training tasks contain many roles, such as chief, ps, and worker. Native interfaces for distributed training tasks require users to specify the number and resource allocation for each role, which makes a significant challenge for users. For users, it is difficult to configure these hyperparameters appropriately to ensure high resource utilization for training tasks. In many scenarios, users configure them too small lead to Out-of-Memory (OOM) errors in their training tasks, while in other scenarios, excessive configurations result in wasted resources due to over-allocation.

Some solutions achieve elastic training by stopping training and restarting it integrating checkpointing mechanisms. This approach is intrusive to users, as the process of halting and resuming training requires resources, resulting in an increase in overall training time. This overhead is particularly significant in scenarios where training tasks frequently require elasticity adjustments and cannot be overlooked. Restoring model with latest checkpoint causes training samples/model rollback and compute resources wasting.

Dynamic Embedding Server(DES) scale-up/scale-down PS nodes without job restart. It makes parameters redistribution and server dynamic addition and deletion automatically.

Gazer

Gazer is a metrics system for DeepRec/TensorFlow. It collects runtime machine load status and graph execution information, reporting them to the Master node, making decision elastic scaling of tasks by the master or presenting them to users via the TensorBoard interface.

Fast-fault-tolerance

Existing checkpoint mechanisms, when a PS node fails unexpectedly, it requires restarting the entire task and reverting the model to the previous checkpoint. This process significantly squanders the training outcomes from the previous checkpoint up until the node failure. Moreover, inconsistencies in the rollback of the distributed training model and samples give rise to the additional issue of sample loss.

Firstly, we support the consistency of the sample and the model by extra checkpoint. Secondly, we restart single PS node when PS node crash instead of restarting job. Lastly, we make backups of the model parameters to enable rapid recovery in the event of PS node failures.

Master Controller

In TensorFlow training tasks, there is a lack of a task-level master node for managing the state control of all the aforementioned functionalities. Taking into account the resource scheduling ecosystem of cloud-native K8S, we have extended tfjob in Kubeflow by adding a CRD with master capabilities.

How to build

  1. clone extension source code & init submodule

shell git clone git@github.com:DeepRec-AI/extension.git /workspace/extension && cd /workspace/extension git submodule update --init --recursive

  1. start container

shell docker run -ti --name deeprec-extension-dev --net=host -v /workspace:/workspace alideeprec/extension-dev:cpu-py36-ubuntu18.04 bash

  1. build all python wheel modules

shell cd /workspace/extension make gazer des master tft -j32

How to deploy

Prerequisites

  1. golang version>=1.20.12

  2. kubectl client install kubectl on linux

Installation

  1. install & configure kubectl client

shell $HOME/.kube/config

  1. deploy kubeflow-operator

```shell

clone kubeflow source code

git clone git@github.com:kubeflow/training-operator.git /workspace/training-operator && cd /workspace/training-operator

install kubeflow CRD

make install

deploy kubeflow image with v1.7.0

make deploy IMG=kubeflow/training-operator:v1-5525468 ```

  1. build & deploy aimaster-operator image

```shell cd /workspace/extension/aimaster_operator/

install aimaster-operator CRD

make install

build aimaster-operator image

{image} is aimaster-operator image name

make docker-build IMG={image}

push image to YOUR dockerhub

such as: Alibaba Container Registry, ACR, https://cr.console.aliyun.com

make docker-push IMG={image}

deploy aimaster-operator image to k8s

make deploy IMG={image} ```

  1. build aimaster image

```shell cd /workspace/extension

build aimaster image

make master-build-docker

push image to YOUR dockerhub

such as: Alibaba Container Registry, ACR, https://cr.console.aliyun.com

docker push {image} {namespace/image:tag} ```

  1. build deeprec-extension image

```shell cd /workspace/extension

build deeprec-extension image

bash tools/examples/build_docker.sh

push image to YOUR dockerhub

such as: Alibaba Container Registry, ACR, https://cr.console.aliyun.com

docker push {image} {namespace/image:tag} ```

How to use

shell kubectl apply -f tools/examples/extension_test.yaml

Latest Images

aimaster-operator

shell alideeprec/extension-operator-release:latest

aimaster

shell alideeprec/extension-aimaster-release:latest

extension

deeprec + estimator + gazer + des + tf-fault-tolerance + deeprec-master

shell alideeprec/extension-release:latest

Example on ACK

  1. Create ACK

  2. deploy kubeflow-operator

  3. deploy aimaster-operator

  4. execute training job

shell kubectl apply -f tools/examples/extension_test.yaml

Estimator sample code

train.py

Owner

  • Name: DeepRec-AI
  • Login: DeepRec-AI
  • Kind: organization

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

aimaster_operator/Dockerfile docker
  • gcr.io/distroless/static nonroot build
  • golang 1.16 build
deeprec_master/tools/Dockerfile docker
  • alideeprec/extension-dev aimaster-base-py36-ubuntu18.04 build
tools/dockerfiles/Dockerfile docker
  • alideeprec/extension-dev cpu-py36-ubuntu18.04 build
aimaster_operator/go.mod go
  • cloud.google.com/go v0.54.0
  • github.com/Azure/go-autorest v14.2.0+incompatible
  • github.com/Azure/go-autorest/autorest v0.11.18
  • github.com/Azure/go-autorest/autorest/adal v0.9.13
  • github.com/Azure/go-autorest/autorest/date v0.3.0
  • github.com/Azure/go-autorest/logger v0.2.1
  • github.com/Azure/go-autorest/tracing v0.6.0
  • github.com/PuerkitoBio/purell v1.1.1
  • github.com/PuerkitoBio/urlesc v0.0.0-20170810143723-de5bf2ad4578
  • github.com/beorn7/perks v1.0.1
  • github.com/cespare/xxhash/v2 v2.1.1
  • github.com/davecgh/go-spew v1.1.1
  • github.com/emicklei/go-restful v2.9.5+incompatible
  • github.com/evanphx/json-patch v4.11.0+incompatible
  • github.com/form3tech-oss/jwt-go v3.2.3+incompatible
  • github.com/fsnotify/fsnotify v1.4.9
  • github.com/go-logr/logr v0.4.0
  • github.com/go-logr/zapr v0.4.0
  • github.com/go-openapi/jsonpointer v0.19.5
  • github.com/go-openapi/jsonreference v0.19.5
  • github.com/go-openapi/spec v0.20.3
  • github.com/go-openapi/swag v0.19.14
  • github.com/gogo/protobuf v1.3.2
  • github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da
  • github.com/golang/protobuf v1.5.2
  • github.com/google/go-cmp v0.5.5
  • github.com/google/gofuzz v1.1.0
  • github.com/google/uuid v1.1.2
  • github.com/googleapis/gnostic v0.5.5
  • github.com/imdario/mergo v0.3.12
  • github.com/josharian/intern v1.0.0
  • github.com/json-iterator/go v1.1.11
  • github.com/kubeflow/common v0.4.1
  • github.com/kubeflow/training-operator v1.4.0
  • github.com/mailru/easyjson v0.7.6
  • github.com/matttproud/golang_protobuf_extensions v1.0.2-0.20181231171920-c182affec369
  • github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd
  • github.com/modern-go/reflect2 v1.0.1
  • github.com/nxadm/tail v1.4.8
  • github.com/onsi/ginkgo v1.16.5
  • github.com/onsi/gomega v1.18.1
  • github.com/pkg/errors v0.9.1
  • github.com/prometheus/client_golang v1.11.0
  • github.com/prometheus/client_model v0.2.0
  • github.com/prometheus/common v0.26.0
  • github.com/prometheus/procfs v0.6.0
  • github.com/spf13/pflag v1.0.5
  • go.uber.org/atomic v1.7.0
  • go.uber.org/multierr v1.6.0
  • go.uber.org/zap v1.19.0
  • golang.org/x/crypto v0.0.0-20210220033148-5ea612d1eb83
  • golang.org/x/net v0.0.0-20211209124913-491a49abca63
  • golang.org/x/oauth2 v0.0.0-20200107190931-bf48bf16ab8d
  • golang.org/x/sys v0.0.0-20211216021012-1d35b9e2eb4e
  • golang.org/x/term v0.0.0-20210220032956-6a3ed077a48d
  • golang.org/x/text v0.3.6
  • golang.org/x/time v0.0.0-20210723032227-1f47c861a9ac
  • gomodules.xyz/jsonpatch/v2 v2.2.0
  • google.golang.org/appengine v1.6.7
  • google.golang.org/protobuf v1.26.0
  • gopkg.in/inf.v0 v0.9.1
  • gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7
  • gopkg.in/yaml.v2 v2.4.0
  • gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b
  • k8s.io/api v0.22.6
  • k8s.io/apiextensions-apiserver v0.22.2
  • k8s.io/apimachinery v0.22.6
  • k8s.io/client-go v0.22.6
  • k8s.io/component-base v0.22.2
  • k8s.io/klog/v2 v2.9.0
  • k8s.io/kube-openapi v0.0.0-20200805222855-6aeccd4b50c6
  • k8s.io/utils v0.0.0-20210819203725-bdf08cb9a70a
  • sigs.k8s.io/controller-runtime v0.10.3
  • sigs.k8s.io/structured-merge-diff/v4 v4.2.1
  • sigs.k8s.io/yaml v1.2.0
aimaster_operator/go.sum go
  • 996 dependencies
deeprec_master/setup.py pypi
dynamic_embedding_server/setup.py pypi
gazer/setup.py pypi
tf_fault_tolerance/setup.py pypi