Caliban

Caliban: Docker-based job manager for reproducible workflows - Published in JOSS (2020)

https://github.com/google/caliban

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

ai-platform docker google-cloud python3 research-tool

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence

Last synced: 6 months ago · JSON representation

Repository

Research workflows made easy, locally and in the Cloud.

Basic Info

Host: GitHub
Owner: google
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://caliban.readthedocs.io
Size: 2.35 MB

Statistics

Stars: 501
Watchers: 18
Forks: 67
Open Issues: 25
Releases: 6

Topics

ai-platform docker google-cloud python3 research-tool

Created over 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Codemeta

Caliban

Caliban is a tool that helps researchers launch and track their numerical experiments in an isolated, reproducible computing environment. It was developed by machine learning researchers and engineers, and makes it easy to go from a simple prototype running on a workstation to thousands of experimental jobs running on Cloud.

With Caliban, you can:

Develop your experimental code locally and test it inside an isolated (Docker) environment
Easily sweep over experimental parameters
Submit your experiments as Cloud jobs, where they will run in the same isolated environment
Control and keep track of jobs

Quickstart

Install Docker, make sure it's running, then install Caliban (you'll need Python >= 3.6):

bash pip install caliban

Train a simple deep learning model on your local machine:

bash git clone https://github.com/google/caliban.git && cd caliban/tutorials/basic caliban run --nogpu mnist.py

Sweep over learning rates to find the best one (flags are specified in JSON format):

bash echo '{"learning_rate": [0.01, 0.001, 0.0001]}' | caliban run --experiment_config stdin --nogpu mnist.py

Next:

See how to submit the experiment to Cloud and use other Caliban features in "Getting Started with Caliban"
See Installation for detailed installation instructions
Read the Command Overview for info on Caliban commands.

Full documentation for Caliban lives at Read The Docs.

Dramatic Interlude

> “Be not afeard; the isle is full of noises, \ > Sounds, and sweet airs, that give delight and hurt not. \ > Sometimes a thousand twangling instruments \ > Will hum about mine ears; and sometime voices, \ > That, if I then had waked after long sleep, \ > Will make me sleep again: and then, in dreaming, \ > The clouds methought would open, and show riches \ > Ready to drop upon me; that, when I waked, \ > I cried to dream again.” > > -- Shakespeare, The Tempest

Installation and Prerequisites

Caliban's prequisites are Docker and Python >= 3.6.

Make sure your Python is up to date:

bash $ python --version Python 3.6.9 # should be >=3.6.0

If not, visit "Installing Python 3.6" before proceeding.

Next, install Caliban via pip:

bash pip install -U caliban

check if your installation worked by navigating to an empty folder and running caliban --help. You should see the usage dialogue:

bash $ caliban --help usage: caliban [-h] [--helpfull] [--version] {shell,notebook,build,run,cloud,cluster,status,stop,resubmit} ...

Docker

Caliban executes your code inside a "container", managed by Docker. To get Docker:

On MacOS, follow the installation instructions at Docker Desktop and start the newly-installed Docker Desktop application.
On Linux, visit the Docker installation instructions. (It's important that you configure sudo-less Docker and start Docker running on your machine.)

Make sure Docker is correctly installed, configured and running by executing the following command:

bash docker run hello-world

You should see output that looks like this:

text ... Hello from Docker! This message shows that your installation appears to be working correctly. ...

Python 3.6

Make sure your Python version is up to date:

bash $ python --version Python 3.6.9 # should be >=3.6.0

If you need to upgrade:

On MacOS, install the latest Python version from python.org (direct link).
On Linux, run sudo apt-get update && sudo apt-get install python3.7.

Cloud Submission and GPUs

Caliban's Read the Docs documentation has instructions on:

Installing the nvidia-docker2 runtime, so you can use Caliban to run jobs that use your Linux machine's GPU.
Setting up a Google Cloud account so you can submit your code to Google's Cloud AI Platform with caliban cloud.

Getting Started with Caliban

In this section we will use Caliban to train an image classification network (implemented in TensorFlow). We will:

Train a neural network on the local machine
Increase the model's accuracy by changing the learning rate with a command-line flag
Sweep across a range of learning rates with Caliban's experiment broadcasting feature
Train the model in the Cloud on Google's AI Platform
Develop code interactively using caliban shell in the exact same environment.

Preparing your Project

Create an empty directory and use curl to download a python script that trains a basic neural network.

mkdir demo && cd demo curl --output mnist.py https://raw.githubusercontent.com/google/caliban/main/tutorials/basic/mnist.py

Create a file called requirements.txt to declare tensorflow-cpu as a dependency:

bash echo "tensorflow-cpu" > requirements.txt

Caliban will automatically make any entry in requirements.txt available when you run your code. See "Declaring Requirements" for more information.

Training the Network

Run this command to train your first ML model:

bash caliban run --nogpu mnist.py

You should see a stream of output ending in this:

text Training model with learning rate=0.1 for 3 epochs. Epoch 1/3 1875/1875 - 3s - loss: 2.0989 - accuracy: 0.2506 Epoch 2/3 1875/1875 - 3s - loss: 1.9222 - accuracy: 0.2273 Epoch 3/3 1875/1875 - 3s - loss: 2.0777 - accuracy: 0.1938 Model performance: 313/313 - 0s - loss: 2.0973 - accuracy: 0.1858

Your model was able to recognize digits from the MNIST dataset with 18.58% accuracy. Can we do better?

Improving the Model

The default learning rate is 0.1. Run the code again with a smaller learning rate by passing a command-line flag, separated from your original command by --:

```bash $ caliban run --nogpu mnist.py -- --learning_rate 0.01

Training model with learning rate=0.01 for 3 epochs. Epoch 1/3 1875/1875 - 4s - loss: 0.2676 - accuracy: 0.9221 Epoch 2/3 1875/1875 - 4s - loss: 0.1863 - accuracy: 0.9506 Epoch 3/3 1875/1875 - 4s - loss: 0.1567 - accuracy: 0.9585 Model performance: 313/313 - 0s - loss: 0.1410 - accuracy: 0.9642 ```

96% accuracy! Much better! Can we do better still?

Experiment Broadcasting

Caliban's experiment broadcasting feature will allow us to run many jobs with different sets of arguments.

Create a file called experiment.json with a JSON dictionary of the format {"flag_name": ["list", "of", "values"]}:

bash echo '{"learning_rate": [0.01, 0.001, 0.0001]}' > experiment.json

Pass the config with --experiment_config and run again:

bash caliban run --experiment_config experiment.json --nogpu mnist.py

You should see accuracies of roughly 0.9493, 0.9723 and 0.9537. Looks like 0.001 is a nice choice.

Submitting to Cloud AI Platform

Now it's time to submit the job to Cloud AI Platform.

(NOTE: This section requires a Google Cloud account. You can create a free account with $300 of credit to get started. Follow Caliban's "Getting Started with Google Cloud" documentation, then come back here to proceed.)

Submit the job to AI Platform by changing the word run to cloud:

bash caliban cloud --nogpu mnist.py -- --learning_rate 0.01

You should see output like this:

```bash I0615 19:57:43.354172 4563361216 core.py:161] Job 1 - jobId: calibantotoro1, image: gcr.io/research-3141/974a776e6037:latest I0615 19:57:43.354712 4563361216 core.py:161] Job 1 - Accelerator: {'count': 0, 'type': 'ACCELERATORTYPEUNSPECIFIED'}, machine: 'n1-highcpu-32', region: 'us-central1' I0615 19:57:43.355082 4563361216 core.py:161] Job 1 - Experiment arguments: ['--learningrate', '0.01'] I0615 19:57:43.355440 4563361216 core.py:161] Job 1 - labels: {'gpuenabled': 'false', 'tpuenabled': 'false', 'jobname': 'calibantotoro', 'learningrate': '0_01'}

I0615 19:57:43.356621 4563361216 core.py:324] Submitting request! I0615 19:57:45.078382 4563361216 core.py:97] Request for job 'calibantotoro202006151957431' succeeded! I0615 19:57:45.078989 4563361216 core.py:98] Job URL: https://console.cloud.google.com/ai-platform/jobs/calibantotoro202006151957431?projectId=totoro-project I0615 19:57:45.079524 4563361216 core.py:100] Streaming log CLI command: $ gcloud ai-platform jobs stream-logs calibantotoro202006151957431 Submitting calibantotoro1: 100%|####################################################################################################################################################################################| 1/1 [00:02<00:00, 2.65s/requests] I0615 19:57:45.405600 4563361216 core.py:673] I0615 19:57:45.405819 4563361216 core.py:676] Visit https://console.cloud.google.com/ai-platform/jobs/?projectId=research-3141 to see the status of all jobs. I0615 19:57:45.405959 4563361216 core.py:677] ```

This output means that Caliban has:

built a Docker container with all of your code
Pushed that container up to Google Cloud's Container Registry
Submitted the job to AI Platform.

You can now visit the link in the output that looks like: https://console.cloud.google.com/ai-platform/jobs/calibantotoro202006151957431?projectId=totoro-project to see all of your job's logs.

Why do I need Cloud?

With Google Cloud, you can use on-demand GPUs and TPUs and train models on large datasets at very high speeds. You can also customize the machine type that AI Platform uses to run your job. You might need high memory or more CPU, for example.

See Caliban's "Customizing Machines and GPUs" for more information.

Interactive Development with `caliban shell`

caliban shell lets you develop code interactively inside of the exact same environment that your code will have available, locally during caliban run or in the Cloud with caliban cloud.

Run the following command to activate the shell:

bash caliban shell --nogpu

You should see Caliban's terminal:

``` I0611 12:33:17.551121 4500135360 docker.py:911] Running command: docker run --ipc host -w /usr/app -u 735994:89939 -v /Users/totoro/code/example:/usr/app -it --entrypoint /bin/bash -v /Users/totoro:/home/totoro ab8a7d7db868

/ _/ | / / / _/ _ )/ | / | / / \ \ \ \ / / / /| | / / / // __ / /| | / |/ / \ \ \ \ / // __ |/ /_/ // // / ___ |/ /| / / / / / _// |/_//____// |// |/ // /_/

You are running caliban shell as user with ID 735994 and group 89939, which should map to the ID and group for your user on the Docker host. Great!

[totoro@6a9b28990757 /usr/app]$ ```

You're now living in an isolated Docker container with your tensorflow-cpu dependency available (and any others you've declared).

Run the python command and check that tensorflow is installed:

```bash $ python Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

import tensorflow as tf tf.version '2.2.0' ```

Your home directory and the folder where you ran the command are both mounted into this isolated environment, so any changes you make to either of those directories will be reflected immediately.

Any code you add to the current folder and edit on your computer will be available in this special Caliban shell. Run the example from before like this:

python mnist.py --learning_rate 0.01

If your code runs in caliban shell, you can be almost certain that your code will execute in a Cloud environment, with potentially many GPUs attached and much larger machines available.

What next?

Read the Overview for more information on Caliban's subcommands, then head over to Caliban's documentation site and check out the links on the sidebar.

If you find anything confusing, please feel free to create an issue on our Github Issues page, and we'll get you sorted out.

Command Overview

Caliban provides seven subcommands that you run inside some project directory on your machine:

caliban shell generates a Docker image containing any dependencies you've declared in a requirements.txt and/or setup.py in the directory and opens an interactive shell in that directory. The caliban shell environment is ~identical to the environment that will be available to your code when you submit it to AI Platform; the difference is that your current directory is live-mounted into the container, so you can develop interactively.
caliban notebook starts a Jupyter notebook or lab instance inside of a Docker image containing your dependencies; the guarantee about an environment identical to AI Platform applies here as well.
caliban run packages your directory's code into the Docker image and executes it locally using docker run. If you have a GPU, the instance will attach to it by default - no need to install the CUDA toolkit. The Docker environment takes care of all that. This environment is truly identical to the AI Platform environment. The Docker image that runs locally is the same image that will run in AI Platform.
caliban cloud allows you to submit jobs to AI Platform that will run inside the same Docker image you used with caliban run. You can submit hundreds of jobs at once. Any machine type, GPU count, and GPU type combination you specify will be validated client side, so you'll see an immediate error with suggestions, rather than having to debug by submitting jobs over and over.
caliban build builds the Docker image used in caliban cloud and caliban run without actually running the container or submitting any code.
caliban cluster creates GKE clusters and submits jobs to GKE clusters.
caliban status displays information about all jobs submitted by Caliban, and makes it easy to interact with large groups of experiments. Use caliban status when you need to cancel pending jobs, or re-build a container and resubmit a batch of experiments after fixing a bug.

Disclaimer

This is a research project, not an official Google product. Expect bugs and sharp edges. Please help by trying out Caliban, reporting bugs, and letting us know what you think!

Get Involved + Get Support

Pull requests and bug reports are always welcome! Check out our Contributor's Guide for information on how to get started contributing to Caliban.

The TL;DR; is:

send us a pull request,
iterate on the feedback + discussion, and
get a +1 from a Committer

in order to get your PR accepted.

Issues should be reported on the GitHub issue tracker.

If you want to discuss an idea for a new feature or ask us a question, discussion occurs primarily in the body of Github Issues, though the project is growing large enough that we may start a Gitter channel soon.

The current list of active committers (who can +1 a pull request) can be found here: COMMITTERS.md

A list of contributors to the project can be found at the project's Contributors page.

Citing Caliban

If Caliban helps you in your research, please consider citing Caliban's associated academic paper:

@article{Ritchie2020, doi = {10.21105/joss.02403}, url = {https://doi.org/10.21105/joss.02403}, year = {2020}, publisher = {The Open Journal}, volume = {5}, number = {53}, pages = {2403}, author = {Sam Ritchie and Ambrose Slone and Vinay Ramasesh}, title = {Caliban: Docker-based job manager for reproducible workflows}, journal = {Journal of Open Source Software} }

License

Licensed under the Apache License, Version 2.0.

Owner

Name: Google
Login: google
Kind: organization
Email: opensource@google.com
Location: United States of America

Website: https://opensource.google/
Twitter: GoogleOSS
Repositories: 2,773
Profile: https://github.com/google

Google ❤️ Open Source

JOSS Publication

Caliban: Docker-based job manager for reproducible workflows

Published

September 17, 2020

DOI

10.21105/joss.02403

Volume 5, Issue 53, Page 2403

Authors

Sam Ritchie

Google, United States of America

Ambrose Slone
Google, United States of America

Vinay Ramasesh

Google, United States of America

Editor

Patrick Diehl

CodeMeta (codemeta.json)

{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/master/codemeta.jsonld",
  "@type": "Code",
  "author": [
    {
      "@id": "http://orcid.org/0000-0002-0545-6360",
      "@type": "Person",
      "email": "samritchie@google.com",
      "name": "Sam Ritchie",
      "affiliation": "Google"
    },
    {
      "@id": "",
      "@type": "Person",
      "email": "aslone@google.com",
      "name": "Ambrose Slone",
      "affiliation": "Google"
    },
    {
      "@id": "http://orcid.org/0000-0003-0625-3327",
      "@type": "Person",
      "email": "ramasesh@google.com",
      "name": "Vinay Ramasesh",
      "affiliation": "Google"
    }
  ],
  "identifier": "",
  "maintainer": "http://orcid.org/0000-0002-0545-6360",
  "codeRepository": "https://github.com/google/caliban",
  "issueTracker": "https://github.com/google/caliban/issues",
  "datePublished": "2020-06-22",
  "dateModified": "2020-06-22",
  "dateCreated": "2020-06-22",
  "description": "Docker-based job manager for reproducible workflows",
  "keywords": "python, docker, machine learning, reproducibility",
  "license": "Apache 2.0",
  "title": "Caliban",
  "version": "0.2.5"
}

GitHub Events

Total

Watch event: 10
Fork event: 2

Last Year

Watch event: 10
Fork event: 2

Committers

Last synced: 6 months ago

All Time

Total Commits: 238
Total Committers: 9
Avg Commits per committer: 26.444
Development Distribution Score (DDS): 0.336

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Sam Ritchie	s**e@g**m	158
Ambrose Slone	a**e@g**m	37
Ambrose Slone	a****e	22
Sam Ritchie	s**m@m**g	7
Vinay Ramasesh	r**h@g**m	7
Erik Schnetter	s**r@g**m	3
Guy Gur-Ari	g**y@g**t	2
P. Oscar Boykin	j****k	1
Eric Jinks	e**s@g**m	1

Committer Domains (Top 20 + Academic)

google.com: 3 gurari.net: 1 mentat.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 29
Total pull requests: 94
Average time to close issues: 3 months
Average time to close pull requests: about 1 month
Total issue authors: 12
Total pull request authors: 11
Average comments per issue: 2.66
Average comments per pull request: 1.67
Merged pull requests: 63
Bot issues: 0
Bot pull requests: 12

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

eschnett (9)
sritchie (5)
arokem (4)
rahimentezari (3)
dthiagarajan (1)
dfurrer (1)
hamelsmu (1)
dmrd (1)
ramasesh (1)
fwilliams (1)
hamzaziizzz (1)
jordanrule (1)

Pull Request Authors

sritchie (41)
ajslone (23)
dependabot[bot] (17)
ramasesh (7)
r0cketdyne (6)
eschnett (6)
guygurari (2)
huntr-helper (1)
sagravat (1)
Jinksi (1)
mohitmishra786 (1)

Top Labels

Issue Labels

bug (2)

Pull Request Labels

dependencies (17)

Packages

Total packages: 1
Total downloads:
- pypi 72 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 11
Total maintainers: 1

pypi.org: caliban

Docker-based job runner for AI research.

Homepage: https://github.com/google/caliban
Documentation: https://caliban.readthedocs.io/
License: Apache-2.0
Latest release: 0.4.2
published over 2 years ago

Versions: 11
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 72 Last month

Rankings

Stargazers count: 2.8%

Forks count: 5.0%

Dependent packages count: 10.1%

Average: 12.7%

Dependent repos count: 21.5%

Downloads: 24.0%

Maintainers (1)

caliban

Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi

sphinx ==3.0.4
sphinx_rtd_theme *

requirements-dev.txt pypi

hypothesis *
ipython *
pre-commit *
pytest ==5.4.3
pytest-cov ==2.10.0
pytest-subprocess ==0.1.5
twine *

tutorials/basic/requirements.txt pypi

tensorflow-cpu *

.github/workflows/coverage.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite
codecov/codecov-action v1 composite

.github/workflows/pre-commit.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
pre-commit/action v2.0.0 composite

.github/workflows/release.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/workflow.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

dockerfiles/Dockerfile docker

$BASE_IMAGE latest build

setup.py pypi

tutorials/uv-metrics/setup.py pypi

pyproject.toml pypi

Caliban

Science Score: 93.0%

Keywords

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Caliban

Quickstart

Dramatic Interlude

Installation and Prerequisites

Docker

Python 3.6

Cloud Submission and GPUs

Getting Started with Caliban

Preparing your Project

Training the Network

Improving the Model

Experiment Broadcasting

Submitting to Cloud AI Platform

Why do I need Cloud?

Interactive Development with caliban shell

What next?

Command Overview

Disclaimer

Get Involved + Get Support

Citing Caliban

License

Owner

JOSS Publication

Caliban: Docker-based job manager for reproducible workflows

Authors

Editor

Tags

CodeMeta (codemeta.json)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: caliban

Rankings

Maintainers (1)

Dependencies

Interactive Development with `caliban shell`