https://github.com/broadinstitute/arret

Stop overspending on Terra GCS storage

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Stop overspending on Terra GCS storage

Basic Info

Host: GitHub
Owner: broadinstitute
Language: Python
Default Branch: main
Homepage:
Size: 449 KB

Statistics

Stars: 2
Watchers: 8
Forks: 0
Open Issues: 3
Releases: 4

Created almost 2 years ago · Last pushed 11 months ago

Metadata Files

Readme

Arret: Stop overspending on Terra GCS storage

Inspired by automop, this tool deletes unneeded objects from a Terra workspace's GCS bucket in order to reduce storage costs.

It will delete objects in the bucket according to configurable rules, including filters on creation time, size, and name. In order to prevent deleting "active" files, however, any file with a gs:// URL referenced in any data table in the Terra workspace (or other specified workspaces) will not be deleted.

Recommended installation

Install the required system dependencies:
- pyenv
- Poetry
Install the required Python version (developed with 3.12.3, but other 3.12+ versions should work): shell pyenv install "$(cat .python-version)"
Confirm that python maps to the correct version: python --version
Set the Poetry interpreter and install the Python dependencies: shell poetry env use "$(pyenv which python)" poetry install

A requirements.txt file is also available and kept in sync with Poetry dependencies in case you don't want to use Poetry, or you can use arret via docker: docker pull dmccabe606/arret:latest.

This repo expects that your default GOOGLE_APPLICATION_CREDENTIALS authorizes write access to the Terra workspace (and its bucket).

Running

The arret CLI app requires just a single named argument for the path to a config file. See configs/example.dist.toml for an example:

```toml gcpprojectid = ""

[terra] workspacenamespace = "" workspacename = "" otherworkspaces = [ { "workspacenamespace" = "", "workspace_name" = "" } ] # optional

[inventory] inventory_path = "./data/inventories/inventory.ndjson"

[plan] planpath = "./data/plans/plan.duckdb" daysconsideredold = 30 # can be 0 bytesconsidered_large = 1e6 # can be 0

[clean] todeletesql = "ispipelinelogs OR isold OR islarge"

[batch] region = "us-central1" zone = "us-central1-a" # used only to look up CPU and memory for the machine_type machinetype = "n2-highcpu-4" bootdiskmib = 20000 # should be large enough to accommodate the inventory file maxrunseconds = 1200 provisioningmodel = "STANDARD" # or, e.g., "SPOT" serviceaccountemail = "" # see README containerimageuri = "docker.io/dmccabe606/arret:latest" ```

All the steps can be run in sequence with the run-all command: shell poetry run python -m arret --config-path="./configs/your_config.toml" run-all

Steps

Alternatively, the steps can be run individually:

1. Inventory

shell poetry run python -m arret --config-path="./configs/your_config.toml" inventory

This will create an .ndjson file containing name, size, and updated datetime for all the blobs in the GCS bucket.

2. Plan

shell poetry run python -m arret --config-path="./configs/your_config.toml" plan

This loads the generated inventory and stores it as a DuckDB database, with additional columns indicating whether blobs are large, old, etc.

3. Clean

shell poetry run python -m arret --config-path="./configs/your_config.toml" clean

This reopens the DuckDB and collects blobs to be deleted.

For the example to_delete_sql SQL string "is_pipeline_logs OR is_old OR is_large", it will delete a blob if any of the following is true: - blob is inside a /pipelines-logs/ folder - blob is old (based on days_considered_old) - blob is large (based on bytes_considered_large)

...except when: - blob is referenced in any Terra data table in the workspace of interest or any of the other_workspaces

Before the deletion logic is applied, a row in the blobs table might look like this:

url gs://fc-1dfcd8c5-aaaa-aaaa-aaaa-0358fcf90e31/s... name submissions/01263c2a-bbbb-bbbb-bbbb-216fd55a4c... size 414776 updated 2024-06-11 06:42:16-04:00 is_large False is_old True is_pipeline_logs False in_data_table False to_delete False

Thus, to delete blobs that are pipeline logs, old, or large, while keeping log files and task scripts, set this config:

toml [clean] to_delete_sql = """ (is_pipeline_logs OR is_old OR is_large) AND (name NOT LIKE '%/script') AND (name NOT LIKE '%.log') """

If you have Terra job submissions in process and your to_delete_sql logic is set to delete "old" objects, make sure that days_considered_old is high enough not to delete task/workflow outputs that might belong to an active job. It's safest to run arret when your workspace has no active jobs at all.

Any DuckDB SELECT syntax can be used to filter the blobs table. The three is_* columns are populated to handle common use cases.

Config-free commands

Alternatively, you can omit --config-path and pass named options to the various commands, e.g.:

shell poetry run python -m arret run-all \ --workspace-namespace the-workspace-namespace \ --workspace-name the-workspace-name \ --gcp-project-id the-gcp-project-id \ --inventory-path ./data/inventories/inventory.ndjson \ --plan-path ./data/plans/plan.duckdb \ --days-considered-old 30 \ --bytes-considered-large 1000000 \ --other-workspaces the-workspace-namespace/workspace-1 \ --other-workspaces the-workspace-namespace/workspace-2 \ --other-workspaces the-workspace-namespace/workspace-3 # etc.

Runtime

Since inventory generation and blob deletion can take a long time, these steps are multithreaded. Even with many threads available, though, running arret might still take several hours if the Terra workspace has thousands of job submissions. Terra generates lots of small files (especially redundant logs) that must be iterated every time the inventory step runs. One source is a workflow's /pipelines-logs/ folder, which arret deletes, so subsequent runs will be nominally faster if you opt to delete these.

Remote execution on GCP Batch

To aid in automation and reduce runtime, the run-all command can also be submitted as a GCP Batch job:

shell poetry run python -m arret --config-path="./configs/your_config.toml" submit-to-gcp-batch

This requires having already created a GCP service account with at least these IAM permissions: - Batch Agent Reporter - Logs Writer - Service Usage Consumer - Storage Object Admin

The service account must also be registered in Terra and belong to a Terra group that has write access to the workspace you're cleaning and read access to workspaces listed in other_workspaces (if any). Note that it might take up to a day for Terra to sync permissions from a newly registered service account to the GCS buckets it should be able to access.

Owner

Name: Broad Institute
Login: broadinstitute
Kind: organization
Location: Cambridge, MA

Website: http://www.broadinstitute.org/
Twitter: broadinstitute
Repositories: 1,083
Profile: https://github.com/broadinstitute

Broad Institute of MIT and Harvard

GitHub Events

Total

Create event: 6
Release event: 2
Issues event: 2
Watch event: 1
Push event: 19
Public event: 1
Pull request event: 1

Last Year

Create event: 6
Release event: 2
Issues event: 2
Watch event: 1
Push event: 19
Public event: 1
Pull request event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 4
Total pull requests: 1
Average time to close issues: about 2 months
Average time to close pull requests: N/A
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: 3 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/broadinstitute/arret

Science Score: 26.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Arret: Stop overspending on Terra GCS storage

Recommended installation

Running

Steps

1. Inventory

2. Plan

3. Clean

Config-free commands

Runtime

Remote execution on GCP Batch

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels