https://github.com/broadinstitute/arret
Stop overspending on Terra GCS storage
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.3%) to scientific vocabulary
Repository
Stop overspending on Terra GCS storage
Basic Info
Statistics
- Stars: 2
- Watchers: 8
- Forks: 0
- Open Issues: 3
- Releases: 4
Metadata Files
README.md
Arret: Stop overspending on Terra GCS storage
Inspired by automop, this tool deletes unneeded objects from a Terra workspace's GCS bucket in order to reduce storage costs.
It will delete objects in the bucket according to configurable rules, including filters on creation time, size, and name. In order to prevent deleting "active" files, however, any file with a gs:// URL referenced in any data table in the Terra workspace (or other specified workspaces) will not be deleted.
Recommended installation
Install the required system dependencies:
Install the required Python version (developed with 3.12.3, but other 3.12+ versions should work):
shell pyenv install "$(cat .python-version)"Confirm that
pythonmaps to the correct version:python --versionSet the Poetry interpreter and install the Python dependencies:
shell poetry env use "$(pyenv which python)" poetry install
A requirements.txt file is also available and kept in sync with Poetry dependencies in case you don't want to use Poetry, or you can use arret via docker: docker pull dmccabe606/arret:latest.
This repo expects that your default GOOGLE_APPLICATION_CREDENTIALS authorizes write access to the Terra workspace (and its bucket).
Running
The arret CLI app requires just a single named argument for the path to a config file. See configs/example.dist.toml for an example:
```toml gcpprojectid = ""
[terra] workspacenamespace = "" workspacename = "" otherworkspaces = [ { "workspacenamespace" = "", "workspace_name" = "" } ] # optional
[inventory] inventory_path = "./data/inventories/inventory.ndjson"
[plan] planpath = "./data/plans/plan.duckdb" daysconsideredold = 30 # can be 0 bytesconsidered_large = 1e6 # can be 0
[clean] todeletesql = "ispipelinelogs OR isold OR islarge"
[batch]
region = "us-central1"
zone = "us-central1-a" # used only to look up CPU and memory for the machine_type
machinetype = "n2-highcpu-4"
bootdiskmib = 20000 # should be large enough to accommodate the inventory file
maxrunseconds = 1200
provisioningmodel = "STANDARD" # or, e.g., "SPOT"
serviceaccountemail = "" # see README
containerimageuri = "docker.io/dmccabe606/arret:latest"
```
All the steps can be run in sequence with the run-all command:
shell
poetry run python -m arret --config-path="./configs/your_config.toml" run-all
Steps
Alternatively, the steps can be run individually:
1. Inventory
shell
poetry run python -m arret --config-path="./configs/your_config.toml" inventory
This will create an .ndjson file containing name, size, and updated datetime for all the blobs in the GCS bucket.
2. Plan
shell
poetry run python -m arret --config-path="./configs/your_config.toml" plan
This loads the generated inventory and stores it as a DuckDB database, with additional columns indicating whether blobs are large, old, etc.
3. Clean
shell
poetry run python -m arret --config-path="./configs/your_config.toml" clean
This reopens the DuckDB and collects blobs to be deleted.
For the example to_delete_sql SQL string "is_pipeline_logs OR is_old OR is_large", it will delete a blob if any of the following is true:
- blob is inside a /pipelines-logs/ folder
- blob is old (based on days_considered_old)
- blob is large (based on bytes_considered_large)
...except when:
- blob is referenced in any Terra data table in the workspace of interest or any of the other_workspaces
Before the deletion logic is applied, a row in the blobs table might look like this:
url gs://fc-1dfcd8c5-aaaa-aaaa-aaaa-0358fcf90e31/s...
name submissions/01263c2a-bbbb-bbbb-bbbb-216fd55a4c...
size 414776
updated 2024-06-11 06:42:16-04:00
is_large False
is_old True
is_pipeline_logs False
in_data_table False
to_delete False
Thus, to delete blobs that are pipeline logs, old, or large, while keeping log files and task scripts, set this config:
toml
[clean]
to_delete_sql = """
(is_pipeline_logs OR is_old OR is_large)
AND
(name NOT LIKE '%/script')
AND
(name NOT LIKE '%.log')
"""
If you have Terra job submissions in process and your to_delete_sql logic is set to delete "old" objects, make sure that days_considered_old is high enough not to delete task/workflow outputs that might belong to an active job. It's safest to run arret when your workspace has no active jobs at all.
Any DuckDB SELECT syntax can be used to filter the blobs table. The three is_* columns are populated to handle common use cases.
Config-free commands
Alternatively, you can omit --config-path and pass named options to the various commands, e.g.:
shell
poetry run python -m arret run-all \
--workspace-namespace the-workspace-namespace \
--workspace-name the-workspace-name \
--gcp-project-id the-gcp-project-id \
--inventory-path ./data/inventories/inventory.ndjson \
--plan-path ./data/plans/plan.duckdb \
--days-considered-old 30 \
--bytes-considered-large 1000000 \
--other-workspaces the-workspace-namespace/workspace-1 \
--other-workspaces the-workspace-namespace/workspace-2 \
--other-workspaces the-workspace-namespace/workspace-3 # etc.
Runtime
Since inventory generation and blob deletion can take a long time, these steps are multithreaded. Even with many threads available, though, running arret might still take several hours if the Terra workspace has thousands of job submissions. Terra generates lots of small files (especially redundant logs) that must be iterated every time the inventory step runs. One source is a workflow's /pipelines-logs/ folder, which arret deletes, so subsequent runs will be nominally faster if you opt to delete these.
Remote execution on GCP Batch
To aid in automation and reduce runtime, the run-all command can also be submitted as a GCP Batch job:
shell
poetry run python -m arret --config-path="./configs/your_config.toml" submit-to-gcp-batch
This requires having already created a GCP service account with at least these IAM permissions: - Batch Agent Reporter - Logs Writer - Service Usage Consumer - Storage Object Admin
The service account must also be registered in Terra and belong to a Terra group that has write access to the workspace you're cleaning and read access to workspaces listed in other_workspaces (if any). Note that it might take up to a day for Terra to sync permissions from a newly registered service account to the GCS buckets it should be able to access.
Owner
- Name: Broad Institute
- Login: broadinstitute
- Kind: organization
- Location: Cambridge, MA
- Website: http://www.broadinstitute.org/
- Twitter: broadinstitute
- Repositories: 1,083
- Profile: https://github.com/broadinstitute
Broad Institute of MIT and Harvard
GitHub Events
Total
- Create event: 6
- Release event: 2
- Issues event: 2
- Watch event: 1
- Push event: 19
- Public event: 1
- Pull request event: 1
Last Year
- Create event: 6
- Release event: 2
- Issues event: 2
- Watch event: 1
- Push event: 19
- Public event: 1
- Pull request event: 1
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 4
- Total pull requests: 1
- Average time to close issues: about 2 months
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: 3 days
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
- dpmccabe (4)
Pull Request Authors
- dependabot[bot] (1)