https://github.com/broadinstitute/adapt-pipes
Workflows to run ADAPT on AWS Batch.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Repository
Workflows to run ADAPT on AWS Batch.
Basic Info
- Host: GitHub
- Owner: broadinstitute
- License: mit
- Language: wdl
- Default Branch: main
- Homepage: https://github.com/broadinstitute/adapt
- Size: 15.6 MB
Statistics
- Stars: 3
- Watchers: 8
- Forks: 0
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
adapt-pipes
Workflows to run ADAPT on AWS Batch.
For more information on ADAPT and on how to run it, please see the ADAPT repository on GitHub.
Setting up Cromwell Server for AWS
Setting up a VPC
Go to AWS CloudFormation, and click "Create Stack". If prompted, click "With new resources (standard)".
Choose "Template is ready" and "Upload a template file". Upload
/cromwell-setup/vpcstack.json, then hit "Next".Name your stack (ex.
Cromwell-VPC).Select regions for your availability zones. You must select between 2 and 4 regions. If you are unsure, select
us-east-1a,us-east-1b,us-east-1c, andus-east-1d.Select the number of availability zones that matches the number of regions you chose in Step 4.
Keep the defaults for the rest of the options on this page, and hit "Next".
Add any tags you would like, then hit "Next". Tags will be added to all AWS resources built by the stack and serve as additional metadata.
Click "Create Stack".
After the stack has finished running, click on the stack name, click on
Outputs, and record each Private and Public Subnet ID (in the formsubnet-#################) and the VPC ID (in the formvpc-#################). You will need them to set up the Genomics Workflow Core and Cromwell Resources.
Setting up Genomics Workflow Core and Cromwell Resources
Open
Installing the Genomics Workflow Core and Cromwell.pdfin/cromwell-setup/and follow the instructions. Whenever it asks to use VPC subnets, use as many as you can from "Setting up a VPC".If there are issues with running the stacks, try replacing "latest" with "v3.0.2" in any S3 file paths.
If it still is not working, upload the contents of
/cromwell-setup/cromwell-setup.zipto an S3 bucket, and run the stacks using paths to your personal S3 bucket. These have slight modifications to the templates inInstalling the Genomics Workflow Core and Cromwell.pdfthat allow AWS Batch to use the optimal instance size rather than selecting from a predefined list of instance types.After the stacks have finished running, click on the core stack name, click on
Outputs, and record theDefaultJobQueueArn, thePriorityJobQueueArn, and theS3BucketName. You will need these to set up your input files. Then, click on the resources stack name, click onOutputs, and record theHostName. TheHostNamewill be how you connect to your Cromwell Server.
Running ADAPT on Cromwell and AWS Batch
In order to do this, you will need the following values you recorded while building your server. If you didn't record them, they can be found by going to AWS Cloud Formation and following the instructions in Step 4 of "Setting up Genomics Workflow Core and Cromwell Resources".
DefaultJobQueueArnorPriorityJobQueueArn: the Batch queue to run your jobs on. TheDefaultJobQueueArnuses Spot instances if capacity is available, then On Demand instances; thePriorityJobQueueArnuses On Demand instances until a limit is reached , at which point it will use Spot instances. TheDefaultJobQueueArncosts less, but thePriorityJobQueueArnwill work faster. If you do not have access to the Cloud Formation stack and need to find theDefaultJobQueueArnorPriorityJobQueueArn, go to the AWS Batch Management Console, click on "Job queues", and look for the queue with "Default" or "Priority" (and likely "Cromwell") in their names. Click on it, and record the ARN (Amazon Resource Name).S3BucketName: the S3 Bucket where your Cromwell files are If you do not have access to the Cloud Formation stack and need to find theS3BucketName, go to the AWS S3 Management Console and click through the buckets until you find one with a folder called_gwfcore. Record this bucket's name.HostName: the URL for your server. If you do not have access to the Cloud Formation stack and need to find theHostName, go to the AWS EC2 Management Console, go to your list of instances, and find the one named "cromwell-server" (or something similar). The "Public IPv4 DNS" of this instance is yourHostName.
Setting up ADAPT Docker images
You may either use our Docker images or create your own. If you would like to use our Docker images, use quay.io/broadinstitute/adaptcloud to use cloud memoization features. Otherwise, or if you're unsure, use quay.io/broadinstitute/adapt.
If you would like to build your own Docker images, do the following:
Clone the ADAPT repository to your computer using the following command:
$ git clone https://github.com/broadinstitute/adapt.gitGo into the repository, and build the ADAPT docker image using the following commands:
$ cd adapt $ docker build . -t adaptIf you would like to use cloud memoization features, run the following command:
$ docker build . -t adaptcloud -f ./cloud.Dockerfile
If you are building your own Docker image, you will also need to publish it. You can do this either via DockerHub or via AWS itself. The following are instructions of how to publish your image using AWS.
Install the AWS Command Line Interface.
Click "Create Repository".
Name your repository, keep the other options at their defaults, and click "Create Repository".
Click on your repository's name, click "View push commands", and then follow the instructions listed there to push your Docker image to AWS.
Click back to the ECR home screen, and record the URI of your image.
Setting up Input Files
To send the job to your Cromwell server, you will need two or three files locally:
a WDL workflow for ADAPT. To design for a single taxon, use
single_adapt.wdl. To design for multiple taxa in parallel, useparallel_adapt.wdl.a JSON file of inputs to your WDL To design for a single taxon, modify
single_adapt_input_template.json. Details on each of the inputs are below:- single_adapt.adapt.queueArn: Queue ARN (Amazon Resource Name) of the queue you want the jobs to run on. This should be either the
DefaultJobQueueArnor thePriorityJobQueueArn. - single_adapt.adapt.taxid: Taxonomic ID of the design to create.
- singleadapt.adapt.refaccs: Accession number for sequences for references used by ADAPT for curation; separate multiple with commas.
- single_adapt.adapt.segment: Segment number of genome to design for; set to 'None' for unsegmented genomes.
- single_adapt.adapt.obj: Objective (either 'minimize-guides' or 'maximize-activity').
- singleadapt.adapt.specific: true to be specific against the taxa listed in specificitytaxa, false to not be specific.
- single_adapt.adapt.image: URI for Docker ADAPT Image to use
- singleadapt.adapt.specificitytaxa: Optional, only needed if specific is true. AWS S3 path to file that contains a list of taxa to be specific against. Should have no headings, but be a list of taxonomic IDs in the first column and segment numbers in the second column
- singleadapt.adapt.randsample: Optional, take a sample of RAND_SAMPLE sequences from the taxa to design for.
- singleadapt.adapt.randseed: Optional, set ADAPT's random seed to get consistent results across runs.
- single_adapt.adapt.bucket: Optional, S3 bucket for cloud memoization. May include path to put memo in a subfolder; do not include '\' at the end.
- singleadapt.adapt.memory: Optional, sets the memory each job uses. Defaults to 2GB. If jobs fail unexpectedly, increase this. To design for multiple taxa in parallel, modify `paralleladaptinputtemplate.json`. Details on each of the inputs are below:
- parallel_adapt.queueArn: Queue ARN (Amazon Resource Name) of the queue you want the jobs to run on. This should be either the
DefaultJobQueueArnor thePriorityJobQueueArn. - parallel_adapt.objs: Array of objective functions to design for; can include any of {"maximize-activity", "minimize-guides"}.
- paralleladapt.sps: Array; include "true" in the array to have designs made specific against any other order in the same family that is listed in ALLTAXA_FILE; include "false" to design nonspecifically.
- paralleladapt.taxafile: AWS S3 path to a TSV file that contains a list of taxa to design for. Headings should be 'family', 'genus', 'species', 'taxid', 'segment', 'refseqs', 'neighbor-count'.
- paralleladapt.formattaxa.alltaxafile: AWS S3 path to a TSV file that contains a list of all taxa to be specific against (note: will only check for specificity within a family). Can be the same file as TAXA_FILE. Headings should be 'family', 'genus', 'species', 'taxid', 'segment', 'refseqs', 'neighbor-count'.
- parallel_adapt.adapt.image: URI for Docker ADAPT Image to use
- parallel_adapt.adapt.bucket: Optional, S3 bucket for cloud memoization. May include path to put memo in a subfolder; do not include '/' at the end.
- parallel_adapt.adapt.memory: Optional, sets the memory each job uses. Defaults to 2GB. If jobs fail unexpectedly, increase this.
- single_adapt.adapt.queueArn: Queue ARN (Amazon Resource Name) of the queue you want the jobs to run on. This should be either the
a configuration file for AWS (optional, only necessary for running workflows through a Cromwell call) Modify anything that says
REGION,S3BUCKET, orQUEUEARNinaws-template.conf.REGIONshould be the region in which your S3 bucket is stored and your job queues are. You should see something likeus-east-1in theDefaultJobQueueArn/PriorityJobQueueArn; this is the region it is in.S3BUCKETshould be theS3BucketName.QUEUEARNshould be either theDefaultJobQueueArnor thePriorityJobQueueArn.
Sending Workflow to Cromwell server
There are three methods to run a workflow on your Cromwell Server-either through the Swagger UI, through an HTTP POST command, or through a Cromwell call.
Running your workflow through the Swagger UI
To access the Swagger UI, go to your HostName URL in a web browser. Note, it does not work in Chrome; use Firefox, Safari, Edge, or Internet Explorer instead. You may need to ignore a security warning about a self-signed certificate to access the page; to do so, click "Advanced" or "More Information" and then continue to the webpage.
To run your workflow, click POST /api/workflows/{version}, click "Try it Out", set version to "v1", upload your WDL workflow to workflowSource, upload your JSON input file to workflowInputs, set workflowType to "WDL", set workflowTypeVersion to "1.0", and click "Execute". Record the workflow ID outputted.
To check the status of your workflow, click GET /api/workflows/{version}/{id}/status, click "Try it Out", set version to "v1", set id to the workflow ID previously outputted, and click "Execute".
To get the outputs of your workflow once it has finished running, click GET /api/workflows/{version}/{id}/outputs, click "Try it Out", set version to "v1", set id to the workflow ID previously outputted, and click "Execute". You will get S3 paths to the files containing your outputs, which you can access via the S3 dashboard.
You may keep track of the status of each job produced by the workflow by referring to the AWS Batch Dashboard.
Running your workflow through an HTTP POST command
To run your workflow, open a terminal, and run the following command:
$ curl -k -X POST "https://{HostName}/api/workflows/v1" \
-H "accept: application/json" \
-F "workflowSource=@{WDL Workflow}" \
-F "workflowInputs=@{JSON Inputs}"
To check the status of your workflow, run the following command:
$ curl -k -X GET "https://{HostName}/api/workflows/v1/{id}/status
To get the outputs of your workflow once it has finished running, run the following command:
$ curl -k -X GET "https://{HostName}/api/workflows/v1/{id}/outputs
You will get S3 paths to the files containing your outputs, which you can access via the S3 dashboard.
You may keep track of the status of each job produced by the workflow by referring to the AWS Batch Dashboard.
Running your workflow through a Cromwell call
First, you will need to download Cromwell. You will only need to download cromwell-54.jar. You will also need to install the AWS Command Line Interface and add the AWSBatchFullAccess permissions policy to your account via IAM (click on "Users", your account name, "Add permissions", "Attach existing policies directly", "AWSBatchFullAccess", "Next: Review", and finally "Add permissions").
To run your workflow, open a terminal, and run the following command:
$ java -Dconfig.file={AWS Configuration file} -jar {path to Cromwell jar file} run {WDL Workflow} -i {JSON Inputs}
You will get updates on the status of your workflow in the terminal, as well as the S3 paths to the files of your outputs. You can access these via the S3 dashboard.
Owner
- Name: Broad Institute
- Login: broadinstitute
- Kind: organization
- Location: Cambridge, MA
- Website: http://www.broadinstitute.org/
- Twitter: broadinstitute
- Repositories: 1,083
- Profile: https://github.com/broadinstitute
Broad Institute of MIT and Harvard
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- priyappillai (2)