Manual Deployment

There are many ways (automated and manual) to deploy, configure, and manage AWS resources depending on your security posture and familiarity with the AWS eco-system. If you cannot use our AWS CloudFormation template, this article lists the steps for a rather straightforward deployment of AWS resources for use by Metaflow.

Please note that Metaflow can re-use existing AWS resources - for example, your existing AWS Batch job queue for job execution. The instructions listed here will create these resources from scratch. If you have a strong background in administering AWS resources, you will notice that many of the security policies are fairly permissive and are intended to serve as a starting point for more complex deployments. Please reach out to us if you would like to discuss more involved deployments.

Steps for Manual Deployment

These steps assume that the users of Metaflow have sufficient AWS credentials on their workstation to interact with the AWS resources that are spun up.

Datastore

Metaflow currently supports Amazon S3 as the storage backend for all the data that is generated during the execution of Metaflow flows.

Metaflow stores all flow execution data (user code, pickled object files, etc.) in an S3 folder which is set as the variable METAFLOW_DATASTORE_SYSROOT_S3 in the metaflow configuration. In case, you are using metaflow.S3, you can set the variable METAFLOW_DATATOOLS_S3ROOT to store your data in a specific folder in S3.

Create a private Amazon S3 bucket

The following instructions will create a private S3 bucket for Metaflow -

  1. Sign in to the AWS Management Console and open the Amazon S3 console.

  2. Choose Create bucket.

  3. In Bucket name, enter a DNS-compliant name for your bucket. Avoid including sensitive information, such as account numbers, in the bucket name. The bucket name is visible in the URLs that point to the objects in the bucket.

  4. In Region, choose the AWS Region where you want the bucket to reside. Choose a Region close to you to minimize latency and costs.

  5. In Bucket settings for Block Public Access, keep the values set to the defaults. By default Amazon S3 blocks all public access to your buckets.

  6. Choose Create bucket.

Amazon S3 Bucket

In this example, we created a private bucket metaflow-s3. While configuring Metaflow through metaflow configure aws, we can set the following values when prompted -

METAFLOW_DATASTORE_SYSROOT_S3 = s3://metaflow-s3/flows
METAFLOW_DATASTORE_SYSROOT_S3 = s3://metaflow-s3/data-tools
METAFLOW_DEFAULT_DATASTORE=s3

Compute

Metaflow currently supports scaling compute via AWS Batch. Metaflow orchestrates this compute by leveraging Amazon S3 as the storage layer for code artifacts. If you want to use AWS Batch, you would have to configure Amazon S3 first following the instructions listed previously.

Once you have set up your Amazon S3 bucket, you would need to set up an AWS Batch job queue and an IAM role that has permission to access Amazon S3 (and other AWS services). Jobs launched via Metaflow on AWS Batch will assume this role so that they can communicate with Amazon S3.

Before you can create a job queue, you would need to set up a compute environment that your job queue will eventually execute jobs on. There are many ways to create a compute environment depending on your specific use case. In most cases, a managed compute environment is sufficient, where AWS manages your compute resources with sensible defaults. But, if you prefer, you can create an unmanaged compute environment by following these instructions from AWS. If you want to know how to create a managed compute environment, read on!

Compute resources in your compute environments need external network access to communicate with the Amazon ECS service endpoint. However, you might have jobs that you would like to run in private subnets. Creating a VPC with both public and private subnets provides you the flexibility to run jobs in either a public or private subnet. Jobs in the private subnets can access the internet through a NAT gateway. Before creating an AWS Batch compute environment, we would need to create a VPC with both public and private subnets.

Create a VPC

  1. Run the VPC Wizard

    1. Open the Amazon VPC console and in the left navigation pane, choose VPC Dashboard.

    2. Choose Launch VPC Wizard, VPC with a Single Public Subnet, Select.

    3. For VPC name, give your VPC a unique name.

    4. For Elastic IP Allocation ID, choose the ID of the Elastic IP address that you created earlier.

    5. Choose Create VPC.

    6. When the wizard is finished, choose OK. Note the Availability Zone in which your VPC subnets were created. Your additional subnets should be created in a different Availability Zone. These subnets are not auto-assigned public IPv4 addresses. Instances launched in the public subnet must be assigned a public IPv4 address to communicate with the Amazon ECS service endpoint.

  2. Create Additional Subnets

    The wizard in Step 2. creates a VPC with a single public in a single Availability Zone. For greater availability, you should create at least one more subnet in a different Availability Zone so that your VPC has public subnets across two Availability Zones. To create an additional public subnet

    1. In the left navigation pane, choose Subnets and then Create Subnet.

    2. For Name tag, enter a name for your subnet, such as Public subnet.

    3. For VPC, choose the VPC that you created earlier.

    4. For Availability Zone, choose the same Availability Zone as the additional private subnet that you created in the previous procedure.

    5. For IPv4 CIDR block, enter a valid CIDR block. For example, the wizard creates CIDR blocks in 10.0.0.0/24 and 10.0.1.0/24 by default. You could use 10.0.2.0/24 for your second public subnet.

    6. Choose Yes, Create.

    7. Select the public subnet that you just created and choose Route Table, Edit.

    8. By default, the private route table is selected. Choose the other available route table so that the 0.0.0.0/0 destination is routed to the internet gateway (igw-xxxxxxxx) and choose Save.

    9. With your second public subnet still selected, choose Subnet Actions, Modify auto-assign IP settings.

    10. Select Enable auto-assign public IPv4 address and choose Save, Close.

Amazon VPC with subnets

Create a managed AWS Batch compute environment

  1. Open the AWS Batch console and from the navigation bar, select the region to use.

  2. In the navigation pane, choose Compute environments, Create environment.

  3. Configure the environment.

    1. For Compute environment type, choose Managed.

    2. For Compute environment name, specify a unique name for your compute environment.

    3. For Service role, choose to have AWS create a new role for you.

    4. For Instance role, choose to have AWS create a new instance profile for you.

    5. For EC2 key pair leave it empty.

    6. Ensure that Enable compute environment is selected so that your compute environment can accept jobs from the AWS Batch job scheduler.

  4. Configure your compute resources.

    1. For Provisioning model, choose On-Demand to launch Amazon EC2 On-Demand Instances or Spot to use Amazon EC2 Spot Instances.

    2. If you chose to use Spot Instances:

      1. (Optional) For Maximum Price, choose the maximum percentage that a Spot Instance price can be when compared with the On-Demand price for that instance type before instances are launched.

      2. For Spot fleet role, choose an existing Amazon EC2 Spot Fleet IAM role to apply to your Spot compute environment.

    3. For Allowed instance types, choose the Amazon EC2 instance types that may be launched. You can specify instance families to launch any instance type within those families (for example, c5, c5n, or p3), or you can specify specific sizes within a family (such as c5.8xlarge). You can also choose optimal to pick instance types (from the C, M, and R instance families) on the fly that match the demand of your job queues. In order to use GPU scheduling, the compute environment must include instance types from the P or G families.

    4. For Allocation strategy, choose the allocation strategy to use when selecting instance types from the list of allowed instance types. For more information, see Allocation Strategies.

    5. For Minimum vCPUs, choose the minimum number of EC2 vCPUs that your compute environment should maintain, regardless of job queue demand.

    6. For Desired vCPUs, choose the number of EC2 vCPUs that your compute environment should launch with.

    7. For Maximum vCPUs, choose the maximum number of EC2 vCPUs that your compute environment can scale out to, regardless of job queue demand.

    8. (Optional) Check Enable user-specified AMI ID to use your own custom AMI. For more information on custom AMIs, see Compute Resource AMIs.

      1. For AMI ID, paste your custom AMI ID and choose Validate AMI.

  5. Configure networking.

    1. For VPC ID, choose the VPC which you created earlier.

    2. For Subnets, choose which subnets in the selected VPC should host your instances. By default, all subnets within the selected VPC are chosen.

    3. For Security groups, choose a security group to attach to your instances. By default, the default security group for your VPC is chosen.

  6. (Optional) Tag your instances so that it is helpful for recognizing your AWS Batch instances in the Amazon EC2 console.

  7. Choose Create to finish.

AWS Batch compute environment

Create an AWS Batch job queue

  1. Open the AWS Batch console and select the region to use.

  2. In the navigation pane, choose Job queues, Create queue.

  3. For Queue name, enter a unique name for your job queue.

  4. Ensure that Enable job queue is selected so that your job queue can accept job submissions.

  5. For Priority, enter an integer value for the job queue's priority. Job queues with a higher integer value are evaluated first when associated with the same compute environment.

  6. In the Connected compute environments for this queue section, select the compute environment that you just created.

  7. Choose Create to finish and create your job queue.

AWS Batch job queue

In this example, we create an AWS Batch job queue metaflow-queue following the steps listed above.

Create an IAM role for AWS Batch

  1. Open the IAM console and in the navigation pane, choose Roles, Create role.

  2. For Select type of trusted entity section, choose AWS service.

  3. For Choose the service that will use this role, choose Elastic Container Service.

  4. For Select your use case, choose Elastic Container Service Task and choose Next: Permissions.

  5. Next, we will create a policy for Amazon S3 and attach it to this role

    1. Amazon S3 for data storage

      1. Choose Create Policy to open a new window.

      2. Use the visual service editor to create the policy

        1. For Service, choose S3.

        2. For Actions, add GetObject, PutObject, DeleteObject and List Bucket as allowed actions

        3. For resources, for bucket put in the bucket name create earlier. For object, use the same bucket name and choose any for object name. Choose Save changes.

        4. Choose Review policy. On the Review policy page, for Name type your own unique name and choose Create policy to finish.

    2. Amazon DynamoDB - Metaflow uses a DynamoDB table to track execution information for certain steps within AWS Step Functions. If you intend to use AWS Step Functions, you would need to create a policy for Amazon Dynamo DB as well.

      1. In the original pane (in Step 4.), Choose Create Policy to open a new window.

      2. Use the visual service editor to create the policy

        1. For Service, choose DynamoDB.

        2. For Actions, add PutItem, GetItem, DeleteObject and UpdateItem as allowed actions

        3. For resources, for region put in the region in which you will create your AWS Dynamo DB table and for table name, use a table name (that you will create later while configuring AWS Step Functions). Choose Save changes.

        4. Choose Review policy. On the Review policy page, for Name type your own unique name and choose Create policy to finish.

  6. Click the refresh button in the original pane (in Step 4.) and choose the policies that you just created (in Step 5.). Choose Next:tags.

  7. For Add tags (optional), enter any metadata tags you want to associate with the IAM role, and then choose Next: Review.

  8. For Role name, enter a name for your role and then choose Create role to finish. Note the ARN of the IAM role you just created.

Amazon S3 policy
Amazon DynamoDB policy
IAM role for AWS Batch

In this example, we created an AWS Batch job queue metaflow-queue and an IAM role metaflow-batch-role. While configuring Metaflow through metaflow configure aws, we can set the following values when prompted:

METAFLOW_BATCH_JOB_QUEUE = metaflow-queue
METAFLOW_ECS_S3_ACCESS_IAM_ROLE = arn:aws:iam::xxxxxxxxxx:role/metaflow-batch-role
METAFLOW_DEFAULT_METADATA=service

Metaflow allows setting up some additional defaults for the docker image that AWS Batch jobs execute on. By default, an appropriate Python image (matching the minor version of Python interpreter used to launch the flow) is pulled from docker hub. You can modify the behavior by either pointing to a specific image or a specific docker image repository using the following variables:

METAFLOW_BATCH_CONTAINER_REGISTRY = foo
METAFLOW_BATCH_CONTAINER_IMAGE = bar

Metadata

Metaflow ships with a metaflow service that tracks all flow executions. This service is an aiohttp service with a SQL datastore as a backend. At a high level, this service can be thought of as an index on top of all the data that Metaflow stores in its datastore. This allows users to easily share their results and collaborate with their peers. Deploying this service is not strictly necessary, you can still use Amazon S3 as your storage backend and execute your flows on AWS Batch without it. But for any production deployment, we highly recommend deploying the metaflow service since it helps in easily monitoring the state of the Metaflow universe.

The Metadata service is available as a docker image in docker hub. There are many ways to deploy this service within AWS. Here we detail some of the steps to deploy this service on AWS Fargate with a PostgreSQL database in Amazon RDS.

Create a VPC

We will deploy the service within a VPC. You can use the same VPC that we created for AWS Batch earlier or if you don't intend to use AWS Batch, you can create a VPC following the same set of steps.‚Äč

Create Security Groups

We will create two security groups, one for the AWS Fargate cluster and another for the AWS RDS instance.

  1. Open the EC2 console and from the navigation bar, select the region to use.

  2. Choose Security Groups under Resources.

  3. You will notice that a security group already exists for the VPC that you created previously. Choose Create security group to create a security group for the AWS Fargate cluster that we will create shortly.

  4. Pick a name for your security group for Security group name, add a Description and select the VPC that you created previously under VPC.

  5. For Inbound rules,

    1. Select Custom TCP for Type.

    2. Use 8080 for Port range.

    3. Select Anywhere for Source type.

    4. Choose Add rule and select Custom TCP for Type.

    5. Use 8082 for Port range. This is needed for the migration service to work.

    6. Select Anywhere for Source type.

  6. For Outbound rules,

    1. Select All traffic for Type.

    2. Select Custom for Destination type.

    3. Select 0.0.0.0/0 for Destination.

  7. Choose Create security group.

  8. Take note of the ID of the security group.

  9. Next, we will create a security group for the AWS RDS instance. Choose Copy to new security group.

  10. Pick a name for your security group for Security group name and add a Description. The correct VPC is already selected for you.

  11. For Inbound rules, instead of Custom TCP for Type, choose PostgreSQL and under Source, choose the security group from Step 8.

  12. Choose Create security group and take note of the ID.

Security Group for AWS Fargate
Security Group for AWS RDS

Create an AWS RDS instance

  1. Open the RDS console and from the navigation bar, select the region to use.

  2. Choose Subnet Groups under Resources.

  3. Choose Create DB Subnet Group to create a DB Subnet Group for your RDS instance within the VPC that you created earlier.

  4. Pick a name for your subnet group for Name, add a Description and select the VPC that you created previously under VPC.

  5. Choose Add all the subnets related to this VPC under Add subnets.

  6. Choose Create.

  7. Choose Databases on the left side pane and then choose Create database.

    1. Choose Standard Create in Choose a database creation method.

    2. In Engine options, choose PostgreSQL for Engine type. Leave the Version untouched.

    3. In Templates, choose Production.

    4. In Settings,

      1. Pick a unique name for DB instance identifier.

      2. Under Credentials Settings,

        1. Pick a Master username.

        2. Pick a Master password and use the same in Confirm password.

    5. In DB instance size, pick the instance you are most comfortable with. Burstable classes are a good option.

    6. In Storage,

      1. Under Storage type, choose General Purpose (SSD).

      2. Allocate initial storage in Allocated storage. 100GiB is a good start.

      3. Check the Enable storage autoscaling is enabled. This will allow your instance to scale up when it runs out of storage.

      4. Choose maximum storage for Maximum storage threshold. Your instance will scale up to a maximum of this limit. 1000GiB is a good number to begin with.

    7. In Availability & durability, ensure that Create a standby instance is enabled.

    8. In Connectivity,

      1. In Virtual private cloud (VPC), choose the VPC that you created previously.

      2. In Additional connectivity configuration

        1. For Subnet group, choose the subnet group you created in Step 6.

        2. Select No under Publicly accessible.

        3. Under VPC security group, choose Choose existing and add both the security groups that you created previously in addition to the default security group.

        4. Choose 5432 as the Database port.

    9. In Database authentication, enable Password authentication.

    10. In Additional configuration,

      1. Under Database options,

        1. Set metaflow as Initial database name.

        2. Select default.postgres11 as the DB parameter group.

      2. Under Backup,

        1. Enable Enable automatic backups

        2. Choose a Backup retention period

        3. Choose a Backup window if you so wish, otherwise check No preference.

        4. Check Copy tags to snapshots.

      3. Under Performance Insights, if you so wish, enable Enable Performance Insights.

      4. Under Retention Period,

        1. Choose a Retention period.

        2. Leave the Master key to default.

      5. Under Monitoring,

        1. Choose 60 seconds for Granularity.

        2. Set default for the Monitoring Role.

      6. Under Log exports, you can choose to export either of the Postgresql log or the Upgrade log.

      7. Under Maintenance, choose Enable auto minor version upgrade if you so wish to.

      8. Under Deletion protection, enable Enable deletion protection.

    11. Choose Create database.

    12. Once the database spins up, note the Endpoint & port under Connectivity & security.

AWS RDS DB subnet group
AWS RDS instance

Create an IAM role for ECS Fargate Service

  1. Open the IAM console and in the navigation pane, choose Roles, Create role.

  2. For Select type of trusted entity section, choose AWS service.

  3. For Choose the service that will use this role, choose Elastic Container Service.

  4. For Select your use case, choose Elastic Container Service Task and choose Next: Permissions.

  5. Choose AmazonECSTaskExecutionRolePolicy.

  6. Choose Next:tags.

  7. For Add tags (optional), enter any metadata tags you want to associate with the IAM role, and then choose Next: Review.

  8. For Role name, enter a name for your role and then choose Create role to finish. Note the ARN of the IAM role you just created.

IAM role for AWS Fargate Cluster

Create an AWS Fargate Cluster

  1. Open the ECS console and from the navigation bar, select the region to use.

  2. Choose Create Cluster under Clusters.

  3. Choose Networking only, Next step.

  4. Pick a name for Cluster name. Don't enable Create VPC. We will use the VPC we have created previously. You can choose to check Enable Container Insights. Choose Create.

  5. Choose View Cluster and choose Task Definitions on the left side pane.

  6. Choose Create new Task Definition, Fargate and Next step.

    1. Under Configure task and container definitions,

      1. Choose a Task Definition Name.

      2. Choose the Task Role as the one you just created above.

    2. Under Task execution IAM role, set the Task execution role to ecsTaskExecutionRole if you have it already. Leave it empty otherwise.

    3. Under Task size,

      1. Choose 4 GB for Task memory (GB)

      2. Choose 1 vCPU for Task CPU (vCPU).

    4. Under Container Definitions, choose Add container

      1. Set metaflow-service as the Container name.

      2. Set netflixoss/metaflow_metadata_service as the Image.

      3. Leave other options as is.

      4. Under Advanced container configuration, in Environment variables add the following values

        1. Set Key as MF_METADATA_DB_HOST and the Value as the endpoint value in Step 12. while creating the AWS RDS instance.

        2. Set Key as MF_METADATA_DB_NAME and the Value as metaflow.

        3. Set Key as MF_METADATA_DB_PORT and the Value as 5432.

        4. Set Key as MF_METADATA_DB_USER and the Value as the user from Step 7.4.2.1. while creating the AWS RDS instance.

        5. Set Key as MF_METADATA_DB_PSWD and the Value as the user from Step 7.4.2.2. while creating the AWS RDS instance.

      5. Choose Add.

    5. Choose Create.

  7. Choose Clusters in the left side pane and select the cluster you created in Step 4.

  8. Choose Create under Services,

    1. Choose Fargate as Lauch type.

    2. Choose the task definition that you created in Step 6. for Task Definition. Pick the latest for Revision.

    3. For Platform version choose Latest.

    4. Leave the Cluster as is (pointing to the cluster that you are configuring).

    5. Pick a name for Service name.

    6. Pick a number for Number of tasks. In this example we will use 1.

    7. Choose Rolling update for Deployment type.

  9. Choose Next step.

  10. For Configure network,

    1. For Cluster VPC, choose the VPC that you have created previously.

    2. Choose all public subnets in that VPC for Subnets.

    3. Choose the two Security groups that you have created previously.

  11. For Load balancing, choose None as Load balancer type. In case, you would like help setting up a load balancer, please reach out to us.

  12. Choose Next step.

  13. You can configure Service Auto Scaling if you want to do so. We will skip that for now.

  14. Choose Next step and Create Service.

  15. Choose View Service and wait for the task to get to the running state.

  16. Choose the task and copy the Public IP. You can verify that your service is up and running by curling the ping endpoint - curl xxx.xxx.xxx.xxx:8080/ping. You should expect pong as the response. This public IP with the port 8080 is the url to the metadata service.

AWS Fargate cluster
AWS Fargate cluster

In this example, we created an AWS Fargate cluster metaflow-metadata-service. While configuring Metaflow through metaflow configure aws, we can set the following values when prompted:

METAFLOW_SERVICE_URL = http://xxx.xxx.xxx.xxx:8080

The metadata service in this example is exposed to the internet. Ideally, you would want to put this service behind an API gateway and use authentication in front of it. The AWS CloudFormation does that automatically for you. If you need help with manual installation, please get in touch.

Scheduling

Using Metaflow, workflows can be directly scheduled on AWS Step Functions. Moreover, from within Metaflow, time-based triggers can be set to execute these deployed workflows via Amazon EventBridge. Metaflow currently also has a dependency on Amazon DynamoDB for tracking metadata for executing specific steps (foreaches) on AWS Step Functions.

Create an IAM role for AWS Step Functions

  1. Open the IAM console and in the navigation pane, choose Roles, Create role.

  2. For Select type of trusted entity section, choose AWS service.

  3. For Choose the service that will use this role, choose Step Functions.

  4. For Select your use case, choose Step Functions and choose Next: Permissions.

  5. Choose Next:tags.

  6. For Add tags (optional), enter any metadata tags you want to associate with the IAM role, and then choose Next: Review.

  7. For Role name, enter a name for your role and then choose Create role to finish.

  8. In the IAM console, choose Roles and select the role you created in Step 7.

  9. Choose Attach policies. Attach AmazonS3FullAccess, AWSBatchFullAccess, AmazonDynamoDBFullAccess, CloudWatchFullAccess and AmazonEventBridgeFullAccess policies. Please note that Metaflow doesn't need full access to any of these resources and the CloudFormation template tracks the exact set of permissions needed. Please reach out to us if you need any assistance.

  10. Click on Attach policy and note the ARN of the role created.

Create an IAM role for Amazon EventBridge

  1. Open the IAM console and in the navigation pane, choose Roles, Create role.

  2. For Select type of trusted entity section, choose AWS service.

  3. For Choose the service that will use this role, choose CloudWatch Events.

  4. For Select your use case, choose CloudWatch Events and choose Next: Permissions.

  5. Choose Next:tags.

  6. For Add tags (optional), enter any metadata tags you want to associate with the IAM role, and then choose Next: Review.

  7. For Role name, enter a name for your role and then choose Create role to finish.

  8. In the IAM console, choose Roles and select the role you created in Step 7.

  9. Choose Attach policies. Attach AWSStepFunctionsFullAccess policy. Please note that Metaflow doesn't need full access to this resource and the CloudFormation template tracks the exact set of permissions needed. Please reach out to us if you need any assistance.

  10. Click on Attach policy and note the ARN of the role created.

Create an Amazon DynamoDB table

  1. Open the DynamoDB console and choose Dashboard from the left side pane.

  2. Choose Create Table.

  3. Choose a name for Table name, use pathspec as Primary key and choose String from the dropdown right next to it. Keep Add sort key unchecked.

  4. Choose Create. After the table has been created, under Table details, choose Manage TTL to open the Enable TTL dialog.

  5. Choose ttl for TTL attribute and choose Continue.

While configuring Metaflow through metaflow configure aws, we can set the following values when prompted:

METAFLOW_SFN_IAM_ROLE = [Full ARN of IAM role for AWS Step Functions]
METAFLOW_EVENTS_SFN_ACCESS_IAM_ROLE = [Full ARN of IAM role for AWS EventBridge]
METAFLOW_SFN_DYNAMO_DB_TABLE = [DynamoDB table name]

And that's it! Now you should have a full-blown set-up for using all the cloud functionality of Metaflow! In case you need any help, get in touch.