β οΈ Warning: Manual EMR Studio Creation Required
This Terraform script provisions the necessary backend infrastructure for your EMR Serverless jobs (S3 buckets, IAM roles, etc.). However, it does not create the EMR Serverless Studio, which is the interactive web-based IDE for development.
After you successfully run terraform apply, you must create the EMR Studio manually using the AWS Console (web UI). You can follow the official AWS tutorial to create an EMR Studio.
ποΈ Infrastructure Setup with Terraform
This guide explains how to use the provided Terraform scripts to provision the necessary AWS infrastructure for this project. Following these steps will create the S3 bucket, ECR repository, and all the required IAM roles and permissions.
Understanding the Terraform Files
The infrastructure is defined across three main files in the terraform/ directory.
Hereβs a breakdown of what each file does:
-
variables.tf: This file is where you define the input variables for your infrastructure. Think of them as parameters you can pass to your Terraform scripts to customise your deployment. For example, you can change the AWS region, S3 bucket name, or environment (dev, staging, prod) by modifying the variables in this file or by creating aterraform.tfvarsfile to override them. -
main.tf: This is the core of your infrastructure definition. It contains the main set of instructions that tell Terraform what to create. It uses the variables fromvariables.tfto configure the resources. In this project,main.tfdefines:- An S3 bucket for job artifacts and logs.
- A dedicated S3 bucket for the Iceberg data warehouse.
- An ECR repository for Docker images.
- An AWS Glue Database to act as the Iceberg catalogue.
- The IAM roles and policies needed for EMR Serverless to access these resources.
-
outputs.tf: This file declares the output values that you want to be easily accessible after Terraform has created the infrastructure. When you runterraform apply, the values of the outputs defined in this file will be printed to your console. This is particularly useful for getting the ARNs and names of the resources you've just created, which you'll need for your.envfile.
Terraform vs. the Deployment Script: What's the Difference?
It is important to understand the distinct roles of Terraform and
the deploy-to-emr script:
-
Terraform (Infrastructure Setup): The Terraform script is responsible for provisioning the foundational, long-lived AWS resources. This includes the IAM roles, S3 buckets, ECR repository, and Glue Database. You typically run this once to set up your environment.
-
deploy-to-emrscript (Application Deployment): Theuv run deploy-to-emr --create-appcommand creates the EMR Serverless application itself. This application is tightly coupled to your code and a specific Docker image version. Since the image URI changes with each new build, managing the application lifecycle through the deployment script provides more speed and flexibility for day-to-day development and deployment.
This separation ensures that your stable infrastructure is managed consistently, while your application can be updated and deployed rapidly.
Prerequisites
Before you begin, make sure you have the following installed and configured:
- Terraform: Install Terraform on your local machine.
- AWS CLI: Install and configure the AWS CLI with credentials for your AWS account. You should have permissions to create S3 buckets, ECR repositories, and IAM roles.
π Provisioning the Infrastructure
The Terraform scripts are located in the terraform/ directory.
-
Navigate to the Terraform directory:
```bash cd terraform ``` -
Initialise Terraform: This command downloads the necessary provider plug-ins. You only need to run this once per project.
```bash terraform init ``` -
(Optional) Customise Variables: Open the
variables.tffile to review the available options. You can change the default values here, or create aterraform.tfvarsfile to override them for your specific environment. For example, to change the S3 bucket name, you could create aterraform.tfvarsfile with the following content:```terraform s3_artifact_bucket_name = "my-unique-emr-artifacts" iceberg_s3_bucket_name = "my-unique-iceberg-data" ``` -
Plan the Deployment: This command shows you what resources Terraform will create, modify, or destroy. It's a great way to review changes before applying them.
```bash terraform plan ``` -
Apply the Configuration: This command will create the resources in your AWS account.
```bash terraform apply ```Terraform will show you the plan again and ask for confirmation. Type
yesand press Enter to proceed.
π Configuring Your .env File
After terraform apply completes successfully, it will print a list of outputs.
These outputs contain the names and ARNs of the resources that were just created.
You will use these values to create your .env file.
-
Copy the example file: In the root of the repository, copy the
.env.examplefile to a new file named.env.```bash cp .env.example .env ``` -
Update
.envwith Terraform outputs: Open the newly created.envfile and replace the placeholder values with the outputs from theterraform applycommand.Your
terraform applyoutput will look something like this:``` Outputs: EMR_EXECUTION_ROLE = "arn:aws:iam::{YOUR_ACCOUNT_ID}:role/emr-spark-uv-emr-execution-role-dev" ICEBERG_GLUE_DB = "emr_serverless_iceberg_dev" ICEBERG_S3_BUCKET = "your-iceberg-data-bucket-dev" IMAGE_URI = "{YOUR_ACCOUNT_ID}.dkr.ecr.eu-west-2.amazonaws.com/emr-pyspark" S3_BUCKET = "your-artifacts-bucket-dev" ```Use these values to update your
.envfile accordingly:```bash # .env REGION=eu-west-2 S3_BUCKET="your-artifacts-bucket-dev" # <-- Use the S3_BUCKET output EMR_EXECUTION_ROLE="arn:aws:iam::{YOUR_ACCOUNT_ID}:role/emr-spark-uv-emr-execution-role-dev" # <-- Use the EMR_EXECUTION_ROLE output IMAGE_URI="{YOUR_ACCOUNT_ID}.dkr.ecr.eu-west-2.amazonaws.com/emr-pyspark:7.9.0" # <-- Use the IMAGE_URI output and add a tag # Iceberg Configuration ICEBERG_S3_BUCKET="your-iceberg-data-bucket-dev" # <-- Use the ICEBERG_S3_BUCKET output ICEBERG_GLUE_DB="emr_serverless_iceberg_dev" # <-- Use the ICEBERG_GLUE_DB output ... ```
You are now ready to deploy your EMR Serverless application!
π§Ή Cleaning Up
When you are finished with the infrastructure and want to avoid incurring further costs, you can destroy all the resources created by Terraform with a single command:
terraform destroy
A Note on Force Deletion: The ECR repository is configured with
force_delete = true and the S3 buckets are configured with force_destroy = true.
This means terraform destroy will delete these resources and all of their
contents (Docker images and S3 objects). This is convenient for development but be
cautious if you ever use this script in a production environment where you
might want to preserve data.