Skip to content

πŸ—‚οΈ Business Workflow β€” Config in S3 with Local Fallback

A lightweight, non-technical playbook for keeping configs auditable, versioned, and tied to specific code releases, with automatic fallback to packaged defaults.


πŸ”„ How Config Loads at Runtime

flowchart LR
  A[CLI Flag or .env] -->|CONFIG_S3_URI set| B[Load from S3 via boto3]
  B --> C[Log sha256 + ETag + VersionId]
  A -->|No S3 config| D[Use packaged config.toml]
  C --> E[Config available to Spark job]
  D --> E

πŸ“Œ Store Configs in a Stable Location

Use a consistent key so jobs always know where to find the latest config:

s3://my-biz-configs/<env>/<pipeline>/config.toml

If you generate date-stamped configs, also upload the same file to .../config.toml as a stable alias.


πŸ“Œ Version by Release

On each deploy, copy the config to a release folder and pass that exact object to the job:

s3://my-biz-configs/<env>/<pipeline>/releases/<deployment_id>/config.toml

Then set --config-s3 (or CONFIG_S3_URI) to that URI -- this couples config to the specific code build.


πŸ“„ Example config.toml

[general]
pipeline_name = "dom-pipeline"
input_path = "s3://my-raw-bucket/data/"
output_path = "s3://my-processed-bucket/results/"

[processing]
max_partitions = 200
enable_deduplication = true

πŸ“‹ Production Checklist

  • βœ… Config uploaded to correct path & accessible by EMR role.
  • βœ… Bucket versioning enabled.
  • βœ… sha256 and ETag logged on deploy.
  • βœ… CHANGELOG updated with what changed and why.
  • βœ… Validation schema matches config file.

⚠️ Common Pitfalls

  • Wrong prefix β€” s3:// included twice in path will cause a 404.
  • IAM perms missing β€” Ensure s3:GetObject and s3:GetObjectVersion are allowed for the EMR role.
  • Cached packaged config β€” If S3 config isn’t being picked up, check --config-s3 or .env is being read in EMR.

πŸ“₯ Running with Config from S3

Option 1 β€” CLI flag (highest priority)

uv run deploy-to-emr --config-s3 s3://my-biz-configs/prod/dom/config.toml

Option 2 β€” .env file

# .env
CONFIG_S3_URI=s3://my-biz-configs/prod/dom/config.toml

Then run:

uv run deploy-to-emr

βš™οΈ Runtime Behaviour

  1. If --config-s3 or CONFIG_S3_URI is set β†’ Fetch from S3, parse TOML in-memory, log sha256/ETag/VersionId.
  2. Otherwise β†’ Use packaged config.toml from emr_dummy module.
  3. Config is broadcast to executors as needed.