Data Engineering with AWS Medallion
Introduction
Modern analytics lives or dies by the quality, freshness, and cost of your data pipelines. In this blog, I’ll first unpack existing, production pipeline and then walk you through a proposed AWS-native redesign based on the Medallion (Bronze/Silver/Gold) pattern. Along the way, I’ll cover pain points, cost levers, security, governance, automation, and how this architecture unlocks scalable analytics for apps, dashboards, and machine learning.
TL;DR
Today: Hevo → Google Cloud Storage (GCS) → Databricks → GCS → consumed by DataBrain, Custify, and apps. Works, but costs and governance are not where we need them.
Tomorrow (proposed): AWS DMS/Streams/API → S3 Bronze/Silver/Gold (Iceberg + Parquet) → Athena, Glue, EMR → Step Functions, EventBridge, Lambda → consumed by Databricks, DataBrain, Custify, and apps. Lower cost, stronger governance, reproducible CDC, and clean, analytics-ready marts.
Problem Statement
What’s not working well enough:
Third-party integration cost & limits: Hevo moves data reliably but adds monthly cost without native control over orchestration and governance.
Paying for underused GCP: We store and process in GCS/Databricks, but governance and lifecycle optimizations are inconsistent, inflating storage and compute.
Data bloat: Without robust ETL/ELT and deduplication at source-of-truth tables, duplicates and redundant snapshots accrue, driving up storage and query costs.
Governance gaps: Limited data partitioning standards, insufficient PII controls, and uneven access management increase risk and cost.
Inconsistent CDC fidelity: We want a repeatable, auditable change feed across heterogeneous systems (MySQL, MongoDB, DynamoDB, Zoho Books).
What we need instead:
Serverless, pay-as-you-go ingestion and transformation.
Clear layered data design (Medallion) for quality, lineage, and reprocessing.
Columnar + compressed storage with lifecycle policies.
Fine-grained security (authN/authZ), robust monitoring, and automated quality checks.
First-class support for CDC and replays for deterministic rebuilds.
Architecture Overview
Existing (high-level)
Place Figure 1 here: Existing Architecture Diagram
Sources → Hevo → GCS (data lake) → Databricks (ETL/ML) → GCS (processed) → DataBrain / Custify / Apps
Sources: MongoDB, DynamoDB, MySQL, Zoho Books
Ingestion: Hevo (full load + CDC)
Storage/Compute: GCS + Databricks (Spark, SQL)
Consumption: DataBrain (embedded analytics), Custify (CS ops), plus internal apps
Proposed (high-level)
Sources → AWS DMS / Streams / API → S3 (Bronze/Silver/Gold, Iceberg + Parquet) → Athena / Glue / EMR → Step Functions + EventBridge + Lambda → Databricks / DataBrain / Custify / Apps
Ingestion: AWS DMS (MySQL, MongoDB), DynamoDB Streams → Firehose/Lambda, Zoho Books API → Lambda/Step Functions
Storage: Amazon S3 layered by Bronze (raw), Silver (clean), Gold (marts) with Iceberg tables in Parquet + compression
Compute: Athena (SQL/CTAS/MERGE), Glue (Spark ETL), EMR (large batch/ML)
Orchestration: Step Functions + EventBridge + S3 events; Lambda for micro-transforms
Governance & Security: Glue Data Catalog, Lake Formation, IAM least privilege, KMS, private networking
Observability & Quality: CloudWatch dashboards/alarms, DLQs, quarantine buckets, DQ suites (Deequ/Great Expectations-style)
Data Sources & Ingestion
Sources:
MongoDB (semi/unstructured ops data)
MySQL (relational business data)
DynamoDB (low-latency NoSQL)
Zoho Books (finance/transactions)
Existing: Hevo does full load + CDC from all supported sources and lands into GCS.
Proposed: a best-tool-for-the-job approach:
AWS DMS for MySQL (CDC via binlog) and MongoDB (CDC via oplog/change streams).
Replication instance runs tasks; endpoints define source/target; tasks handle Full Load + CDC seamlessly.
DMS is a reliable mover (light filters/mappings), not a heavy ETL engine.
DynamoDB: Streams → Kinesis Firehose to S3 (or Lambda to S3) for CDC.
Zoho Books: API → Lambda/Step Functions → S3 (scheduled incremental pulls).
All feeds converge into S3 Bronze with consistent append-only events and CDC fidelity (I/U/D flags, source timestamps, watermarks).
Data Lake & Storage (Medallion)
Why Medallion? It standardizes quality gates and lineage:
Bronze (Raw): immutable, 1:1 source events; perfect for audit and replay.
Silver (Clean): deduplicated, schema-conformed, deletes honored; Parquet + ZSTD/Snappy; Iceberg tables for time travel & MERGE.
Gold (Marts): domain-specific, analytics-ready; Customer 360, finance marts (invoices, payments, AR aging, MRR/ARR), operational KPIs.
Partitioning & Format:
Partition by ingest_date (Bronze) and business time (Silver/Gold:
event_date,invoice_month, etc.).Use Parquet columnar storage for scan reduction; target file sizes 128–512 MB and run small-file compaction.
Data Transformation (ETL / ELT)
Existing (Databricks on GCP):
Schema normalization, timestamp standardization, null handling.
Validation (counts, referential integrity), deduplication, enrichment (currency conversion, joins).
Spark SQL/PySpark for aggregates, joins, ML; reports saved back to GCS; consumed by DataBrain and Custify.
Proposed (AWS):
Athena for SQL transforms, CTAS, and Iceberg MERGE for incremental upserts.
Glue Spark jobs for heavy joins, window functions, compaction, schema enforcement (with job bookmarks).
EMR for very large batch jobs and custom ML runtimes.
Lambda for micro-transforms: manifests, S3 tagging, schema tweaks, pre/post checks.
Example Silver dedup logic (conceptual):
-- Keep one latest version per business key, honoring CDC ordering
SELECT *
FROM (
SELECT
t.*,
ROW_NUMBER() OVER (
PARTITION BY t.business_key
ORDER BY t.cdc_timestamp DESC, t.op_order DESC
) AS rn
FROM bronze_table
WHERE t.op <> 'D' -- map deletes to Iceberg deletes when needed
) s
WHERE rn = 1;
Tools & Technologies
AWS DMS: purpose-built, low-latency mover for Full Load + CDC; reduces custom ingestion code.
S3: durable, cheap, limitless data lake; integrates with Athena/Glue/EMR; lifecycle policies for cost control.
Iceberg on S3: ACID for the lake, schema/time/partition evolution, time travel, and fast MERGE.
Athena: serverless SQL; pay per TB scanned; ideal for ad-hoc, marts, and scheduled transforms.
Glue: managed Spark ETL; bookmarks, crawlers, Data Catalog.
EMR: elastic big data/ML when Spark needs custom runtimes or scale.
Step Functions + EventBridge: resilient orchestration and scheduling; idempotent runs.
Lambda: connectors & micro-transforms (API pulls, manifests, quality checks).
Lake Formation + IAM + KMS: security, least privilege, PII masking, encryption at rest.
Challenges & Solutions
Heterogeneous CDC: Standardize events (_id/PK, op I/U/D, timestamps, watermarks) → Bronze append-only.
Schema evolution: Use Iceberg for add/drop/rename without table rewrites; maintain Glue Catalog.
Duplicate & small files: Silver dedup; compaction jobs to right-size Parquet files.
Partition over/under-fitting: Choose filters BI actually uses (
invoice_month,event_date) to maximize pruning.Athena partition limits: Avoid excessive
IN/ORon partitions (>200); prefer ranges or different partitioning.Late arrivals & lag: Monitor DMS latency, store lag metrics, and make jobs idempotent for safe re-runs.
Orchestration & Automation
EventBridge schedules CDC checks and batch windows.
Step Functions coordinates: DMS task health → Bronze → Silver → Gold → DQ → publish.
S3 PUT events trigger Lambdas for manifests, tagging, and small pre/post transforms.
Idempotency across jobs ensures safe retries and deterministic rebuilds.
Data Quality, Monitoring & Governance
DQ suites: schema checks, null thresholds, ranges, referential integrity, row counts vs DMS stats.
Quarantine paths: bad records → S3 quarantine with reason codes; DLQs via SQS.
CloudWatch dashboards & alarms: DMS latency, Glue/EMR/Athena failures, Lambda errors, Iceberg maintenance jobs.
Catalog & lineage: Glue Data Catalog for every table across Bronze/Silver/Gold; Lake Formation tags for PII; lineage via job metadata.
Security: Authentication & Authorization (the essentials)
Authentication (who are you?):
AWS IAM Identity Center (SSO) backed by your IdP for user auth.
IAM Roles for workloads (DMS, Glue, EMR, Lambda, Step Functions).
Authorization (what can you access?):
Least privilege IAM policies per role: producers (write Bronze), transformers (read Bronze, write Silver/Gold), consumers (read Gold/views).
Lake Formation grants for table/column/row level access; dynamic masking for PII.
KMS: separate CMKs per layer (Bronze/Silver/Gold); bucket policies enforce TLS and deny public access.
Private access: VPC endpoints for S3/Glue/Athena; private subnets for Glue/EMR.
Cost Optimization (where the savings come from)
Storage
Parquet + ZSTD/Snappy → 3–10× smaller vs JSON/CSV; fewer bytes scanned by Athena.
Dedup in Silver → drop redundant history while preserving full fidelity in Bronze.
Lifecycle policies → Bronze rolls down: Standard → Infrequent Access → Glacier (e.g., 30/60/90 days).
Right-sized files (128–512 MB) → fewer file listings and faster scans.
Compute
Push SQL to Athena where possible (no clusters to manage).
Glue job bookmarks to avoid reprocessing; partition pruning to scan less.
Compaction & Z-ordering (Spark/EMR) for heavy marts with multi-column filters.
Platform
- Replace recurring Hevo spend with serverless native ingestion; reduce GCP footprint where underutilized.
Existing Architecture (Deep Dive)
Sources: MongoDB, DynamoDB, MySQL, Zoho Books
Ingestion: Hevo (full + CDC) lands raw/semi-processed into GCS (partitioned by source/timestamp).
Processing: Databricks pulls from GCS → cleans, validates (counts, referential checks), dedups, enriches (currency conversion, joins).
Advanced analytics: Spark SQL/PySpark for aggregations, joins, and optional ML.
Outputs: Reports/datasets (Parquet/CSV/Delta) written back to GCS.
Consumption: DataBrain (embedded analytics), Custify (CS ops), and other apps read from GCS.
Strengths: multi-source CDC, scalable compute (Databricks), decent freshness.
Gaps: third-party ingestion cost; uneven governance and lifecycle; limited dedup in core tables; higher query/storage spend than necessary.
Proposed Architecture (Deep Dive)
Sources & CDC model: Treat every source as an append-only stream with stable keys, op flags (I/U/D), source timestamps, and a CDC watermark.
Ingestion:
AWS DMS for MySQL (binlog) & MongoDB (oplog/change streams) → S3 Bronze.
DynamoDB Streams → Firehose/Lambda → S3 Bronze.
Zoho Books API → Lambda/Step Functions → S3 Bronze.
Storage (S3 Medallion):
Bronze: Raw events, append-only.
Silver: Deduped, conformed, Iceberg tables in Parquet; deletes honored; partitioned by business time.
Gold: Domain marts (Customer 360, Finance, Ops KPIs) denormalized for BI performance.
Compute:
Athena (SQL, CTAS, MERGE INTO Iceberg) for many transforms and reports.
Glue for heavy ETL, compaction, and window analytics.
EMR for very large batch/ML jobs.
Automation: Step Functions orchestrates Bronze→Silver→Gold with EventBridge schedules and S3 events; Lambda for micro-tasks.
Quality & Monitoring: DQ checks, quarantine buckets, CloudWatch alarms, lag dashboards, DLQs.
Security: IAM least privilege, Lake Formation, KMS per layer, private networking.
Final Output / Results
Gold marts supporting:
Customer 360: unified profiles, lifecycle stages, activity aggregates, health scores.
Finance: invoices/payments, AR aging, MRR/ARR, cohort revenue.
Operational KPIs: product usage, funnel metrics, retention.
Consumption paths:
Databricks (reads Silver/Gold Iceberg tables) for ML notebooks and writes back curated outputs.
DataBrain and Custify via Athena views or materialized Parquet exports (with manifests) for fast dashboards and embedded analytics.
Other apps via scheduled S3 exports or JDBC/ODBC to Athena.
Key Learnings
Medallion layering brings clarity: raw → clean → curated, with explicit quality gates.
CDC done right requires a consistent event contract and immutable Bronze for replayability.
Table formats matter: Iceberg unlocks ACID, time travel, schema evolution, and efficient MERGE.
Serverless first (Athena, Glue, Lambda, Step Functions) reduces ops toil and aligns cost to usage.
Governance (Lake Formation + IAM + KMS) is a design pillar—not an afterthought.
Future Improvements
Consumer Data Lakes from Gold sources
Publish the Gold layer as domain data lakes per business area (Customer, Finance, Ops).
Cross-account sharing with Lake Formation; optional data marketplace via DataZone.
Maintain semantic views (Athena) for self-serve analytics.
Real-time streaming everywhere
- Where needed, augment CDC with Kinesis (or Kafka) to power sub-minute freshness and event-driven use cases.
Advanced anomaly detection
- EMR/Databricks notebooks to build ML models for outlier detection (finance anomalies, churn risk, pipeline drift).
SSO & fine-grained security expansion
- IAM Identity Center (SSO), attribute-based access control (ABAC), pervasive column masking for PII.
Cost posture reviews
- Quarterly compaction and partition audits; Heatmaps of Athena scans; automatic tiering rules across all buckets.
Implementation Plan (Phased)
Phase 0 – Foundations
- S3 buckets/prefixes for Bronze/Silver/Gold; KMS keys per layer; Glue Catalog; Lake Formation setup.
Phase 1 – Ingestion
Stand up DMS for MySQL & MongoDB; Streams/Firehose/Lambda for DynamoDB; API → Lambda for Zoho Books.
Land append-only events with consistent metadata.
Phase 2 – Silver
- Build Iceberg Silver tables; implement dedup logic, deletes, and partitioning; first DQ suite; compaction jobs.
Phase 3 – Gold
- Create Customer 360 and Finance marts; define Athena views for DataBricks/DataBrain/Custify; set SLAs (e.g., Customer 360 by 06:00 IST, Finance hourly).
Phase 4 – Automation & Observability
- Step Functions pipelines, EventBridge schedules, S3 triggers; CloudWatch dashboards/alerts; DLQs & quarantine flows.
Phase 5 – Hardening
- Load/perf tests; security audits; cost reviews; user onboarding (SSO); documentation and runbooks.
Conclusion
By migrating from a Hevo + GCS + Databricks flow to an AWS-native Medallion lakehouse—DMS/Streams/API → S3 (Bronze/Silver/Gold, Iceberg + Parquet) → Athena/Glue/EMR, orchestrated with Step Functions—we gain lower cost, stronger governance, and clean, reliable, analytics-ready data.
The result: faster, cheaper queries; deterministic CDC with replay; secure, audited access; and curated marts that power DataBrain, Custify, ML workflows, and downstream apps with confidence.
Project Highlights & Contact
This project was successfully executed with a focus on scalability, efficiency, and cost optimization. Over the course of implementation, I handled 48 TB of data, processed more than 5 billion records in a single collection, and managed 15+ collections across multiple databases. These efforts ensured smooth data flow, optimized storage through Medallion architecture, and delivered reliable reporting pipelines.
If you’d like to know more about my work, collaborate, or discuss data engineering solutions, feel free to reach out to me at:
📧 mugeshkrishna585@gmail.com
Appendix: Quick Glossary
Hevo: Third-party integration platform for moving data to warehouses/lakes.
GCS: Google Cloud Storage (object store in GCP).
Databricks: Unified analytics platform built on Apache Spark.
DataBrain: SQL-based transformation/orchestration for embedded analytics.
Custify: Customer Success platform consuming analytics for retention and growth.
AWS DMS: Managed service for database Full Load + CDC replication.
S3: AWS object storage (our data lake).
Athena: Serverless SQL over S3.
Glue: Serverless ETL (Spark) with Data Catalog, crawlers, and notebooks.
EMR: Elastic big data/ML clusters.
Iceberg: Open table format enabling ACID, schema evolution, time travel on the lake.
Medallion: Bronze (raw) → Silver (clean) → Gold (business marts).