Case Study
Building a Scalable, Cost-Optimised & Resilient Data Platform
Sector: E-commerce & Retail
Executive Summary
- Challenge: A rapidly growing e-commerce company was struggling with fragmented data systems, rising cloud costs, and slow access to critical business insights. Multiple departments (marketing, sales, finance, IT) needed timely, trustworthy data, but the existing data infrastructure was unreliable, siloed, and expensive. The client set an ambitious goal: to unify data from various sources into a single platform capable of delivering real-time analytics with high reliability, all on a cloud budget of about $500 per month.
- Solution: Ray Consulting Group (RCG) designed and built a modern cloud-native data platform following our Discover – Design – Deliver – Improve methodology. In the Discover phase, we gathered requirements and mapped the data landscape. In Design, we architected a modular data platform using open-source technologies and cost-effective AWS services. Key components included an ingestion layer for batch and streaming data (Apache NiFi, Kafka), a scalable data lakehouse on Amazon S3 with Apache Iceberg tables and AWS Glue catalog, a processing layer with Apache Spark and dbt for data transformation, and an orchestration layer with Apache Airflow. A federated SQL query engine (Trino) provided a unified access layer for analysts and applications, with built-in security and data governance (fine-grained IAM controls, data quality checks via Great Expectations). The architecture emphasised resilience (no single point of failure, clustering, auto-recovery) and cost optimisation (serverless and spot instances, auto-scaling, open-source tools to avoid license fees). Throughout Delivery, our agile team implemented the solution in iterative sprints, infrastructure-as-code (Terraform) ensured repeatable deployments, and comprehensive testing/monitoring guaranteed reliability. We also put strong data governance practices in place: encryption, access control by roles, data lineage tracking, and compliance with regulations (e.g. GDPR).
- Results: The new data platform now delivers near real-time insights and reliable daily reports at a fraction of the previous cost. The solution came in under budget, reducing cloud costs by ~30% while exceeding the performance targets. Data that once took days to assemble is available in minutes, enabling faster decision-making (e.g. marketing sees live campaign metrics, sales managers get daily dashboards by morning). The platform operates with 99.9%+ uptime and automatically recovers from failures, dramatically improving trust in data availability. Fine-grained security and quality controls mean stakeholders only see "one version of the truth" and sensitive data is protected – boosting confidence in data-driven decisions and compliance. The client's data engineering team has been upskilled through the project, and they now have a future-proof foundation: the modular architecture can easily scale with growth or be extended to new data sources and analytics needs. This case study showcases RCG's end-to-end data engineering expertise in transforming a complex, costly data ecosystem into a streamlined, cost-efficient "engine of insight."
Background & Challenge
In today's digital economy, businesses often find their data architecture strained as they scale. Our client, a mid-sized company in the online retail sector, experienced exactly this. They had cloud-native infrastructure in place, but as data volumes and user demands grew, data platform costs skyrocketed, and reliability suffered. Different teams were pulling data from disparate systems: a CRM database for customer info, an ERP/BSS for transactions and billing, application logs from web servers, and even third-party APIs for marketing analytics. The data landscape was fragmented across these sources, making it slow and labour-intensive to get a complete picture of business performance.
The leadership realised that to make timely, informed decisions, they needed faster access to trustworthy data across the organisation. Marketing wanted real-time user engagement metrics for agile campaign tuning. Sales and finance wanted daily revenue and cash flow reports without the need for manual spreadsheet work. IT wanted to reduce maintenance overhead from the patchwork of pipelines that frequently broke. On top of that, the company had to control cloud spending – they were concerned that adding more data and users to their existing setup (which included some expensive managed services and ad-hoc processes) would blow through the budget. In fact, they set a hard target: the new solution should run at around $500 per month in cloud costs, significantly less than their current spend, without sacrificing performance.
Key Business Challenge: Build a unified, scalable, and cost-efficient data platform that could consolidate data from multiple sources and deliver near real-time insights with high reliability. This platform needed to meet strict service-level agreements (SLAs) for uptime (no lengthy outages) and data freshness, enforce robust security and governance for sensitive data, and remain within a tight budget. Essentially, the client needed a modern data backbone to transform their complex data ecosystem into a single source of truth in a lean, efficient manner.
RCG was engaged to solve this challenge. We kicked off the project by aligning expectations with stakeholders and planning a path forward using our proven RCG Model (Discover – Design – Deliver – Improve). In the Discover phase, we focused on understanding the business goals, technical constraints, and the current pain points in detail. This set the stage for designing a solution perfectly tailored to the client's needs.
Requirements & Discovery
During the discovery phase, our team conducted intensive requirements-gathering sessions. We interviewed stakeholders across departments (executives, analysts, and engineers) and audited existing data systems. The goal was to map out what the business needed versus what they had, and identify any constraints (budget, compliance, timeline). Through workshops and whiteboarding, we captured several key requirements and pain points:
- Multi-Source Data Integration: The platform must ingest and consolidate data from a variety of sources. This included an on-premise CRM (MySQL) for customer data, a PostgreSQL ERP/BSS containing billing and operational data, application and web logs stored in AWS S3 (as well as some logs in an Elasticsearch cluster), and various third-party APIs (for marketing metrics, social media, etc.). We needed to bring all these into one unified data repository. A special consideration was that some sources would be batch (e.g., nightly database exports) and others streaming (e.g., real-time event capture from the website).
- Mixed Workload Support (Batch + Streaming): The client's use cases spanned batch processing (like daily or weekly reports, monthly financial closes) and real-time or micro-batch streaming (like user activity feeds, live metrics dashboards). The platform had to handle both efficiently. For example, daily summary jobs should not interfere with real-time event processing. This required an architecture that supports scheduling batch ETL jobs as well as always-on streaming data pipelines concurrently.
- Analytics Accessibility & Self-Service: End-users (particularly business analysts and some product managers) should have self-service access to the data through familiar tools (BI dashboards, SQL queries, maybe Excel) without always needing a technical data engineer in the loop. Different teams (marketing, sales, finance, operations) each needed customised data views or data marts relevant to them, but all derived from the same single source of truth. The requirement was to enable fast access to data and insights (e.g., a marketer querying campaign data or a finance user retrieving the latest revenue numbers) with minimal friction.
- Governance, Security & Compliance: Given the sensitive nature of some data (customer PII, financial records), the solution had to enforce strict security and data governance. This included fine-grained access controls (so that, for instance, marketing can see campaign performance but not HR data, finance can see revenue but not personal customer details), as well as data quality checks and lineage tracking. We needed to track where data came from and how it transformed (for transparency and debugging), and ensure that any bad data (corrupt records, privacy-sensitive info) could be identified and handled. Compliance requirements, such as GDPR (for customer data protection) and any industry-specific regulations, had to be supported. For example, data needed to stay in certain regions or be encrypted at rest and in transit.
- Cost Efficiency (Budget ~$500/Month): A hard constraint was cost optimisation. The client's leadership set a target cloud budget of about $500 per month for the data platform operations. This is quite modest given the scope of real-time processing and storage needed, so every design choice had to consider cost. We would prioritise using open-source technologies (no hefty license fees) and managed services that offer a good price-performance ratio. The design would need to smartly leverage cloud cost-saving mechanisms (such as auto-scaling down, using spot instances or reserved instances, and avoiding unnecessary data duplication). Essentially, the platform should scale within budget, growing efficiently as data and user demands increase.
- Reliability, Scalability & Resilience: The client required that the new platform be significantly more reliable and fault-tolerant than their current setup. The target was at least 99.9% uptime for critical components, meaning minimal downtime. This entails designing with no single points of failure (redundant instances, clustering, failover mechanisms). If a server or service goes down, the platform should heal itself or have a standby ready. Additionally, the system should scale seamlessly: as data volume grows 2x or 5x, or as new data sources are added, the architecture should handle it without a complete redesign. Scalability also means handling peak loads. For instance, if there's a traffic spike on the website generating events, the streaming pipeline should absorb it by scaling out. All of this should happen while maintaining performance (so queries and pipelines remain within expected time SLAs).
These requirements painted a clear picture of what the solution must achieve. We documented them carefully and obtained sign-off from all stakeholders to avoid surprises later. We also catalogued the existing technology stack and data assets the client possessed – some would be reused, while others phased out. For instance, the client was using a traditional ETL tool that was expensive; we noted this might be replaced with an open-source alternative. Another example: they had data spread across several S3 buckets and databases – we planned to migrate and reorganise this data into a central repository.
With a firm grasp on the goals and constraints, our RCG team proceeded to the Design phase. In this phase, our architects would blueprint a solution architecture that meets all requirements and stays within budget. We aimed to leverage our full-stack data expertise to choose the right tools and design patterns for this specific scenario.
Solution Architecture & Design
RCG's architects designed a modern, cloud-native data platform architecture that met the client's requirements for scalability, cost, and resilience. We followed a layered architecture approach – separating the system into distinct layers (ingestion, storage, processing, etc.) – to keep the design modular and flexible. The design was also cloud-agnostic at its core: we chose technologies that could, if needed, be ported to another cloud provider with minimal refactoring, thereby avoiding vendor lock-in. Below, we outline the key components and design decisions for each layer of the platform:
Figure 1 illustrates the end-to-end data platform. Data sources (CRM, ERP, logs, APIs) feed into an ingestion layer (batch and streaming pipelines). Ingested data lands in a central data lake storage (S3 with Iceberg), which is accessible through a processing layer (Spark, dbt) and a query layer (Trino). Downstream, users and applications access data via BI dashboards or APIs. Security and monitoring envelop all layers.
Infrastructure & Cloud Foundation
We chose AWS as the primary cloud platform due to the client's familiarity and the available services that align with our needs. All infrastructure was defined as code using Terraform. This Infrastructure-as-Code approach allowed us to spin up consistent environments (dev, test, prod) on demand and ensured reproducibility. For example, the entire VPC, subnets, security groups, EC2 instances, S3 buckets, RDS databases, and other necessary resources were scripted – a new deployment or changes could be applied with Terraform, reducing manual setup errors and enabling quick rollback if needed. The design emphasised using cloud-native but cloud-agnostic components: core storage was S3 (which has equivalents in Azure/GCP), and compute was mostly on EC2 or containerised on ECS/EKS (which can port to other container services). This avoided locking the client to a single cloud vendor's proprietary databases or warehouses. We also right-sized the infrastructure for cost: using small EC2 instances and AWS Fargate for containers where appropriate and leveraging AWS's spot instances for non-critical workloads (like dev/test Spark jobs or less urgent batch tasks) to get steep discounts. We enabled auto-scaling groups for services like Kafka and Spark workers – they scale out under high load and scale in when idle, so we're only paying for what we use. By automating the infrastructure and keeping it lean, we built a strong, cost-efficient foundation for the data platform.
Data Ingestion Layer (Batch and Streaming)
The ingestion layer handles getting data from source systems into the platform. We implemented two parallel paths: one for batch (scheduled periodic loads) and one for real-time streaming data.
- Batch Ingestion: For databases like the CRM and ERP, which didn't need real-time syncing, we used Apache NiFi as a versatile data ingestion tool. NiFi allowed us to create visual pipelines to extract data from MySQL and PostgreSQL on a schedule (for instance, a nightly full extract or more frequent incremental extracts of new records). NiFi handled data routing, transformations (if needed) and loading into the target storage. We configured NiFi flows to pull data in CSV or JSON format and land it into a staging area in Amazon S3. NiFi's drag-and-drop interface with processors for common tasks accelerated development, and its scheduling capability allowed us to orchestrate batch jobs without heavy custom code. We templated the NiFi flows so they were reusable and easy to maintain.
- Initial Bulk Load (Data Migration): As part of setting up the platform, we also performed a one-time bulk data migration of historical data. Using NiFi and database dumps, we migrated years of CRM and ERP data (customer records, transactions, etc.) from the old systems into the new S3 data lake. This seeding of historical data ensured the new platform started with a complete dataset. After this migration, ongoing changes were captured via incremental processes (CDC or periodic diffs), avoiding repeated full loads.
- Streaming Ingestion: For sources that required real-time or near-real-time data, we incorporated Apache Kafka as the backbone. Kafka is a distributed log system ideal for streaming integration. We set up a Kafka cluster (using either AWS MSK or self-managed Kafka on EC2) and created topics for various data streams – e.g., "customer_activity_events", "new_signups", "order_placed". We used Change Data Capture (CDC) connectors (like Debezium) to stream updates from the MySQL and PostgreSQL databases into Kafka in real-time. This means that whenever a new order is placed in the ERP, a change event is published to a Kafka topic within seconds. Similarly, for application logs, we used Logstash (from the ELK stack) to tail log files and send events to Kafka, and some application services were modified to directly produce events to Kafka (for example, user clicks or app events). With Kafka in place, we had a reliable buffer that could transport high-volume event data to consumers asynchronously and in a fault-tolerant way.
- Real-Time Processing: Downstream of Kafka, we leveraged Apache Spark Structured Streaming for processing the streaming data in real-time. Spark streaming jobs run continuously, reading from Kafka topics, performing transformations or aggregations on the fly, and then writing results out to the data lake or other storage. For instance, we built a Spark streaming job to listen to the "customer_activity_events" topic, aggregate events into session data, and update a running count of active users, which could then be queried nearly live. Spark's unified engine allowed us to use the same codebase in batch and streaming modes. We also evaluated Apache Flink for streaming (as it's even more real-time for certain use cases), but given the team's familiarity with Spark and the moderate latency requirements (sub-minute latency was fine), we stayed with Spark for simplicity.
In summary, the ingestion layer combined NiFi (for batch/bulk loads) and Kafka + Spark streaming (for real-time data). This hybrid approach ensured we captured all data changes in a timely manner. A crucial part of the design was decoupling producers from consumers – sources publish data to Kafka or drop files to S3, and our processing jobs pick them up. This isolation adds resilience: if the downstream processing is temporarily down, Kafka will buffer events until it's back, so no data is lost. Likewise, if a source system is unavailable briefly, NiFi can catch up later. The result is a robust data ingestion pipeline feeding the platform continuously.
Data Lake Storage & Management
At the heart of the platform lies the data lake, which we built on Amazon S3. S3 was the natural choice for cost-effective, scalable storage – it can store vast amounts of raw and processed data very cheaply (pennies per GB) and is highly durable (99.999999999% durability). However, raw files on S3 alone are not enough for an analytics platform; we needed to impose structure and enable fast queries. For this, we adopted a Lakehouse architecture by using the Apache Iceberg table format on top of S3.
- Apache Iceberg Tables: Iceberg is an open table format that brings database-like features to data lakes. We converted the raw data files into partitioned Iceberg tables, which gave us ACID transactions, schema evolution, and time-travel query capabilities on the data lake. For example, we defined an Iceberg table for CRM customers and another for transactions, each stored as Parquet files in S3 under the hood. Iceberg manages metadata about these files and allows concurrent writes safely. This means multiple jobs can write to the same table without corrupting it – a big improvement over plain CSV/Parquet files. It also allowed the use of SQL MERGE operations for easier upserts (important for CDC data merging). By using Iceberg, we essentially created a data warehouse layer on top of the data lake, achieving the best of both worlds: low-cost storage and robust table management.
- Data Partitioning & Layout: We carefully planned how data is organised in S3 to balance performance and cost. Data in the lake is partitioned by key fields (dates, regions, etc.) so that queries can prune unnecessary data. For instance, the sales transactions table is partitioned by year/month/day, so a query for October 2025 only reads that subset instead of the whole dataset. We also used compression (Parquet format with Snappy compression) to reduce storage size and I/O costs.
- Catalog & Metadata: To make the data discoverable, we set up the AWS Glue Data Catalog as a central metadata repository. Every table in the data lake (Iceberg table) is registered in Glue with its schema and partition info. This enables tools like Spark, Trino, and Athena to easily find and query the data. Analysts can also search the catalog (via a data catalog UI we provided) to see what datasets are available. The catalog also stores metadata for any other sources (for example, if we want to treat an API as an external table, etc.). In essence, the Glue Catalog serves as the data dictionary for the platform.
- Other Storage Needs: While S3/Iceberg covered the main analytical storage, the design also factored in specialised data stores for specific use cases:
- We integrated Amazon OpenSearch (Elasticsearch) for use cases requiring text search or low-latency lookups. Certain log data and processed metrics were indexed into OpenSearch, allowing the client's ops team to quickly search logs or enabling the product team to power a search feature without scanning the entire data lake.
- We proposed AWS Neptune (Graph DB) for any graph-based queries (e.g., if the client wanted to map relationships between customers or do network analysis). In this project, Neptune was an optional component for future use in advanced analytics (it wasn't a core requirement, but we designed with extension in mind).
- For staging and intermediate processing, we used some Amazon RDS (MySQL/Postgres) instances and AWS ElastiCache (Redis) where appropriate. For example, Airflow's metadata database ran on RDS (MySQL), and we used Redis as a message broker for Airflow's Celery executors.
By using S3 + Iceberg as the main data lake, we drastically saved on costs compared to a traditional data warehouse. S3 costs are low, and Iceberg, being open-source, incurred no license cost. The design is also future-proof: if later the client opts for a warehouse like Snowflake or Redshift for certain workloads, this data lake can still serve as the staging area or be directly queried by those tools. But initially, we aimed to see how far the lakehouse could go on its own to meet all analytics needs, thus avoiding those extra costs.
Data Processing & Orchestration
On top of the storage layer, we built a robust processing and orchestration mechanism to transform raw data into insights. This layer is where the data engineering and analytics engineering magic happens: cleaning, joining, aggregating data, and preparing it for consumption.
- Batch Processing with Spark: We deployed Apache Spark for heavy data transformations. Spark is an engine that can handle large-scale data processing using distributed computing. We set up Spark in a cost-conscious way: for instance, using AWS EMR on EC2 spot instances for ephemeral big jobs or experimenting with EMR on EKS (running Spark on Kubernetes) for more control. Spark jobs were written mostly in PySpark (Python API) for readability. An example batch process: every night, a Spark job reads the day's new transactions from the staging area (or directly via Kafka CDC events), joins it with customer data to enrich it, and writes the result as an Iceberg table or updates an existing table. We also used Spark for generating aggregates like monthly sales summaries, which feed certain reports. Spark's scalability ensured that even if we had tens of millions of records, the processing could be parallelized across a cluster and finish within the required window.
- SQL Transformations with dbt: For transforming data in SQL and building a semantic layer for analysts, we introduced dbt (data build tool). dbt allowed our analytics engineers to define models (as SQL select statements) that transform raw base tables into cleaned, business-friendly tables or views. We created a dbt project that sat on top of the Iceberg tables (using Trino or Spark SQL as the execution engine for dbt models). For example, a dbt model might take the raw transactions table and filter out test orders, calculate new fields like total_amount, and produce a "fact_sales" table in the data lake. Another model might join customer info with usage logs to produce a "daily_active_users" table. dbt handles the dependency management, so if the base table is updated, dependent models know to refresh. It also supports tests, so we embedded data quality tests in SQL (like "no nulls in primary key fields") as part of these models. By using dbt, we essentially implemented the analytics engineering layer: creating curated datasets that are directly usable by BI tools or data scientists, with documented definitions and quality checks. This drastically reduces the time an analyst spends wrangling data – the heavy lifting is done in these pipelines, and the data is ready for analysis.
- Orchestration with Airflow: To coordinate all these tasks (ingestion jobs, Spark jobs, dbt models, etc.), we set up Apache Airflow as the workflow orchestrator. Airflow allows us to define DAGs (Directed Acyclic Graphs) of tasks that run in a specific order and can be scheduled or triggered by events. We containerized Airflow and deployed it on AWS (using either ECS or as an EC2 instance with Celery workers for scale). Each pipeline (batch or streaming supervisory tasks) was defined as an Airflow DAG in Python. For example, we had a daily ETL DAG that orchestrated the steps to update sales data: first, extract from source via NiFi or a Python script, then run a Spark job to transform, then run dbt models, then validate data quality, and finally load the results to the presentation layer or notify users. Airflow handles retries (if a task fails, it auto-retries), alerting (sending an email/Slack if a pipeline fails), and logging for each task. We also used Airflow to parametrically run jobs on demand, such as a "backfill" DAG that could be triggered to recompute data for past dates if needed.
- Data Quality with Great Expectations: Data quality and trust were paramount, so we integrated Great Expectations checks into the data pipelines. Great Expectations is a framework for asserting expected conditions on data. We created expectation suites for critical datasets – for instance, ensuring that the daily sales total is non-negative, or that 95% of records have all required fields present. In Airflow DAGs, after data was loaded into a table, a task would run these validation checks. If any data quality test failed, the pipeline could be stopped and flagged for investigation, preventing bad data from flowing downstream or reaching reports. Over time, we worked with the client to add more expectations based on their domain knowledge (like business rules thresholds, outlier detection), effectively creating an automated data certification process.
- Automation & DevOps: All code for Spark jobs, dbt models, and Airflow pipelines was put in a Git repository. We established a CI/CD pipeline (using GitHub Actions and AWS CodePipeline) that would automatically run tests (including unit tests for data transformations and the Great Expectations checks on sample data) whenever we updated the pipeline code. On a successful test run, the CI/CD would deploy the new or updated Airflow DAGs and related code to the cloud environment. This DevOps approach ensured that the data engineers could develop and deploy iteratively and reliably, with version control and rollback if needed. It also meant the client could continue evolving the platform with minimal risk, as all changes go through automated checks.
Overall, the processing and orchestration layer was designed to be robust and flexible. We can add new pipelines by writing a new DAG and some transformation code, without affecting existing ones (thanks to Airflow's modular DAG management and the decoupled nature of the lake and Kafka topics). The combination of Spark and dbt gave us the power to handle both big data processing and business-oriented data modelling. Airflow ties everything together in a schedule, allowing daily, hourly, and real-time flows to coexist in a managed way. This orchestration also made the platform "self-driving" to an extent – once we set up a pipeline, it runs and monitors itself, freeing up human effort from manual data chores.
Query Layer & Data Access Services
Providing fast, convenient access to the data for end-users was a major goal. Instead of having analysts query raw files or build their own extracts, we implemented a robust query layer and data access services:
- Trino (PrestoSQL) Query Engine: We deployed Trino as a federated SQL query engine on top of our data lake and other sources. Trino (formerly PrestoSQL) allows users to run SQL queries that can join data across multiple data sources, all under one hood. We configured Trino with connectors to our Iceberg tables on S3, to OpenSearch (Elasticsearch), and even to some live databases if needed. This means an analyst or BI tool can connect to Trino and run a single SQL statement that maybe joins customer data in the lake with, say, a real-time index in OpenSearch or a reference table in MySQL. Trino will handle fetching data from all those sources and combining it, so the user doesn't have to manually move data around. We set up Trino on a small cluster of EC2 instances (or containers) that could autoscale when multiple users run heavy queries. Importantly, we enabled fine-grained access control in Trino: using tools like Apache Ranger or Trino's own access control configurations, we defined which roles can access which catalogs, schemas, or even specific tables/columns. For example, the HR role might be denied access to the "salary" column in an employee table, or marketing analysts might only see aggregated metrics but not granular personal data. Trino served as a unified query layer or "virtual data warehouse" without incurring the cost of duplicating all data into a separate warehouse technology.
- Semantic Layer & Data Catalog: We implemented a data catalog and discovery tool (such as Lyft Amundsen or Azure Purview, but in this case, we leveraged Amundsen for its lightweight open-source nature). This provided a UI where users could search for data sets, view schema information, access sample data, and, most importantly, see data lineage (where the data comes from and which dashboards or queries use it). The catalog was integrated with our pipelines: as part of the dbt models and Airflow jobs, we pushed metadata about table lineage and freshness into the catalog. This increased transparency – for instance, a business user can look up "monthly_active_users" dataset and see that it's generated by a pipeline every hour from raw web log data, last updated 10 minutes ago, and owned by the data engineering team. Having this context avoids a lot of confusion and duplicate efforts. It also promotes a data-driven culture where people can easily find and trust data for their needs.
- BI Dashboards and Tools: For the presentation layer, we made sure the platform connected seamlessly with common BI and analysis tools. We set up an open-source BI tool (Apache Superset) for the client as a quick win to build some dashboards on top of Trino. Superset was configured to query Trino, so users could create charts and dashboards directly on the data lake with good performance. The client's analysts also used Power BI and Tableau in their daily work, so we documented how to connect those tools to the Trino endpoint or directly to the Glue Catalog (for example, by using Athena or ODBC drivers if needed). We ensured that whichever tool they prefer, the live data could be accessed without each team making their own copy. Dashboards for sales, marketing, and finance were created as showcases. For example, a real-time sales dashboard queries the latest sales Iceberg table every 5 minutes via Trino, while a marketing dashboard pulls campaign data combined with web analytics from the data lake.
- Data Subscription API: In addition to human-facing dashboards, we built a lightweight Data API service to allow other applications or services to programmatically retrieve insights from the platform. This was essentially a Flask or FastAPI application that sat in front of the data platform. It exposed REST endpoints for certain common queries or reports. For instance, an endpoint `/api/customer/
/usage `could return a summary of a customer's recent activity by querying the data lake or OpenSearch index behind the scenes. We implemented this API with caching and pagination as needed, so external systems (like the company's web app or mobile app, or even external partners) could tap into curated data in a secure, controlled manner. The API applies security by requiring API keys or OAuth, ensuring only authorised systems can fetch certain data. This essentially turned the data platform into a service that not only serves internal analysts but also feeds other products or services, enabling data-driven features in the client's customer-facing applications.
By providing multiple access paths (SQL query engine, BI tools, data APIs), we ensured that different types of users could get value from the data platform in the way that suited them. Importantly, all these paths were built on the same single source of truth (the data lake and associated index), so whether someone uses a dashboard or an API or runs an ad-hoc SQL query, they're getting consistent answers. This addressed the prior situation where different teams had conflicting numbers due to siloed spreadsheets and databases.
From a governance perspective, centralising queries through Trino and the API also gave us control: we could audit queries, track usage patterns, and manage access in one place, rather than having direct credentials to the raw databases floating around. We set up logging for Trino queries and API calls so we could see who accessed what data – an important capability for compliance audits and usage optimisation.
Security, Governance & Observability by Design
Throughout the architecture, we wove in security and governance best practices. Rather than treat security as an afterthought, every layer was designed with the principle of least privilege and defence-in-depth:
- Access Controls: We used AWS IAM roles and policies to restrict access to resources. For example, the EC2 instances running Spark or Trino had IAM roles that only allowed them to read/write specific S3 paths (they couldn't access other buckets or sensitive services). Data in S3 was organised such that sensitive data (like PII) was in separate buckets or prefixes with stricter access rules. For user access, we integrated Airflow, Trino, and other tools with the client's SSO/LDAP system where possible, centralising user authentication. Within Trino, we configured SQL-level permissions (as mentioned), and within the BI tools, we set up row-level security on dashboards for any restricted content. Essentially, each user or service only sees the data they are supposed to see.
- Encryption & Data Protection: All data at rest in S3 is encrypted using AWS KMS-managed keys. We enforced TLS encryption for data in transit: Kafka was configured with SSL encryption for data streams, the APIs were only accessible via HTTPS, and internal service communication, such as between Spark and S3, or Trino and S3, using secure channels. Additionally, we tokenised or masked sensitive data in non-production environments when copying production data to development for testing. A masking script (using the Faker library or similar) replaced details like names, emails, etc., ensuring compliance with data protection regulations even in development and testing scenarios.
- Data Lineage and Auditing: We enabled logging at multiple levels. Every time data was ingested or transformed, the process logs an entry about the number of records processed, timestamps, and any anomalies. Airflow's logging, combined with custom audit logs in Spark/dbt, gives an end-to-end trace of data flows. We also implemented data lineage tracking using a combination of dbt (which inherently tracks dependencies between tables) and the Amundsen data catalog. This means we can pick a data item (say a number on a dashboard) and trace back which pipeline, source file, or upstream system it came from. If something goes wrong or a number looks off, this lineage helps quickly pinpoint where the issue might have arisen.
- Compliance Alignment: Although the client operated in an industry that was not heavily regulated like healthcare or finance, we still adhered to common compliance frameworks. For example, GDPR principles were followed for customer data (we made it easy to delete or anonymise a customer's data across the data lake if a "right to be forgotten" request was received). We also took data residency into account – if in the future they expand to regions with data locality laws, the infrastructure can be deployed in multiple regions to keep data isolated as required. Access logs (via AWS CloudTrail, etc.) are maintained so any access to sensitive data is recorded and reviewable. These measures reassured the client that the new platform wouldn't introduce security or compliance risks but, in fact, would reduce them compared to their previous setup.
- Monitoring & Observability: We established a comprehensive monitoring stack for the platform's operations. All application logs (from NiFi, Kafka, Spark, Airflow, etc.) are collected in a central log management system (using Elasticsearch/Kibana or AWS CloudWatch Logs with subscriptions). This enables the engineering team to search logs in one location when debugging. We also deployed Prometheus to gather metrics from various components – such as JVM metrics from Kafka and Spark, Airflow DAG success/failure counts, CPU and memory usage of nodes, etc. These metrics are visualised in Grafana dashboards that we configured for the client, displaying the health of data pipelines at a glance (e.g., a dashboard might show "Lag on Kafka topics," "Records processed per minute," "Pipeline success rate," etc.). We set up alerts using CloudWatch Alarms and Prometheus Alertmanager – such as if a critical pipeline fails or runs too long, if a server goes down, or if costs for the month are trending high (yes, we even established a cost monitoring alert to stay within budget!). Alerts notify the on-call engineer via email or Slack so immediate action can be taken.
- Resilience and Failover: The architecture inherently provided resilience, such as Kafka clustering, multi-AZ deployment for critical services, and EMR checkpoints for Spark, but we also tested it. During staging, we simulated failures—e.g., turning off a Kafka broker to see if producers and consumers auto-failover to other brokers (they did), or terminating a Spark executor mid-job to ensure the job retries the task on another node. We also tested recovery scenarios: if a pipeline failed due to bad data, does it quarantine the bad data and alert us (using Great Expectations)? If an entire AWS availability zone went down, would our services in another AZ pick up the load? By simulating these scenarios, we fine-tuned configurations—such as increasing the replication factor on Kafka topics, enabling checkpointing in streaming jobs, and using multi-AZ RDS for Airflow metadata—to eliminate single points of failure.
In designing the solution, we never lost sight of the cost constraint. Each service introduced was weighed against open-source or serverless alternatives. For example, we considered AWS Kinesis for streaming but chose Kafka on EC2 for lower long-term cost and flexibility. We avoided an expensive data warehouse by using Iceberg on S3. We used as few EC2 instances as possible, leaning on managed services with free tiers (Glue catalog, Lambda for some lightweight tasks, etc.) and efficient use of resources (packing multiple lightweight services like minor APIs on one EC2 if needed). We estimated the cost of each component and ensured the total would stay around $500/month given the expected data volumes and usage patterns. In fact, the final design came out to be very economical – something we validated continuously in the next phase.
Implementation & Delivery
RCG's cross-functional team (data engineer, analytics engineer, cloud architect, data governance lead) collaborated closely to implement this platform using an agile delivery approach. We divided the work into sprints focused on achieving small, incremental goals – for example, one sprint to set up infrastructure and a basic pipeline, another to add streaming capabilities, another to introduce governance features, and so forth. This iterative method enabled the client to see progress early and offer ongoing feedback, ensuring the final product met their expectations. Below are some highlights and key aspects of the implementation process.
- Infrastructure as Code Deployment: In the first sprint, we established the AWS foundation using Terraform scripts prepared during the design phase. With a single command, we could provision the VPC, subnets, security groups, EC2 instances, S3 buckets, RDS databases, and other necessary resources in a development environment. This automated provisioning was vital for ensuring consistency; it also means that future scaling or region expansion can reuse the same scripts. We kept environments isolated (development/test/production) with separate configurations but utilised the same code – allowing us to safely experiment in development and then deploy to production with confidence that it is reproducible. The Terraform code was stored in Git, and changes to infrastructure underwent code reviews, adding an extra layer of checks to prevent costly mistakes. Within minutes, our entire core platform could be deployed or dismantled, providing us with significant agility in testing different configurations.
- Setting Up Pipelines: We then tackled data ingestion pipelines. For the batch pipelines, our engineers built NiFi flows to connect to the MySQL CRM and Postgres ERP. We set up NiFi on an EC2 instance with high memory (to handle large data flows efficiently) and created templates for common ingestion patterns, such as "dump a table to CSV and upload to S3." Using these, we onboarded initial datasets quickly. Each NiFi pipeline was parameterised—for example, we could reuse the same flow for multiple tables by simply changing a table name parameter. We tested these with small data samples, then scaled up to full extracts, ensuring the flows could handle the volume. For streaming, we set up the Kafka cluster and connectors (Debezium for CDC from databases). We began streaming a small table's changes first to validate the end-to-end process. Once confirmed, we expanded to more topics for other tables and events. Meanwhile, we developed the initial Spark Structured Streaming job to consume those Kafka topics. This code was containerised and deployed on EMR. We fine-tuned Spark configurations (such as memory and batch intervals) to ensure it processed events promptly without lag. One challenging aspect was exactly-once processing—we used Kafka offsets and Iceberg’s ACID capabilities to prevent data duplication if a job restarted.
- Data Lake and Iceberg Setup: We set up the S3 bucket structure for the data lake, organising folders for raw and curated data, partitioned by domain. Using sample data, we established the first Iceberg table, such as a Customer table. We configured the Glue Data Catalog with the database and table definitions. Afterwards, we wrote a few Spark scripts to convert raw data into Iceberg format, primarily by reading CSV files and writing them as partitioned Parquet files with Iceberg. These scripts were later integrated into data pipelines. An important step was defining the convention and processes for schema changes: since Iceberg supports schema evolution, we documented how to handle a new source column, for example, adding it to the table with a default null and recording it in metadata. We also enabled versioning on the S3 bucket and implemented a backup strategy- periodically snapshotting the metadata catalogue- for safety, despite Iceberg's own version history.
- Developing Transformations (Spark & dbt): The team then developed the data transformation logic. For intensive computations, we coded PySpark jobs. Initially, we used Jupyter notebooks to prototype some transformations with sample data, then translated them into PySpark jobs in our codebase. We ensured that logging was included in these jobs (tracking record counts, anomalies, and runtime) so operators could trace what happened later from the logs. For the dbt models, our analytics engineer defined several layers: staging models (which select from raw data and perform light cleaning), intermediate models (joining datasets and applying business logic), and final models (the ones presented to users, like fact and dimension tables, similar to a Kimball warehouse approach). We added tests in dbt for critical fields and used dbt's built-in documentation generation to produce descriptions for each model (such as column details). Running dbt docs provided an internal site where the client's analysts could learn about each data model, which was part of the knowledge transfer.
- Orchestrating with Airflow: We installed Apache Airflow using the official Docker image and configured it to connect to an Amazon RDS MySQL instance for its metadata database. We set up Airflow's executor as Celery with a Redis broker, enabling us to scale out workers when needed for parallel tasks. The team wrote Airflow DAG code for several pipelines. One example is the daily sales ETL DAG, which we described in the design. We used Airflow's built-in operators where possible (e.g., PythonOperator for running small Python functions, SparkSubmitOperator to initiate Spark jobs, etc.). For NiFi flows, we scheduled them outside Airflow (using NiFi's own scheduler), but Airflow included sensors to verify that NiFi jobs were complete or to trigger NiFi through APIs if necessary – in some cases, we initiated NiFi pipeline runs from Airflow to centralise scheduling. As an illustration, here is a simplified example of an Airflow DAG definition in code.
After several iterations and testing cycles, we performed the final deployment to production. The transition from the old system to the new platform was carefully planned to minimise disruption. We operated the new platform alongside the old for a short period ("dual-run") to compare outputs and ensure they matched. Once validated, the old pipelines were decommissioned. The new scalable, unified data platform was now live, reliably providing data to stakeholders. We then moved into the Improve phase, focusing on results monitoring and ongoing improvements.
Results & Impact
The deployment of this scalable, cost-efficient, and resilient data platform delivered exceptional results for the client. It effectively transformed their data operations and generated new business value. Below, we summarise the key results and impacts observed after going live:
- Dramatic Cost Reduction: The platform achieved its goal of cost-efficiency. The total cloud expenditure for running the data platform was approximately $500 per month (about 30% less than their previous spend). By replacing costly legacy ETL tools and optimising resource use (such as auto-scaling clusters and spot instances), we lowered infrastructure costs by an estimated 30%, while managing more data and users than before. These savings freed up budget for other initiatives, and the client's finance team was impressed that we could enhance capabilities while reducing expenses. Additionally, cost monitoring and alerts in place ensure that spending remains predictable – if usage increases, the team will be notified to adjust accordingly, preventing unexpected bills (a common issue in the old setup).
- Faster Insights & Better Decision-Making: What used to take hours or days in the old environment now takes minutes. For instance, marketing campaign performance data that had to be manually assembled from multiple sources is now accessible in a real-time dashboard that refreshes automatically every few minutes. The sales team, which previously waited until midday to receive the previous day's sales figures in a spreadsheet, now wakes up to a morning email summary and live dashboard of sales metrics by 6 AM. This increase in data availability (in some cases from a 24-hour delay to just minutes or near-instant access) has clearly enhanced decision-making speed. Managers can respond to trends immediately – one example: the marketing department detected a dip in user engagement in near-real-time and quickly launched a targeted campaign on the same day to address it, something not achievable before. Across the board, the organisation is more agile and data-driven, making decisions based on current data rather than outdated reports.
- High Reliability & Resilience: Since launch, the platform has maintained over 99.9% uptime, achieving the strict SLA we targeted. There have been no major outages impacting business reporting. Even when minor incidents, such as a node failure or a late data file, occurred, the system's resilience mechanisms activated—clustered services and retries ensured continuity without human intervention. For instance, when one Kafka broker was briefly unreachable due to a network glitch, data producers automatically retried, and other brokers took over, with no data loss or downtime. The client's stakeholders have recognised the reliability – trust in the data platform has grown significantly because it is always available when needed. This reliability contrasted sharply with the old setup, where reports were often delayed by system issues.
- Improved Data Quality & Governance: A significant achievement of the new platform is the boost in trust and transparency of data. Thanks to the data quality frameworks we integrated, the data outputs are effectively "certified". If a pipeline finishes successfully, users are assured the data meets established quality standards (or else it would trigger an alert). This has lessened the time analysts spend double-checking or reconciling figures. One clear outcome: the finance team reported that the quarterly financial reconciliation process (cross-verifying figures from various reports) was halved because the numbers aligned perfectly from the unified platform, unlike before when they had to investigate discrepancies from different source extractions. From a governance standpoint, sensitive data is now clearly managed – for example, only HR can query HR data, only finance can see revenue details, and so on, all enforced by the platform. This principle of least privilege diminishes the risk of unauthorised access or internal data leaks. The client's compliance officer was pleased to see features like audit logs and lineage documentation, simplifying audits and regulatory reporting. In brief, data governance shifted from ad-hoc to systematic: the organisation now has policies enforced by technology, not just trust.
- Enhanced Data Accessibility & Collaboration: The unified nature of the platform has promoted better collaboration and self-service. With the data catalogue and consistent data definitions, different teams are effectively "on the same page." For example, what defines an "active customer" is located in one place (the data model) and universally used, rather than each team creating their own definition. Analysts can independently discover data sets, identify data owners, and request access through a formal process if necessary—all facilitated by the catalogue and governance setup. The barrier to starting new analysis has decreased: a new analyst joined the marketing team and, in her first week, she was able to find the tables she needed and build a report without extensive help from IT, thanks to the documentation and tools available. We also observed an unexpected benefit: because the platform is accessible (with proper guardrails), some tech-savvy business users began learning SQL and conducting more exploration themselves, which uncovered new insights. The client effectively gained a modern data culture where IT and business users share the same platform, rather than submitting requests over a fence.
- Scalability & Future Growth: The delivered solution is designed to support the client's future needs, as demonstrated by existing evidence. Soon after initial deployment, the client acquired a smaller company and needed to integrate a new data source (a MongoDB database from the acquired product). Thanks to the modular architecture, the client's team – with some guidance from us – was able to connect this new source to the pipeline within days: they set up a Kafka connector for MongoDB, transferred the data into the lake, and began incorporating some of it into existing models. The ability to onboard new data sources quickly validated the flexibility of the design. Performance-wise, the platform has headroom: the average system utilisation is around 50%, which means it can handle bursts and growth. If the business doubles in customers or transactions, the data platform can scale by allocating more resources or adding nodes, without needing a redesign. This future-proof feature ensures the client's investment in this platform will continue to benefit them for years. Additionally, because we used largely open-source and standard components, the client isn't tied to a single vendor – they can adapt as new technologies emerge (for example, if in two years a better query engine appears, they can replace Trino; or if they want to implement machine learning, they can easily integrate SageMaker or Databricks on top of the lake).
- Empowered Teams & Agile Operations: By working closely with the client's staff and delivering training, we didn't just provide a black-box system – we empowered the client's team to own and develop the platform. This is a vital outcome: the internal data engineering team enhanced their skills through the project (learning infrastructure-as-code, Kafka, etc.). They now continue to operate in an agile way, managing backlog grooming for new data requests, and they utilise the foundations we established (CI/CD, testing, monitoring) to add features securely. Essentially, we left them with both a product and the ability to oversee it. The "you build it, you run it" DevOps philosophy we introduced has taken root, resulting in faster iterations. For instance, when a business stakeholder asked for a new KPI on a dashboard, the data team was able to update the pipeline and deliver within the next sprint – a much quicker turnaround than the previous system permitted. The confidence and morale of the data team have increased because they have access to modern tools that assist them, rather than hinder them.
In quantitative terms, within a few months of operation, the client reported improvements such as a 20-30% reduction in the time analysts spent on data preparation, a notable increase in on-time report delivery (close to 100% on time versus approximately 70% before), and cost savings that allowed re-investment in other data initiatives. Qualitatively, the platform has become an "analytics hub" for the company – a single place for data, which is a strategic asset. It has also created the foundation for more advanced analytics; for example, the data science team is excited that they now have easy access to clean, well-organised data to develop machine learning models for applications like customer lifetime value and churn prediction, which was previously challenging. In short, the project succeeded on all fronts —cost, speed, reliability, governance, and scalability —fundamentally transforming the client's data capabilities and business potential.
Continuous Improvement & Next Steps
RCG's engagement with the client did not stop at delivering the solution. In line with our "Improve" phase of the RCG model, we established a plan for continuous improvement and value expansion of the data platform:
- Post-Launch Monitoring and Tuning: In the initial weeks after go-live, we closely monitored the platform's performance and usage. We collected feedback from end-users regarding any issues or new requirements. Minor adjustments were made, such as modifying the schedule of some pipelines based on actual data arrival patterns and fine-tuning query caching in Trino to accelerate certain frequent dashboard queries. This ensured the platform continued to operate optimally as real-world loads increased.
- Agile Backlog of Enhancements: Together with the client, we maintained an ongoing backlog of improvements – some were nice-to-have features that didn't make the initial release, while others were new requests as users began exploring the platform. For instance, one enhancement added after launch was a data archival process: we developed a workflow to automatically archive data older than 2 years to an even cheaper storage tier (Amazon Glacier), to further optimise costs as data continued to accumulate. Another future enhancement in the backlog is integrating a data science notebook environment (like JupyterHub or AWS Sagemaker) directly with the data lake, allowing data scientists to prototype models on fresh data easily.
- Scaling and Optimisation: As data volumes grow, we advised the client on medium-term steps such as partitioning strategies and hardware upgrades. We scheduled quarterly architecture reviews to reassess capacity and costs. Due to business growth, we expect to increase the Kafka cluster size and possibly add a few more Spark nodes in the next year— all of which is straightforward thanks to the auto-scaling design. We also seek opportunities to adopt new technologies: for instance, we are contemplating evaluating Apache Flink for certain streaming analytics that might require lower latency than Spark, and Delta Sharing or Lake Formation for data sharing with external partners if the need arises. Our mindset for improvement is to keep the platform modern and efficient as requirements evolve.
- User Training & Data Literacy: We recommended that the client continue investing in data literacy programmes to maximise the platform's value. We established "office hours" and a user group where analysts and business users can share how they use the platform, ask questions, and suggest new datasets or features. This community approach keeps users engaged and ensures the platform remains aligned with business needs. We also provided advanced training to some power users on writing SQL queries on Trino, creating their own Great Expectations data quality rules, and other skills, effectively embedding best practices into the organisation's DNA.
- Governance Oversight: To maintain effective governance, we assisted the client in establishing a Data Governance Council comprising data owners from each department and IT governance personnel. This council meets periodically to review access requests, data quality metrics, and ensure compliance requirements are met as new data is onboarded. The platform itself supplies the metrics (e.g., how many quality incidents were identified, how many queries were executed by each department), enabling the council to make informed decisions (such as "we need to improve quality on X dataset" or "Y team appears to require a new data source to be integrated"). This governance process guarantees the data platform continues to provide trusted data and that any issues regarding data definitions or quality are addressed collaboratively.
Through these ongoing improvement efforts, the data platform will not stagnate but rather evolve with the business. RCG remains a strategic partner to the client, ready to assist in the next phases—whether it's incorporating more sophisticated analytics, scaling to global operations, or further cost optimisations. The agile, iterative mindset we have instilled means the platform is never "done" in a static sense; it's a living system that can adapt as new challenges and opportunities arise.
Conclusion
In this case study, Ray Consulting Group showcased how a holistic, engineering-led approach can transform a complex data challenge into a success story. We delivered a scalable, cost-efficient, and resilient data platform that now functions as the central nervous system for the client's decision-making. By following our Discover – Design – Deliver – Improve framework, we ensured alignment with business goals from the outset, designed thoughtfully with modern best-of-breed technologies, executed with discipline and agility, and laid the groundwork for ongoing improvement.
This engagement highlights RCG's strengths across the data spectrum: data engineering (building pipelines and infrastructure), analytics engineering (modelling and preparing data for analysis), data migration (seamlessly transferring historical data into a new platform), and data governance (implementing security, quality, and compliance). The only aspects we did not directly undertake were the final BI analytics and data science modelling – but we enabled these by providing clean, ready-to-use data on a silver platter. The client's analytics teams can now concentrate on deriving insights rather than fixing data issues.
Ultimately, the project was delivered on time and within budget, surpassing expected outcomes. The client now possesses a competitive edge – the ability to leverage their data assets swiftly and reliably, without incurring excessive costs. This case study exemplifies how RCG combines the right technology stack, methodology, and collaborative teamwork to resolve real-world data problems. It stands as a testament to being the "best data services team" our client could partner with, and it serves as a blueprint we are prepared to apply to future data challenges.
Back to Case Studies Overview