Case Study
Accelerating Data Freshness with Incremental Loading in Financial Analytics
Sector: Financial Services
Background
A global financial services firm relied on daily dashboards for trading risk and performance metrics. In such real-time financial applications, timely data updates are critical . Yet, their data warehouse ingestion process was doing full loads – rebuilding entire tables from scratch on each batch run. This approach meant that even minor daily changes triggered reprocessing of all historical data, leading to slow refresh cycles. Stakeholders often waited ~3 hours after market close for dashboards to update, delaying decision-making.
Challenge – Slow Data Refresh with Full Loads
The full-load ETL strategy became a bottleneck as data volumes grew. Key issues included:
- Long Processing Time: Each run reloaded the entire dataset, consuming nearly 3 hours nightly. This lag undermined the firm's ability to get near real-time insights, unacceptable in environments like stock trading where updates must be timely .
- Resource Inefficiency: Full loads were re-reading and writing terabytes of unchanged historical data, wasting compute and I/O. Incremental changes (often just a few percent of data) were overshadowed by redundant processing of old data.
- Stale Dashboards: Business intelligence reports only refreshed once the full load completed. Data was only as fresh as the last full load, so intraday updates or quick iterations were impossible. In fast-moving financial markets, this latency hampered responsive analysis.
- Maintenance Complexity: While conceptually simple, full reloads offered no easy checkpointing. Any failure meant starting over from scratch. In contrast, moving to an incremental load introduces complexity (change tracking, merge logic), requiring careful design .
The challenge was clear – how to speed up data loading and hence dashboard refresh, without sacrificing accuracy or completeness.
Solution – Incremental Loading Pipeline Design
To resolve these issues, we redesigned the batch data pipeline to use incremental loading instead of full reloads. The core idea is to process only new or changed records each run, rather than everything . This involved changes across our AWS-based stack, leveraging dbt, Amazon S3, and Redshift Spectrum for an efficient lakehouse-style architecture:
Incremental dbt Models: We refactored our dbt models to use the incremental materialization. The first run still fully populated the tables, but subsequent runs apply a WHERE filter (using is_incremental() macro) to select only recent data (e.g. last 1 day of records) for processing. dbt then performs an upsert—inserting new records and updating existing ones—using a merge strategy with a unique key. This ensures the target table in Amazon Redshift is kept in sync with the latest source data without a full rebuild. According to industry sources, using incremental models "limits the amount of data loaded… significantly reduces loading time" and "vastly improve[s] performance" in ETL workflows .
Data Lake Integration (Amazon S3 + Spectrum): Incoming data from operational databases was continuously ingested to an S3 data lake in parquet files (via AWS DMS CDC streams). We configured Amazon Redshift Spectrum external tables pointing to the S3 landing zone, so that dbt could directly query new files without a separate load step. Redshift Spectrum allows the warehouse to query data in S3 without loading it into Redshift tables , which was ideal for our incremental approach. In each incremental run, dbt would read only the latest S3 partitions (e.g. the last day's partition) via Spectrum and join with existing warehouse data to merge changes.
Batch Orchestration: The pipeline remained batch-driven (triggered by a nightly cron/Airflow job), but with much shorter runtime. We preserved the same multi-layer dbt architecture (staging, intermediate, marts), but every layer was now built incrementally. Dependency management in dbt ensured downstream models only reprocessed the new data as well. We also implemented data quality checks on incremental loads (e.g. validating record counts for the day) to catch any discrepancies early, since incremental processes can be prone to missing updates if not carefully managed .
Technology Stack
Our solution leveraged a modern cloud analytics stack, using the best tool for each task:
- AWS DMS (Database Migration Service): Used for change data capture, replicating source database changes (in near real-time) to Amazon S3 in parquet format. This provided the incremental feed of new data.
- Amazon S3 Data Lake: Served as the landing zone for raw data. S3 held the historical dataset in partitions (e.g. by date), enabling partition-pruning for incremental loads.
- Amazon Redshift + Spectrum: Redshift was our data warehouse, and Spectrum let us treat S3 data as external tables. Redshift's powerful SQL engine could join warehouse tables with new S3 data on the fly . The internal Redshift tables (dimensions, facts) were updated using merges of new data.
- dbt (Data Build Tool): Orchestrated the transformations and merge logic. dbt incremental models automated the "MERGE vs INSERT" logic using SQL under the hood, so we didn't have to hand-write upsert statements for Redshift. We configured unique_key and incremental filter logic as per dbt best practices. The result is a simpler development workflow, since dbt handles the heavy lifting of materialization.
- Orchestration & Scheduling: AWS Step Functions or Apache Airflow (via MWAA) triggered the incremental pipeline runs on a schedule (and on-demand if needed). The orchestration was configured to supply the cutoff (e.g. "load data up to yesterday") to the dbt job. Our previous cron job was repointed to run the new incremental dbt project.
Results – Faster Dashboards and Fresher Data
The impact of moving to incremental loading was dramatic and immediately felt across the organization. By only processing recent changes instead of entire history, the ETL runtime dropped from ~3 hours to under 1 hour – a 66% reduction in load time. This aligns with common observations that "incremental loading is of course much faster than a full load" . In fact, in some data models we saw 3x-5x speed improvements.
Key outcomes and benefits:
- Data Freshness: Dashboards now update on an hourly schedule (or even more frequently for critical metrics), as opposed to once overnight. Business teams have access to near real-time data. In the finance sector, this means risk managers can react to market moves on the same day, not the day after. Incremental loading made our warehouse better support real-time analytics needs .
- Resource Efficiency: Warehouse compute costs dropped significantly. We eliminated redundant processing of unchanged data, which freed up Redshift capacity for other workloads. As one source notes, limiting loads to new data "reduces loading time" and also "reduces compute costs" . In our case, Redshift cluster concurrency and throughput improved, since heavy full-table scans were replaced by targeted upsert operations.
- Scalability: The new pipeline is more scalable. As data volume grows, the incremental load will still handle a day's delta quickly, whereas the old full load approach would have grown linearly with data size. This ensures we can onboard more data sources (and we did add 3 new source systems without impacting nightly load schedules).
- Broader Adoption: Seeing the success, other analytics teams began adopting incremental modeling. It proved the viability of a batch-incremental hybrid approach – combining batch scheduling with incremental processing – to balance freshness and reliability.
- Maintenance & Complexity: Despite initial complexity in implementation (incremental logic, CDC handling, etc.), the system has been running reliably. We put guardrails in place (e.g. periodic full refresh fallback, robust monitoring) to mitigate incremental load risks like data drift or missed updates . The result is a maintainable pipeline that meets the firm's data SLAs.
Conclusion
By switching from full loads to incremental loads, the financial firm achieved a step-change improvement in data latency. The project showcases how even batch pipelines can be optimized to deliver near real-time data. This paradigm shift – reconstructing only what's changed rather than everything – can benefit any sector dealing with growing data volumes and the need for timely insights. As our case showed, the investment in a more sophisticated incremental ETL process paid off with faster analytics, happier end-users, and a more efficient data infrastructure. The approach is applicable beyond finance to retail, IoT, or any domain where fresh data is a competitive advantage . Embracing incremental loading has empowered the organization to make data-driven decisions with confidence, knowing that their dashboards reflect the latest state of the business.
Sources: The advantages of incremental vs full data loads are well documented in industry literature . Modern data tools like dbt further simplify implementing incremental models for warehouses . Amazon's ecosystem (DMS, S3, Redshift Spectrum) enables querying data lakes without heavy reloads , a pattern we leveraged. This case validates those principles in practice – incremental ETL delivered the speed and efficiency gains as promised, transforming the organization's analytics capabilities.
Back to Case Studies Overview