The Great Shift Left: Embracing the Shift Left Data Architecture

The Great Shift Left: Embracing the Shift Left Data Architecture

“Shift Left” in data architecture refers to moving data quality controls, testing, and validation earlier in the data pipeline, rather than waiting until data reaches its final destination. Think of it as pushing quality gates "leftward" in your data flow.

This article explores the journey of ACME Corp, a fictional company, that transformed its data practices by implementing a Shift Left strategy. From establishing data contracts to embracing real-time processing, this case study illustrates how moving data quality checks and processing closer to the source can dramatically improve an organization's data operations and decision-making capabilities.

Whether you're a data professional looking to optimize your company's data architecture or a business leader seeking to understand the latest trends in data management, this article walks you through the potential of the Shift Left approach and its real-world applications.

The ACME Corp - A House of Cards

Once upon a time, ACME Corp, a global leader in consumer electronics, was struggling with a common issue that plagued many organizations of its scale: focusing on speed of delivery and treating data quality as an afterthought.

Their architecture looked like this:

Data validation and quality checks happened on the analytics side.

Their traditional data architecture had served them well in the past. However, as the volume and velocity of data surged, cracks began to show, accumulating problems including:

  • Data issues were discovered weeks after entering the systems
  • Each department had its own data validation rules
  • 40% of data scientist time was spent cleaning data
  • Trust in data was at an all-time low
  • Cost of fixing data issues: $2M+ annually

The Breaking Point

It was a typical Monday morning when Sarah Chen, ACME Corp's Chief Data Officer, walked into an emergency executive meeting. The company's flagship AI-powered inventory management system had just made a $12 million mistake. The system had ordered a year's worth of raw materials for a product line that was being discontinued next month.

"How did this happen?" demanded the CEO.

The root cause analysis revealed a familiar pattern: bad data had silently crept through their pipelines, propagated across systems, and finally surfaced when it was too late - and too expensive - to fix.

The Decision to “Shift Left”

The inventory system fiasco was the final straw. Sarah and her team spent weeks researching modern data architecture patterns and discovered the "shift left" approach.

What is Shift Left data architecture? Another fancy concept only seen at data conferences? Well, it’s too early to say that. But we can safely say that it is a proactive approach designed to address potential issues earlier in the data lifecycle and enable real-time, streaming data. By shifting responsibilities like data quality checks, processing, and monitoring closer to the data generation point, organizations like ACME could theoretically empower each department to access fresher insights without the typical wait time associated with batch processing.

Moreover, ACME has seen a few case studies showing that by shifting left, data management responsibilities could move closer to developers and analysts. This would allow for faster troubleshooting, reduce rework, and—most importantly—decrease dependency on a single, overburdened data engineering team.

Shift left emphasizes validating and transforming data closer to its source.

The Great Shift Left - Implementing the Shift Left Architecture

Transitioning to a shift left architecture was no small feat, and Sarah’s team knew it meant a reimagining of their traditional setup.

Rather than making a big bang transition, they followed a phased approach towards adoption.

Phase 1: Establishing Data Contracts

Data contracts are formal agreements between data producers and consumers that define the structure, format, and quality expectations of shared data. Think of them as mutual commitments that data producers and consumers uphold, ensuring data consistency and reliability throughout the system.

Firstly, ACME decided to establish data contracts between teams. Every data producer and consumer had to agree on:

  • Data format and schema
  • Quality expectations
  • Service level agreements
  • Error handling procedures

To implement data contracts effectively, ACME could leverage several open-source tools:

  • Apache Avro: A data serialization system that provides rich data structures and a compact, fast, binary data format. It's particularly useful for defining schema evolution.
  • JSON Schema: A vocabulary that allows you to annotate and validate JSON documents. It's ideal for defining the structure of JSON-based data contracts.
  • OpenAPI (formerly Swagger): While primarily used for REST APIs, OpenAPI can be adapted to define data contracts, especially for data exchanged via APIs.
  • Protobuf (Protocol Buffers): Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. It's excellent for defining strict contracts with high performance.
  • Great Expectations: A Python-based tool for validating, documenting, and profiling data. It can be used to implement and enforce data quality expectations within contracts.

The nice thing about these tools is that most of them are declarative, allowing version-controlled and GitOps-based workflows for maintenance.

Phase 2: Quality at source

With the data contracts in place, ACME implemented validation gates at every data entry point.

Rather than waiting until data reached the central warehouse for quality assessments, ACME embedded data validation at the source. Data producers were given responsibility for ensuring quality using lightweight validation scripts integrated into the data streams.

The data contracts they established in the previous phase served as the basis for validations. They explicitly state the expected format, structure, and content of the data. This allows validation scripts to check incoming data against these predefined criteria.

Next, the team committed the validations to a central Git repository and integrated them with their CI/CD pipeline. This approach offered several advantages:

  • Version Control: By checking validations into a Git repository, teams can track changes, collaborate more effectively, and easily roll back to previous versions if needed.
  • Automated Quality Assurance: Integrating validations with the CI/CD pipeline ensures that data quality checks are automatically performed as part of the development and deployment process, reducing the risk of errors slipping through.
  • Consistency and Standardization: Helps maintain consistent data quality standards across the organization, as all validations are centralized and automatically applied, reducing the likelihood of disparate validation rules between departments.

Phase 3: Data Observability at every stage

Comprehensive data observability gives real-time insights into data health, usage patterns, clear custodianship during transformations, and potential anomalies, allowing for proactive problem-solving and continuous improvement of data pipelines. Data lineage provides a clear understanding of how data moves and transforms throughout the system, enabling quick identification of error sources and impact analysis. Defined custodianship ensures accountability at each stage of data processing, reducing the risk of data quality issues slipping through unnoticed.

Next, ACME integrated observability tools to track data lineage, quality, and schema changes. This allowed engineering teams to catch and address issues on the fly, with alerts triggered in real-time.

Phase 4: Self-service analytics, data access, and federated governance:

Taking things further, the team established self-service data marts and implemented a data catalog, enabling analysts and business teams to easily locate and consume data without waiting on engineering. This was a monumental step, shifting data ownership to teams who could now directly control their data streams. Additionally, this promoted a federated data governance model within ACME departments, empowering each department with governance controls, and ensuring compliance and data stewardship without the usual bottlenecks.

Had ACME Corp invested in a self-service analytics and data access system, they might have averted the $12 million blunder we mentioned earlier. Let's explore how:

  • Real-time data visibility: With self-service analytics, the product management team could have easily accessed up-to-date information about product lifecycles and inventory levels, spotting the discrepancy between the AI system's order and the product's planned discontinuation.
  • Cross-functional data integration: Self-service platforms often integrate data from various departments. This would have allowed for a holistic view of product, inventory, and sales data, making it easier to identify inconsistencies between the AI system's decisions and actual business plans.
  • Empowered decision-making: By giving teams direct access to data, they could have independently verified the AI system's recommendations before approving large purchases, adding a human oversight layer to catch potential errors.
  • Customized alerts and dashboards: Teams could set up personalized alerts for unusual inventory movements or large orders, potentially flagging the issue before it became a multi-million dollar mistake.

Leveraging edge processing with streaming ETL pipelines

Sarah’s team used Apache Kafka and Apache Flink in the shift left architecture for real-time data collection and processing. Kafka decoupled the data producers and consumers, ensuring that data is captured and made available in real time without overwhelming downstream systems.

Apache Flink, a stream processing framework, complemented Kafka by providing powerful, low-latency data processing capabilities. Flink's ability to handle both batch and stream processing within a single engine aligns perfectly with the shift left philosophy. It would enable ACME to perform complex event processing, data transformations, and analytics directly on the streaming data as it flows through the system, rather than waiting for data to be stored in a data lake or warehouse before processing.

The combination of Kafka and Flink enabled ACME to implement data quality checks, apply data contracts, and perform real-time analytics at the edge of their data architecture. This drastically reduced the time to insight, allowed for immediate detection and correction of data issues, and provided a flexible foundation for building and deploying data products.

Shift Left architecture augmented with streaming data looks like this:


Use of streaming data architectures makes high-quality data instantly accessible to the analytics layer. Stream processors like Flink enable implementing quality checks, and data observability, as well as making corrective measurements.

A New Era at ACME Corp

Like a phoenix rising from the ashes of outdated data practices, ACME Corp emerged transformed. After navigating through a labyrinth of iterations and overcoming numerous cultural challenges, the company finally embraced the shift left architecture in its entirety. The results of this transformation were nothing short of revolutionary:

  • Real-Time Decision-Making: Data-driven decisions were no longer dependent on overnight batch jobs. Marketing could adjust campaigns mid-day based on real-time customer responses, and inventory adjustments could be made immediately to prevent shortages or excess stock.
  • Reduced Engineering Bottlenecks: With validation and processing pushed to the edge, data engineering’s workload shifted from endless bug-fixing to high-value tasks like optimizing pipelines and exploring new technologies. Teams across ACME became more self-sufficient, which in turn empowered them to innovate faster.
  • Cost Savings: By eliminating massive nightly ETL runs, ACME saw a significant reduction in processing costs. Real-time stream processing requires far less computational power than the legacy setup.
  • Improved Data Quality: Quality checks at the source meant data reached analysts in a much cleaner state. ACME’s engineers saw a drop in rework, and the trust in data accuracy skyrocketed across departments.

Today, ACME Corp operates with data agility that their competitors envy. By transforming data operations, ACME discovered a powerful truth: sometimes, the key to scaling data-driven success is not to wait for insights, but to bring the insights to the teams who need them most. And with that, they’re ready for whatever the future brings.

How to convince your CXO to embrace Shift Left?

Well, I believe you got the idea and value proposition behind Shift Left after reading the inspiring story of ACME. While it’s just fiction, organizations would have to fight a little hard implementing Shift Left architectures in real life. Here’s my reasoning:

  • Skepticism and Buy-In: Not all departments will be immediately sold on the shift left concept, especially those accustomed to traditional data workflows. Better to conduct pilot projects to demonstrate quick wins and gain buy-in from reluctant stakeholders.
  • Skills Gap: Real-time data processing requires skills in streaming technologies, and some team members have only worked with batch-oriented tools. Organizations would have to invest heavily in upskilling their workforce, providing hands-on workshops, and bringing in consultants to bridge the knowledge gap.
  • Tooling and Integration Overheads: Adopting a new architecture required integrating a suite of new tools, which initially adds complexity to the organizational tech stack. Through trial, error, and refinement, they will eventually build a cohesive ecosystem that balances simplicity with functionality.
  • Data Governance Complexity: With federated governance, maintaining consistent policies was challenging.

As Sarah Chen often says:

"In the world of data, quality is not a destination - it's a journey. And that journey begins as far left as you can possibly go.”
Mangesh Gajbhiye

10k+| Member of Global Remote Team| Building Tech & Product Team| AWS Cloud (Certified Architect)| DevSecOps| Kubernetes (CKA)| Terraform ( Certified)| Jenkins| Python| GO| Linux| Cloud Security| Docker| Azure| Ansible

4 个月

Interesting

Javier Ortega Cirugeda

DataOps |Data Architect | Data Governance Expert | Big Data, ETL, Metadata, Cloud Integration

4 个月

A Good Point , true as left as possible , Data generators with CI/CD and Data Quality on mind , and Data Contracts for every data source , leads to clean and more productive work for the whole company Data engineers, Data Science team and Data Analyst, and end users

要查看或添加评论,请登录

Dunith Danushka的更多文章