The Foundations of Data: Enabling Analytics and AI
Noam Chiger
Analyst with Engineering & Science Expertise | Transforming Complex Data into Business Value
In my previous article, I discussed why organizations must adopt a process-based approach to managing data and its many components. Every modern organization depends on data, but data-driven insights are impossible without a solid storage foundation. Before we can analyze, model, or act on data, we need a reliable way to store it.
In a basic storage setup, an organization typically maintains a single shared storage repository, which all users access in the same way. This setup generally includes:
Common Challenges:
To overcome these limitations, organizations typically start by separating data used for ongoing product operations from data used for analytics. A common first step is creating a read-only replica of the production database, allowing analysts to connect without impacting live systems.
However, this approach introduces a new challenge: the replica is often not analyst-friendly. For example, while the data may be structured for optimal product performance, it might lack normalization or intuitive organization, making it difficult for analysts to locate and use the data effectively for reporting.?
As businesses scale, read-only replicas of production databases become limiting. They often require manual extracts, don't support unstructured data, and may not efficiently handle growing analytical workloads. This is where data lakes provide an advantage.?
The data analytics infrastructure will be centered around a Data Lake (I’ll explain what that means shortly). Data flows into the lake through streams—sources that continuously insert data (more on this later). Data may undergo enrichment and transformation within the lake before being structured for downstream use.
A Data Warehouse will be the primary resource for daily queries and ad hoc analysis. Additionally, cached datasets from the warehouse will be used in visualization tools like Power BI to optimize performance and enhance reporting.
So, what exactly is a Data Lake, and how does it differ from a Database? Technically, a database could function as a data lake, though it would be an expensive and inefficient choice. At its core, a Data Lake is simply a storage system. However, what truly sets it apart from traditional databases is not just how data is stored, but the intent and structure behind it.
To fully grasp what a Data Lake is, it’s helpful to explore how data storage evolved—from structured SQL databases to modern, cloud-based Data Lakes.
Storage has two fundamental requirements: storing data and retrieving it efficiently. Traditional data warehouses, such as SQL Server, handled this by dividing resources into three main components:
In traditional architectures, compute and storage were tightly integrated. While compute power processed queries, storage housed the data, with occasional overlaps—such as caching or temporary disk writes. This separation generally held, especially for simpler workloads.
Over time, the industry recognized that compute is significantly more expensive than storage—and it isn't always needed at the same scale. Take Netflix as an example: They must store thousands of movies at all times, but in the early morning hours, when fewer people are watching, they don’t need as much compute power to serve content. However, storage requirements remain constant.
To optimize costs and scalability, new technologies emerged that decoupled storage from compute, allowing organizations to scale storage independently while provisioning compute power only when necessary. This shift paved the way for cloud-based architectures, enabling businesses to manage vast amounts of data more cost-effectively than ever before.
Unlike a database, which only accepts and returns structured data (i.e., fixed columns and rows), a data lake supports unstructured data and, in particular, object storage, much like a file system. However, here’s the catch: while databases enforce structure and require you to maintain it over time, data lakes do not. Without discipline in organizing your data, a lake can quickly turn into a swamp—a chaotic, hard-to-navigate mess where data is difficult to find, trace, and use effectively for analysis, reporting, or AI.
领英推荐
This naturally leads to the next question: How should you structure a data lake to prevent it from becoming a swamp? To answer this, let’s first explore how data flows into the lake.
A data lake serves as the single source of truth for an organization’s relevant data. Every piece of data comes from a source, whether internal or external. Continuing with the lake metaphor, you can think of data as streams flowing into the lake, each originating from different sources.
Data ingestion comes in two main forms: real-time streams and batch processing, which typically runs hourly or daily, depending on the required cadence. Regardless of the underlying data engineering complexity, both approaches result in a partition of the data lake that stores incoming raw data. This partition is usually loosely structured, often using formats like JSON or plain text files.
While various terminologies exist, Databricks provides one of the most intuitive frameworks: this raw partition is called the bronze layer. The naming follows a logical progression—as data moves through layers, it becomes increasingly refined and valuable, much like precious metals.
The bronze layer is the least processed but also the most complex in terms of usability. Unlike structured databases, raw JSON files often contain nested structures, requiring additional processing before analysis. This means that working with the bronze layer typically requires coding skills, such as the ability to parse JSON files, rather than relying solely on SQL-based querying, which most analysts are more familiar with.
The silver layer is the second stage in the data lake. While analysts don’t typically access this layer for daily work, they may use it for exploratory research and new ideas. The key difference between the silver and bronze layers is twofold: (1) the data is now better structured, and (2) it is partially curated for usability. While the silver layer is still stored in files rather than a traditional database, it is organized into structured folders or tables with a consistent schema, making it easier to navigate.
One of the most important aspects of both the bronze and silver layers is replayability—at any point, you should be able to reproduce the transformation process that creates the silver layer from the bronze layer. This is crucial because the bronze layer is vast, and the silver layer only contains a subset of that data. As new use cases arise, you may need to incorporate additional data from the Bronze layer into the Silver layer. By maintaining replayability, you can modify the transformation pipeline over time without disrupting downstream processes—including the Gold layer, which also depends on this flexibility.
The Gold Layer contains the most frequently accessed and business-ready version of the data. At this stage, additional normalizations and enrichments take place to ensure usability. Some data from the Silver Layer may appear in multiple tables within the Gold Layer.
For example, in the Silver Layer, you might store every single customer activity. However, in the Gold Layer, this data could be structured differently—one table might contain the full historical activity, while another table stores only the most recent activity for each customer. This second table could be overwritten every few hours or updated on a set schedule, depending on business needs. The gold layer is often also loaded into a warehouse for easy access for analysts/reports.
With a solid understanding of the lake structure, let's take a closer look at the Bronze Layer and its key characteristics:
The Gold Layer is the primary source of data for the business, but a caching layer is often used to improve performance. This caching is typically managed by BI tools such as Power BI’s Data Flows, which store pre-aggregated data for faster queries.
Since BI tools often have limitations on complex schemas and table relationships, data from the Gold Layer is frequently duplicated across multiple datasets. For example, a business may create separate datasets for different use cases—such as client data, product data, or other business-relevant information.
To enhance performance, datasets may be optimized into smaller, focused subsets. This ensures that dashboards remain responsive and queries return results quickly, preventing long load times for end users.
Modern organizations need a storage strategy that balances scalability, flexibility, and performance. By leveraging a structured data lake, businesses can not only store massive amounts of data cost-effectively but also ensure that data is transformed into actionable insights. Whether it’s optimizing operational efficiency, enabling real-time analytics, or supporting AI-driven initiatives, a well-structured data lake is the foundation for data-driven success.
Structuring data goes beyond technical organization; it aligns with crucial principles of prioritization, workflow, and accountability. In the context of Institutional Economics, making informed decisions about what to focus on and understanding why those decisions matter are essential for ensuring that business needs are met and expectations are properly managed.
Problem-Solver | Data Enthusiast | Optimizing Processes | Enhancing Efficiency
3 周Clear, well explained and on point, as usual Noam Chiger. Enjoyed reading it. How do you see this applying to SMEs embarking on their initial data analytics journey?