Modern Data Stack: Will it Stick or Stink? - Part 2

Modern Data Stack: Will it Stick or Stink? - Part 2

In the previous blog , we discussed how the modern data stack is gaining popularity and has been prominent in leading data events, such as Snowflake and Databricks Summit. We investigated why this is happening. This was because of some of the pitfalls of the Traditional Data Stack. Then we discussed the 3 core (micro) challenges in the Traditional Data Stack by drawing parallelism to a boat example:

  • The long and slow setup of the required infrastructure and the additional time needed for troubleshooting (the trailer “infrastructure” was not ready).
  • Slow responses to new information because of an inability to scale up, sluggishness, and time-to-value (it took time to transport the air compressor from the house to the storage unit).
  • The expensive journey required to produce insights (it took multiple trips to discover that the tire needed to be replaced) Manual process = Complexity.

In this blog, we will discuss why enterprises need to be business-focused and understand the associated macro-challenges. Both of these requirements will help us understand whether #moderndatastack is here to stick or will it stink (NOTE: we will discuss #datafabric and #datamesh in another blog).?

No alt text provided for this image



Before we get started, let’s change our vernacular from technology-focused to business-focused. We need to think about business needs. We discussed how businesses need to be competitive. They need to be data-driven to make decisions quickly, understand their customers and/or improve their processes (as well as many more tasks). According to the Forrester Analytics Business Technographics Data and Analytics Survey, 2020, advanced insight-driven businesses were 1.4 times more likely to report increased business innovation compared to beginner-level businesses.??

So, the question is this: were enterprises previously not business-focused? The answer is that they were, and they always have been; however, past technology limitations began to dictate business behaviors.?

In the 90s, I worked in the data center at 美国银行 . We used to have a surplus of computers, storage, etc., and we would provision them over the weekend. I recall having weekly shutdowns to perform maintenance and updates. In today’s online world, implementing weekly shutdowns would be very difficult. Imagine Salesforce —the company that pioneered Software-as-a-Service—having a minute of downtime, let alone an hour or a full day. Our sales, marketing, and other teams would go crazy!?

That was in the 90s and 2000s. We are now in the 2020s. We don’t have these limitations anymore and need to rethink our (legacy) data stack again, this time with a business focus. That said, let’s talk about the macro-challenges to be considered in this task (NOTE: these are the reasons why enterprises are looking beyond the traditional data stack):

? The explosion of data

? The need for data-driven culture

? Governance and compliance requirements

Explosion of data:

The amount of data that we produce every day is truly mind-boggling. At our current pace, 2.5 quintillion bytes of data are created each day. Over the last two years, we have created 90 percent of the total world generated.?

This is because data is now being collected from various sources and in various forms. We saw growth in the early 2000s with data collection from images and videos. In the 2010s, this pace accelerated with the rise of the Internet of Things (IoT), and now, we also have sensor and telemetry data, APIs and real-time application data. The infrastructure must keep up with the pent-up demands. Hence, we have seen technology advancements to provide increased storage & compute capacities. Let’s take a look at the example of Netflix.?

Netflix example:

A couple of years ago, Michelle Winters , Director of Data and Analytics at Netflix , shared how the Netflix Data Engineering & Analytics team manages growth. On any given day, Netflix manages 100 million members who watch 125 million+ hours of content.

?Netflix currently manages data in 130 countries and on 4000 “different” devices. They write 700 billion events to their stream and adjusted pipeline every single day. On average, they peak at well over a trillion events written in a single day. This data is processed and loaded into a data warehouse built entirely using open-source Big Data technologies. Netflix is currently sitting at around 60 petabytes processed each day, with this total growing at a rate of 300 terabytes per day, and this data is actively used across the company. On average, they perform about 5 petabytes of reads per hour!

The Legacy Data Stack was not designed for this volume!

Legacy Data Stack tools were never designed for volumes such as those mentioned above. Let's examine an example of a traditional on-prem database OLAP—Postgres. @Postgres was never designed to handle such huge datasets. Why? Well, you guessed it—the reason was technical limitations. The PostgreSQL server is process-based (not threaded). Each database session connects to a single PostgreSQL operating system (OS) process. How should I explain this procedure? Hmm. Just think of a highway with a single lane vs. one with multiple lanes. Which one will be faster? (NOTE: Sorry for having produced a contradiction here. Obviously, some legacy tools have caught up. However, the MDS provides an opportunity for even these tools to refresh their stack, too).

You need tools to easily ingest data from various sources (real-time, batch, API, on-prem, cloud, etc), a single location into which all of the data is loaded, and the ability to transform the data at speed.? Additional needs include architectural changes to respond to analytical queries and the ability to process multiple joins for these mammoth datasets—all of which must occur at speed. (NOTE: Some of the tools in the modern data stack have existed for over a decade.)

No alt text provided for this image


Intro to key Modern Data Stack tools

As you can imagine, a company like Netflix uses multiple modern technologies to address the data explosion problem. So, let’s talk about some of the processes in the Modern Data Stack. They can be divided into the following categories:

- Data Ingestion/Integration

- Data Transformation (NOTE: this is a subset of data Integration)

- ?Cloud Data Warehouse

- Data Intelligence* (Data Catalog – Alation )?

- Analytics* (Business Intelligence – Tableau )?

*We will cover this in the next blog

What is #dataingestion ?

First, let's talk about business problems. Businesses need to be competitive and need fast access to information. The want tools to connect to vast range of data sources, quickly. This is where data ingestion comes to play.

Data ingestion is the process of moving data from various sources (databases, servers logs, third-party apps, etc.) into a data lake or data warehouse. A data warehouse allows you to store data from various sources, and it also manages stored data. Traditional tools were hard to connect and required special connectors. This process was time-consuming and error-prone. It required coding. Plus, most tools couldn’t manage applications like Salesforce, Zendesk , Adobe Marketo , Outreach and more. To be competitive and provide sales with intelligence, businesses should make sure that they have input from tools like Salesforce (or Zendesk for HR, Marketo for Marketing, Outreach for SDRs, etc.).?

Ingesting or moving data from sources to your warehouse is easier today, thanks to advancements in technologies, such as Kafka . Modern technologies such as Nexla , Fivetran , and Airbyte also provide solutions to this problem.

Let’s take the example of one of these tools, Nexla. It can rapidly streamline the data pipeline. Nexla does so by auto-generating connectors that bring in data from any application and in the format of your choice. All of this can be accomplished within 24 hours. Traditionally, this process could have taken days to weeks.?

Wait, let's talk about data integration, too.

Before cloud solutions were developed, data processing and management were designed with the limitations of on-prem systems in mind. The legacy data stack structure relied on the extract, transform, and load (#etl ) process. In other words, it required extracting data from sources, such as business applications (i.e., HubSpot or Salesforce) and databases (Mainframes, 甲骨文 , #mysql , etc.), transforming the data to prepare it for storage, and then loading it into a target data warehouse. A Data Integration Service, such as Fivetran, Segment, or Airbyte, enables you to move data from your #saas tools into a cloud data warehouse.

However, ETL tends to be a time-consuming process with low data usability, and it’s expensive.?

No alt text provided for this image


  • ETL provides a way to get data from transactional databases into data warehouses, so that data-hungry analysts can query the warehouses while the operational transactional databases serve real consumers.?
  • Developers need to codify these solutions individually. This results in long backlogs, with the data team always playing catch-up.
  • ETL tools are traditionally not distributed and not designed to take advantage of cloud-native architectures.
  • Analysts are comfortable with SQL. For analysts to ask questions of data, they need the data to be in the warehouse

A modern data stack reverses the steps within the above process to extract, load, and transform (#elt ). This approach allows businesses to load data into warehouses without transforming it first, making it easy to get data into your data warehouse (without transformations).? You are able to bring data from business applications (i.e., Hubspot or Salesforce) and databases (i.e., Oracle or MySQL) into target data warehouses. This provides your entire data team (scientists, analysts, and engineers) with easy access to data and the ability to easily perform self-service.? Examples of ELT data transformation tools are Matillion , Fivetran (EL), Nexla (ETL/ELT), and dbt Labs (T).?


No alt text provided for this image

(NOTE: At a high-level, #ReverseETL looks very similar to ELT because both move data from point “A” to point “B” automatically. However, ELT reads data from SaaS tools and writes it to data warehouses, Reverse ETL reads data from data warehouses and writes it to SaaS tools. Examples of reverse ETL tools are Nexla , Census and Hightouch .

Let's talk about #DataTransformation

As data come from various sources, business users want to ensure that they can easily trust and use the data. They want a format that they can use within the organization. If they want to do analysis, then don't want to waste time prepping the data. This is where data transformation becomes useful.

This is the part of the process in which raw data is converted into user-friendly models by changing the data’s format and structure. Data transformation prepares accumulated data for analytics and makes it more readable. One of the most popular data transformation tools is dbt.

What is a Cloud Data Warehouse?

The core of the modern data stack is a #clouddatawarehouse , which serves as the central location in which to collect all of an organization’s data. It makes life of business users (including technical users) simple. It helps handle heavier data workloads, optimize pipelines, and shorten query times. Examples of Cloud Data Warehouses are Snowflake, #Redshift , #BigQuery , and #DatabricksDeltaLake . We will cover more about Cloud Data warehouses in the next blog. For now, let's just say that legacy on-prem data warehouses didn’t have the benefits of cloud warehouses, such as pay-as-you-go scalability, low acquisition costs, rapid benefits, and more.

That was quite a bit of information. Guess what? We haven’t even defined the Modern Data Stack. I will leave that topic for the last blog in this series. Right now, I am trying to explain the “Why” and “How” instead of the “What”.?

Now that we have introduced the associated business needs, introduced three macro-challenges, and discussed advancements in some underlying technologies used to manage data volumes, I think there are good reasons that the Modern Data Stack will "Stick". We will unravel this further in the next blog. Next, we will discuss how the Modern Data Stack addresses the needs for data-driven culture and governance & compliance.


Thank you Stephanie Yuen , Avinash Shahdadpuri , and Saket Saurabh for providing input for the blog.

Avinash Avi

Senior Associate at Helical IT Solutions

8 个月

You should have a look at World's first chat based Data Engineering tool powered by AI "Ask On Data" : https://www.askondata.com Simply type in English language and create data pipelines - Zero learning curve. Type and get it done. - No technical knowledge required. Anybody can use - Automatic documentation - Super fast speed of development at the speed of typing, save around 93% of time as compared to other tools - Save money in infra by decoupling processing in case if you are using platforms like Snowflake, Databricks etc

回复
Mark Kwiatkowski

Principal Technology Alliance Manager

2 年

Nice job Ibby. I was wondering when part 2 was coming out. Great info.

回复
Kamal?? Maheshwari

Co-Founder, CXO; Data Trust for GenAI; Startup Advisor

2 年

Ibby - comprehensive and very well expressed !! Couple of things 1. You seem to have missed Data Observability which is an essential success factor for #ModernDataStack (MDS). I have said it for a while and now Gartner also agrees. 2. I'd love to see your point of view on how the MDS simplifies business users' life and outcomes

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了