Data Stack as a Service (2/3)

This series of articles is a pedagogical experiment to help apprehend the data world as it is and its evolution. It’s dedicated to non-data experts.

In the first article, we’ve introduced the concept of a Data Stack and its evolution from a Traditional Data Stack (TDS) to a Modern Data Stack (MDS, a controversial term that means “the data stack in 2022”).

Introduction

In this second article, we’ll enter more profound into a data platform's Data Management and Delivery pillars in 2022.

As usual, I will take a short step back for?perspective?purposes.

We’ll focus on:

Data evolution from 2000 to 2020 as a qualitative comparison
Companies' data expertise-maturity classification

These perspectives will help us understand structural challenges, our company's position, and how other comparable companies face them.

Then we’ll explain how to?upgrade our data-driven company to the next level?by implementing Data Management and Data Delivery plans.

Perspectives

Data evolution From 2000 to 2020

Let’s make a quick qualitative comparison of data evolution:

(don’t hesitate to suggest other attributes for this comparison in the comments below).

Expertise-Maturity classification

When discussing data-driven(-ish) organizations, they don’t all start from the same context. We can classify them with data expertise (people) and maturity dimensions (data culture, organization’s age, achievements).

LL case: In the Low data expertise and Low data maturity case, there is no data stack nor a data team. We have individuals using excel/google sheets without valuable collaboration nor actual governance policy regarding how to create, update, share and delete data in those organizations. Tactical and strategic alignments on clear metrics definitions and how to make them actionable are complicated. The lack of automation and observability makes processes error-prone and extremely long.

Typical stack:

Spreadsheets using data from internal and external sources (not always online)
ERP/CRM csv exports

LH case: In the Low data expertise and High data maturity case, there is a Traditional Data Stack with a data analyst(s) and data engineer(s). Some processes and tools quickly become obsolete, and it’s a real pain when organization leaders require new metrics. Or when they change the way to calculate existing ones. That either will take a lot of time to produce or/and be unreliable because of the lack of centralization of data preparation (depups, clean, curate, aggregate) and metrics computation. For example, we usually see two to five ways to calculate the organization’s profit in this kind of environment, none of them use the exact source of truth, and in the end, nobody knows which is the correct one.

Typical stack:

On-premise Hadoop (ecosystem) and SQL warehouses are logically coupled and complex.
BI: Excel

"Imagine a TDS like a Christmas light decoration in a box. The lights are needed at the right time for the Christmas party, but you discover that some of the bulbs need to be changed. You have to untangle the whole thing to find the broken bulb. After a while, you finally replace the bulb, but by the time you are done, the party is over."

Credit: https://neptune.ai/blog/modern-data-stack

HL case: In the High data expertise and Low data maturity case, they mostly use new generation BI tools without a complete data platform. They have business & data analysts but no data engineers. They master the analytics part (Metrics / KPIs, some processes). Still, most of the time, we are missing streams automation/orchestration and data governance as a whole, even if (unfortunately) BI tools try more and more to provide their governance features. We say “unfortunately” because we’ll see later how many Saas providers are creating their dashboards or governance features which makes it too redundant to integrate. Using Reverse ETL tools can help serve SaaS Apps without integrating technically directly.

Typical stack:

Using legacy DBMS and other resources
Tableau, Domo, Power BI, or Looker as BI tools

HH case: In the High data expertise and High data maturity case, they already have it all, the modern data platform and experienced data teams. They “only” have to manage the beast, like any other organization's product.

Typical stack:

Fivetran or AirByte as ELT/ETL
Snowflake as a data warehouse (or Redshift, Big Query, …)
DBT as a transformation tool (also centralized metrics coming soon)
Tableau, Domo, Power BI, or Looker as BI tools
Castor or Amundsen to manage data discovery and regulations compliance

Upgrade to the next level

What are the good questions?

First things first, what are the questions before creating/upgrading anything in a company?

Management

Is this necessary?
What do our customers want? No, what do they REALLY want?
Did we (the management team) have buy-in from C-suite-level management and investors?
What business problem are we trying to solve? Who are we solving it for?
What is the impact of this work? How much effort for the team undertake vs. how much value will it provide? How can we put a dollar amount on the importance of this project?
Any legal constraints to consider in the short, mid, and long term (RGPD, CCPA, SOC 2, MiFID II)?
Can every piece of our existing data stack allow us to be compliant?
Should/could we externalize it? Is there any ready-to-use end-to-end platform that could do it immediately and avoid taking risks before building up our stack? Kind of data as a service thing?
How can we evaluate the ROI? (see a detailed post about data strategy ROI?here)

Data Analytics

What are the KPIs (Key Performance Indicators)?
What are the metrics (While KPIs?measure progress toward specific goals,?metrics are?used to track the performance of particular business processes at an operational level? They?help provide context to the implementation of critical business goals)?
How are those KPIs calculated?
How do these KPIs need to be sliced and diced (what are my dimensions)?
Does the data exist for all these KPIs and dimensions? Do I have access to it? Are there keys in place to link these fields together?
What is the use case of the dashboard (or stories)? What decisions should be made upon consultation with the report?

Data Engineering

Can we complete this work today? Can it be done iteratively?
What is the definition of done for each iteration of the work?
Where’s the source data, and what state is it in? What’s the infrastructure like, and what’s the budget for investment? How do we get the business knowledge we need for helpful data transformation and modeling? Is there a data dictionary?
Is our current or desired tech stack allow for what we want to build?
If we managed to get the answers to those questions, what action would we take?
How much maintenance will this require?
Is the data coming inaccurate?

Data Management Pillar

Now that we asked the right questions and got some answers, successfully adopting a data-driven approach requires efficient data management, whatever the organization and its expertise-maturity situation. Which should be based on four major parts:

Governance: processes, roles, policies, and metrics to achieve its goals
Standards: a shared understanding of data (domain languages & interoperability)
Integration: unification approaches so data is seamlessly reconciled (how we ingest, load, transform, link/join and orchestrate pipelines from different sources)
Quality: data quality specific metrics with periodic audits

Let’s take an example to clarify all this. An organization XYZ in the humanitarian domain must comply with RGPD in the EU. Concretely, when they ask for donations or to fill out any form, they get PII (Personal Identifiable Information). Not everybody in the organization should have access to them, or at least they should be anonymized.

To comply with these regulations, XYZ must define data governance:

Roles, e.g., Country Manager can access all PII of its country, not others’
Policies, e.g., Privacy Policies (see bottom website homepage)
RLS, i.e., Row Level Security on every listing/report/visualization

Knowing that they have the following process:

Get social media stats via admin dashboards of its social media tools.
Get email stats via admin dashboards of its CRM reporting tools.
Get tracking stats (Google Analytics, segment)
Get all funnel ratios (have to join several sources)
Get some metrics (donations by channel, TLV, churn)
Evaluate, time-serialize, and visualize KPIs evolution

How can we comply with RGPD from begin to end of these pipelines?

Admin tools in steps 1,2,3, and 4 are not all able to filter by XYZ “Roles and permissions.”
Data people have to transform these data in a centralized data warehouse to apply the rules BEFORE sending them to the user that has specific permissions.
Metrics should be calculated at the same time to keep consistency between downstream KPIs (based on these metrics)

In addition, different tools use different namings or data structures for the same concept. And people in your company too (credits to Castor’s?blog):

The finance team defines the “number of customers” as the total number of customers that?paid their bills?between Jan. 1, 2021, 12:00 AM → Jan. 31, 2021, 11:59 PM.
The sales team defines the “number of customers” as the total number of customers that?signed the contract?between Jan. 1, 2021, 12:00 AM → Jan. 31, 2021, 11:59 PM.
The marketing team defines the “number of customers” as the total number of customers that?are either paying or in the 14-trial period.?between Jan. 1, 2021, 12:00 AM → Jan. 31, 2021, 11:59 PM.

Whatever the data project, the first step is reducing noise and emphasizing the information. It requires?Standards, Integration, and Quality?to succeed (or avoid failure).

We have illustrated so far that we can’t build a data platform by just stitching together different apps and tools. There are numerous traps we could fail into without thinking about a clear data management strategy first.

Now let’s assume we have a nice and clean?Data Management Pillar.?Data engineers use it to?prepare/transform?data. They create sophisticated SQL requests that can take hours/days to be processed. Then, business users/data analysts can access this newly computed data via the?Delivery Pillar.

Delivery Pillar

“where” and “how” the data platform distributes its outputs?

It can be :

internally for driving the company (BI, AI, OPs),
externally in B2B and B2B2B cases (multi-tenant or clients of your clients) or any other case implying partners or government stakeholders.

The way the Delivery distributes outputs can be in dashboards, stories (group of dashboards), embedded UI, APIs endpoints for better interoperability. Or simply by plugging your BI tool and making SQL requests for last aggregation before visualization, and so on.

What kind of data are we processing at the Delivery level?

Remember the “What are the good questions above?” when defining metrics and KPIs? Now it’s time to take them from the Data Management storage (usually a Data Warehouse) and use them for further calculations or visualize them.

Metrics can be very different from one company to another, in quality and quantity. Here are some examples:

Always start with a minimum of them. A way to filter them is by removing the “vanity metrics” and keeping the “actionable metrics.”

Vanity metrics: make you look good to others but do not help you understand your performance in a way that informs future strategies;
Actionable metrics: data that helps you make decisions and helps your business reach its goals or grow. They’re related to something you can control or repeat meaningfully.

"Think about a marketing landing page for an ebook download. Measuring pageviews doesn’t allow you to make a business decision, but measuring download rate might inspire you to test out different on-page wording, call to action buttons, or styles of form submission."

Credit:?https://www.tableau.com/learn/articles/vanity-metrics

What’s next?

At this stage, our managers and tech teams (BI, AI, OPs) got their data, but what about all others? Employees are not all data analysts or data scientists. Business employees, from marketing, finance, or sales, need the resulting insights from BI and AI experts. Which are much more powerful if they drive the everyday operations of your teams across sales, marketing, finance, etc. in tools like Hubspot, Salesforce, Netsuite, SAP, Workday, Gainsight, Zendesk, etc.

These operational teams require very pragmatic insights in order to their daily duties. Some examples:

Sales wants the list of webinar attendees to import as leads into Salesforce
Support wants to see on Zendesk the data about accounts with premium support
Product wants a?Slack feed of customers?who have enabled a feature
Accounting wants customer attributes to be synced to NetSuite
Finance wants a CSV of rolled up transaction data to use in Excel or Google Sheets

Reverse ETL, therefore, has emerged as a key part of the modern data stack to close the loop between analysis and action or activation. These tools will redistribute “insight as data” into SaaS Apps or internal applications automatically.

Here is a list of providers in this field: Hevo Activate, Hightouch, Census, Polytomic, Omnata.

Conclusion

We closed the loop, and it’s time to conclude.

In this second article, we’ve seen how the data world has changed structurally within the last 20 years in all dimensions and how challengee is to design a data platform today.These perspectives helped us understand structural challenges, how other comparable companies face them, and where our company positions relative to them.

Whatever the data strategy, the first step is reducing noise and emphasizing the information, not considering which applications and tools could help nor how to integrate them.

We also learned how to?upgrade our data-driven company to the next level?by preparing Data Management and Data Delivery plans.

Asking the good questions first is key, force yourself to do this before any complex task, ask to mentors for help if needed, everything is easier then.

We hope this can be activable in your learning process. Don’t hesitate to ask your suggestions/questions in the comments below so that we can make this more accessible to future readers.

Links to other articles of this series:

Data Stack as a Service (2/3)

Alexandre Beauvois

Senior Software Engineer

Introduction

Perspectives

Data evolution From 2000 to 2020

Expertise-Maturity classification

Upgrade to the next level

What are the good questions?

Data Management Pillar

领英推荐

Delivery Pillar

“where” and “how” the data platform distributes its outputs?

What kind of data are we processing at the Delivery level?

What’s next?

Conclusion

Modern Data Stack as a Service (1/3)

The Evolution of Data Platforms

medium.com

社区洞察

其他会员也浏览了

Standardizing Data Delivery with Data as a Product

Building the Data Foundation

Benefits Of Using Data Modeling Tool For Data Analysts

"Foundations of Data Mastery: Choosing the Right Tools for Success"

Creating a Data Culture: The Role of Data Interfaces

7 Steps of a Successful Data Framework

Twenty reasons why data and analytics projects fail and what to do about it

10 Practical OKR Examples in Data Analytics

Killing a Data Strategy: Common Mistakes and How to Avoid Them as Beginners

Refined data and faster insights with advanced data preparation