登录查看更多内容

Trends Shaping The Modern Data Stack

Manpreet Singh Sarna

Ex Axtria l PGDM IMT Ghaziabad

发布日期: 2022年6月29日

The modern data stack is in a continuous state of flux. This is due to the fact that large complex software systems are being built around data, with the primary business value of the system coming from the analysis of data, rather than the software directly. This change in perception about how data is to be used has led to the emergence of new roles, organizations carving out bigger budgets for data architecture overhaul, and the emergence of new companies providing platforms and tools for working with and managing of data.

Many companies have sprung up around building products to manage data. These new products and platforms are enabling near real-time data-driven decision-making. New technology trends are emerging for managing data at each step of the data’s lifecycle. Solutions are sprouting up that range from automated data pipelines that carry data, to storage solutions that house data, then SQL engines that analyze data, to dashboards that make it easy to understand data through visual representation.

Many companies are working to be at the forefront of this change. It has been observed that how the discussion has shifted from how to warehouse data to how to use data in daily operations.? In this article, we will discuss some immediate trends that are shaping the modern data stack.

Metrics layer:

Typically, a big pharma organization has many brands and corresponding commercial operations teams that are heavily reliant on BI tools for their daily, weekly, and monthly reporting. The brand teams are generally tracking the RoI of marketing efforts. They track various marketing and sales-related KPIs like various customer engagement metrics, prescription trends, price volume effects, sales and volume by channel, Market evolution, and SOV (share of voice), among many more. Reporting of these types of KPIs are generally similar across different brand teams across the organization with some KPIs which are specific to different brands.??

In the present architecture, users created and defined their metrics and KPIs in downstream Business Intelligence applications like PowerBI or Tableau. This led to unnecessary man-hours being wasted in recreating and maintaining logic for similar KPIs and metrics which are being used by different teams in the organization. This also led to having an inconsistent understanding of metrics because different teams may interpret a metric in different ways leading to wrong reporting. This led to misalignment and confusion within the organization. Metric layers can solve these issues throughout the organization to a certain degree.?

The metric layer acts as a centralized repository of key business metrics that sits between the organization's data storage and downstream business intelligence tools where all the metric logic lives. Metric layers can be used to create a single source of truth within the organization for all KPIs and metrics. This means that analysts who are working on BI tools like PowerBI or Tableau will be using the standardized metric and KPI logic maintained in the metrics layer. This will standardize reporting and analysis within the organization along with bringing metrics consistency across all business analyses. This also makes data teams more efficient since it reduces the need for repeated analytics.

Some benefits of the metric layer are:

1.?????????Democratizes the use of consistent metrics throughout the organization.

2.?????????A single interface containing all the metric definitions enhances transparency between technical and non-technical teams within an organization.

3.?????????Same metrics can be reutilized in diverse contexts and in different tools.

4.?????????Tracking changes in KPI and metric logic becomes easier.

?The metrics layer was brought into the picture by prominent tech companies like Spotify, Airbnb, Slack, and Uber. These companies were scaling fast, didn’t have enough manpower, and wanted a single source of truth metrics platform so that they can get a consistent view of various functions within the organization which was generated in an automated manner. They achieved this by standardizing the way metrics are created, calculated, served, and used throughout the organization. These companies pioneered a metrics platform that maintains the whole lifecycle of a metric from discovery, definition, planning, calculation, quality, and consumption. This enables data-driven business decisions by supporting rapid metric computation, promoting data democratization, and creating of useful features for training ML models.

?Reverse ETL/Connectors:

Traditionally data sitting inside a Data warehouse was being used for analytical workloads and business intelligence applications. However, data teams across organizations have recognized that data can be further utilized for operational analytics by delivering real-time data to places where it will be most useful in an automated manner. Technology that emerged to fill this gap is called reverse ETL. It is the process of moving data from a data warehouse to third-party systems to make data operational and get better leverage out of an organization's data. Reverse ETL has emerged as a crucial component of the modern data stack to close the loop between analysis and action or activation.?Through Reverse ETL, data in the data warehouse can be used to drive the everyday operations of various teams in an organization like sales, marketing, finance, etc. through tools such as Netsuite, Hubspot, SAP, Salesforce, Workday, and others. It acts as a bridge between your data warehouse and cloud applications and SaaS products your team loves to use.?

Data teams in organizations have been writing their own API connectors from the data warehouse to push data into operational SaaS products. API connectors are generally not built to handle real-time data transfer, so teams have to set up batching, retries, and checkpoints to manage traffic which is a time-consuming task. It becomes challenging to maintain these connectors because API spec changes over time. To deal with these challenges and make it seamless to transfer data from the data warehouse to operational SaaS products, reverse ETL tools have emerged. These tools provide out-of-the-box connectors to various systems so that teams don’t have to spend time writing and maintaining their own connectors. Reverse ETL tools reduce the engineering burden by allowing teams to push data from warehouses to more and more systems to make better use of their data. They also provide a visual interface to select a query that will be used to populate standard and customer SaaS fields. These tools also allow you to sync and define the triggers that will be used to start syncing between the data warehouse and SaaS systems. Reverse ETL tools have many use cases which can make the life of data teams easier.

Use Cases of Reverse ETL Tools:

A problem faced by the commercial operations team in a pharma company is of identifying the most important HCPs (health care providers) that the sales reps should focus on. To solve this problem they generally use SQL to derive characteristics of high-valued HCPs and present the findings in a BI report. But to their dismay, they find that the pharma sales reps rarely use their reports to make decisions, and nor do they provide any feedback on it.

This is an analytics enablement problem. Insights are not being provided to the business teams during their usual workflow so that they can make more data-informed decisions. It’s easy to derive insights from data, but the last mile part, i.e. translating those insights into action is a different game altogether.

The traditional approach to solving this problem would be to train pharma sales reps on how to leverage BI reports during their normal daily workflow. Increasing the adoption of your analysis would be tough. Instead, you can operationalize your analysis by feeding lead scores from the data warehouse to a custom field in commercial operations software so that they are armed with your insights when they work to take action.

Another use case would be to send customer data from Warehouse to various SaaS systems to help maintain a consistent view of the customers throughout the system. A pharma organization can maintain an up-to-date list of high lifetime value HCPs or HCPs that have a high writing limit above a particular threshold by pushing data from the warehouse to SaaS products.

Another use case would be mirroring product usage data in marketing and customer relationship with SaaS products to improve interactions with customers by sending personalized messages along with product metrics. Syncing customer data with customer support SaaS solutions can help save time responding to customer requests and prioritizing messages as they come.

Reverse ETL also helps in data automation. In any pharma organization, the IT team is glutted with tons of manual requests for data and CSV files from and between various departments. The commercial operations team wants to know the list of medical event attendees to import as leads in their CRM software. Performance reporting teams creating dashboards want to know who is using what feature of the dashboard so that they can work on improving it. Finance teams rolled up transaction data in CSV format to be used in Excel or some other tool. All this data is already present in the data warehouse. We just need to set up automated reverse ETL pipelines to push this data to sync it with external tools.?

Modern Data Catalogs:

The modern data stack is evolving in lightning speed due to various innovations in data management, data storage, easy scale-up or scale-down mechanisms, fast access, and transformation space. One area that has seen slow progress in comparison to other aspects of the data stack is increasing trust in data and bringing context to it.

Companies are collecting a huge amount of data these days. Every external and internal touchpoint of an organization is creating data. For example, in pharma organizations, the commercial operations part is becoming more and more data-driven. Data is collected for each healthcare professional and targets patients on a nearly daily basis. Data that is being collected is of varying granularity and complexity. For one HCP or patient, a table with more than a hundred column headers is a very common thing these days. To deal with such complexity, pharma organizations started employing Data Stewards who acted as a bridge between the IT and business sides. They are given the responsibility of providing high-quality data to business users that is easily accessible in a consistent manner. One major part of a Data Steward’s job is maintaining data catalogs. Data catalogs contain the definition of all the column headers throughout the numerous data sets that are being maintained in the pharma organization. Having a consistent definition of each column throughout the organization is a practice that the data steward has to instill throughout the organization.

Data & Analytics 5 个月前

Data Pipelines by the Numbers: Key Metrics to Track

Muhammad Ishtiaq Khan 1 个月前

Building A High-Impact Data Analytics Team

Analytics8 | Data & Analytics Consultancy 4 个月前

But as companies started setting up huge data implementations, they realized that a simple IT inventory of data wasn’t enough and that they needed to blend data inventory with the evolving business context. On top of that, the implementation of data catalogs were difficult to set up and maintain since they required rigid data governance committees, complex technology setups, and lengthy implementation cycles. Tools were built-in legacy monolithic architectures and were deployed on-premise, which meant they could not roll out software changes by pushing a simple cloud update.?This led to an increase in technical debt and metadata management steadily fell behind the rest of the modern data stack.

Modern Data Catalogs were introduced to keep up with the pace of new-age data warehousing companies like Snowflake and Fivetran that allow a company to set up a warehouse in less than 30 minutes. Modern data catalogs were designed to run on the cloud to provide a collaborative experience which is a key in today’s modern workplace. Some of the characteristics of Modern Data Catalogs are:

1.??????Focus is on Data Assets and not just on tables: Modern metadata management tools will have to think beyond the legacy thinking that tables are the only thing that needs to be managed. Assets like BI Dashboards, SQL queries, Code Snippets, and Data models are all considered data assets that need to be linked and intelligently stored in one place.

2.??????Collaboration and not silos: Metadata management tools should be designed in such a way that they can be easily integrated seamlessly with a team’s daily workflow. The user interface and user experience should not be an afterthought but a proactive step when thinking about these solutions. That is when the idea of real-time embedded collaboration comes alive.

3.??????Single source of truth: Modern data management tools should provide end-to-end visibility of different data assets information about whom are spread across different places like data quality tools, data lineage tools, data prep tools, etc. Having a centralized single source of truth about various data assets is a feature that is highly desirable.

?Data Observability:

Imagine a scenario in which a drug brand team needs to send a quarterly performance report to a pharma organization's CEO. They have been hard at work for the last few days making sure that they send an accurate view to the CEO. They send the report over email and to their horror, they receive an email from the CEO stating that these numbers don’t look right. The brand team tries to figure out what went wrong, and this is when they realize that they have been shot in the head with data leaders' biggest pain called data downtime.

Data downtimes are periods of time when the incoming data is partial, erroneous, missing, or just plain inaccurate. This problem multiplies when the data systems become larger and more complex, managing a huge ecosystem of different data sources and consumers. AS data downtimes increase, the business users, who consume data to make decisions start losing trust in the data and are left at the mercy of their intuition and limited view of the system.

Data observability is a way to deal with these problems. It is systematic monitoring, tracking, and triaging of incidents to prevent data downtime. It helps organizations understand the health of their data.?Observability engineering has come up as a new engineering field that has the following three pillars at its core:

1.??????Metrics: It is the numeric representation of data measured over a period.

2.??????Logs: Logs keep a record of different events that happen at a particular point of time using a timestamp. It also provides a context in which a particular event happened.

3.??????Traces: This tries to represent causally related events in distributed environments.

Data observability tools are gaining traction in the industry. These tools carry out automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues. This leads to healthier data pipelines and productive teams. These platforms stand on the strong foundational pillars of data observability which are:

1.??????Freshness: This keeps track of how fresh your data tables are, and the frequency at which your data tables are updated.

2.??????Distribution: This takes care of tracking whether values of different columns in a table lie within their acceptable range or not. If a value seems odd, it highlights that and forces the data teams to take corrective action. It acts as a trust measure, more distribution-related issues mean that the data can not be trusted at all.

3.??????Volume: This keeps track of the completeness factor. It checks how complete are your data tables, whether the data from the source is complete or not, etc.

4.??????Schema: Monitoring of who can make changes to the schema of a table and when. Because if a schema is changed, but the source data is the same, then it will lead to missing data in the new columns that might have been introduced.

5.??????Lineage: If data is missing, it means it might have broken somewhere in the pipeline. Lineage tries to answer the question, of where the event has occurred in the pipeline. It also tracks which upstream sources and downstream consumers are impacted by this break in data.

Sources:

1: https://towardsdatascience.com/the-future-of-the-modern-data-stack-in-2022-4f4c91bb778f

2: https://www.thoughtspot.com/blog/the-metrics-layer-has-growing-up-to-do

3: https://future.a16z.com/emerging-architectures-modern-data-infrastructure/

4: https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/

5: https://hightouch.io/blog/reverse-etl/

6: https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb

7: https://blog.transform.co/data-talks/the-metric-layer-why-you-need-it-examples-and-how-it-fits-into-your-modern-data-stack/

8: https://towardsdatascience.com/data-catalog-3-0-modern-metadata-for-the-modern-data-stack-ec621f593dcf

9: https://www.techtarget.com/searchdatamanagement/definition/data-stewardship

10: https://databand.ai/data-observability/#:~:text=%E2%80%9CData%20observability%E2%80%9D%20is%20the%20blanket,issues%20in%20near%20real%2Dtime .

11: https://www.montecarlodata.com/blog-what-is-data-observability/