Snowflake 101: Data Products #3
Harald Erb
Data Enthusiast & Architect, Regional Principal Sales Engineer (????) at Snowflake
Snowflake Forecasting, consistent Metrics & Data Observability
Part 1 of this mini-series focused on practices for documenting and securing data products, including their implementation in Snowflake. Practical examples were also used to present new types of analysis tools from the business user's point of view, with which self-service for cross-functional data teams can become a bit more effective again. Part 2 of the series showed how, in the Snowflake Data Cloud, distributed data teams (data providers) can professionally offer their data products in a private data exchange accessible throughout the company and make them available using Snowflake Secure Data Sharing without additional data integration efforts.
Here, in the final part of the series, we look at other concepts that data teams and platform operators can design and implement with Snowflake for the benefit of their end users. The ultimate goal is to increase user adoption and referral rates of their own data products, e.g. measured by Net Promotor Score (NPS) , which could be achieved through additional convenience features and quality assurance measures:?
In the following three examples, I will again use the "Gasoline Prices Germany" data product virtually constructed in Part 1 and Part 2 . On the one hand, to illustrate technical aspects and on the other hand, because the currently breathtaking development of fuel prices in Germany becomes visible.?
Building a price forecast as a value-added function
At Snowflake Summit 2022 , the Snowflake ML engineering team presented not only Snowpark for Python (now GA!) but also a built-in time series forecasting model (currently in private preview) along with a demo that can be watched here ("Train and deploy models using Python or SQL"). With this new Snowflake built-in feature, you can produce a forecast for a univariate time series by using the function FORECAST().
The picture below shows how it works:?A shared Snowflake Native Application (in private preview) has been made available in a schema called SHARED_DATA_APPS.SNOWML (1) and from there you can call and test the FORECAST() function by passing in time series data from a table (2).function by passing in the time series data from a table (2). In our case, the "Gasoline Prices Germany" data product provides everything we need: historical fuel prices of all gasoline stations in Germany since 2014.
The E10 fuel price forecast created by the table function (3) for a given gas station can then be easily visualized using Snowsight or your favorite visualization tool.?
A note about the algorithm used for forecasting: In the current preview release, the FORECAST() function uses a forecasting algorithm that is based on a combination of a generalized linear model (GLM) and a gradient boosting machine (GBM) and can include forecasts over multiple time series at once. In relation to our scenario, this means a forecast can be generated for several gas stations simultaneously with one function call. Furthermore, the FORECAST() function is robust with regard to missing values and misaligned timestamps, and automatically determines the best settings to use for the algorithm. It also supports the extraction of a linear trend component from the time series data, detects patterns due to seasonal components, and automatically performs hyperparameter tuning.
After successfully testing the FORECAST() function with our data product's data, a wrapper function GET_2022_FUEL_PRICES_WITH_FCST() can be created to ensure that a correct data context is used.
The screenshot below shows a test call of the new function with a well formatted output of a sample forecast.
Considering the development of high-quality data products, this example shows quite well how, beyond a curated dataset, additional value-added functions can then be part of a data product and potentially be a unique selling point. Now that we've given some thought to value-added functions and how to make use of machine learning for the benefit of our data product users, it's also important to keep our own house clean with regard to consistent metrics and constantly high data quality!
Achieving Metric Consistency through a Metric Store
One thing in advance: Self-service with data is good and, in particular, business departments should be supported by suitable working environments and tools at all times! Often, however, the feedback loop is missing, which moves a good idea or a prototype into the hardened, scalable and company-wide accessible production environment. When it comes to data usage, decision makers consistently complain that different teams report discrepant numbers for simple business questions and there is no easy way to trace back which number is correct.
In this context, the image below summarizes common tool or team scenarios that can be the reason for inconsistencies of all kinds: Multiple use of metric names, different data sources, divergent data transformations, poor data quality, etc. For example, a calculated variable "contribution margin" could be provided in a data mart in a quality-assured manner and independently of the BI tools used. If this is not the case, but the formula components (revenues and variable costs) are available in the data mart, this calculation can be defined in the semantic layer of the BI platform used throughout the company and used from there by all BI users. However, this requires that all users work with the same BI tool.??
Things can get confusing when other BI tools with their own semantic layer, suddenly deployed in a barely coordinated fashion or even declared the new enterprise standard, provide divergent answers with slightly different tables, metrics definitions and business logic.?Not to be forgotten are the many Data Science projects that run in parallel to the day-to-day business: They usually start with raw data that is cleaned, transformed and enriched to train Machine Learning models. When the project results are presented later, deviations cannot be ruled out here either, which can undermine the confidence of the decision-makers. The good news is that you're not alone in this problem, and we can learn from leading data-driven companies, such as Airbnb: Their analytics journey, the growing pains they faced in the last few years, and their solution approach with a metric infrastructure is well described in this blog post "How Airbnb Achieved Metric Consistency at Scale ".
The concept of a Snowflake-based Metrics Store could look like what Snowflake's technology partner SLED (formerly: Snowboard) has already introduced as part of their Analytics Workspace. SLED is an application built for the Snowflake Data Cloud and has its own user interface allowing you to
领英推荐
The advantages of SLED's approach are obvious and described in more detail in this article : You leverage the power of Snowflake when it comes to computing and performance, but also keep governance simple with a single role and privilege management. And finally, you consume all business critical metrics from Snowflake using SQL, so any tool that can connect to a SQL database can also access your metric store.?
To get a practical impression of how to define, document, quality-assure and make metrics findable in SLED, we again use our "Gasoline Prices Germany" data product for better illustration. The figure below shows the homepage of a SLED Analytical Workspace connected to a Snowflake account containing our "SHARED_DATA_PRODUCT" database and its schemas. As a SLED user, you can quickly orientate yourself about all assets provided, e.g. by using the free-text search or by browsing all accessible database objects in the navigation bar.?Alternatively, you can first narrow your search by selecting predefined topics ("data-product", "approved"), or one of the most popular metrics, datasets, etc.
And as you would expect from a professional application, typical areas of responsibility have been defined in SLED using a role-based access model:
Next, let's inspect an already existing metric definition to quickly track price trends for E10 fuels in a given region. The foundation for this is a data source prepared in advance in Snowflake (a view in our case), which provides all the required data fields. On this basis, a time dimension and other dimensions can be defined for the metric definition together with the aggregation to be performed. Meaningful names and descriptions as well as additionally set tags (topics) later ensure a quick discoverability in SLED's catalog.
As a result, catalog users are presented with a highly interactive metrics preview (1) and can copy & paste the corresponding SQL statement (2) to be used into their favorite IDE or BI tool. Side note: Matching the metric definition, SLED generates so-called "Magic Views" for Snowflake, which are located in the database managed by SLED (3).??
In summary: What can be achieved with a metric store, as just shown in the example above? A metric store enables all data consumer to dynamically segment a metric in different dimensions in a declarative way. How the metric is aggregated or sliced is transparent to the consumer, who can just request “E10 fuel price change by month, gas station” to obtain the results. Furthermore, there is now a free choice of tools: any tool that is able to connect to a SQL database can also connect to a metric store.
Data observability - How to avoid data issues in the future
So far in this and the previous two blogs, we have focused on how to provide quality data and additional value-added functions. Let's now assume that the self-service offering of a data team or business domain is widely accepted by its users and increasingly utilized in many areas of an organization.?Subsequently, the data team grows and the data stack scales with more data sources, more tables, and more complexity. But as the team becomes more successful, it spends more time combating data quality issues - until it becomes the data team's top priority to ensure the success of its work to date. Therefore, it is important for data teams to keep a constant eye on the health of their data products and avoid data downtimes because of an incomplete, incorrect, or otherwise inaccurate data product.
So is it enough to simply monitor data loading and availability of the data product or is there more to be done? Since a few years there is a new category, called Data Observability, which has its origin in software observability (especially for distributed systems and microservice architectures).?The following quote from Baron Schwarz and Sean McCormick describes the difference to simple monitoring nicely:?“Monitoring tells you whether the system works. Observability helps you in not just knowing a problem is happening, but knowing why it's happening and how I can go in and fix it”. If you want to know more, a deeper insight into this topic can be found here and here . In terms of Data Observability, one enables data quality monitoring, where an early alert is given to the data team when pre-defined thresholds or other targets are not met.
The task is now for the data team to implement its own data observability solution that stands on several pillars, discussed in more detail here and here . To better illustrate, let's return one last time to our "Gasoline Prices Germany" data product example and see how SLED's Analytics Workspace can help to
Let's take a look at the state of the largest table F_FUEL_PRICES of multi-year historical fuel price changes for all gas stations across Germany.?In the screenshot below you can see a summary with the most important information about the table itself, i.e. data volume, data freshness, usage (how much, top users), etc.?Now, the following two facts about the usability of this table are particularly insightful - especially for business analysts who value consistent good quality when using the data product:?(1) The "seal icon" symbolizes that this table has been approved by the owner (an individual or the data team) for external use and (2) that all required DQ checks have been successfully completed. For this user experience, various background tasks can be scheduled in SLED, such as table discovery, parsing Snowflake's (very informative) Query History , or synchronizing metadata between the SLED catalog and Snowflake.?Some results of these background tasks are directly visible in the table overview on column level, e.g. information about (3) data lineage and (4) value ranges, which look reasonable in our case.
Diving deeper, it is easy to see how to ensure data freshness, detect schema changes, and diagnose problematic dependencies with data lineage information.?Per table, for example, Checks (5) can be activated as needed. Standard checks with a simple mouse click or specific checks with a custom SQL-statement.?Other interactive visualizations, such as Timeline (6) to track table refreshes and Lineage to track upstream/downstream dependencies (7), can greatly support the day-to-day operations of a data team.?
Conclusion: Analysts should do … analysis
Data analysts should do more of the work that matters and less of the work that don’t . That includes: Spend more time for analysis and interpretation of new insights, spend less time finding what data to use and dealing with data issues.?Certainly not a particularly new insight, because, for example, every corporate controller has wanted more time for actual analysis work for ages - when there was no mention of data products for a long time.??But that is, after all, the goal that a data team should strive to achieve for the benefit of its internal "customers" and business partners interested in fact-based decision making. And that's exactly what this three-part article was intended for: With the perspective of the end users in mind, to give hints on how their work can be specifically supported. Part 1 started with basic aspects (object tagging and documentation in Snowflake), followed by Part 2 with the focus on building/listing/using Snowflake-based data products up to this article with related topics such as data observability - to keep the new house clean.
P.S A few weeks before the publication of this article, I started my "electric mobility" adventure. Nevertheless, I will track the price changes for fuels from the approximately 15,000 German gas stations for other demo purposes - this topic is always good for a new story.
data for all · cofounder @ getdot.ai + sled.so
1 年Thank you so much, Harald Erb, for the fantastic blog post about our product! Your insights and enthusiasm really shine through, and we appreciate the time and effort you put into writing it. Keep up the great work!