Hot off the Presses - Defining Data Products, Taming the AI Frontier, Driving ROI with MDM, Succeeding with Large Language Models, Data Fabric

Hot off the Presses - Defining Data Products, Taming the AI Frontier, Driving ROI with MDM, Succeeding with Large Language Models, Data Fabric

1. Let’s Be Clear: A Data Asset is Not a Data Product

By Wayne Eckerson

If there is one thing in which our industry excels, it’s concocting new terms for existing things. The latest is “data product.” Today, many people use that term to describe data assets: datasets, SQL queries, dashboards, reports, ML models, or data components. But these types of assets have existed for decades. So, why the name change??

I suppose the term “data product” sounds more important and valuable than “data asset”. Or maybe it’s that a data product sounds like something that a sophisticated data team produces. Or maybe it’s because data is now an intrinsic part of honest-to-goodness digital products that generate revenue. Whatever the case, we shouldn’t call data assets “data products” just because it’s trendy.?

What is a Data Product??

I believe there is a subtle, but fundamental difference between a data asset and data product. This might be a tad radical, but bear with me. I think a data product is a data asset that has all the characteristics of something that can be bought and sold in a store. Let me explain.?

Most products in the real world are found in digital or brick-and-mortar stores. The store is a central point for customers to shop for products and for sellers to connect with customers, removing the friction between buying and selling. Until they reach a store, products are just assets or inventory. Once in a store, products possess certain characteristics that facilitate the shopping process: they are standardized, packaged, “shoppable”, deliverable, and returnable. (See figure 1).?

Figure 1. Characteristics of a Product

I believe data products work the same way. A data product is just an asset until it resides in a data store. There, it acquires new characteristics: a SKU, a unique metadata (subscription and delivery options), and terms of service/use that spell out a bidirectional, binding contract that is part of a formal transaction. In addition, data producers can charge for the product, if they want, or give it away for free or assess a chargeback fee, if it’s an internal transaction. In essence, a data product looks, smells, and acts like any product available for purchase in a local grocery, hardware, or retail store—except that it’s digital and data.

The data store makes it easy for data producers to create, publish, and distribute data products and for data consumers to browse, evaluate, compare, and acquire them. The focus here is external, on what customers do with data assets, not internal on how developers build data assets, stewards govern them, or IT staff monitor them. Companies ought to build data assets with the same rigor as data products. Right? A data asset is a prerequisite for a data product, but a data asset doesn’t become a data product until it lands on the digital shelves of a data store (or the price list of a vendor). Essentially, a data product is determined by its transactional nature and its residence and use.

The Role of an Internal Data Store?

Most people are familiar with public data marketplaces operated by Amazon, Snowflake, and commercial data providers, such as Corelogic, Acxiom, and LiveRamp. These serve a purpose, mainly to promote commercial data products from data brokers. But they don’t meet the needs of most organizations that want to broadly share data assets internally, and perhaps externally as well. What our industry really needs are internal data stores that make it easy for internal data producers to create and publish data products that data consumers can find and use.?

Benefits. Without an internal data store, data owners become overwhelmed with requests for data that suck up valuable time and resources and inject huge delays in the delivery and consumption of valuable data assets. A data store broadens access to core data assets while eliminating the manual and time-consuming processes that require data owners to review data requests to ensure data security. Data producers create data products once and distribute them many times without human intervention. Data consumers browse, evaluate, and acquire data products without having to request permission and wait for delivery.?

>Continue reading here.

2. Enterprise Data and the Taming of the Generative AI Frontier

By Kevin Petrie

Sponsored by?Prophecy

At noon on a spring day in 1889, thousands of settlers raced across the prairie to grab a stake in the?Oklahoma Land Rush. Farms, schools, and churches soon followed, taming the frontier.?

In November 2022,?OpenAI?opened a new frontier by releasing ChatGPT-3 and demonstrating the possibilities of generative AI. As early adopters rush to embrace AI language models, companies are rapidly devising ways to tame them. The answer might lie in their own enterprise data—and the governance programs they use to control it.

Super smart

Let’s start with the definition of a language model, the heart of generative AI.?

A language model (LM) is a type of neural network that summarizes and generates content by learning how words relate to one another. A trained LM produces textual answers to natural language questions or “prompts,” often returning sentences and paragraphs faster than humans can speak. While ChatGPT gets headlines, other notable LMs include BARD by Google and various open source platforms such as?LLaMA?by?META?and?BLOOM?by?HuggingFace. These “large” language models derive from massive training inputs with billions of “parameters” that describe the inter-relationships of words.

Innovative

Language models might unleash a new world of innovation. Knowledge workers of all types already use LMs as productivity assistants. For example, 43% of 61 data practitioners told Eckerson Group in a recent poll that they already use LMs to help document environments, build starter code for data pipelines, and learn new techniques. In another poll, 73% of 40 early adopters said LMs make them up to 30% more productive. ?(These results cleanse respondents by job title.) It’s no surprise, then, that Databricks acquired the startup MosaicML for $1.3 billion to help companies put language models into production.

…But wild

This frontier, of course, has a wild side: LMs make things up—i.e., “hallucinate”—when they don't have the right inputs. An LM generates strings of words that become logical sentences and paragraphs. But it doesn't “know” anything in the human sense of the term: rather, it takes guesses based on the statistical interrelationships of words it has studied. This can become a big problem when users pose detailed questions to LMs that lack enterprise-specific context because they were only trained on public data.

So when you put an LM to work in your enterprise environment, you create new risks for data quality, privacy, intellectual property, fairness, and explainability. Companies must adapt their data governance programs to mitigate these risks. They must feed accurate data into the LMs, as part of both the training process and natural-language prompting—or both.?

>Continue reading here.

3. Driving ROI with Master Data Management, Part II: Your First Project

By Kevin Petrie

While members of the United Nations speak hundreds of languages, they manage to conduct official business with just six. This proved the best way for them to streamline time, resources, and risk while keeping everyone on the same page. Similarly, companies invest in master data management (MDM) to help different teams conduct their official business. There is no perfect formula; each organization must decide how MDM can deliver the right return on investment given its situation.

Master data management (MDM) comprises practices and tools that aim for a single source of truth with consistent, trusted records for key business entities. MDM tools match and merge data from various source systems to create standard attributes and terms that describe entities such as products, customers, and partners. The resulting “golden records” strengthen data governance programs by reducing duplicates and resolving discrepancies.?

This blog, the second in a series, explores how companies can achieve the right return on investment (ROI) with MDM by selecting the right architectural approach and measuring success while executing their first project. It builds on the?first blog, which helps companies prepare an overall business case for MDM. The third and final blog will recommend ways to iterate with subsequent projects based on the successes and lessons learned from the initial project.

The value of MDM?

MDM delivers value by reducing the risk, time, and resources required to process data for analytics and operational workloads. Given the inherent complexity of the endeavor, MDM projects tend to make things worse during implementation but improve them afterwards. A successful project should have the following aggregate impact over time:

No alt text provided for this image

As with any technology project, messy details tend to get in the way and threaten ROI. Let’s consider how data teams can meet or exceed ROI goals by selecting the right architectural approach and measuring the right key performance indicators. (To understand the MDM implementation process in detail, also check out the?Rapid Delivery Blueprint?that Semarchy wrote based on more than 100 projects over a decade.)

Architectural approach

The architectural approach has a big impact on data management processes and therefore the ROI of an MDM project. The data engineer should work with business owners, data consumers and the IT/CloudOps engineer to evaluate how each approach can streamline risk, time, and resources to support their company’s data environment and priorities.

The primary approaches are registry, consolidation, coexistence, and centralization. Here is a summary of these approaches, which Semarchy also explores in more detail in a recent blog.

>Continue reading here.

4. Should AI Bots Build Your Data Pipelines? Part IV: Guiding Principles for Success with Language Models and Data Engineering

By Kevin Petrie

The irony of adding robots to your team is that they need lots of care and feeding, as do the humans who manage them. This holds true for data teams that use language models to build and manage data pipelines.

This blog, the fourth and final installment of our series, recommends guiding principles for successful implementation of language models to assist data engineering. The?first?blog defined language models and use cases; the?second?explored risks; and the?third?described the emergence of “small language models” that reduce those risks. Together, these blogs explore why and how language models—the most popular form of generative AI—make data engineers more productive as they discover, validate, ingest, transform, and document data for analytics.

To recap, a large language model (LLM) is a type of neural network that learns, summarizes, and generates content based on statistical calculations about how words relate to one another. Once trained, the LLM produces textual answers to natural language prompts, often returning sentences and paragraphs faster than humans speak. Examples of LLMs include ChatGPT from?OpenAI,?Bard?from Google, and?BLOOM?from Hugging Face. A small language model (SLM) applies the same techniques as an LLM, but also uses fine-tuning, enriched prompts, or augmented outputs to support more specialized use cases in a more governed fashion. Data vendors such as?Informatica,?TimeXtender, and?Illumex?offer SLMs. We can expect SLMs to become the standard approach to language models for data engineering.

Data leaders and engineers should adopt five guiding principles to achieve their intended results with language models, including LLMs and SLMs. They should manage these models like assistants; compare costs and benefits; embrace prompt engineering; train their teams, and adapt data governance programs to address language models. Let’s walk through each principle in turn.

Guiding Principle 1. Manage your language model like an assistant

A startup founder recently told me he has a team of six but it feels like more because they use LLMs to assist tasks such as brainstorming and content creation. He has the right mindset. Language model tools are employees that we manage to improve team productivity. These tools do not replace humans; on the contrary, they need expert human oversight. An LLM is akin to a 20-year-old savant, long on cognitive powers but short on real-world judgment. He can do great things provided his manager trains him well, inspects his work, and incorporates his work into approved organizational processes. Data engineers and other practitioners should manage their language models in a similar way.

Guiding Principle 2. Compare costs and benefits

Any data engineer that has tried LLMs or early-stage SLM offerings understands well the primary benefit: productivity. LLMs can document data environments, build starter code for pipelines, find relevant public datasets, and so on, all of which helps overwhelmed data teams get more work done in less time. Other potential benefits include education on new techniques and elevation of team members to be more strategic. For example, data engineers might become more like data architects as they enter natural language commands to design pipelines rather than writing scripts from scratch.?

Then we come to costs. Open source software and free or discounted vendor prototypes minimize the upfront costs of language models. However, teams must spend time reducing the risks posed to data quality, privacy, intellectual property (IP), bias, and explainability—and the time it takes your team to mitigate those risks. Practitioners must inspect the quality of language model outputs to ensure they don’t break pipelines, deliver bad data to the business, or compromise privacy. They must ensure they don’t mishandle IP or propagate bias. Given the opacity of language models, teams also might need more time to explain how they handle data so business owners and external stakeholders such as investors, auditors, and customers have peace of mind.

>Continue reading here.

5. Analyst Series - Data Fabric: The Next Step in the Evolution of Data Architectures

No alt text provided for this image

By Daniel O'Brien and Jay Piscioneri

Summary

  • Dan and Jay discussed the concept of Data Fabric, an automated and AI-driven approach to managing modern data environments. They also compared it to another new data architecture called Data Mesh, which focuses on distributing the responsibility for data to different functional domains within a business.
  • Jay and Dan discussed the differences between Data Mesh and Data Fabric. Data Mesh emphasizes a domain-oriented organization of data and a granular, distributed approach to managing data products, while Data Fabric relies on a centralized approach and abstract data objects to de-emphasize the location and format of data.
  • Jay and Dan discussed how artificial intelligence and machine learning are core to the data fabric, as they help automate functions such as identifying sensitive data and pipeline preparation. They also talked about how abstracted data objects can be created either as virtualized or persistent objects to manage the volume and velocity of data.
  • Jay and Dan discussed the benefits and risks of implementing a data fabric, which involves integrating different products to work together in a coordinated way. Jay emphasized that data fabric is not a product, but rather a matter of integration, although there are now product suites available from different companies.

>Listen to the podcast episode here.


About Eckerson Group

Eckerson Group is a global research and consulting firm that focuses solely on data analytics. Our experts have substantial experience in data analytics and specialize in data strategy, data architecture, data management, data governance, data science, and data analytics.

Our clients say we are hard-working, insightful, and humble. It stems from our love of data and desire to help organizations optimize their data investments. We see ourselves as a family of continuous learners, interpreting the world of data and analytics for you.

Get more value from your data. Put an expert on your side.?Learn what Eckerson Group can do for you!



要查看或添加评论,请登录

社区洞察

其他会员也浏览了