Hot off the Presses - Navigating Generative AI, Data Product Development, Small Language Models, DataOps, And FinOps
Generative AI has exploded into the public arena, especially since the release of Open AI’s?ChatGPT?in November 2022. One hundred million people used the product monthly after just two months, the fastest adoption of any technology in history. This series focuses on generative AI, rather than AI broadly, because data and analytics companies are quickly adopting the technology to enhance their products.?
This blog is the first in a series:?The Opportunity and Risk of Generative AI.?The goal of the series is to help data analytics leaders understand the darker side of generative AI as they consider using it in their own enterprise. The first blog will provide an overview of generative AI and its risks. The second and third blogs will use the Responsible AI framework to examine the regulatory and ethical issues of generative AI. The final blog will offer recommendations and best practices for implementing generative AI in data analytics projects.
In this blog, you will learn what generative AI is and the common structures of generative AI systems. Then you’ll get an overview of the types of risks that generative AI poses to companies. Finally, I will offer you two visions of a generative AI future to help consider how governance decisions now will make either vision real.
What is generative AI?
Let’s start with the basics. Artificial intelligence (AI) is the ability of a machine to complete a human task reasonably well, with the goal of outperforming humans. Computers have long been able to do simple tasks like computation much faster and more accurately than humans. But recent advancements in hardware (like graphics cards), algorithms (like artificial neural networks), and data collection (we just have way, way more data these days) now enable AI to take off.
Generative AI is a type of artificial intelligence that creates some type of digital media: text (in many languages), computer code, images, video, audio, synthetic data, and 3D models. The allure of Generative AI is that it can support a huge number of use-cases like speeding up mundane tasks or helping brainstorm ideas. In the data analytics sphere the most promising use cases include generating metadata and documentation, serving as an advanced digital assistant, and generating synthetic data for training models.?
There are three major types of generative AI models:
These designs all share similar characteristics: a large amount of training data, a semi-automated training process, and an undetermined end structure. These characteristics help define the unique risks of generative AI models that we explore in the next section.
Eckerson Group research analyst Dan O'Brien is joined by Kevin Petrie, VP of research to?discuss FinOps, which is a cost governance discipline for cloud-based analytics and operational projects.
Dan and Kevin discussed implementing disciplined, cross-functional cloud governance to manage cloud-related costs. Kevin also recommended the FinOps Foundation as a resource for building best practices.
An emerging approach to generative AI will help data engineering teams achieve much-needed productivity gains while controlling risk. This approach centers on what Eckerson Group calls the “small language model.”
This blog, the third in a series about language models for data engineering, describes this technique and an early offering from?Informatica. It builds on the?first?and?second?blogs, which define use cases as well as risks and governance practices with language models. The fourth blog will conclude the series with guiding principles for data teams to achieve much-needed productivity benefits. Together these blogs explore the opportunity for language models to assist many aspects of data engineering, including data discovery, quality checks, ingestion, transformation, and documentation.
Let’s start by reviewing the definition of a large language model, a term that quickens the pulse of techies everywhere. A large language model (LLM) is a type of neural network that learns, summarizes, and generates content. Once trained, the LLM produces textual answers to natural language prompts, often returning sentences and paragraphs faster than humans speak.
A large language model (LLM) is a type of neural network that learns, summarizes, and generates content.
Despite the aura of magic, such capabilities boil down to lots of basic number crunching. An LLM breaks reams of text down into “tokens,” each representing a word, part of a word, or punctuation, then assigns a number to each token. During the training phase it studies how all the numbered tokens relate to one another in different contexts, and practices generating the next tokens in a string based on what came before. When?OpenAI?debuted ChatGPT-3 in November last year, the world started to comprehend the astonishing potential of its outputs.
Enter the small language model
A small language model (SLM) uses the same techniques as an LLM, but applies them to a specific domain. It might use pre-trained LLM logic to start—either from a vendor or open source community—and customizes that logic further. The following chart compares LLMs and SLMs in terms of governance and specialization. This builds on a concept that founder Alexander Rattner of?Snorkel.ai?shared at?The Future of Data-Centric AI Conference?earlier this month (although he did not use the SLM term).
A small language model applies LLM techniques to small, specific domains.
We can view LLMs and SLMs as two ends of a spectrum with overlap in between. Overall SLMs distinguish themselves from LLMs in one or more of the following ways.
Data pipeline vendors are building SLMs with these capabilities now, often alongside LLMs, to help companies tackle specialized data engineering problems with better governance. This will help data teams boost productivity while reducing risks related to data quality, fairness, and explainability. We should get ready for a boom of small language models in data engineering and many other fields.
领英推荐
In a Complex Data Landscape, You Need DataOps and Data Engineering
The data ecosystem has become a data jungle waiting for a “datastrophe” to happen. As a case in point,?Matt Turck’s 2023 MAD (Machine Learning, AI, Data) landscape?has close to 1400 listed tools and frameworks, up from a mere 120 tools in 2012. The enterprise data landscape is crowded with shiny objects and dazzling buzzwords, all competing for customer attention and investment dollars. Through most of 2021, a data company got funding every 45 minutes.?
Data teams struggle to create data products and a functional modern experience in this disparate data ecosystem. Data teams need data engineering principles and DataOps processes to reduce the complexity of building end to end data products, navigate and integrate the complex maze of data tools and frameworks across distributed platforms.
Why organizations need data engineering and DataOps
Data engineering has evolved beyond ETL to include new paradigms such as ELT and ETLT. Data pipelines are growing in number, volume, and complexity, with frameworks and tools constantly evolving and appearing. The lack of a single product or tool to build end-to-end data platforms, coupled with the ever-increasing demands to improve speed of adoption, is causing organizations to duct tape different products and frameworks together. In the process they ignore data engineering and operationalization principles.
Look at Figure 1, which shows the vertical stack of a data platform. Each of these layers is subject to various types of drift, which causes additional complexities in terms of maintaining compatibility, controlling versions, and managing dependencies. There are subtle but strong ways in which each of the layers affect each other. This causes numerous challenges in terms of development and operationalization.
Figure 1. Data Platform Layers
Figure 2 shows a single end-to-end pipeline. Notice the number of handshake points between the systems, from ingestion all the way to consumption. Each of the systems run in different data centers and most of these run on distributed platforms. The complexities of identifying, debugging, and troubleshooting failures at the network or system level can be difficult to isolate, fix and coordinate. For example:
As organizations strive to meet the ever-growing demand for data and analytics, they are adopting data products to streamline delivery and ensure solutions provide value to business stakeholders. Data products are reusable data assets designed for specific uses and delivered according to agreed-upon standards and schedules. They provide tailored results and offer flexibility to cope with ever-changing data technologies and business requirements. As such, they have emerged as a key component of modern analytics strategies.
Developing data products can be challenging. The approach requires a high degree of collaboration with stakeholders. While it’s meant to streamline delivery, it also requires governance to protect quality. And there are many roles involved in creating data products that must work together. In this article, we’ll discuss four traps that can disrupt data product development and how to avoid falling into them. These bad practices focus on quantity over quality, ignoring stakeholder feedback, overlooking collaboration, and putting off governance.
Four Traps of Data Product Development
Trap #1: Focusing on Quantity Over Quality
For most organizations, the demand for data and analytics exceeds their capacity to deliver solutions. If you’re considering a data product approach with a huge backlog of requests, beware of the trap of opting for quantity over quality. The iterative approach to developing data products can make it seem like you’re blasting through the backlog, getting lots done, and making stakeholders happy. However, that feeling of victory won’t last if your data products lack the quality and documentation to maintain your stakeholders’ trust.?
Minimal Viable Data Product (MVDP).?It’s important to define a set of standards that all data products must meet before a data product can be considered “done” and made available to data consumers. These common standards describe a minimal viable data product (MVDP). MVDP standards should ensure that stakeholders can discover, understand, access, and trust the data products they use. An MVDP must include documentation about itself, such as the value it delivers, how it should be used, the product owner, the lineage of its data, service level objectives (SLOs), quality measures, sample datasets, and how to address it through an API or query interface.
About Eckerson Group
Eckerson Group is a global research and consulting firm that focuses solely on data analytics. Our experts have substantial experience in data analytics and specialize in data strategy, data architecture, data management, data governance, data science, and data analytics.
Our clients say we are hard-working, insightful, and humble. It stems from our love of data and desire to help organizations optimize their data investments. We see ourselves as a family of continuous learners, interpreting the world of data and analytics for you.
Get more value from your data. Put an expert on your side.?Learn what Eckerson Group can do for you!