登录查看更多内容

Hot off the Presses - Navigating Generative AI, Data Product Development, Small Language Models, DataOps, And FinOps

Eckerson Group

Get More Value From Your Data

发布日期: 2023年7月21日

+ 关注

1. The Opportunity and Risk of Generative AI Part I: A Nuclear Explosion

Generative AI has exploded into the public arena, especially since the release of Open AI’s?ChatGPT?in November 2022. One hundred million people used the product monthly after just two months, the fastest adoption of any technology in history. This series focuses on generative AI, rather than AI broadly, because data and analytics companies are quickly adopting the technology to enhance their products.?

This blog is the first in a series:?The Opportunity and Risk of Generative AI.?The goal of the series is to help data analytics leaders understand the darker side of generative AI as they consider using it in their own enterprise. The first blog will provide an overview of generative AI and its risks. The second and third blogs will use the Responsible AI framework to examine the regulatory and ethical issues of generative AI. The final blog will offer recommendations and best practices for implementing generative AI in data analytics projects.

In this blog, you will learn what generative AI is and the common structures of generative AI systems. Then you’ll get an overview of the types of risks that generative AI poses to companies. Finally, I will offer you two visions of a generative AI future to help consider how governance decisions now will make either vision real.

What is generative AI?

Let’s start with the basics. Artificial intelligence (AI) is the ability of a machine to complete a human task reasonably well, with the goal of outperforming humans. Computers have long been able to do simple tasks like computation much faster and more accurately than humans. But recent advancements in hardware (like graphics cards), algorithms (like artificial neural networks), and data collection (we just have way, way more data these days) now enable AI to take off.

Generative AI is a type of artificial intelligence that creates some type of digital media: text (in many languages), computer code, images, video, audio, synthetic data, and 3D models. The allure of Generative AI is that it can support a huge number of use-cases like speeding up mundane tasks or helping brainstorm ideas. In the data analytics sphere the most promising use cases include generating metadata and documentation, serving as an advanced digital assistant, and generating synthetic data for training models.?

There are three major types of generative AI models:

Generative Adversarial Networks (GANs):?In this scenario, two adversarial neural networks compete to thwart each other. The generator neural network creates content based on real content from an external source. The discriminator neural network receives real content and generated content randomly and tries to determine whether each piece of content is real. Both models are trained, getting better at generating content and discriminating content until a benchmark level is achieved. GANs were the first major class of generative AI, created in 2014 by?Ian Goodfellow, et. al., and remain in use today, particularly for image manipulation and generation.
Transformer-Based Models:?Transformers are neural networks that process sequential data, usually sentences of text, and then output a transformed version of that data. They power large language models (LLMs) such as ChatGPT and Google’s?Bard. Transformers use an encoder-decoder system: the encoder translates input into vectors representing the meaning. The decoder receives these vectors and produces meaningful content based on each component of the input and its context. The best applications for transformers include translation, summarization, and other text-related tasks.
Variational Autoencoder (VAE):?VAEs learn the distribution of input data and then generate new data from that distribution. VAEs also use an encoder and decoder system, although they have different tasks than in a transformer. The VAE encoder tries to efficiently translate data into a compressed, organized form. The VAE decoder then generates data points from the encoder’s output. Developers often use VAEs to create synthetic data from a small dataset.

These designs all share similar characteristics: a large amount of training data, a semi-automated training process, and an undetermined end structure. These characteristics help define the unique risks of generative AI models that we explore in the next section.

>Continue reading here.

2. Interview: Governing Costs with FinOps for Cloud Analytics

Eckerson Group research analyst Dan O'Brien is joined by Kevin Petrie, VP of research to?discuss FinOps, which is a cost governance discipline for cloud-based analytics and operational projects.

They discussed the importance of having a multifunctional team manage a FinOps program for an analytics project. Kevin described how FinOps applies to the three phases of an analytics project: design, operate, and optimize.
Kevin discussed the importance of operation and optimization in managing a project, including adjusting to business requirements and optimizing costs and performance.?
Kevin discussed how companies such as Kroger have used Unravel's data observability platform to optimize their cloud costs. He also gave a fictional example of an e-commerce company using machine learning models to personalize content and make recommendations to customers, and how using platforms such as Unravel can help track and optimize resource costs.

Dan and Kevin discussed implementing disciplined, cross-functional cloud governance to manage cloud-related costs. Kevin also recommended the FinOps Foundation as a resource for building best practices.

>Listen to the podcast episode here.

3. Should AI Bots Build Your Data Pipelines? Part III: The Emergence of Small Language Models for Data Engineering

An emerging approach to generative AI will help data engineering teams achieve much-needed productivity gains while controlling risk. This approach centers on what Eckerson Group calls the “small language model.”

This blog, the third in a series about language models for data engineering, describes this technique and an early offering from?Informatica. It builds on the?first?and?second?blogs, which define use cases as well as risks and governance practices with language models. The fourth blog will conclude the series with guiding principles for data teams to achieve much-needed productivity benefits. Together these blogs explore the opportunity for language models to assist many aspects of data engineering, including data discovery, quality checks, ingestion, transformation, and documentation.

Let’s start by reviewing the definition of a large language model, a term that quickens the pulse of techies everywhere. A large language model (LLM) is a type of neural network that learns, summarizes, and generates content. Once trained, the LLM produces textual answers to natural language prompts, often returning sentences and paragraphs faster than humans speak.

A large language model (LLM) is a type of neural network that learns, summarizes, and generates content.

Despite the aura of magic, such capabilities boil down to lots of basic number crunching. An LLM breaks reams of text down into “tokens,” each representing a word, part of a word, or punctuation, then assigns a number to each token. During the training phase it studies how all the numbered tokens relate to one another in different contexts, and practices generating the next tokens in a string based on what came before. When?OpenAI?debuted ChatGPT-3 in November last year, the world started to comprehend the astonishing potential of its outputs.

Enter the small language model

A small language model (SLM) uses the same techniques as an LLM, but applies them to a specific domain. It might use pre-trained LLM logic to start—either from a vendor or open source community—and customizes that logic further. The following chart compares LLMs and SLMs in terms of governance and specialization. This builds on a concept that founder Alexander Rattner of?Snorkel.ai?shared at?The Future of Data-Centric AI Conference?earlier this month (although he did not use the SLM term).

A small language model applies LLM techniques to small, specific domains.

We can view LLMs and SLMs as two ends of a spectrum with overlap in between. Overall SLMs distinguish themselves from LLMs in one or more of the following ways.

SLMs are?more fine-tuned?because vendors or companies train them on detailed, domain-specific data, for example to assist complex data engineering tasks.
They?enrich user prompts, for example by injecting domain-specific data into a user’s question to make the response more accurate.
They?augment outputs, for example by having multiple models generate outputs from different datasets to give users more contextual knowledge.

Data pipeline vendors are building SLMs with these capabilities now, often alongside LLMs, to help companies tackle specialized data engineering problems with better governance. This will help data teams boost productivity while reducing risks related to data quality, fairness, and explainability. We should get ready for a boom of small language models in data engineering and many other fields.

>Continue reading here.

4. DataOps in Data Engineering

领英推荐

Economic potential of Generative AI

Leon Gordon 8 个月前

The Artificial Intelligence Glossary

Vodacom 1 年前

Enterprise AI 2.0: Crafting the Future with Generative…

Dr. Vivek Pandey 11 个月前

In a Complex Data Landscape, You Need DataOps and Data Engineering

The data ecosystem has become a data jungle waiting for a “datastrophe” to happen. As a case in point,?Matt Turck’s 2023 MAD (Machine Learning, AI, Data) landscape?has close to 1400 listed tools and frameworks, up from a mere 120 tools in 2012. The enterprise data landscape is crowded with shiny objects and dazzling buzzwords, all competing for customer attention and investment dollars. Through most of 2021, a data company got funding every 45 minutes.?

Data teams struggle to create data products and a functional modern experience in this disparate data ecosystem. Data teams need data engineering principles and DataOps processes to reduce the complexity of building end to end data products, navigate and integrate the complex maze of data tools and frameworks across distributed platforms.

Why organizations need data engineering and DataOps

Data engineering has evolved beyond ETL to include new paradigms such as ELT and ETLT. Data pipelines are growing in number, volume, and complexity, with frameworks and tools constantly evolving and appearing. The lack of a single product or tool to build end-to-end data platforms, coupled with the ever-increasing demands to improve speed of adoption, is causing organizations to duct tape different products and frameworks together. In the process they ignore data engineering and operationalization principles.

Look at Figure 1, which shows the vertical stack of a data platform. Each of these layers is subject to various types of drift, which causes additional complexities in terms of maintaining compatibility, controlling versions, and managing dependencies. There are subtle but strong ways in which each of the layers affect each other. This causes numerous challenges in terms of development and operationalization.

Figure 1. Data Platform Layers

Figure 2 shows a single end-to-end pipeline. Notice the number of handshake points between the systems, from ingestion all the way to consumption. Each of the systems run in different data centers and most of these run on distributed platforms. The complexities of identifying, debugging, and troubleshooting failures at the network or system level can be difficult to isolate, fix and coordinate. For example:

How does one address the lack of adherence to service level agreements (SLAs), service level objectives (SLOs), and specifications for non-functional requirements (NFR)??
What happens when the number of rows ingested do not match the numbers of rows stored in the sink systems??
How do data teams go about looking for the needle in the haystack?

>Continue reading here.

5. Four Traps to Avoid When Developing Data Products

As organizations strive to meet the ever-growing demand for data and analytics, they are adopting data products to streamline delivery and ensure solutions provide value to business stakeholders. Data products are reusable data assets designed for specific uses and delivered according to agreed-upon standards and schedules. They provide tailored results and offer flexibility to cope with ever-changing data technologies and business requirements. As such, they have emerged as a key component of modern analytics strategies.

Developing data products can be challenging. The approach requires a high degree of collaboration with stakeholders. While it’s meant to streamline delivery, it also requires governance to protect quality. And there are many roles involved in creating data products that must work together. In this article, we’ll discuss four traps that can disrupt data product development and how to avoid falling into them. These bad practices focus on quantity over quality, ignoring stakeholder feedback, overlooking collaboration, and putting off governance.

Four Traps of Data Product Development

Trap #1: Focusing on Quantity Over Quality

For most organizations, the demand for data and analytics exceeds their capacity to deliver solutions. If you’re considering a data product approach with a huge backlog of requests, beware of the trap of opting for quantity over quality. The iterative approach to developing data products can make it seem like you’re blasting through the backlog, getting lots done, and making stakeholders happy. However, that feeling of victory won’t last if your data products lack the quality and documentation to maintain your stakeholders’ trust.?

Minimal Viable Data Product (MVDP).?It’s important to define a set of standards that all data products must meet before a data product can be considered “done” and made available to data consumers. These common standards describe a minimal viable data product (MVDP). MVDP standards should ensure that stakeholders can discover, understand, access, and trust the data products they use. An MVDP must include documentation about itself, such as the value it delivers, how it should be used, the product owner, the lineage of its data, service level objectives (SLOs), quality measures, sample datasets, and how to address it through an API or query interface.

>Continue reading here.

About Eckerson Group

Eckerson Group is a global research and consulting firm that focuses solely on data analytics. Our experts have substantial experience in data analytics and specialize in data strategy, data architecture, data management, data governance, data science, and data analytics.

Our clients say we are hard-working, insightful, and humble. It stems from our love of data and desire to help organizations optimize their data investments. We see ourselves as a family of continuous learners, interpreting the world of data and analytics for you.

Get more value from your data. Put an expert on your side.?Learn what Eckerson Group can do for you!

要查看或添加评论，请登录

Hot off the Presses - Navigating Generative AI, Data Product Development, Small Language Models, DataOps, And FinOps

Eckerson Group

Get More Value From Your Data

1. The Opportunity and Risk of Generative AI Part I: A Nuclear Explosion

What is generative AI?

2. Interview: Governing Costs with FinOps for Cloud Analytics

3. Should AI Bots Build Your Data Pipelines? Part III: The Emergence of Small Language Models for Data Engineering

Enter the small language model

4. DataOps in Data Engineering

领英推荐

Why organizations need data engineering and DataOps

5. Four Traps to Avoid When Developing Data Products

更多精彩文章

社区洞察

其他会员也浏览了

How Generative AI Reshapes Business Application Dynamics

10 Artificial Intelligence Trends That Will Shape the Future

Generative AI

Generative AI: Unleashing the Creative Potential of Machines

With Quantum Leaps Forward: How Can Companies Now Adapt to Generative AI?

Generative AI vs. Explainable AI

What is Generative AI? Everything You Need to Know

The Future of Generative AI: What Startups Need to Prepare for in 2025 and Beyond

CTOs: Ignore Generative AI at Your Own Peril

Generative AI: The Secret Weapon Your Competitors Don't Know About (Yet)

1. The Opportunity and Risk of Generative AI Part I: A Nuclear Explosion

What is generative AI?

2. Interview: Governing Costs with FinOps for Cloud Analytics

3. Should AI Bots Build Your Data Pipelines? Part III: The Emergence of Small Language Models for Data Engineering

Enter the small language model

4. DataOps in Data Engineering

领英推荐

Why organizations need data engineering and DataOps

5. Four Traps to Avoid When Developing Data Products

Blending Data Mesh and Data Fabric: Crafting a Balanced Data Strategy

2024年10月29日

A Guide to Self Service: Understand, Plan & Implement

2024年9月27日

Secrets to Creating an Effective Data Strategy: Tips from Industry Leaders

2024年6月21日

A Guide to Data Products: Everything You Need to Understand, Plan, and Implement

2024年5月28日

2024 Predictions, Data Leader's Guide to GenAI, Data Modeling Rediscovered, and Trends in Data Products

2024年1月12日

Hot off the Presses - Data Democratization, Data Products, Semantic Layer, Data Modeling, and Generative AI

2023年12月8日

Hot off the Presses - Metadata Management, Creating Data Products, Responsible AI Ethics, and AI Data Pipelines

2023年11月6日

Hot off the Presses - Data Pipelines For Gen AI, ROI On MDM, Responsible AI, And Data Analytics Operating Models

2023年10月4日

Hot off the Presses - Defining Data Products, Taming the AI Frontier, Driving ROI with MDM, Succeeding with Large Language Models, Data Fabric

2023年9月6日

Modernizing Data Stack, Data Products, LLMs for Data Engineering, Center of Excellence, Data Mesh Readiness and More

2023年6月21日

社区洞察

其他会员也浏览了

How Generative AI Reshapes Business Application Dynamics

10 Artificial Intelligence Trends That Will Shape the Future

Generative AI

Generative AI: Unleashing the Creative Potential of Machines

With Quantum Leaps Forward: How Can Companies Now Adapt to Generative AI?

Generative AI vs. Explainable AI

What is Generative AI? Everything You Need to Know

The Future of Generative AI: What Startups Need to Prepare for in 2025 and Beyond

CTOs: Ignore Generative AI at Your Own Peril

Generative AI: The Secret Weapon Your Competitors Don't Know About (Yet)