Hot off the Presses - Data Democratization, Data Products, Semantic Layer, Data Modeling, and Generative AI
1. Data Democratization and the Duties of Data Citizenship
For decades we’ve tried to empower enterprise stakeholders with data to spot problems and opportunities and make better decisions about how to respond. Data democratization is the latest buzzword to describe this elusive goal. While there have been advances in data management, governance, and analytics, something keeps getting in the way of achieving data democratization.?
The solutions that data industry vendors and analysts often propose are about technical components and approaches, such as data platforms, architectures, formats, programming languages, and now generative AI. This makes sense because we need new technical approaches to address the massive scale and complexity of modern data. But technical solutions, while necessary, are not sufficient.?
The Human Factor?
One constant barrier to data democratization receives far less attention, the human factor. Thomas Jefferson wrote about the human factor in fostering political democracy:
“Whenever the people are well-informed, they can be trusted with their own government.”
Jefferson said that for democracy to work, the citizenry must have a basic level of education and be well-informed on the issues of the day. The same is true of a data democracy. For data to empower people, people must understand how to use it. They must know the conclusions they can draw from it, the suppositions it does not support, and most importantly, how to care for data to protect its quality and secure it from theft and misuse.?
2. Data Products Part II: Data Products Require Product Thinking
According to a recent LinkedIn poll I conducted, 54% of respondents said their organization has implemented data products. That’s a surprisingly high percentage, and it led me to ask how people define a data product. As I said in my prior blog , a data product is not a data asset with better quality or governance. It’s the output of a data product organization that oversees the complete lifecycle of a data product and manages its availability in an automated way.
The organizations that have implemented bonafide data products say that the biggest challenge is creating a “product mindset”. This is ironic because all commercial organizations have product teams that define and package products and services. But product thinking and experience are almost non-existent among data and analytics professionals. Hence, adopting a data product strategy is an uphill climb for most data teams.
Adopting a data product strategy is an uphill climb for most data teams.
Product Thinking
So, how do you instill a product mindset among data people? The chief data officer can’t just declare that the data team is now a product organization. They can’t simply appoint people to serve as product managers and owners and expect that a product culture will grow in alien soil. Without ample education and coaching, data professionals and the businesspeople they support will simply resort to old ways of doing things. If there is a product team, it will be in name only.?
Here are a few tips to nurture product mindset and develop a bonafide product organization:?
To instill a product mindset, you should seed product experts on teams, train and coach new product managers, and create a central product team.
The good news is that creating a product organization is not rocket science. It doesn’t take an inordinate amount of time to train and coach product managers. However, it’s imperative that the organization properly defines roles and processes to support the management and marketing of data products.?
3. Why and How to Enable Data Science with an Independent Semantic Layer
By Kevin Petrie
Sponsored by?TimeXtender
Nobel Prize-winning author Thomas Mann observed that “order and simplification are the first steps toward the mastery of a subject." While Mann died in 1955, his observation describes perfectly the challenge of modern analytics. How do you simplify complex data into something business managers can understand??
The semantic layer is an abstraction layer that aims to do just that. It derives consistent business metrics from underlying data and presents them to BI tools, AI/ML tools such as notebooks, or applications containing analytical functions. Many BI, data warehouse, and data lake products include a semantic layer. However, most practitioners—57% according to a recent poll by Eckerson Group—now prefer an independent semantic layer that can unify data from all their platforms. TimeXtender , for example, offers an independent semantic layer as part of its data management product.
The need for an independent semantic layer continues to rise as data science gains traction in the enterprise. This blog examines how it supports AI/ML use cases as part of data science projects. We’ll consider the five primary elements of a semantic layer: metrics, caching, metadata management, application programming interfaces (APIs), and access controls.
Metrics
The semantic layer presents metrics such as average revenue per sales rep by region, operating costs per factory, or annual growth in unit sales per customer. These metrics describe market trends and business performance as part of BI projects, and also serve as features for ML models as part of data science projects. For example, data scientists might identify and refine features such as annual sales per customer, transaction size vs. average, or historical market prices. The semantic layer then presents these metrics and their values to ML models that segment customers, detect fraudulent transactions, predict market prices, and so on.
The semantic layer presents metrics that serve as features?for machine learning models
Caching
The semantic layer pre-fetches high-priority metrics into cache (i.e., memory) along with their supporting tables, columns, and records. While caching can push up cloud costs, it also reduces latency for real-time ML use cases such as fraud prevention or customer recommendations. Use cases like these might have an ML platform that uses a semantic layer to pre-fetch metrics and calculate feature values in memory based on inputs from a streaming data pipeline. The ML model then uses those feature values to produce its real-time predictions or recommendations. Caching plays a critical role in performance when metrics and features derive from distributed, even far-flung data sources.
领英推荐
Caching speeds the processing of ML features that derive from far-flung data sources
4. A Fresh Look at Data Modeling Part 1: The What and Why of Data Modeling
By Dave Wells
Many organizations abandoned the practices of data modeling as they shifted from data management practices of the past to adopt big data, data lake, and NoSQL technologies. Past practices focused on relational data and were typically relegated to logical and physical design to develop new databases. Today’s data modeling has much larger scope driven by many factors. These include advances in analytics and data science, rapid growth in the volume and variety of data, a shift from primarily working with enterprise generated data to acquiring lots of external data, semantic disparity of operational data as operational systems become predominantly SaaS applications, and the pursuit of data lakes and NoSQL technologies.?
These factors influence data modeling practices in three significant ways: (1) modeling to understand content and structure of existing and acquired data as well as modeling to design new databases, (2) semantic and conceptual modeling as well as logical and physical modeling, (3) modeling for all types of data including key-value, document-oriented, knowledge graphs, property graphs, etc.?
With those differences in mind, my goal with this article is to make the case that data modeling is not dead. It is more important than ever before. And it is more interesting than ever before.
The Data Modeling Process
Data modeling is the process of constructing data models. That simple definition expresses the reality, but not the complexities of data modeling. It is important to recognize that a data model is more than a diagram. It is a description of the content and structure of a collection of data. That means a diagram (or set of diagrams) supported with descriptive text and definitions. Furthermore, it is a description of the content and structure of a collection of data from a particular perspective – semantic, business, system, or technical perspective. Those perspectives partially align with the multiple levels of data modeling that have been practiced for decades. (See figure 1.)
Figure 1. Levels of Data Modeling Past and Present
5. Generative AI Needs Vigilant Data Cataloging and Governance
By Kevin Petrie
Sponsored by?Alation
Our industry’s breathless hype about generative AI tends to overlook the stubborn challenge of data governance. In reality, many GenAI initiatives will fail unless companies properly govern the text files that feed the language models they implement.?
Data catalogs offer help. Data teams can use the latest generation of these tools to evaluate and control GenAI inputs on five dimensions: accuracy, explainability, privacy, IP friendliness, and fairness. This blog explores how data catalogs support these tasks, mitigate the risks of GenAI, and increase the odds of success.
What is GenAI?
GenAI refers to a type of artificial intelligence that generates digital content such as text, images, or audio after being trained on a corpus of existing content. The most broadly applicable form of GenAI centers on a language model (LM), which is a type of neural network whose interconnected nodes collaborate to interpret, summarize, and generate text. OpenAI ’s release of ChatGPT 3.5 in November 2022 triggered an arms race among LM innovators. Google released Bard, Microsoft integrated OpenAI code into its products, and GenAI specialists such as Hugging Face and Anthropic gained new prominence with their LMs.?
Now things get tricky
Companies are embedding LMs into their applications and workflows to boost productivity and gain competitive advantage. They seek to address use cases such as customer service document processing based on their own domain-specific data, especially natural-language text. But text files introduce the risks of data quality, fairness, and privacy. They can cause LMs to hallucinate, propagate bias, or expose sensitive information unless they are properly cataloged and governed.?
Data teams, more accustomed to database tables, must get a handle on governing all these PDFs, Google Docs, and other text files to ensure GenAI does more good than harm. And the stakes run high: 46% of data practitioners told Eckerson Group in a recent survey that their company does not have sufficient data quality and governance controls to support its AI/ML initiatives.
Data teams need to govern the natural-language text that feeds GenAI initiatives
Enter the data catalog
The data catalog has long assisted governance by enabling data analysts, scientists, engineers, and stewards to evaluate and control datasets in their environment. It centralizes a wide range of metadata—file names, database schemas, category labels, and more—so data teams can vet data inputs for all types of analytics projects.
Modern catalogs go a step further to evaluate risk and control usage of text files for GenAI initiatives.
Modern catalogs go a step further to evaluate risk and control usage of text files for GenAI initiatives. This helps data teams fine-tune and prompt their LMs with inputs that are accurate, explainable, private, IP-friendly, and fair. (See figure 1.)?
Figure 1. Data Catalog Controls for Gen AI
About Eckerson Group
Eckerson Group is a global research and consulting firm that focuses solely on data analytics. Our experts have substantial experience in data analytics and specialize in data strategy, data architecture, data management, data governance, data science, and data analytics.
Our clients say we are hard-working, insightful, and humble. It stems from our love of data and desire to help organizations optimize their data investments. We see ourselves as a family of continuous learners, interpreting the world of data and analytics for you.
Get more value from your data. Put an expert on your side. Learn what Eckerson Group can do for you!
Senior Digital Marketing Specialist- Data Dynamics
9 个月Great lineup of analytics insights! The articles on overcoming barriers to data democratization, building a data products culture, leveraging semantic layers for #AI, revisiting #data modeling fundamentals, and responsible #generativeai via data cataloging sparked some valuable reflections and ideas for me to explore. Already looking forward to next month's edition - keep up the great work! ??