登录查看更多内容

Building a Data Platform to enable analytics and AI-driven innovation

Valliappa Lakshmanan

Data/AI products and platforms

发布日期: 2020年7月25日

Build a Data Mesh & Set up?MLOps

Businesses realize that as more and more products and services become digitized, there is an opportunity to capture a lot of value by taking better advantage of data. In retail, it could be by avoiding deep discounting by stocking the right items at the right time. In financial services, it could be by identifying unusual activity and behavior faster than the competition. In media, it could be by increasing engagement by offering up more personalized recommendations.

Key Challenges

In my talk at Cloud Next OnAir (see video), I describe that, in order to lead your company towards data-powered innovation, there are a few key challenges that you will have to address:

The size of data that you will employ will increase 30–100% year on year. You are looking at a 5x data growth over the next 3–4 years. Do not build your infrastructure for the data you currently have. Plan for growth.
25% of your data will be streaming data. Avoid the temptation of building a batch data processing platform. You will want to unify batch and stream processing.
Data quality reduces the farther away from the originating team the data gets. So, you will have to provide domain experts control over the data. Don’t centralize data in IT.
The greatest value in ML/AI will be obtained by combining data that you have across your organization and even data shared by partners. Breaking silos and building a data culture will be key.
Much of your data will be unstructured — images, video, audio (chat), and free form text. You will be building data and ML pipelines that derive insights from unstructured data.
AI/ML skills will be scarce, so you will have to take advantage of packaged AI solutions and systems that democratize machine learning.

The platform that you will need to build will need to address all of these challenges and serve as an enabler of innovation.

In this article, I will summarize the key points from my talk, and delve into technical details that I didn’t have time to cover. I recommend both watching the talk and reading this article because the two are complementary.

The 5-step?journey

Based on our experience helping many Google Cloud customers go through a digital transformation journey, there are five steps in the journey:

Step 1: Simplify operations and lower the total cost of ownership

The first step for most enterprises is to find the budget. Moving your enterprise data warehouse and data lakes to the cloud can save you anywhere from 50% to 75%, mostly by reducing the need to spend valuable time doing resource provisioning. Ephemeral and spiky workloads will also benefit from autoscaling and the cloud economics of pay-for-what-you-use.

But when doing this, make sure you are setting yourself up for success because this is only the first step of the journey. Your goal is not just to save money; it is to drive innovation. You can get the ability to handle more data, more unstructured data, streaming data, and build a data culture (“modernize your data platform”) and save money at the same time by moving to a capable platform. Just make sure to pick a platform that is serverless, self-tuning, highly scalable, provides high-performance streaming ingestion, allows you to operationalize ML without moving data, enables domain experts to “own” the data but share it broadly with the organization, and does all this in a robust, secure way.

When it comes to analytics, Google BigQuery is the recommended destination for structured and semi-structured data. Google Cloud Storage is what we recommend for unstructured data. We have low-risk migration offers to quickly move on-premises data (Teradata/Netezza/Exadata), Hadoop and Spark workloads, and point data warehouses like RedShift and Snowflake to BigQuery. Similarly, if you need to capture logs or changes from transactional databases to the cloud for analytics.

Step 2: Break down silos, democratize analytics, and build a data?culture

My recommendation to choose the storage layer based on type of data might seem surprising. Shouldn’t you store “raw” data in a data lake, and “clean” data in a data warehouse? No, not a good idea. Data platforms and roles are converging and you need to be aware that traditional terminology like Data Lake and Data Warehouse can lead to status quo bias and bad choices. My recommendation instead is for you to think about what type of data it is, and choose your storage layer. Some of your “raw” data, if it is structured, will be in BigQuery and some of your final, fully produced media clips will reside in Cloud Storage.

Don’t fall into the temptation of centralizing the control of data in order to break down silos. Data quality reduces the further away from the domain experts you get. You want to make sure that domain experts create datasets in BigQuery and own buckets in Cloud Storage. This allows for local control, but access to these datasets will be controlled through Cloud IAM roles and permissions. The use of encryption, access transparency, and masking with Cloud Data Loss Prevention can help ensure orgwide security even if the responsibility of data accuracy lies with the domain teams.

Each analytics dataset or bucket will be in a single cloud region (or multi-region such as EU or US). Following Zhamak Dehghani’s nomenclature, you could call such a storage layer a “distributed data mesh” to avoid getting sidetracked by the lake vs. warehouse debate.

Encourage teams to provide wide access to their datasets (“default open”). Owners of data control access to data, but subject to org-wide data governance policies. IT teams also have the ability to tag datasets (for privacy, etc.). Cloud IAM is managed by IT. Permissions to their datasets are managed by the data owners. Upskill your workforce so that they are discovering and tagging datasets through Data Catalog, and building no-code integration pipelines using Data Fusion to continually increase the breadth and coverage of your data mesh.

One problem you will run into when you build a democratized data culture is that you will start to see analytics silos. Each time a Key Performance Indicator (KPI) is calculated is one more opportunity for it to be calculated the wrong way. So, encourage data analytics teams to build a semantic layer using Looker and apply governance through that semantic layer:

领英推荐

2023 Product Roundup: AI, Data Mesh, and a New Age of…

Atlan 1 年前

Essential Big Data Analytics Trends You Need to Know…

CloudThat 1 年前

AI for Data & Data for AI: The Big Shift in Data…

Lingaro 1 个月前

This has the advantage of being multi-vendor and multi-cloud. The actual queries are carried out the underlying data warehouse, so there is no data duplication.

Regardless of where you store the data, you should bring compute to that data. On Google Cloud, the compute and storage are separate and you can mix and match. For example, your structured data can be in BigQuery, but you can choose to do your processing using SQL in BigQuery, Java/Python Apache Beam in Cloud Dataflow, or Spark on Cloud Dataproc.

Do not make copies of data.

Step 3: Make decisions in context,?faster

The value of a business decision, especially a decision that is made in the long tail, drops with latency and distance. For example, suppose you are able to approve a loan in 1 minute or in 1 day. The 1-minute approval is much, much more valuable than the 1-day turnaround. Similarly, if you are able to make a decision that takes into account spatial context (whether it is based on where the user currently lives, or where they are currently visiting), that decision is much more valuable than one devoid of spatial context.

One goal of your platform should be that you can do GIS, streaming, and machine learning on data without making copies of the data. The principle above, of bringing compute to the data, should apply to GIS, streaming, and ML as well.

On Google Cloud, you can stream data into BigQuery, and all queries on BigQuery are streaming SQL. Even as you are streaming data into BigQuery, you can carry out time-window transformations (to take into account user- and business-context) in order to real-time AI and populate real-time dashboards.

Step 4: Leapfrog with end-to-end AI Solutions

ML/AI is software, and like any software, you should consider whether you should build or whether you can buy. Google Cloud’s strategy in AI is to bring the best of Google’s AI to our customers in the form of APIs (e.g. Vision API) and building blocks (e.g. Auto ML Vision, where you can fine tune Vision API on your own data, with the advantage that you need much less of it).

When it comes to AI (arguably, this is true of all tech, but it is particularly apparent in AI because it’s so new), every vendor seems to check all the boxes. We really encourage you to look at the quality of the underlying services. It is not the case that any competing natural language or text classifier comes close to Cloud Natural Language API or Auto ML Natural Language. The same holds for our vision, speech-to-text, etc. models.

We are also putting together our basic capabilities into higher-value, highly integrated solutions. Contact Center AI, where we do automated call handling, operator assistance, and call analytics as a packaged solution is one example. As is Document AI, where we tie together form parsing, and knowledge extraction.

Step 5: Empower data and ML teams with scaled AI platforms

I recommend that you split your portfolio of AI solutions into 3 categories. For many problems, using APIs and building blocks will be sufficient. Build out a data science team to solve AI problems that will uniquely differentiate you and give you sustainable advantage.

Once you decide to build a data science team, though, make sure that you enable them to do machine learning efficiently. This will require the ability to experiment on models using notebooks, capture their ML workflows using experiments, deploy their ML models using containers, and do CI/CD for continuous training and evaluation. You should use our ML Pipelines for that. It is well integrated with our data analytics platform and with Cloud AI Platform services.

At Google Cloud, we will walk with you in every step in this journey. Contact us!

Next Steps

Watch my talk in Cloud Next OnAir (register for the talk here)

This article is crossposted from my medium blog -- follow me there, and on Twitter.

Here are some articles and white papers that might be useful:

Masil Masilamani

Structural Engineer at Independent consulting

4 年

Wow!!

Gaetano Ruggiero

Global Head of Data Management | Building Complex Data Architectural Landscapes

4 年

Thanks for this reading, I understand the need for a single pane of glass in order to mitigate the risk of wrong kpis evaluations in a distributed analytical ecosystem where analytic silos may also led to multiple kpis having the same semantic. Do you think that master data management, in building a single trusted view of analytic dimensions provided by multiple domains, can help in improving analytic models quality?

Venkat Lolla

Senior Director, Engineering & Strategy - Optum Whole Health Solutions

4 年

Fantastic read.. we have going towards similar directions.. streaming analytics..

John Klacynski

Principal Customer Solutions Manager (AWS)

4 年

Roberto Pasquier FYI. Different stack, but you guys are on final leg(s) of the journey!

Zhamak Dehghani

Founder and CEO Nextdata | Creator of Data Mesh | Author | Speaker | Ex-Thoughtworks

4 年

Valliappa Lakshmanan, Thank you for sharing this article - concisely written and to the point. I'm delighted to see Data Mesh is on your radar. While Google services can certainly implement a data mesh architecture, perhaps better than many other providers, IMHO there is still room for improvement in providing a *frictionless experience* building domain-driven and decentralized big data architecture. I'm hopeful that Google makes some investment in that area. One thing to say, Data Mesh caters for copying data. In fact, there is no way out of it. If we do embrace decentralization that's inevitable; However the underlying data platform technology can provide guardrails and standardization, builtin, to make sure time-based immutability are baked into representation of analytical data. Though that's a longer conversation to have another time :) I am looking forward to your session on Google Next.

10 次回应

查看更多评论

要查看或添加评论，请登录

Valliappa Lakshmanan的更多文章

Optimizing to the Eval: GenAI Design Pattern #5

2025年2月23日

Optimizing to the Eval: GenAI Design Pattern #5

Hannes Hapke and I are writing an O’Reilly book on GenAI Design Patterns. This is Pattern #5 in the book.

1 条评论
Evaluation-Driven Development for agentic applications using PydanticAI

2024年12月20日

Evaluation-Driven Development for agentic applications using PydanticAI

An open-source, model-agnostic agentic framework that supports dependency injection One of the ways to handle the core…

9 条评论
A framework to select the simplest, fastest, cheapest architecture that will balance LLMs' creativity and?risk

2024年10月2日

A framework to select the simplest, fastest, cheapest architecture that will balance LLMs' creativity and?risk

Look at any LLM tutorial and the suggested usage involves invoking the API, sending it a prompt, and using the…

6 条评论
Using GenAI to create a video talk; illustrates what GenAI is becoming rapidly useful for

2024年9月22日

Using GenAI to create a video talk; illustrates what GenAI is becoming rapidly useful for

Like most everyone, I was flabbergasted by NotebookLM and its ability to generate a podcast from a set of documents…

8 条评论
What goes into bronze, silver, and gold layers of a medallion data architecture?

2024年9月18日

What goes into bronze, silver, and gold layers of a medallion data architecture?

Here's a four-layer medallion architecture that explicitly addresses data governance and separation-of-responsibility…

27 条评论
How to Implement a GenAI Agent using Autogen or LangGraph

2024年7月31日

How to Implement a GenAI Agent using Autogen or LangGraph

Comparing Autogen and LangGraph from a developer standpoint GenAI models are good at a handful of tasks such as text…

24 条评论
Corporate blogs in the age of AI

2024年3月21日

Corporate blogs in the age of AI

Due to the infusion of AI in both the production and the consumption of marketing content, the way you write product…
Building an AI Assistant with DSPy

2024年3月7日

Building an AI Assistant with DSPy

A way to program and tune prompt-agnostic LLM agent pipelines I hate prompt engineering. For one thing, I do not want…

28 条评论
Getting value out of Voice AI one stage at a?time

2023年5月16日

Getting value out of Voice AI one stage at a?time

Improving the customer experience through automation To improve the customer experience your product provides, you need…

2 条评论
Four Approaches to build on top of Generative AI Foundational Models

2023年3月21日

Four Approaches to build on top of Generative AI Foundational Models

What works, the pros and cons, and example code for each approach If some of the terminology I use here is unfamiliar…

11 条评论

See all articles

Building a Data Platform to enable analytics and AI-driven innovation

Valliappa Lakshmanan

Data/AI products and platforms

Build a Data Mesh & Set up?MLOps

Key Challenges

The 5-step?journey

Step 1: Simplify operations and lower the total cost of ownership

Step 2: Break down silos, democratize analytics, and build a data?culture

领英推荐

Step 3: Make decisions in context,?faster

Step 4: Leapfrog with end-to-end AI Solutions

Step 5: Empower data and ML teams with scaled AI platforms

Next Steps

Valliappa Lakshmanan的更多文章

社区洞察

其他会员也浏览了

AI for Data & Data for AI: The Big Shift in Data Analytics for 2025

Harnessing Advanced Data Analytics: Transforming Data into Insights

AI-Powered Data Management: Transforming Big Data Strategies in 2025

Harnessing the Power of AI in Data Analytics: Kasmo's Pioneering Approach to Business Decision Making

2023.3 #DataOnTheRocks

Are your analytic models powered by the right data?

Powering AI with Data: Essential Principles for Production and Consumption

Generative AI Success Relies on A Strong Data Strategy

The Role of Data Readiness in AI Adoption

The Rise of Unified Data and AI Platforms: Breaking Free from the Multi-Tool Tangle

Build a Data Mesh & Set up?MLOps

Key Challenges

The 5-step?journey

Step 1: Simplify operations and lower the total cost of ownership

Step 2: Break down silos, democratize analytics, and build a data?culture

领英推荐

Step 3: Make decisions in context,?faster

Step 4: Leapfrog with end-to-end AI Solutions

Step 5: Empower data and ML teams with scaled AI platforms

Next Steps

Valliappa Lakshmanan的更多文章

Optimizing to the Eval: GenAI Design Pattern #5

Evaluation-Driven Development for agentic applications using PydanticAI

A framework to select the simplest, fastest, cheapest architecture that will balance LLMs' creativity and?risk

Using GenAI to create a video talk; illustrates what GenAI is becoming rapidly useful for

What goes into bronze, silver, and gold layers of a medallion data architecture?

How to Implement a GenAI Agent using Autogen or LangGraph

Corporate blogs in the age of AI

Building an AI Assistant with DSPy

Getting value out of Voice AI one stage at a?time

Four Approaches to build on top of Generative AI Foundational Models

社区洞察

其他会员也浏览了

AI for Data & Data for AI: The Big Shift in Data Analytics for 2025

Harnessing Advanced Data Analytics: Transforming Data into Insights

AI-Powered Data Management: Transforming Big Data Strategies in 2025

Harnessing the Power of AI in Data Analytics: Kasmo's Pioneering Approach to Business Decision Making

2023.3 #DataOnTheRocks

Are your analytic models powered by the right data?

Powering AI with Data: Essential Principles for Production and Consumption

Generative AI Success Relies on A Strong Data Strategy

The Role of Data Readiness in AI Adoption

The Rise of Unified Data and AI Platforms: Breaking Free from the Multi-Tool Tangle