Building a Data Platform to enable analytics and AI-driven innovation

Build a Data Mesh & Set up?MLOps

Businesses realize that as more and more products and services become digitized, there is an opportunity to capture a lot of value by taking better advantage of data. In retail, it could be by avoiding deep discounting by stocking the right items at the right time. In financial services, it could be by identifying unusual activity and behavior faster than the competition. In media, it could be by increasing engagement by offering up more personalized recommendations.

Key Challenges

In my talk at Cloud Next OnAir (see video), I describe that, in order to lead your company towards data-powered innovation, there are a few key challenges that you will have to address:

  • The size of data that you will employ will increase 30–100% year on year. You are looking at a 5x data growth over the next 3–4 years. Do not build your infrastructure for the data you currently have. Plan for growth.
  • 25% of your data will be streaming data. Avoid the temptation of building a batch data processing platform. You will want to unify batch and stream processing.
  • Data quality reduces the farther away from the originating team the data gets. So, you will have to provide domain experts control over the data. Don’t centralize data in IT.
  • The greatest value in ML/AI will be obtained by combining data that you have across your organization and even data shared by partners. Breaking silos and building a data culture will be key.
  • Much of your data will be unstructured — images, video, audio (chat), and free form text. You will be building data and ML pipelines that derive insights from unstructured data.
  • AI/ML skills will be scarce, so you will have to take advantage of packaged AI solutions and systems that democratize machine learning.

The platform that you will need to build will need to address all of these challenges and serve as an enabler of innovation.

In this article, I will summarize the key points from my talk, and delve into technical details that I didn’t have time to cover. I recommend both watching the talk and reading this article because the two are complementary.

The 5-step?journey

Based on our experience helping many Google Cloud customers go through a digital transformation journey, there are five steps in the journey:

No alt text provided for this image


Step 1: Simplify operations and lower the total cost of ownership

The first step for most enterprises is to find the budget. Moving your enterprise data warehouse and data lakes to the cloud can save you anywhere from 50% to 75%, mostly by reducing the need to spend valuable time doing resource provisioning. Ephemeral and spiky workloads will also benefit from autoscaling and the cloud economics of pay-for-what-you-use.

But when doing this, make sure you are setting yourself up for success because this is only the first step of the journey. Your goal is not just to save money; it is to drive innovation. You can get the ability to handle more data, more unstructured data, streaming data, and build a data culture (“modernize your data platform”) and save money at the same time by moving to a capable platform. Just make sure to pick a platform that is serverless, self-tuning, highly scalable, provides high-performance streaming ingestion, allows you to operationalize ML without moving data, enables domain experts to “own” the data but share it broadly with the organization, and does all this in a robust, secure way.

When it comes to analytics, Google BigQuery is the recommended destination for structured and semi-structured data. Google Cloud Storage is what we recommend for unstructured data. We have low-risk migration offers to quickly move on-premises data (Teradata/Netezza/Exadata), Hadoop and Spark workloads, and point data warehouses like RedShift and Snowflake to BigQuery. Similarly, if you need to capture logs or changes from transactional databases to the cloud for analytics.

Step 2: Break down silos, democratize analytics, and build a data?culture

My recommendation to choose the storage layer based on type of data might seem surprising. Shouldn’t you store “raw” data in a data lake, and “clean” data in a data warehouse? No, not a good idea. Data platforms and roles are converging and you need to be aware that traditional terminology like Data Lake and Data Warehouse can lead to status quo bias and bad choices. My recommendation instead is for you to think about what type of data it is, and choose your storage layer. Some of your “raw” data, if it is structured, will be in BigQuery and some of your final, fully produced media clips will reside in Cloud Storage.

No alt text provided for this image

Don’t fall into the temptation of centralizing the control of data in order to break down silos. Data quality reduces the further away from the domain experts you get. You want to make sure that domain experts create datasets in BigQuery and own buckets in Cloud Storage. This allows for local control, but access to these datasets will be controlled through Cloud IAM roles and permissions. The use of encryption, access transparency, and masking with Cloud Data Loss Prevention can help ensure orgwide security even if the responsibility of data accuracy lies with the domain teams.

Each analytics dataset or bucket will be in a single cloud region (or multi-region such as EU or US). Following Zhamak Dehghani’s nomenclature, you could call such a storage layer a “distributed data mesh” to avoid getting sidetracked by the lake vs. warehouse debate.

Encourage teams to provide wide access to their datasets (“default open”). Owners of data control access to data, but subject to org-wide data governance policies. IT teams also have the ability to tag datasets (for privacy, etc.). Cloud IAM is managed by IT. Permissions to their datasets are managed by the data owners. Upskill your workforce so that they are discovering and tagging datasets through Data Catalog, and building no-code integration pipelines using Data Fusion to continually increase the breadth and coverage of your data mesh.

One problem you will run into when you build a democratized data culture is that you will start to see analytics silos. Each time a Key Performance Indicator (KPI) is calculated is one more opportunity for it to be calculated the wrong way. So, encourage data analytics teams to build a semantic layer using Looker and apply governance through that semantic layer:

No alt text provided for this image

This has the advantage of being multi-vendor and multi-cloud. The actual queries are carried out the underlying data warehouse, so there is no data duplication.

Regardless of where you store the data, you should bring compute to that data. On Google Cloud, the compute and storage are separate and you can mix and match. For example, your structured data can be in BigQuery, but you can choose to do your processing using SQL in BigQuery, Java/Python Apache Beam in Cloud Dataflow, or Spark on Cloud Dataproc.

Do not make copies of data.

Step 3: Make decisions in context,?faster

The value of a business decision, especially a decision that is made in the long tail, drops with latency and distance. For example, suppose you are able to approve a loan in 1 minute or in 1 day. The 1-minute approval is much, much more valuable than the 1-day turnaround. Similarly, if you are able to make a decision that takes into account spatial context (whether it is based on where the user currently lives, or where they are currently visiting), that decision is much more valuable than one devoid of spatial context.

One goal of your platform should be that you can do GIS, streaming, and machine learning on data without making copies of the data. The principle above, of bringing compute to the data, should apply to GIS, streaming, and ML as well.

No alt text provided for this image

On Google Cloud, you can stream data into BigQuery, and all queries on BigQuery are streaming SQL. Even as you are streaming data into BigQuery, you can carry out time-window transformations (to take into account user- and business-context) in order to real-time AI and populate real-time dashboards.

Step 4: Leapfrog with end-to-end AI Solutions

ML/AI is software, and like any software, you should consider whether you should build or whether you can buy. Google Cloud’s strategy in AI is to bring the best of Google’s AI to our customers in the form of APIs (e.g. Vision API) and building blocks (e.g. Auto ML Vision, where you can fine tune Vision API on your own data, with the advantage that you need much less of it).

When it comes to AI (arguably, this is true of all tech, but it is particularly apparent in AI because it’s so new), every vendor seems to check all the boxes. We really encourage you to look at the quality of the underlying services. It is not the case that any competing natural language or text classifier comes close to Cloud Natural Language API or Auto ML Natural Language. The same holds for our vision, speech-to-text, etc. models.

We are also putting together our basic capabilities into higher-value, highly integrated solutions. Contact Center AI, where we do automated call handling, operator assistance, and call analytics as a packaged solution is one example. As is Document AI, where we tie together form parsing, and knowledge extraction.

Step 5: Empower data and ML teams with scaled AI platforms

I recommend that you split your portfolio of AI solutions into 3 categories. For many problems, using APIs and building blocks will be sufficient. Build out a data science team to solve AI problems that will uniquely differentiate you and give you sustainable advantage.

No alt text provided for this image

Once you decide to build a data science team, though, make sure that you enable them to do machine learning efficiently. This will require the ability to experiment on models using notebooks, capture their ML workflows using experiments, deploy their ML models using containers, and do CI/CD for continuous training and evaluation. You should use our ML Pipelines for that. It is well integrated with our data analytics platform and with Cloud AI Platform services.

No alt text provided for this image

At Google Cloud, we will walk with you in every step in this journey. Contact us!

Next Steps

Watch my talk in Cloud Next OnAir (register for the talk here)

This article is crossposted from my medium blog -- follow me there, and on Twitter.

Here are some articles and white papers that might be useful:

Masil Masilamani

Structural Engineer at Independent consulting

4 年

Wow!!

回复
Gaetano Ruggiero

Global Head of Data Management | Building Complex Data Architectural Landscapes

4 年

Thanks for this reading, I understand the need for a single pane of glass in order to mitigate the risk of wrong kpis evaluations in a distributed analytical ecosystem where analytic silos may also led to multiple kpis having the same semantic. Do you think that master data management, in building a single trusted view of analytic dimensions provided by multiple domains, can help in improving analytic models quality?

回复
Venkat Lolla

Senior Director, Engineering & Strategy - Optum Whole Health Solutions

4 年

Fantastic read.. we have going towards similar directions.. streaming analytics..

回复
John Klacynski

Principal Customer Solutions Manager (AWS)

4 年

Roberto Pasquier FYI. Different stack, but you guys are on final leg(s) of the journey!

回复
Zhamak Dehghani

Founder and CEO Nextdata | Creator of Data Mesh | Author | Speaker | Ex-Thoughtworks

4 年

Valliappa Lakshmanan, Thank you for sharing this article - concisely written and to the point. I'm delighted to see Data Mesh is on your radar. While Google services can certainly implement a data mesh architecture, perhaps better than many other providers, IMHO there is still room for improvement in providing a *frictionless experience* building domain-driven and decentralized big data architecture. I'm hopeful that Google makes some investment in that area. One thing to say, Data Mesh caters for copying data. In fact, there is no way out of it. If we do embrace decentralization that's inevitable; However the underlying data platform technology can provide guardrails and standardization, builtin, to make sure time-based immutability are baked into representation of analytical data. Though that's a longer conversation to have another time :) I am looking forward to your session on Google Next.

要查看或添加评论,请登录

Valliappa Lakshmanan的更多文章

社区洞察

其他会员也浏览了