登录查看更多内容

Art of Data Newsletter - Issue #9

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

发布日期: 2023年5月22日

+ 关注

Welcome all Data fanatics. In today's issue:

MLOps basics for Data Engineers
Managing BigQuery at Reddit scale
Compass shares the experience of building Data Platform with Databricks
MLOps Best Practices from Walmart
Instacart Ads measurement platform with Lakehouse and Spark
Using graphs to model and analyze the customer journey

Let's dive in!

MLOps Basics - For Data Engineers | 12mins

MLOps (Machine Learning Operations) is the term used to describe the work that Data Engineers take on to enable ML to run at scale in a production environment. It involves the automation of machine learning tasks such as feature storage, model training, prediction, and analysis. Feature stores are used to represent data in a form that algorithms can understand and process, while MLOps automation and tracking are important to run a stable ML environment. This article describes the best practices, including data tracking, and automating machine learning tasks.

Wrangling BigQuery at Reddit | 31mins

This reddit post talks about managing a BigQuery instance at Reddit scale. It mentions that the fundamentals of database management are similar regardless of scale or platform, but at Reddit's scale, the numbers in the logs are much larger. The BigQuery platform at Reddit supports over 100 petabytes of data and is used for data science, machine learning, analytics, experiments, advertising, revenue, safety, and more. As Reddit grew, the workload velocity and complexity within BigQuery increased, leading to the need for more efficient workload management. The post also discusses how they navigate their data lake logs to maintain visibility and context while avoiding potential issues.

Enterprise Data Platform @ Compass | 10mins

Compass chose Databricks to build its modern data platform due its scalability, reliability, security, and ability to support AI, BI, and DI use cases on one platform. This platform has allowed the company to store and manage its analytics data on one platform, create an environment for AI, BI, and DI collaboration, and optimize its cost metrics. The platform has become a comprehensive go-to place for data and machine learning needs across the company and is currently undergoing an evolution to reach its full potential.

领英推荐

Unlocking the Power of Data and Evidence: the Value of…

Doug Rose 10 个月前

The untapped potential of unstructured data

KX 8 个月前

Starting Slow And Scaling Sustainably – Boosting…

Florian Roscheck 1 年前

Rapid & Reliable ML Experiments using MLOps Best Practices | 14mins

This article explains using MLOps best practices for rapid and reliable Machine Learning Experiments. MLOps best practices involve using an open source model lifecycle management library to log the parameters, hyperparameters, metrics and model output artifacts associated with each experiment. Additionally, the YAML library can be used for configuring parameters in models, and the DVC library can be used for versioning large data sets and model objects. By using MLOps best practices, data scientists can easily log, query and manage various ML experiment runs and maintain various configurations of ML model development.

How we monitor thousands of Spark data pipelines | 21mins

Thanan explains how he implemented an observability monitoring system for thousands of data pipelines running on Apache Spark. The Spark Listener collected statistics of each event and exported useful statistics to the DTP Internal Server via REST API. SLOs were set for runtime, skew, spill and failed apps, with tier levels for priority. Example results after monitoring included a skew issue that was solved with one line of code, and a retry issue that was fixed by adjusting resources and the repartitioning

How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark | 15mins

Instacart has implemented a new ads measurement platform using modularized ETL pipelines built with Lakehouse architecture and Spark. This platform has replaced the previous system with Kinesis Firehose, which was becoming increasingly costly and difficult to manage. The new system utilizes the Delta Lake file format and has resulted in significant cost savings and read optimization. Additionally, the modular pipelines with their advanced testability and observability have greatly improved the platform's scalability and effectiveness. With streaming capabilities, Instacart is now able to provide near-real-time metrics. The success of this project was the result of an immense effort by Instacart's Ads Data Pipeline and Data Platform team.

How to use graphs to analyze the customer journey | 8mins

This article from Microsoft, advocates for using graphs to model and analyze the customer journey. It outlines a comprehensive approach, from conceptualizing the customer journey, gathering data, building a graph model, and further analysis. This approach provides a comprehensive and visual method of understanding the customer journey that can guide future developmental decisions.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

Art of Data

284 位关注者

要查看或添加评论，请登录

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

2023年8月22日

Art of Data Newsletter - Issue #19

Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…
Art of Data Newsletter - Issue #18

2023年8月7日

Art of Data Newsletter - Issue #18

Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

1 条评论
Art of Data Newsletter - Issue #17

2023年7月31日

Art of Data Newsletter - Issue #17

Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…
Art of Data Newsletter - Issue #16

2023年7月23日

Art of Data Newsletter - Issue #16

Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…
Art of Data Newsletter - Issue #15

2023年7月10日

Art of Data Newsletter - Issue #15

Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…
Art of Data Newsletter - Issue #14

2023年7月2日

Art of Data Newsletter - Issue #14

Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…
Art of Data Newsletter - Issue #13

2023年6月23日

Art of Data Newsletter - Issue #13

Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…
Art of Data Newsletter - Issue #12

2023年6月13日

Art of Data Newsletter - Issue #12

Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.
Art of Data Newsletter - Issue #11

2023年6月6日

Art of Data Newsletter - Issue #11

Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…
Art of Data Newsletter - Issue #10

2023年5月29日

Art of Data Newsletter - Issue #10

Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

See all articles

Art of Data Newsletter - Issue #9

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

MLOps Basics - For Data Engineers | 12mins

Wrangling BigQuery at Reddit | 31mins

Enterprise Data Platform @ Compass | 10mins

领英推荐

Rapid & Reliable ML Experiments using MLOps Best Practices | 14mins

How we monitor thousands of Spark data pipelines | 21mins

How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark | 15mins

How to use graphs to analyze the customer journey | 8mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

社区洞察

其他会员也浏览了

Modern data culture stack, analytics community, data scientist as a role, and more

Top 6 Data Science Pain Points in 2021

?? DATA Pill #109 - Databricks LakeFlow, GKE + Gemma + Ollama = ?

Data Lakehouses: The Best of Two Worlds?

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Analytics and Data Science News for the Week of November 22; Updates from insightsoftware, SAS, ThoughtSpot & More

Analytics and Data Science News for the Week of October 18; Updates from AWS, Databricks, Power BI & More

Understanding of Data Structures and Algorithms in Data Science

December 2024 Top Ten (by Dagster Labs)

MLOps Basics - For Data Engineers | 12mins

Wrangling BigQuery at Reddit | 31mins

Enterprise Data Platform @ Compass | 10mins

领英推荐

Rapid & Reliable ML Experiments using MLOps Best Practices | 14mins

How we monitor thousands of Spark data pipelines | 21mins

How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark | 15mins

How to use graphs to analyze the customer journey | 8mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

Art of Data Newsletter - Issue #18

Art of Data Newsletter - Issue #17

Art of Data Newsletter - Issue #16

Art of Data Newsletter - Issue #15

Art of Data Newsletter - Issue #14

Art of Data Newsletter - Issue #13

Art of Data Newsletter - Issue #12

Art of Data Newsletter - Issue #11

Art of Data Newsletter - Issue #10

社区洞察

其他会员也浏览了

Modern data culture stack, analytics community, data scientist as a role, and more

Top 6 Data Science Pain Points in 2021

?? DATA Pill #109 - Databricks LakeFlow, GKE + Gemma + Ollama = ?

Data Lakehouses: The Best of Two Worlds?

Analytics and Data Science News for the Week of September 13; Updates from AnswerRocket, Luzmo, Qlik & More

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Analytics and Data Science News for the Week of November 22; Updates from insightsoftware, SAS, ThoughtSpot & More

Analytics and Data Science News for the Week of October 18; Updates from AWS, Databricks, Power BI & More

Understanding of Data Structures and Algorithms in Data Science

December 2024 Top Ten (by Dagster Labs)