Art of Data Newsletter - Issue #9
Photo by Pixabay: https://www.pexels.com/photo/clear-light-bulb-355948/

Art of Data Newsletter - Issue #9

Welcome all Data fanatics. In today's issue:

  • MLOps basics for Data Engineers
  • Managing BigQuery at Reddit scale
  • Compass shares the experience of building Data Platform with Databricks
  • MLOps Best Practices from Walmart
  • Instacart Ads measurement platform with Lakehouse and Spark
  • Using graphs to model and analyze the customer journey

Let's dive in!


MLOps Basics - For Data Engineers | 12mins

MLOps (Machine Learning Operations) is the term used to describe the work that Data Engineers take on to enable ML to run at scale in a production environment. It involves the automation of machine learning tasks such as feature storage, model training, prediction, and analysis. Feature stores are used to represent data in a form that algorithms can understand and process, while MLOps automation and tracking are important to run a stable ML environment. This article describes the best practices, including data tracking, and automating machine learning tasks.


Wrangling BigQuery at Reddit | 31mins

This reddit post talks about managing a BigQuery instance at Reddit scale. It mentions that the fundamentals of database management are similar regardless of scale or platform, but at Reddit's scale, the numbers in the logs are much larger. The BigQuery platform at Reddit supports over 100 petabytes of data and is used for data science, machine learning, analytics, experiments, advertising, revenue, safety, and more. As Reddit grew, the workload velocity and complexity within BigQuery increased, leading to the need for more efficient workload management. The post also discusses how they navigate their data lake logs to maintain visibility and context while avoiding potential issues.


Enterprise Data Platform @ Compass | 10mins

Compass chose Databricks to build its modern data platform due its scalability, reliability, security, and ability to support AI, BI, and DI use cases on one platform. This platform has allowed the company to store and manage its analytics data on one platform, create an environment for AI, BI, and DI collaboration, and optimize its cost metrics. The platform has become a comprehensive go-to place for data and machine learning needs across the company and is currently undergoing an evolution to reach its full potential.


Rapid & Reliable ML Experiments using MLOps Best Practices | 14mins

This article explains using MLOps best practices for rapid and reliable Machine Learning Experiments. MLOps best practices involve using an open source model lifecycle management library to log the parameters, hyperparameters, metrics and model output artifacts associated with each experiment. Additionally, the YAML library can be used for configuring parameters in models, and the DVC library can be used for versioning large data sets and model objects. By using MLOps best practices, data scientists can easily log, query and manage various ML experiment runs and maintain various configurations of ML model development.


How we monitor thousands of Spark data pipelines | 21mins

Thanan explains how he implemented an observability monitoring system for thousands of data pipelines running on Apache Spark. The Spark Listener collected statistics of each event and exported useful statistics to the DTP Internal Server via REST API. SLOs were set for runtime, skew, spill and failed apps, with tier levels for priority. Example results after monitoring included a skew issue that was solved with one line of code, and a retry issue that was fixed by adjusting resources and the repartitioning


How Instacart Ads Modularized Data Pipelines With Lakehouse Architecture and Spark | 15mins

Instacart has implemented a new ads measurement platform using modularized ETL pipelines built with Lakehouse architecture and Spark. This platform has replaced the previous system with Kinesis Firehose, which was becoming increasingly costly and difficult to manage. The new system utilizes the Delta Lake file format and has resulted in significant cost savings and read optimization. Additionally, the modular pipelines with their advanced testability and observability have greatly improved the platform's scalability and effectiveness. With streaming capabilities, Instacart is now able to provide near-real-time metrics. The success of this project was the result of an immense effort by Instacart's Ads Data Pipeline and Data Platform team.


How to use graphs to analyze the customer journey | 8mins

This article from Microsoft, advocates for using graphs to model and analyze the customer journey. It outlines a comprehensive approach, from conceptualizing the customer journey, gathering data, building a graph model, and further analysis. This approach provides a comprehensive and visual method of understanding the customer journey that can guide future developmental decisions.

要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #18

    Art of Data Newsletter - Issue #18

    Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

    1 条评论
  • Art of Data Newsletter - Issue #17

    Art of Data Newsletter - Issue #17

    Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…

  • Art of Data Newsletter - Issue #16

    Art of Data Newsletter - Issue #16

    Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…

  • Art of Data Newsletter - Issue #15

    Art of Data Newsletter - Issue #15

    Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…

  • Art of Data Newsletter - Issue #14

    Art of Data Newsletter - Issue #14

    Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #12

    Art of Data Newsletter - Issue #12

    Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

社区洞察

其他会员也浏览了