登录查看更多内容

Art of Data Newsletter - Issue #16

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

发布日期: 2023年7月23日

+ 关注

Welcome all Data fanatics. In today's issue:

Real-Time #MachineLearning foundations at Lyft
Most data engineers are Mid
Saving $2M in platform costs by Razorpay
Cutting #BigQuery costs by 80% at Mixpanel
Automating #DataAnalytics with #ChatGPT
Building credit data platform with Databricks
Turbocharging your #DatabricksSQL workloads
Apache Celeborn - new shuffle service for #ApacheSpark

Let's dive in!

Building Real-time Machine Learning Foundations at Lyft | 17mins

Lyft has developed capabilities for real-time machine learning (ML) with streaming data, enabling the company's hundreds of ML developers to make timely data-driven decisions, retrain models and run inference calls. The company applied its RealtimeMLPipeline interface, simplifying the integration of streaming into ML models, the uniform application of which across various environments sped up development processes, reducing build time to days. Despite the steep learning curve and complex software development involved in streaming applications, Lyft has seen the successful adoption of the technology across its business, with use cases ranging from reducing bias in models to computing safety features for drivers.

Most Data Engineers are Mid - by Daniel Beach | 9mins

The author, a seasoned Data Engineer, expresses the view that many Data Engineers they have encountered fall into the 'mid' category, or average, at best. The author criticizes what they perceive as a lack of initiative, poor quality work, refusal to learn, and isolationist behavior among these 'mid' Data Engineers. They suggest that to break away from the mid cycle, Data Engineers should show initiative, focus on quality, continuously strive for learning, and be team players rather than working alone. Despite their own desire for solitude, the author emphasizes the damage of being a 'Lone Wolf' in a team setting.

Reducing Data Platform Cost by $2M | 15mins

Razorpay, a technology company, shares its approach for reducing its Data Platform costs by approximately $2M per year. The company extensively uses AWS and DBU for its data storage. By evaluating the storage and computing needs of its vast data operation, including transactional data, merchant reporting, and its vast S3 storage buckets, Razorpay was able to identify areas of inefficiency and cost savings.

Key strategies included:

1. Identification of unused S3 storage and application of Lifecycle (LC) Policies to manage data inflow and outflow, saving around 2PB of storage.

2. Application of Z-order indexing to optimise data reading and report creation time, reducing the time and cost by about 20% and 25% respectively

领英推荐

Data Warehousing is Dead

Vincent Rainardi 4 个月前

The Dawn of the AI-Native Data Stack - Part 1

Ananth P. 5 个月前

Here's What No One Tells You About Azure Data Engineer.

Deepak Goyal 3 年前

How we cut BigQuery costs 80% by hunting costly queries | 27mins

Mixpanel, a user analytics platform, managed to cut down its GCP BigQuery costs by 80%. The company tackled its sudden increase in internal data spend by analyzing their BigQuery utilization. Upon finding out that their internal data team was incurring five-figure charges a month, they decided to look into this issue more intricately. Mixpanel used BigQuery's INFORMATION_SCHEMA.JOBS view to identify the most intensive and expensive queries. They then built interactive reports monitoring their ongoing spend and optimized for costliest queries and set up alerts for cost spike notifications. Mixpanel shared the method and code used in an attempt to guide others to achieve the same result.

Automating data analytics with ChatGPT | 10mins

The article discusses how ChatGPT, an AI language model, has gained popularity for text-based responses and working with unstructured text data. It highlights that ChatGPT has limitations in handling structured data and quantitative reasoning tasks due to its training focus. The goal is to turn ChatGPT into a powerful business analytics assistant by teaching it to leverage tools and domain knowledge for data analytics.

The implementation includes the use of two agents (data engineer and data scientist) to collaborate in solving problems. Tools are provided to assist the agents in their tasks. Streamlit is used as the application platform for visualization and user interaction.

How to Build a Credit Data Platform on the Databricks Lakehouse | 15mins

This use-case shown on Databricks blog, showcases the technical implementation and architecture of a credit decisioning demo on the Databricks Lakehouse, demonstrating how data unification, feature engineering, and machine learning models can be used to serve underbanked customers effectively. It emphasizes the importance of explainability, fairness, and data democratization in making data accessible to non-data teams and business users.

Six tried and tested ways to turbocharge Databricks SQL | 15mins

The article discusses six proven methods for enhancing the performance of Databricks SQL, based on Adevinta's experience after its acquisition of eBay Classifieds Group. The massive amount of data (~500TB Google Analytics table) required robust analytical tools, and despite initial challenges, Databricks SQL was chosen to increase run times by 300%. The six methods for optimization include:

1. Use Delta - Delta 2.0 provides quick query performance, high scalability, and ACID transactions. The 'CONVERT TO DELTA NO STATISTICS' syntax speeds up conversion by optimizing statistics collection.

2. Be intentional about file layouts - The arrangement of data on disk and the optimization of partitions can drastically improve performance.

Apache Celeborn — Shuffle Service for Spark | 6mins

Apache Celeborn is a shuffle service for Spark that aims to decouple storage and computation for shuffling in Spark. Shuffle is a process where data is rearranged across partitions, which can be resource-intensive and lead to high network I/O overhead. Celeborn is essentially an external shuffle service that manages shuffle files for distributed systems. It was tested on both YARN & K8s and doesn't require any external dependencies. Unlike native Spark ESS, Celeborn isn't launched on every node and doesn't tie shuffle data's lifecycle with executors' lifecycle. Celeborn includes a master responsible for resource allocation, state management, and a worker for shuffle data procession.

Art of Data

284 位关注者

要查看或添加评论，请登录

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

2023年8月22日

Art of Data Newsletter - Issue #19

Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…
Art of Data Newsletter - Issue #18

2023年8月7日

Art of Data Newsletter - Issue #18

Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

1 条评论
Art of Data Newsletter - Issue #17

2023年7月31日

Art of Data Newsletter - Issue #17

Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…
Art of Data Newsletter - Issue #15

2023年7月10日

Art of Data Newsletter - Issue #15

Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…
Art of Data Newsletter - Issue #14

2023年7月2日

Art of Data Newsletter - Issue #14

Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…
Art of Data Newsletter - Issue #13

2023年6月23日

Art of Data Newsletter - Issue #13

Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…
Art of Data Newsletter - Issue #12

2023年6月13日

Art of Data Newsletter - Issue #12

Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.
Art of Data Newsletter - Issue #11

2023年6月6日

Art of Data Newsletter - Issue #11

Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…
Art of Data Newsletter - Issue #10

2023年5月29日

Art of Data Newsletter - Issue #10

Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…
Art of Data Newsletter - Issue #9

2023年5月22日

Art of Data Newsletter - Issue #9

Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

See all articles

Art of Data Newsletter - Issue #16

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

Building Real-time Machine Learning Foundations at Lyft | 17mins

Most Data Engineers are Mid - by Daniel Beach | 9mins

Reducing Data Platform Cost by $2M | 15mins

领英推荐

How we cut BigQuery costs 80% by hunting costly queries | 27mins

Automating data analytics with ChatGPT | 10mins

How to Build a Credit Data Platform on the Databricks Lakehouse | 15mins

Six tried and tested ways to turbocharge Databricks SQL | 15mins

Apache Celeborn — Shuffle Service for Spark | 6mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

社区洞察

其他会员也浏览了

AWS Glue and Athena based Data Query using S3 Buckets

The Future of Big Data and AI: How Databricks is Leading the Transformation

DATA Pill #060 - How to Create Valuable Data Tests, Modern Data Stack, Data Modeling and dbt Observability

??DATA Pill #101 - What Is a Streaming Database? Flink SQL: Misconfiguration, Misunderstanding, and Mishap

DATA Pill #077 - Snowflake + Snowpark + Streamlit + Vanna AI, How to reduced docker build times by 40%

Mastering Machine Learning Model Deployment: A Comprehensive Guide with Azure Services

DATA Pill #074 - LLMs for Evil, Kedro Dynamic Pipelines, 7 tips for writing better GitLab pipelines

From Silicon Valley to Silicon Valley: how did I turn my 3rd idea into a company?

DATA Pill #022 - What have Google, META and others been doing all summer?

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt

Building Real-time Machine Learning Foundations at Lyft | 17mins

Most Data Engineers are Mid - by Daniel Beach | 9mins

Reducing Data Platform Cost by $2M | 15mins

领英推荐

How we cut BigQuery costs 80% by hunting costly queries | 27mins

Automating data analytics with ChatGPT | 10mins

How to Build a Credit Data Platform on the Databricks Lakehouse | 15mins

Six tried and tested ways to turbocharge Databricks SQL | 15mins

Apache Celeborn — Shuffle Service for Spark | 6mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

Art of Data Newsletter - Issue #18

Art of Data Newsletter - Issue #17

Art of Data Newsletter - Issue #15

Art of Data Newsletter - Issue #14

Art of Data Newsletter - Issue #13

Art of Data Newsletter - Issue #12

Art of Data Newsletter - Issue #11

Art of Data Newsletter - Issue #10

Art of Data Newsletter - Issue #9

社区洞察

其他会员也浏览了

AWS Glue and Athena based Data Query using S3 Buckets

The Future of Big Data and AI: How Databricks is Leading the Transformation

DATA Pill #060 - How to Create Valuable Data Tests, Modern Data Stack, Data Modeling and dbt Observability

??DATA Pill #101 - What Is a Streaming Database? Flink SQL: Misconfiguration, Misunderstanding, and Mishap

DATA Pill #077 - Snowflake + Snowpark + Streamlit + Vanna AI, How to reduced docker build times by 40%

Mastering Machine Learning Model Deployment: A Comprehensive Guide with Azure Services

DATA Pill #074 - LLMs for Evil, Kedro Dynamic Pipelines, 7 tips for writing better GitLab pipelines

From Silicon Valley to Silicon Valley: how did I turn my 3rd idea into a company?

DATA Pill #022 - What have Google, META and others been doing all summer?

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt