Art of Data Newsletter - Issue #16
Photo by Emmanuel Codden: https://www.pexels.com/photo/houses-in-a-residential-district-16991419/

Art of Data Newsletter - Issue #16

Welcome all Data fanatics. In today's issue:

Let's dive in!


Building Real-time Machine Learning Foundations at Lyft | 17mins

Lyft has developed capabilities for real-time machine learning (ML) with streaming data, enabling the company's hundreds of ML developers to make timely data-driven decisions, retrain models and run inference calls. The company applied its RealtimeMLPipeline interface, simplifying the integration of streaming into ML models, the uniform application of which across various environments sped up development processes, reducing build time to days. Despite the steep learning curve and complex software development involved in streaming applications, Lyft has seen the successful adoption of the technology across its business, with use cases ranging from reducing bias in models to computing safety features for drivers.


Most Data Engineers are Mid - by Daniel Beach | 9mins

The author, a seasoned Data Engineer, expresses the view that many Data Engineers they have encountered fall into the 'mid' category, or average, at best. The author criticizes what they perceive as a lack of initiative, poor quality work, refusal to learn, and isolationist behavior among these 'mid' Data Engineers. They suggest that to break away from the mid cycle, Data Engineers should show initiative, focus on quality, continuously strive for learning, and be team players rather than working alone. Despite their own desire for solitude, the author emphasizes the damage of being a 'Lone Wolf' in a team setting.


Reducing Data Platform Cost by $2M | 15mins

Razorpay, a technology company, shares its approach for reducing its Data Platform costs by approximately $2M per year. The company extensively uses AWS and DBU for its data storage. By evaluating the storage and computing needs of its vast data operation, including transactional data, merchant reporting, and its vast S3 storage buckets, Razorpay was able to identify areas of inefficiency and cost savings.

Key strategies included:

1. Identification of unused S3 storage and application of Lifecycle (LC) Policies to manage data inflow and outflow, saving around 2PB of storage.

2. Application of Z-order indexing to optimise data reading and report creation time, reducing the time and cost by about 20% and 25% respectively


How we cut BigQuery costs 80% by hunting costly queries | 27mins

Mixpanel, a user analytics platform, managed to cut down its GCP BigQuery costs by 80%. The company tackled its sudden increase in internal data spend by analyzing their BigQuery utilization. Upon finding out that their internal data team was incurring five-figure charges a month, they decided to look into this issue more intricately. Mixpanel used BigQuery's INFORMATION_SCHEMA.JOBS view to identify the most intensive and expensive queries. They then built interactive reports monitoring their ongoing spend and optimized for costliest queries and set up alerts for cost spike notifications. Mixpanel shared the method and code used in an attempt to guide others to achieve the same result.


Automating data analytics with ChatGPT | 10mins

The article discusses how ChatGPT, an AI language model, has gained popularity for text-based responses and working with unstructured text data. It highlights that ChatGPT has limitations in handling structured data and quantitative reasoning tasks due to its training focus. The goal is to turn ChatGPT into a powerful business analytics assistant by teaching it to leverage tools and domain knowledge for data analytics.

The implementation includes the use of two agents (data engineer and data scientist) to collaborate in solving problems. Tools are provided to assist the agents in their tasks. Streamlit is used as the application platform for visualization and user interaction.


How to Build a Credit Data Platform on the Databricks Lakehouse | 15mins

This use-case shown on Databricks blog, showcases the technical implementation and architecture of a credit decisioning demo on the Databricks Lakehouse, demonstrating how data unification, feature engineering, and machine learning models can be used to serve underbanked customers effectively. It emphasizes the importance of explainability, fairness, and data democratization in making data accessible to non-data teams and business users.


Six tried and tested ways to turbocharge Databricks SQL | 15mins

The article discusses six proven methods for enhancing the performance of Databricks SQL, based on Adevinta's experience after its acquisition of eBay Classifieds Group. The massive amount of data (~500TB Google Analytics table) required robust analytical tools, and despite initial challenges, Databricks SQL was chosen to increase run times by 300%. The six methods for optimization include:

1. Use Delta - Delta 2.0 provides quick query performance, high scalability, and ACID transactions. The 'CONVERT TO DELTA NO STATISTICS' syntax speeds up conversion by optimizing statistics collection.

2. Be intentional about file layouts - The arrangement of data on disk and the optimization of partitions can drastically improve performance.


Apache Celeborn — Shuffle Service for Spark | 6mins

Apache Celeborn is a shuffle service for Spark that aims to decouple storage and computation for shuffling in Spark. Shuffle is a process where data is rearranged across partitions, which can be resource-intensive and lead to high network I/O overhead. Celeborn is essentially an external shuffle service that manages shuffle files for distributed systems. It was tested on both YARN & K8s and doesn't require any external dependencies. Unlike native Spark ESS, Celeborn isn't launched on every node and doesn't tie shuffle data's lifecycle with executors' lifecycle. Celeborn includes a master responsible for resource allocation, state management, and a worker for shuffle data procession.

要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #18

    Art of Data Newsletter - Issue #18

    Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

    1 条评论
  • Art of Data Newsletter - Issue #17

    Art of Data Newsletter - Issue #17

    Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…

  • Art of Data Newsletter - Issue #15

    Art of Data Newsletter - Issue #15

    Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…

  • Art of Data Newsletter - Issue #14

    Art of Data Newsletter - Issue #14

    Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #12

    Art of Data Newsletter - Issue #12

    Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

  • Art of Data Newsletter - Issue #9

    Art of Data Newsletter - Issue #9

    Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

社区洞察

其他会员也浏览了