Art of Data Newsletter - Issue #16
Welcome all Data fanatics. In today's issue:
Let's dive in!
Lyft has developed capabilities for real-time machine learning (ML) with streaming data, enabling the company's hundreds of ML developers to make timely data-driven decisions, retrain models and run inference calls. The company applied its RealtimeMLPipeline interface, simplifying the integration of streaming into ML models, the uniform application of which across various environments sped up development processes, reducing build time to days. Despite the steep learning curve and complex software development involved in streaming applications, Lyft has seen the successful adoption of the technology across its business, with use cases ranging from reducing bias in models to computing safety features for drivers.
The author, a seasoned Data Engineer, expresses the view that many Data Engineers they have encountered fall into the 'mid' category, or average, at best. The author criticizes what they perceive as a lack of initiative, poor quality work, refusal to learn, and isolationist behavior among these 'mid' Data Engineers. They suggest that to break away from the mid cycle, Data Engineers should show initiative, focus on quality, continuously strive for learning, and be team players rather than working alone. Despite their own desire for solitude, the author emphasizes the damage of being a 'Lone Wolf' in a team setting.
Reducing Data Platform Cost by $2M | 15mins
Razorpay, a technology company, shares its approach for reducing its Data Platform costs by approximately $2M per year. The company extensively uses AWS and DBU for its data storage. By evaluating the storage and computing needs of its vast data operation, including transactional data, merchant reporting, and its vast S3 storage buckets, Razorpay was able to identify areas of inefficiency and cost savings.
Key strategies included:
1. Identification of unused S3 storage and application of Lifecycle (LC) Policies to manage data inflow and outflow, saving around 2PB of storage.
2. Application of Z-order indexing to optimise data reading and report creation time, reducing the time and cost by about 20% and 25% respectively
领英推荐
Mixpanel, a user analytics platform, managed to cut down its GCP BigQuery costs by 80%. The company tackled its sudden increase in internal data spend by analyzing their BigQuery utilization. Upon finding out that their internal data team was incurring five-figure charges a month, they decided to look into this issue more intricately. Mixpanel used BigQuery's INFORMATION_SCHEMA.JOBS view to identify the most intensive and expensive queries. They then built interactive reports monitoring their ongoing spend and optimized for costliest queries and set up alerts for cost spike notifications. Mixpanel shared the method and code used in an attempt to guide others to achieve the same result.
The article discusses how ChatGPT, an AI language model, has gained popularity for text-based responses and working with unstructured text data. It highlights that ChatGPT has limitations in handling structured data and quantitative reasoning tasks due to its training focus. The goal is to turn ChatGPT into a powerful business analytics assistant by teaching it to leverage tools and domain knowledge for data analytics.
The implementation includes the use of two agents (data engineer and data scientist) to collaborate in solving problems. Tools are provided to assist the agents in their tasks. Streamlit is used as the application platform for visualization and user interaction.
This use-case shown on Databricks blog, showcases the technical implementation and architecture of a credit decisioning demo on the Databricks Lakehouse, demonstrating how data unification, feature engineering, and machine learning models can be used to serve underbanked customers effectively. It emphasizes the importance of explainability, fairness, and data democratization in making data accessible to non-data teams and business users.
The article discusses six proven methods for enhancing the performance of Databricks SQL, based on Adevinta's experience after its acquisition of eBay Classifieds Group. The massive amount of data (~500TB Google Analytics table) required robust analytical tools, and despite initial challenges, Databricks SQL was chosen to increase run times by 300%. The six methods for optimization include:
1. Use Delta - Delta 2.0 provides quick query performance, high scalability, and ACID transactions. The 'CONVERT TO DELTA NO STATISTICS' syntax speeds up conversion by optimizing statistics collection.
2. Be intentional about file layouts - The arrangement of data on disk and the optimization of partitions can drastically improve performance.
Apache Celeborn is a shuffle service for Spark that aims to decouple storage and computation for shuffling in Spark. Shuffle is a process where data is rearranged across partitions, which can be resource-intensive and lead to high network I/O overhead. Celeborn is essentially an external shuffle service that manages shuffle files for distributed systems. It was tested on both YARN & K8s and doesn't require any external dependencies. Unlike native Spark ESS, Celeborn isn't launched on every node and doesn't tie shuffle data's lifecycle with executors' lifecycle. Celeborn includes a master responsible for resource allocation, state management, and a worker for shuffle data procession.