Art of Data Newsletter - Issue #12
Welcome all Data fanatics. In today's issue:
Let's dive in!
This article discusses how open sourcing AI libraries and models accelerated AI advancements in the last couple of years, however the trends towards less permissible licenses has resulted in a "free for all" model for these advancements. Large corporations have begun changing licenses of popular models from permissible to non-commercial ones, which diminishes the aggregate value of compute and data. As a result, the value capture for AI is at risk of becoming concentrated among a few major players.
The renewed focus on data management due to large language models such as OpenAI's ChatGPT has increased the pressure on corporate technology chiefs to ensure their data is adequately stored, filtered, and protected for use with AI.
Metis is Airbnb's data management platform which enables users to search and discover data assets, manage, and govern them. It is made up of three core products: Dataportal, Unified Metadata Service (UMS), and Lineage Service, which allow users to find and manage millions of data assets, including Apache Hive and Trino datasets, metrics and dimensions, charts and dashboards, data models, machine learning features and models, and teams and employees. UMS plays various roles in data integrations, including providing a Graphql API Layer and a centralized relationship graph, and managing critical business metadata. The Lineage Service is powered by Apache Atlas which holds a large lineage graph of over 100 million nodes and 300 million edges to provide
领英推荐
In this episode of DataTalks.Club, Boyan Angelov discusses key principles and best practices for data strategy. He shares his background in the field and information related to data strategy. This episode is 55 minutes and 49 seconds long and available to view in English.
Daft is a distributed dataframe library that enables developers to work with a variety of complex data types from different sources efficiently. It uses Rust and the Arrow format to maximize performance, making it both Pythonic and powerful. It leverages the Ray framework to process data in both small and large scales, as well as natively support complex types like images. It also offers benefits such as efficient memory usage, out-of-core processing, and high-performance computing. In benchmarking tests, Daft has proven itself to be consistently faster than popular distributed dataframe libraries such as Spark, Modin, and Dask.
This article covers the knowledge and steps necessary to properly execute a backfill: an updating of the data asset in order to deal with greenfield or brownfield usage, failures, or to otherwise fill in the irregularities. The article also delves into how backfills are easier with partitioning, which is an approach to incremental data management with which each data asset is viewed as a collection of partitions. Partitioning helps to understand what needs to be backfilled, benefit from parallelism and fault tolerance when single-threaded code is used, and avoid cost and resource overload. Running a backfill entails steps such as managing the data, planning it, launching it, monitoring it, and verifying the results.
Open table formats provide additional database-like functionality that simplifies the optimization and management overhead of data lakes, improving query performance for use cases involving streaming ingestion, batch loads, change data capture, and processing deletes for privacy regulations. This post reviews the features and capabilities of the three most common open table formats available to support various use cases - Apache Hudi, Apache Iceberg, and Delta Lake - and provides guidance when making the decision about which format best fits the specific use case requirements.
Lyft identified and implemented measures to improve the performance of their streaming pipelines. This includes using tools such as Pyflame and async profiler for CPU utilization, Flink dashboard for operator level records throughput and resource utilization, metrics system for tasks/operators level performance, and various strategies for identifying and tackling data skewness, window size, services latency, and serialization/deserialization. General guidelines for streamlining performance such as avoiding duplicate operations, unnecessary shuffling, and enabling Cython for Python-based pipelines are also suggested. Lastly, network speed is also a critical factor for pipeline performance and it is important to deploy all services with locality to keep instances close and reduce latency.