Art of Data Newsletter - Issue #12
Photo by yentl jacobs: https://www.pexels.com/photo/grayscale-photo-of-concrete-building-157811/

Art of Data Newsletter - Issue #12

Welcome all Data fanatics. In today's issue:

  • The rapid explosion of #AI may come to an end, due to protective licensing.
  • #LLM push companies to get their #Data "right"
  • Airbnb announces their new data management platform - Metis
  • Key principles of good #DataStrategy
  • #Daft - new distributed DataFrame library that uses #Rust and #Arrow format to maximize performance.
  • #Backfills in Data and ML scenarios.
  • Amazon Web Services (AWS) 's guide to open table formats (Hudi vs Iceberg vs Delta)

Let's dive in!


The Golden Age of Open Source in AI Is Coming to an End | 11mins

This article discusses how open sourcing AI libraries and models accelerated AI advancements in the last couple of years, however the trends towards less permissible licenses has resulted in a "free for all" model for these advancements. Large corporations have begun changing licenses of popular models from permissible to non-commercial ones, which diminishes the aggregate value of compute and data. As a result, the value capture for AI is at risk of becoming concentrated among a few major players.


Rush to Use Generative AI Pushes Companies to Get Data in Order | 4 mins

The renewed focus on data management due to large language models such as OpenAI's ChatGPT has increased the pressure on corporate technology chiefs to ensure their data is adequately stored, filtered, and protected for use with AI.


Metis: Building Airbnb’s Next Generation Data Management Platform | 12mins

Metis is Airbnb's data management platform which enables users to search and discover data assets, manage, and govern them. It is made up of three core products: Dataportal, Unified Metadata Service (UMS), and Lineage Service, which allow users to find and manage millions of data assets, including Apache Hive and Trino datasets, metrics and dimensions, charts and dashboards, data models, machine learning features and models, and teams and employees. UMS plays various roles in data integrations, including providing a Graphql API Layer and a centralized relationship graph, and managing critical business metadata. The Lineage Service is powered by Apache Atlas which holds a large lineage graph of over 100 million nodes and 300 million edges to provide


Data Strategy: Key Principles and Best Practices - Boyan Angelov - DataTalks.Club

In this episode of DataTalks.Club, Boyan Angelov discusses key principles and best practices for data strategy. He shares his background in the field and information related to data strategy. This episode is 55 minutes and 49 seconds long and available to view in English.


Introducing Daft: A High-Performance Distributed Dataframe Library for Multimodal Data | 8mins

Daft is a distributed dataframe library that enables developers to work with a variety of complex data types from different sources efficiently. It uses Rust and the Arrow format to maximize performance, making it both Pythonic and powerful. It leverages the Ray framework to process data in both small and large scales, as well as natively support complex types like images. It also offers benefits such as efficient memory usage, out-of-core processing, and high-performance computing. In benchmarking tests, Daft has proven itself to be consistently faster than popular distributed dataframe libraries such as Spark, Modin, and Dask.


Backfills in Data & Machine Learning: A Primer | 13mins

This article covers the knowledge and steps necessary to properly execute a backfill: an updating of the data asset in order to deal with greenfield or brownfield usage, failures, or to otherwise fill in the irregularities. The article also delves into how backfills are easier with partitioning, which is an approach to incremental data management with which each data asset is viewed as a collection of partitions. Partitioning helps to understand what needs to be backfilled, benefit from parallelism and fault tolerance when single-threaded code is used, and avoid cost and resource overload. Running a backfill entails steps such as managing the data, planning it, launching it, monitoring it, and verifying the results.


Choosing an open table format for your transactional data lake on AWS | 19mins


Open table formats provide additional database-like functionality that simplifies the optimization and management overhead of data lakes, improving query performance for use cases involving streaming ingestion, batch loads, change data capture, and processing deletes for privacy regulations. This post reviews the features and capabilities of the three most common open table formats available to support various use cases - Apache Hudi, Apache Iceberg, and Delta Lake - and provides guidance when making the decision about which format best fits the specific use case requirements.


Profiling & Performance Improvements of Streaming Pipelines | 14mins

Lyft identified and implemented measures to improve the performance of their streaming pipelines. This includes using tools such as Pyflame and async profiler for CPU utilization, Flink dashboard for operator level records throughput and resource utilization, metrics system for tasks/operators level performance, and various strategies for identifying and tackling data skewness, window size, services latency, and serialization/deserialization. General guidelines for streamlining performance such as avoiding duplicate operations, unnecessary shuffling, and enabling Cython for Python-based pipelines are also suggested. Lastly, network speed is also a critical factor for pipeline performance and it is important to deploy all services with locality to keep instances close and reduce latency.

要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #18

    Art of Data Newsletter - Issue #18

    Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

    1 条评论
  • Art of Data Newsletter - Issue #17

    Art of Data Newsletter - Issue #17

    Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…

  • Art of Data Newsletter - Issue #16

    Art of Data Newsletter - Issue #16

    Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…

  • Art of Data Newsletter - Issue #15

    Art of Data Newsletter - Issue #15

    Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…

  • Art of Data Newsletter - Issue #14

    Art of Data Newsletter - Issue #14

    Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

  • Art of Data Newsletter - Issue #9

    Art of Data Newsletter - Issue #9

    Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

社区洞察

其他会员也浏览了