Art of Data Newsletter - Issue #14
Photo by eberhard grossgasteiger: https://www.pexels.com/photo/mountain-covered-snow-under-star-572897/

Art of Data Newsletter - Issue #14

Welcome all Data fanatics. In today's issue:

  • Databricks announces LakehouseIQ - LLM-based Assistant for working with Data
  • Announcement of #DeltaLake 3.0 - query Delta using Hudi or Iceberg formats
  • Materialized Views and Streaming Tables for #DatabricksSQL
  • Reference architectures for Large Language Model (#LLM)
  • Detecting AI-Generated Profile Photos by LinkedIn
  • How to become a valuable #DataEngineer
  • Intro to #ApacheIceberg

Let's dive in!


Introducing LakehouseIQ: The AI-Powered Engine that Uniquely Understands Your Business | 6mins

With the help of LakehouseIQ, Databricks is revolutionizing how people interact with data by making it accessible, intelligible, and actionable. LakehouseIQ uses information about a company’s data, usage patterns, and org chart to understand their business’s jargon and create a more personalized experience through providing answers to natural language queries. It provides access to AI-based automatic data classification, monitoring, and Lakehouse Federation, helping to democratize all data in the enterprise in a secure way.


Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | 10mins

Delta Lake 3.0 is the latest release of the Linux Foundation open source Delta Lake Project. It introduces powerful features to improve compatibility and expand the ecosystem. Delta Universal Format (Uniform) allows Delta Lake to be read in different formats for applications, and Delta Kernel simplifies the process of creating Delta connectors by providing API's to hide complex details. Liquid Clustering is a data management technique that is flexible and adjusts the data layout for the best query performance. These features are designed to help companies move to an open data lakehouse without the fear of lock-in, boost performance and save time and resources on partitioning strategies.


Introducing Materialized Views and Streaming Tables for Databricks SQL | 10mins

Databricks SQL is now available with new capabilities that allow for streaming ingestion from any data source and pre-computed Materialized Views (MVs) for improved query speeds. This allows SQL analysts to quickly set up ETL pipelines with simple code, such as creating a streaming table that is continuously ingested with data from an S3 location and setting up a materialized view that is incrementally updated. The benefits of Stream Tables and Materialized Views include faster BI dashboards, reduced data processing costs, improved data access control, real-time use cases, and better scalability. Customers who have previewed the product have seen significant improvements in performance and cost savings.


Emerging Architectures for LLM Applications | 23mins

This article describes a reference architectures for the emerging Large Language Model (LLM) app stack, which combines private data and off-the-shelf LLMs to enable in-context learning - a design pattern for building software with LLMs. The stack includes data preprocessing and embedding, prompt construction and execution, vector databases, orchestration frameworks, proprietary and open-source language models, and operational tools for validation and caching. As the underlying technology for LLMs advance, this architecture may change substantially.


New Approaches For Detecting AI-Generated Profile Photos | 9mins

LinkedIn's Trust Data Team has partnered with academia to research and develop advanced AI-generated profile photo detection techniques. This unique collaboration has enabled the creation of a novel approach which is able to detect 99.6% of a common type of AI-generated profile photos, while only misclassifying a real profile photo as synthetic 1% of the time. The research also showed that this approach outperforms the state-of-the-art CNN model. These cutting-edge techniques help LinkedIn improve and increase the effectiveness of their automated anti-abuse defenses.


How to become a valuable data engineer | 10mins

To summarize, becoming a valuable data engineer involves understanding the importance of business impact, having knowledge of technical skills, and selecting projects that deliver maximum impact. Furthermore, it is important to develop an understanding of business metrics, fundamentals of data storage and processing, and orchestration and scheduling, in order to succeed in this field. Additionally, adding work experience to a resume following the STAR method will help show impact.


Optimizing costs of a Data Lakehouse | 6mins

Data engineering teams need to establish a well-defined strategy for storing data and periodically review it to ensure optimal performance and cost efficiency. This strategy should encompass important considerations such as data backups, data version control, handling cold data, and determining the default partition strategy. By actively addressing these aspects, teams can achieve significant benefits, as demonstrated by a use case that resulted in a remarkable 78% reduction in total costs.


Fokko Driesprong - Tip of the Iceberg | 40mins

In the talk, Fokko?introduces Iceberg and its history, highlighting the companies that use and contribute to it. He dives into the underlying concepts of Iceberg, including metadata, manifest lists, and manifests, and explain how these concepts benefit the query engine and ensure data correctness. The speaker will also discuss the evolution of schema, partitioning, and sorting in Iceberg and how these changes can be made lazily without requiring a complete rewrite of massive tables. Finally, he demonstrates the usage of PyIceberg, a Python library for working with Iceberg.

要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #18

    Art of Data Newsletter - Issue #18

    Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

    1 条评论
  • Art of Data Newsletter - Issue #17

    Art of Data Newsletter - Issue #17

    Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…

  • Art of Data Newsletter - Issue #16

    Art of Data Newsletter - Issue #16

    Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…

  • Art of Data Newsletter - Issue #15

    Art of Data Newsletter - Issue #15

    Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #12

    Art of Data Newsletter - Issue #12

    Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

  • Art of Data Newsletter - Issue #9

    Art of Data Newsletter - Issue #9

    Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

社区洞察

其他会员也浏览了