Art of Data Newsletter - Issue #14
Welcome all Data fanatics. In today's issue:
Let's dive in!
With the help of LakehouseIQ, Databricks is revolutionizing how people interact with data by making it accessible, intelligible, and actionable. LakehouseIQ uses information about a company’s data, usage patterns, and org chart to understand their business’s jargon and create a more personalized experience through providing answers to natural language queries. It provides access to AI-based automatic data classification, monitoring, and Lakehouse Federation, helping to democratize all data in the enterprise in a secure way.
Delta Lake 3.0 is the latest release of the Linux Foundation open source Delta Lake Project. It introduces powerful features to improve compatibility and expand the ecosystem. Delta Universal Format (Uniform) allows Delta Lake to be read in different formats for applications, and Delta Kernel simplifies the process of creating Delta connectors by providing API's to hide complex details. Liquid Clustering is a data management technique that is flexible and adjusts the data layout for the best query performance. These features are designed to help companies move to an open data lakehouse without the fear of lock-in, boost performance and save time and resources on partitioning strategies.
Databricks SQL is now available with new capabilities that allow for streaming ingestion from any data source and pre-computed Materialized Views (MVs) for improved query speeds. This allows SQL analysts to quickly set up ETL pipelines with simple code, such as creating a streaming table that is continuously ingested with data from an S3 location and setting up a materialized view that is incrementally updated. The benefits of Stream Tables and Materialized Views include faster BI dashboards, reduced data processing costs, improved data access control, real-time use cases, and better scalability. Customers who have previewed the product have seen significant improvements in performance and cost savings.
领英推荐
This article describes a reference architectures for the emerging Large Language Model (LLM) app stack, which combines private data and off-the-shelf LLMs to enable in-context learning - a design pattern for building software with LLMs. The stack includes data preprocessing and embedding, prompt construction and execution, vector databases, orchestration frameworks, proprietary and open-source language models, and operational tools for validation and caching. As the underlying technology for LLMs advance, this architecture may change substantially.
LinkedIn's Trust Data Team has partnered with academia to research and develop advanced AI-generated profile photo detection techniques. This unique collaboration has enabled the creation of a novel approach which is able to detect 99.6% of a common type of AI-generated profile photos, while only misclassifying a real profile photo as synthetic 1% of the time. The research also showed that this approach outperforms the state-of-the-art CNN model. These cutting-edge techniques help LinkedIn improve and increase the effectiveness of their automated anti-abuse defenses.
To summarize, becoming a valuable data engineer involves understanding the importance of business impact, having knowledge of technical skills, and selecting projects that deliver maximum impact. Furthermore, it is important to develop an understanding of business metrics, fundamentals of data storage and processing, and orchestration and scheduling, in order to succeed in this field. Additionally, adding work experience to a resume following the STAR method will help show impact.
Data engineering teams need to establish a well-defined strategy for storing data and periodically review it to ensure optimal performance and cost efficiency. This strategy should encompass important considerations such as data backups, data version control, handling cold data, and determining the default partition strategy. By actively addressing these aspects, teams can achieve significant benefits, as demonstrated by a use case that resulted in a remarkable 78% reduction in total costs.
In the talk, Fokko?introduces Iceberg and its history, highlighting the companies that use and contribute to it. He dives into the underlying concepts of Iceberg, including metadata, manifest lists, and manifests, and explain how these concepts benefit the query engine and ensure data correctness. The speaker will also discuss the evolution of schema, partitioning, and sorting in Iceberg and how these changes can be made lazily without requiring a complete rewrite of massive tables. Finally, he demonstrates the usage of PyIceberg, a Python library for working with Iceberg.