登录查看更多内容

Art of Data Newsletter - Issue #14

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

发布日期: 2023年7月2日

+ 关注

Welcome all Data fanatics. In today's issue:

Databricks announces LakehouseIQ - LLM-based Assistant for working with Data
Announcement of #DeltaLake 3.0 - query Delta using Hudi or Iceberg formats
Materialized Views and Streaming Tables for #DatabricksSQL
Reference architectures for Large Language Model (#LLM)
Detecting AI-Generated Profile Photos by LinkedIn
How to become a valuable #DataEngineer
Intro to #ApacheIceberg

Let's dive in!

Introducing LakehouseIQ: The AI-Powered Engine that Uniquely Understands Your Business | 6mins

With the help of LakehouseIQ, Databricks is revolutionizing how people interact with data by making it accessible, intelligible, and actionable. LakehouseIQ uses information about a company’s data, usage patterns, and org chart to understand their business’s jargon and create a more personalized experience through providing answers to natural language queries. It provides access to AI-based automatic data classification, monitoring, and Lakehouse Federation, helping to democratize all data in the enterprise in a secure way.

Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | 10mins

Delta Lake 3.0 is the latest release of the Linux Foundation open source Delta Lake Project. It introduces powerful features to improve compatibility and expand the ecosystem. Delta Universal Format (Uniform) allows Delta Lake to be read in different formats for applications, and Delta Kernel simplifies the process of creating Delta connectors by providing API's to hide complex details. Liquid Clustering is a data management technique that is flexible and adjusts the data layout for the best query performance. These features are designed to help companies move to an open data lakehouse without the fear of lock-in, boost performance and save time and resources on partitioning strategies.

Introducing Materialized Views and Streaming Tables for Databricks SQL | 10mins

Databricks SQL is now available with new capabilities that allow for streaming ingestion from any data source and pre-computed Materialized Views (MVs) for improved query speeds. This allows SQL analysts to quickly set up ETL pipelines with simple code, such as creating a streaming table that is continuously ingested with data from an S3 location and setting up a materialized view that is incrementally updated. The benefits of Stream Tables and Materialized Views include faster BI dashboards, reduced data processing costs, improved data access control, real-time use cases, and better scalability. Customers who have previewed the product have seen significant improvements in performance and cost savings.

领英推荐

Big Data vs. Fast Data: The Evolution of Speed in…

Pratibha Kumari J. 6 个月前

The Great Data Debate: Unbundling or Bundling?

Prukalpa ? 3 年前

Data Wars: Vector Strikes Back

Lawrence Fernandes 4 个月前

Emerging Architectures for LLM Applications | 23mins

This article describes a reference architectures for the emerging Large Language Model (LLM) app stack, which combines private data and off-the-shelf LLMs to enable in-context learning - a design pattern for building software with LLMs. The stack includes data preprocessing and embedding, prompt construction and execution, vector databases, orchestration frameworks, proprietary and open-source language models, and operational tools for validation and caching. As the underlying technology for LLMs advance, this architecture may change substantially.

New Approaches For Detecting AI-Generated Profile Photos | 9mins

LinkedIn's Trust Data Team has partnered with academia to research and develop advanced AI-generated profile photo detection techniques. This unique collaboration has enabled the creation of a novel approach which is able to detect 99.6% of a common type of AI-generated profile photos, while only misclassifying a real profile photo as synthetic 1% of the time. The research also showed that this approach outperforms the state-of-the-art CNN model. These cutting-edge techniques help LinkedIn improve and increase the effectiveness of their automated anti-abuse defenses.

How to become a valuable data engineer | 10mins

To summarize, becoming a valuable data engineer involves understanding the importance of business impact, having knowledge of technical skills, and selecting projects that deliver maximum impact. Furthermore, it is important to develop an understanding of business metrics, fundamentals of data storage and processing, and orchestration and scheduling, in order to succeed in this field. Additionally, adding work experience to a resume following the STAR method will help show impact.

Optimizing costs of a Data Lakehouse | 6mins

Data engineering teams need to establish a well-defined strategy for storing data and periodically review it to ensure optimal performance and cost efficiency. This strategy should encompass important considerations such as data backups, data version control, handling cold data, and determining the default partition strategy. By actively addressing these aspects, teams can achieve significant benefits, as demonstrated by a use case that resulted in a remarkable 78% reduction in total costs.

Fokko Driesprong - Tip of the Iceberg | 40mins

In the talk, Fokko?introduces Iceberg and its history, highlighting the companies that use and contribute to it. He dives into the underlying concepts of Iceberg, including metadata, manifest lists, and manifests, and explain how these concepts benefit the query engine and ensure data correctness. The speaker will also discuss the evolution of schema, partitioning, and sorting in Iceberg and how these changes can be made lazily without requiring a complete rewrite of massive tables. Finally, he demonstrates the usage of PyIceberg, a Python library for working with Iceberg.

Art of Data

284 位关注者

要查看或添加评论，请登录

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

2023年8月22日

Art of Data Newsletter - Issue #19

Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…
Art of Data Newsletter - Issue #18

2023年8月7日

Art of Data Newsletter - Issue #18

Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

1 条评论
Art of Data Newsletter - Issue #17

2023年7月31日

Art of Data Newsletter - Issue #17

Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…
Art of Data Newsletter - Issue #16

2023年7月23日

Art of Data Newsletter - Issue #16

Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…
Art of Data Newsletter - Issue #15

2023年7月10日

Art of Data Newsletter - Issue #15

Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…
Art of Data Newsletter - Issue #13

2023年6月23日

Art of Data Newsletter - Issue #13

Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…
Art of Data Newsletter - Issue #12

2023年6月13日

Art of Data Newsletter - Issue #12

Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.
Art of Data Newsletter - Issue #11

2023年6月6日

Art of Data Newsletter - Issue #11

Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…
Art of Data Newsletter - Issue #10

2023年5月29日

Art of Data Newsletter - Issue #10

Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…
Art of Data Newsletter - Issue #9

2023年5月22日

Art of Data Newsletter - Issue #9

Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

See all articles

Art of Data Newsletter - Issue #14

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

Introducing LakehouseIQ: The AI-Powered Engine that Uniquely Understands Your Business | 6mins

Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | 10mins

Introducing Materialized Views and Streaming Tables for Databricks SQL | 10mins

领英推荐

Emerging Architectures for LLM Applications | 23mins

New Approaches For Detecting AI-Generated Profile Photos | 9mins

How to become a valuable data engineer | 10mins

Optimizing costs of a Data Lakehouse | 6mins

Fokko Driesprong - Tip of the Iceberg | 40mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

社区洞察

其他会员也浏览了

?? DATA Pill #109 - Databricks LakeFlow, GKE + Gemma + Ollama = ?

Architecting Data Pipelines

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

December 2024 Top Ten (by Dagster Labs)

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

DATA Pill #020 - The Rise of DataOps and The Power of MLOps

The Growing Use Of Humanized Big Data

Big Data Analytics: Strategies for Handling and Analyzing Large Datasets

Open File format in data analytics and AI - changing the international rules game

Introducing LakehouseIQ: The AI-Powered Engine that Uniquely Understands Your Business | 6mins

Announcing Delta Lake 3.0 with New Universal Format and Liquid Clustering | 10mins

Introducing Materialized Views and Streaming Tables for Databricks SQL | 10mins

领英推荐

Emerging Architectures for LLM Applications | 23mins

New Approaches For Detecting AI-Generated Profile Photos | 9mins

How to become a valuable data engineer | 10mins

Optimizing costs of a Data Lakehouse | 6mins

Fokko Driesprong - Tip of the Iceberg | 40mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

Art of Data Newsletter - Issue #18

Art of Data Newsletter - Issue #17

Art of Data Newsletter - Issue #16

Art of Data Newsletter - Issue #15

Art of Data Newsletter - Issue #13

Art of Data Newsletter - Issue #12

Art of Data Newsletter - Issue #11

Art of Data Newsletter - Issue #10

Art of Data Newsletter - Issue #9

社区洞察

其他会员也浏览了

?? DATA Pill #109 - Databricks LakeFlow, GKE + Gemma + Ollama = ?

Architecting Data Pipelines

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Revolutionizing Data Engineering with Delta Lake and Azure Databricks

December 2024 Top Ten (by Dagster Labs)

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

DATA Pill #020 - The Rise of DataOps and The Power of MLOps

The Growing Use Of Humanized Big Data

Big Data Analytics: Strategies for Handling and Analyzing Large Datasets

Open File format in data analytics and AI - changing the international rules game