登录查看更多内容

Art of Data Newsletter - Issue #11

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

发布日期: 2023年6月6日

+ 关注

Welcome all Data fanatics. In today's issue:

Complexities of Production AI systems
Uber built Spark Analysers that detect anti-patterns in Spark applications.
Data Warehouse Testing Strategies for Better Data Quality
Switching to Data Engineering as a self-taught Developer
DoorDash uses XcodeGen to eliminate project merge conflicts
Approaches to building Large Software Systems
Concepts and examples of Vector Databases
How to pragmatically evaluate dependencies

Let's dive in!

Production AI systems are really hard - by Kevin Fischer | 4mins

Production AI systems are incredibly complex, requiring significant technical knowledge and expertise to implement effectively. Due to the complexity of the task, it is difficult to accurately predict the implications that AI will have on the world, such a wide-reaching takeover of social systems. Moreover, specific domain knowledge is required in many applications to provide value, meaning AI cannot always be used to completely automate away tasks.

Spark Analysers: Catching Anti-Patterns In Spark Apps | Uber Blog | 11mins

Using Spark Analysers, Uber created a way to detect anti-patterns in Spark applications to save compute resources. It is composed of two components: a Spark Event Listener which collects data and stores it in a Kafka topic, and an Analysers component which is a Flink application that analyzes the data to detect anti-patterns. It currently includes two analysers, the Excessive Partition Scan Analyser and the Duplicate Spark Plan Analyser. This has enabled Uber to save over 60,000 uCores annually.

Data Warehouse Testing Strategies for Better Data Quality | Medium | 18mins

Due to the complexity of Policygenius' ELT (Extract, Load, Transform) architecture, they faced some data quality issues. To address these quality issues, they implemented a shift-left unit testing process and a post-deployment observability framework to test for various constraints, such as uniqueness, ranges, nullability, and more. They also released a scaffolding tool to reduce development time and provided training/pairing. This improved data quality, reduced the risk of data breakages, and decreased time-to-detect from days to hours. They are continuing to expand their test coverage and identify ways to better improve

Breaking into Data Engineering as a Self-Taught Developer | 11mins

This article summarises the journey of a self-taught data engineer who transitioned from being a data analyst. They detail the growing need for technical expertise in the field of data engineering, as well as the core and peripheral skills and tools a data engineer requires. The article also highlights the challenges they faced in transitioning, such as difficulties in scalability, mismanagement of expectations and poor data quality. Finally, they provide insight on resources that aspiring data engineers may use to build a comprehensive understanding of the data engineering role.

领英推荐

Deployment as a Critical Business Data Science…

Tom Davenport 4 年前

Selected Data Engineering Posts . . . November 2024

Axel Schwanke 3 个月前

Low Code Data Scientist: How to use llms to create…

Ajit Jaokar 7 个月前

How to Implement Write-Audit-Publish (WAP) | 24mins

The article discusses the Write-Audit-Publish (WAP) pattern and its implementation in different technologies. It focuses on the storage layer, particularly data lake/house storage, and briefly mentions the potential application in RDBMS like Oracle. The author advises caution when selecting tools for implementing the WAP pattern, emphasizing the importance of understanding the pattern itself. While no specific tool is endorsed, the article aims to help users leverage their existing tools effectively or make informed choices for future implementation. Additionally, the article mentions lakeFS as a favorable option for WAP implementation, given its design and purpose.

How DoorDash uses XcodeGen to eliminate project merge conflicts - DoorDash Engineering Blog | 19mins

DoorDash faced challenges with maintaining an Xcode project for a large team. To avoid slowdowns, they needed to avoid merge conflicts at all costs. Through using XcodeGen, a command line interface (CLI), DoorDash was able to manage the intricate business scenarios and demanding requirements of their Dasher app. This tool allowed them to build a consistent project structure and configuration, tailor projects to specific needs and requirements, and minimize time and effort in resolving merge conflicts. XcodeGen is particularly valuable for larger teams dealing with complex project structures and can be a useful tool for others in similar positions.

My Approach to Building Large Technical Projects – Mitchell Hashimoto | 13mins

This blog post discusses an approach to building large technical projects which involves breaking down the project into smaller manageable pieces, and using automated testing to quickly receive tangible results. This can be used with real-world projects or personal projects, and should result in achieving a demo to showcase progress. The blog post suggests that experienced engineers can more effectively break down the project, but it is not necessary for success. Finally, the author suggests using the product themselves frequently to gain more accurate feedback about what needs to be improved and keep the motivation going.

Vector Database: Concepts and examples | Dev Genius | 11mins

Vector databases are high-performance, high-dimensional databases designed to efficiently store and retrieve vectorized data. Their unique data indexing techniques enable them to perform high-speed similarity searches, making them ideal for a variety of machine learning use cases, such as recommendation systems, semantic search, personalized marketing, and image recognition. In the business world, they are transforming how data is handled, analyzed, and derived insights from. As this technology evolves, it holds significant potential for future innovation and development.

How to evaluate dependencies | 4mins

This post outlines a healthier framework for evaluating dependencies and provides key considerations to think through before making a decision. One should always start with understanding the docs for the dependency, investigate the code and commit history, consider the community around it, and only then consider Github stars as an afterthought.

Art of Data

284 位关注者

要查看或添加评论，请登录

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

2023年8月22日

Art of Data Newsletter - Issue #19

Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…
Art of Data Newsletter - Issue #18

2023年8月7日

Art of Data Newsletter - Issue #18

Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

1 条评论
Art of Data Newsletter - Issue #17

2023年7月31日

Art of Data Newsletter - Issue #17

Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…
Art of Data Newsletter - Issue #16

2023年7月23日

Art of Data Newsletter - Issue #16

Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…
Art of Data Newsletter - Issue #15

2023年7月10日

Art of Data Newsletter - Issue #15

Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…
Art of Data Newsletter - Issue #14

2023年7月2日

Art of Data Newsletter - Issue #14

Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…
Art of Data Newsletter - Issue #13

2023年6月23日

Art of Data Newsletter - Issue #13

Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…
Art of Data Newsletter - Issue #12

2023年6月13日

Art of Data Newsletter - Issue #12

Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.
Art of Data Newsletter - Issue #10

2023年5月29日

Art of Data Newsletter - Issue #10

Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…
Art of Data Newsletter - Issue #9

2023年5月22日

Art of Data Newsletter - Issue #9

Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

See all articles

Art of Data Newsletter - Issue #11

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

Production AI systems are really hard - by Kevin Fischer | 4mins

Spark Analysers: Catching Anti-Patterns In Spark Apps | Uber Blog | 11mins

Data Warehouse Testing Strategies for Better Data Quality | Medium | 18mins

Breaking into Data Engineering as a Self-Taught Developer | 11mins

领英推荐

How to Implement Write-Audit-Publish (WAP) | 24mins

How DoorDash uses XcodeGen to eliminate project merge conflicts - DoorDash Engineering Blog | 19mins

My Approach to Building Large Technical Projects – Mitchell Hashimoto | 13mins

Vector Database: Concepts and examples | Dev Genius | 11mins

How to evaluate dependencies | 4mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

社区洞察

其他会员也浏览了

Understanding the Role of Data Engineering in the AI Industry

Big Data Engineering in 2025: How AI Is Reshaping the Game

Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

The Future of Data Science: How No-Code Tools Are Changing the Game

Top 6 Data Science Pain Points in 2021

Meet Chanakya: The Platform Behind Anko’s Data-Driven Solutions

Step-by-Step Guide to Data Science at ONLEI Technologies

Refined Thinking like a Data Scientist Series

Beyond Models: Building a Full-Stack Data Science Pipeline That Drives Impact

Data Science – an Interdisciplinary Framework set to dictate the Future Businesses

Production AI systems are really hard - by Kevin Fischer | 4mins

Spark Analysers: Catching Anti-Patterns In Spark Apps | Uber Blog | 11mins

Data Warehouse Testing Strategies for Better Data Quality | Medium | 18mins

Breaking into Data Engineering as a Self-Taught Developer | 11mins

领英推荐

How to Implement Write-Audit-Publish (WAP) | 24mins

How DoorDash uses XcodeGen to eliminate project merge conflicts - DoorDash Engineering Blog | 19mins

My Approach to Building Large Technical Projects – Mitchell Hashimoto | 13mins

Vector Database: Concepts and examples | Dev Genius | 11mins

How to evaluate dependencies | 4mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

Art of Data Newsletter - Issue #18

Art of Data Newsletter - Issue #17

Art of Data Newsletter - Issue #16

Art of Data Newsletter - Issue #15

Art of Data Newsletter - Issue #14

Art of Data Newsletter - Issue #13

Art of Data Newsletter - Issue #12

Art of Data Newsletter - Issue #10

Art of Data Newsletter - Issue #9

社区洞察

其他会员也浏览了

Understanding the Role of Data Engineering in the AI Industry

Big Data Engineering in 2025: How AI Is Reshaping the Game

Unmasking Real-World Data Science: A Departure from Kaggle’s Accuracy Frenzy and Model-Centric Approaches

The Future of Data Science: How No-Code Tools Are Changing the Game

Top 6 Data Science Pain Points in 2021

Meet Chanakya: The Platform Behind Anko’s Data-Driven Solutions

Step-by-Step Guide to Data Science at ONLEI Technologies

Refined Thinking like a Data Scientist Series

Beyond Models: Building a Full-Stack Data Science Pipeline That Drives Impact

Data Science – an Interdisciplinary Framework set to dictate the Future Businesses