Art of Data Newsletter - Issue #11
Welcome all Data fanatics. In today's issue:
Let's dive in!
Production AI systems are incredibly complex, requiring significant technical knowledge and expertise to implement effectively. Due to the complexity of the task, it is difficult to accurately predict the implications that AI will have on the world, such a wide-reaching takeover of social systems. Moreover, specific domain knowledge is required in many applications to provide value, meaning AI cannot always be used to completely automate away tasks.
Using Spark Analysers, Uber created a way to detect anti-patterns in Spark applications to save compute resources. It is composed of two components: a Spark Event Listener which collects data and stores it in a Kafka topic, and an Analysers component which is a Flink application that analyzes the data to detect anti-patterns. It currently includes two analysers, the Excessive Partition Scan Analyser and the Duplicate Spark Plan Analyser. This has enabled Uber to save over 60,000 uCores annually.
Due to the complexity of Policygenius' ELT (Extract, Load, Transform) architecture, they faced some data quality issues. To address these quality issues, they implemented a shift-left unit testing process and a post-deployment observability framework to test for various constraints, such as uniqueness, ranges, nullability, and more. They also released a scaffolding tool to reduce development time and provided training/pairing. This improved data quality, reduced the risk of data breakages, and decreased time-to-detect from days to hours. They are continuing to expand their test coverage and identify ways to better improve
This article summarises the journey of a self-taught data engineer who transitioned from being a data analyst. They detail the growing need for technical expertise in the field of data engineering, as well as the core and peripheral skills and tools a data engineer requires. The article also highlights the challenges they faced in transitioning, such as difficulties in scalability, mismanagement of expectations and poor data quality. Finally, they provide insight on resources that aspiring data engineers may use to build a comprehensive understanding of the data engineering role.
领英推荐
The article discusses the Write-Audit-Publish (WAP) pattern and its implementation in different technologies. It focuses on the storage layer, particularly data lake/house storage, and briefly mentions the potential application in RDBMS like Oracle. The author advises caution when selecting tools for implementing the WAP pattern, emphasizing the importance of understanding the pattern itself. While no specific tool is endorsed, the article aims to help users leverage their existing tools effectively or make informed choices for future implementation. Additionally, the article mentions lakeFS as a favorable option for WAP implementation, given its design and purpose.
How DoorDash uses XcodeGen to eliminate project merge conflicts - DoorDash Engineering Blog | 19mins
DoorDash faced challenges with maintaining an Xcode project for a large team. To avoid slowdowns, they needed to avoid merge conflicts at all costs. Through using XcodeGen, a command line interface (CLI), DoorDash was able to manage the intricate business scenarios and demanding requirements of their Dasher app. This tool allowed them to build a consistent project structure and configuration, tailor projects to specific needs and requirements, and minimize time and effort in resolving merge conflicts. XcodeGen is particularly valuable for larger teams dealing with complex project structures and can be a useful tool for others in similar positions.
This blog post discusses an approach to building large technical projects which involves breaking down the project into smaller manageable pieces, and using automated testing to quickly receive tangible results. This can be used with real-world projects or personal projects, and should result in achieving a demo to showcase progress. The blog post suggests that experienced engineers can more effectively break down the project, but it is not necessary for success. Finally, the author suggests using the product themselves frequently to gain more accurate feedback about what needs to be improved and keep the motivation going.
Vector databases are high-performance, high-dimensional databases designed to efficiently store and retrieve vectorized data. Their unique data indexing techniques enable them to perform high-speed similarity searches, making them ideal for a variety of machine learning use cases, such as recommendation systems, semantic search, personalized marketing, and image recognition. In the business world, they are transforming how data is handled, analyzed, and derived insights from. As this technology evolves, it holds significant potential for future innovation and development.
How to evaluate dependencies | 4mins
This post outlines a healthier framework for evaluating dependencies and provides key considerations to think through before making a decision. One should always start with understanding the docs for the dependency, investigate the code and commit history, consider the community around it, and only then consider Github stars as an afterthought.