Art of Data Newsletter - Issue #15
Welcome all Data fanatics. In today's issue:
Let's dive in!
LinkedIn has developed an end-to-end data pipeline orchestrator called Hoptimator, designed to support a growing number of use-cases and reduce onboarding friction for data pipelines. Hoptimator aims to unify the user experience by leveraging LinkedIn's existing infrastructure. The tool uses Flink, a stream processing engine, enabling developers to express data pipelines and stream processing in the same language and run them on the same runtime.
The article explains the growing popularity and investment in vector databases, such as Pinecone and Weaviate. The author, Sven Balnojan, provides a simple explanation of what vector databases and embeddings are, highlighting their relevance in the current tech market. Despite these companies being founded five years ago, their relevance and appeal have significantly surged recently.
Vector databases are a type of database management system designed for managing and retrieving high-dimensional, vectorized data. They optimize data retrieval through specific indexing and query techniques, thus significantly reducing computational time. They excel at performing high-speed similarity searches in large datasets, making them useful in fields such as recommendation systems, semantic searches, and anomaly detection. They have wide applications in business, including personalized marketing, image recognition, and bioinformatics. Platforms like Milvus, Pinecone, and Weaviate offer specialized features for different use cases of vector databases. In conclusion, vector databases present an innovative way of managing high-dimensional data, which is crucial in our data-driven world.
领英推荐
Netflix has launched Maestro, a workflow orchestrator designed to handle various big data and machine learning workflows. Maestro is designed to manage workload orchestrations of large scale, addressing issues such as slow performance during periods of high traffic, and outscaling AWS instance type limits. The platform allows users to schedule and manage workflows on a massive scale, handling system issues and providing a streamlined orchestrator service. The system consists of three main micro services: Workflow Engine, Time-Based Scheduling Service, and Signal Service. With Maestro, users can run millions or billions of steps in a single workflow instance. Furthermore, the platform provides multiple domain specific languages and parameterized workflows for increased flexibility.
Spotify uses in-app messaging to communicate with its listeners globally. They send messages through numerous channels like WhatsApp, SMS, email, push notifications, in-line formats, and other in-app messaging formats. However, they noticed that in-app messages might affect different users differently, requiring a more selective approach for messaging. To solve this, they developed a machine learning model to decide which users would benefit most from in-app messages.
A self-serve data platform refers to processes that allow the end-user to access data, learn about it and understand its meaning more easily. Building such a platform involves recognizing the end users, their goals, and workflows, with the aim being to decrease the dependency of end-users on data engineers.
Creating a self-serve platform consists of different components. These include the data sets needed by the end-users, data access, data discovery, data documentation, and data lineage. Moreover, data quality statistics, data management, and data ownership play a crucial role.
How to Create Valuable Data Tests | 13mins
Creating valuable Data Tests is more about quality than quantity. These essential building blocks in data solutions validate data quality. However, the creation of more tests does not necessarily equate to higher data quality. The quality of data tests is paramount to ensure data quality. Data is seen as a product, with pipelines seen as manufacturing systems. A 'fitness for use' perspective, considering the consumer's needs and feedback, is recommended. The data quality dimensions offer useful metrics for tailoring tests. Key tips for creating data tests include choosing 'no data' rather than 'wrong data', conducting a reconciliation test for consistency, and creating tests with margins to capture deviations. Avoiding focus on the quantity of data tests and adhering to a data quality framework is advised.
In data engineering, the first and the last day of the month, referred to as FDOM/LDOM, are crucial dates. They provide unique opportunities for data engineers to validate, process, and optimize their pipelines. The author shares his personal experiences with monthly job failures and the lessons gleaned from these challenges. The FDOM/LDOM are significant for tasks such as data ingestion and extraction, data integrity checks, performance optimization, data completeness, data validation and aggregation, and data archiving. Insights learned include understanding root causes, fortifying the ETL pipeline, emphasizing testing and validation, and fostering collaboration and communication. Advice for future data engineers includes adopting a proactive mindset, striving for continuous improvement, and cultivating a problem-solving attitude.