Art of Data Newsletter - Issue #15
Photo by Bilal Burak Ka?l?lar: https://www.pexels.com/photo/historical-place-at-acropolis-athens-17397499/

Art of Data Newsletter - Issue #15

Welcome all Data fanatics. In today's issue:

  • LinkedIn explains their new data pipeline orchestrator - Hoptimator
  • Concepts and examples of #VectorDatabases
  • Netflix has launched Maestro - a workflow orchestrator designed to handle vast data and machine learning workflows efficiently.
  • Spotify explains how they use Machine Learning to selectively send in-app messages, enhancing user experience and app effectiveness.
  • Building a self-serve data platform
  • How to create valuable Data Tests

Let's dive in!


Declarative Data Pipelines with Hoptimator | 13mins

LinkedIn has developed an end-to-end data pipeline orchestrator called Hoptimator, designed to support a growing number of use-cases and reduce onboarding friction for data pipelines. Hoptimator aims to unify the user experience by leveraging LinkedIn's existing infrastructure. The tool uses Flink, a stream processing engine, enabling developers to express data pipelines and stream processing in the same language and run them on the same runtime.

Explain like I’m 5 — Vector Database Hype | 6mins

The article explains the growing popularity and investment in vector databases, such as Pinecone and Weaviate. The author, Sven Balnojan, provides a simple explanation of what vector databases and embeddings are, highlighting their relevance in the current tech market. Despite these companies being founded five years ago, their relevance and appeal have significantly surged recently.

Vector Database: Concepts and examples | 11mins

Vector databases are a type of database management system designed for managing and retrieving high-dimensional, vectorized data. They optimize data retrieval through specific indexing and query techniques, thus significantly reducing computational time. They excel at performing high-speed similarity searches in large datasets, making them useful in fields such as recommendation systems, semantic searches, and anomaly detection. They have wide applications in business, including personalized marketing, image recognition, and bioinformatics. Platforms like Milvus, Pinecone, and Weaviate offer specialized features for different use cases of vector databases. In conclusion, vector databases present an innovative way of managing high-dimensional data, which is crucial in our data-driven world.

Orchestrating Data/ML Workflows at Scale With Netflix Maestro | 21mins

Netflix has launched Maestro, a workflow orchestrator designed to handle various big data and machine learning workflows. Maestro is designed to manage workload orchestrations of large scale, addressing issues such as slow performance during periods of high traffic, and outscaling AWS instance type limits. The platform allows users to schedule and manage workflows on a massive scale, handling system issues and providing a streamlined orchestrator service. The system consists of three main micro services: Workflow Engine, Time-Based Scheduling Service, and Signal Service. With Maestro, users can run millions or billions of steps in a single workflow instance. Furthermore, the platform provides multiple domain specific languages and parameterized workflows for increased flexibility.

Experimenting with Machine Learning to Target In-App Messaging | 9mins

Spotify uses in-app messaging to communicate with its listeners globally. They send messages through numerous channels like WhatsApp, SMS, email, push notifications, in-line formats, and other in-app messaging formats. However, they noticed that in-app messages might affect different users differently, requiring a more selective approach for messaging. To solve this, they developed a machine learning model to decide which users would benefit most from in-app messages.

What is a self-serve data platform & how to build one | 9mins

A self-serve data platform refers to processes that allow the end-user to access data, learn about it and understand its meaning more easily. Building such a platform involves recognizing the end users, their goals, and workflows, with the aim being to decrease the dependency of end-users on data engineers.

Creating a self-serve platform consists of different components. These include the data sets needed by the end-users, data access, data discovery, data documentation, and data lineage. Moreover, data quality statistics, data management, and data ownership play a crucial role.

How to Create Valuable Data Tests | 13mins

Creating valuable Data Tests is more about quality than quantity. These essential building blocks in data solutions validate data quality. However, the creation of more tests does not necessarily equate to higher data quality. The quality of data tests is paramount to ensure data quality. Data is seen as a product, with pipelines seen as manufacturing systems. A 'fitness for use' perspective, considering the consumer's needs and feedback, is recommended. The data quality dimensions offer useful metrics for tailoring tests. Key tips for creating data tests include choosing 'no data' rather than 'wrong data', conducting a reconciliation test for consistency, and creating tests with margins to capture deviations. Avoiding focus on the quantity of data tests and adhering to a data quality framework is advised.

The Significance of First and Last Day of the Month in Data Engineering | 7mins

In data engineering, the first and the last day of the month, referred to as FDOM/LDOM, are crucial dates. They provide unique opportunities for data engineers to validate, process, and optimize their pipelines. The author shares his personal experiences with monthly job failures and the lessons gleaned from these challenges. The FDOM/LDOM are significant for tasks such as data ingestion and extraction, data integrity checks, performance optimization, data completeness, data validation and aggregation, and data archiving. Insights learned include understanding root causes, fortifying the ETL pipeline, emphasizing testing and validation, and fostering collaboration and communication. Advice for future data engineers includes adopting a proactive mindset, striving for continuous improvement, and cultivating a problem-solving attitude.

要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #18

    Art of Data Newsletter - Issue #18

    Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

    1 条评论
  • Art of Data Newsletter - Issue #17

    Art of Data Newsletter - Issue #17

    Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…

  • Art of Data Newsletter - Issue #16

    Art of Data Newsletter - Issue #16

    Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…

  • Art of Data Newsletter - Issue #14

    Art of Data Newsletter - Issue #14

    Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #12

    Art of Data Newsletter - Issue #12

    Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

  • Art of Data Newsletter - Issue #9

    Art of Data Newsletter - Issue #9

    Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

社区洞察

其他会员也浏览了