Art of Data Newsletter - Issue #18
Welcome all Data fanatics. In today's issue:
Let's dive in!
Google Bard is a groundbreaking AI model that is expected to rival ChatGPT and other conversational AI systems. Bard's distinctive features, advanced structure, and proficiency in producing human-like responses are notable. The model is causing a stir within the AI community and this podcast compares these two and highlights its weaknesses.
Daniel discusses his journey of falling in love with the programming language, Rust, and its benefits in the world of data engineering. The author credits Rust's appeal to its speed, memory model, immutability, and static typing. Cargo, Rust's manager for packaging and dependency, is another feature that the author commends for its simplicity and effectiveness, especially after experiencing some issues with Python's packaging system. However, the author also mentions that Rust has a steep learning curve and may not be the go-to programming language for everyday data engineering tasks, as Python still dominates this field. Despite this, learning and using Rust is deemed beneficial as it can enhance one's skills as an engineer.
The article discusses common errors made while maintaining production outages, specifically related to using Redis. It starts by explaining how concurrency in the application layer can create contention as commands are queued on the server due to Redis' single-threaded nature. A solution suggested is sharding data across multiple Redis instances or using Redis Cluster for a more general approach. The use of Lua scripts or functions for logic that must run atomically is discussed as a potential source of errors due to Redis' single-threaded nature. It also advises on having alerts set up for memory usage at varying levels to prevent failure. Other tips include understanding the subtleties of Redis' API, careful serialisation of objects to JSON strings before storage, and smart use of lists for large collections.
领英推荐
The book "Fundamentals of Data Engineering" by Joe Reis and Matt Housley is highly recommended for data engineers to gain a deep understanding of the important areas of data engineering. These include the concepts of data generation, ingestion, orchestration, transformation, storage, and governance.
Data engineering is described as the design, implementation, and maintenance of systems and processes that transform raw data into high-quality, consistent information and this intersects with security, data management, DataOps, data architecture, orchestration, and software engineering.
The book also covers the intricacies of data engineering lifecycle such as source systems, the choice of storage, the data ingestion process, the transformation stage, and the effective usage of data.
Square utilizes several internal tools to boost productivity and efficiency, including a communications platform, a customer data platform, and data platform by Amplitude. These tools are crucial to providing a superior experience for their external customers. Square monitors the value and effectiveness of these tools using specific data and metrics. They categorize their tools based on the metrics required to track each one and utilise product and operational metrics.
Product metrics gauge product usage/impact and include user adoption, engagement, satisfaction, and business impact. Operational metrics evaluate the performance and reliability of the product/service from the provider’s perspective including internal impact, service level, and reliability.
Apache Flink and Apache Spark are both widely used for big data processing and analytics. Spark is lauded for its simplicity, high-level APIs, and the capacity to process large amounts of data, while Flink excels at real-time, low-latency data stream processing and stateful computations. The post compares their usage for common streaming patterns, their APIs, and their techniques for handling data preparation, processing, and enrichment. It concludes that both systems are evolving quickly and are effective at handling big data, recommending that the choice between the two should depend on the specific needs of the workload and the surrounding architecture. Advice is also given on using user-defined functions judiciously to avoid slowing down the job or causing backpressure.
The Bard podcast was notably interesting. Thanks!