Art of Data Newsletter - Issue #18
Photo by David Bartus: https://www.pexels.com/photo/photo-lavender-flower-field-under-pink-sky-1166209/

Art of Data Newsletter - Issue #18

Welcome all Data fanatics. In today's issue:

  • 谷歌 's Bard vs OpenAI 's ChatGPT
  • Why some Data Engineers love #Rust?
  • 4 ways to shoot yourself in the foot with #Redis
  • 10 things to learn from Joe Reis 's Fundamentals of Data Engineering book
  • How to measure the value of internal tools
  • #ApacheSpark vs #ApacheFlink for streaming use cases

Let's dive in!


The LLM Battle Begins: Google Bard vs ChatGPT (Ep. 231) - Data Science at Home Podcast | 25mins

Google Bard is a groundbreaking AI model that is expected to rival ChatGPT and other conversational AI systems. Bard's distinctive features, advanced structure, and proficiency in producing human-like responses are notable. The model is causing a stir within the AI community and this podcast compares these two and highlights its weaknesses.


Why (some data people) Love Rust? - by Daniel Beach | 11mins

Daniel discusses his journey of falling in love with the programming language, Rust, and its benefits in the world of data engineering. The author credits Rust's appeal to its speed, memory model, immutability, and static typing. Cargo, Rust's manager for packaging and dependency, is another feature that the author commends for its simplicity and effectiveness, especially after experiencing some issues with Python's packaging system. However, the author also mentions that Rust has a steep learning curve and may not be the go-to programming language for everyday data engineering tasks, as Python still dominates this field. Despite this, learning and using Rust is deemed beneficial as it can enhance one's skills as an engineer.


Four ways to shoot yourself in the foot with Redis | 9mins

The article discusses common errors made while maintaining production outages, specifically related to using Redis. It starts by explaining how concurrency in the application layer can create contention as commands are queued on the server due to Redis' single-threaded nature. A solution suggested is sharding data across multiple Redis instances or using Redis Cluster for a more general approach. The use of Lua scripts or functions for logic that must run atomically is discussed as a potential source of errors due to Redis' single-threaded nature. It also advises on having alerts set up for memory usage at varying levels to prevent failure. Other tips include understanding the subtleties of Redis' API, careful serialisation of objects to JSON strings before storage, and smart use of lists for large collections.


10 Things I Learned from Reading Fundamentals of Data Engineering | 17mins

The book "Fundamentals of Data Engineering" by Joe Reis and Matt Housley is highly recommended for data engineers to gain a deep understanding of the important areas of data engineering. These include the concepts of data generation, ingestion, orchestration, transformation, storage, and governance.

Data engineering is described as the design, implementation, and maintenance of systems and processes that transform raw data into high-quality, consistent information and this intersects with security, data management, DataOps, data architecture, orchestration, and software engineering.

The book also covers the intricacies of data engineering lifecycle such as source systems, the choice of storage, the data ingestion process, the transformation stage, and the effective usage of data.


How To Measure the Value of Internal Tools | Square Corner Blog | 5mins

Square utilizes several internal tools to boost productivity and efficiency, including a communications platform, a customer data platform, and data platform by Amplitude. These tools are crucial to providing a superior experience for their external customers. Square monitors the value and effectiveness of these tools using specific data and metrics. They categorize their tools based on the metrics required to track each one and utilise product and operational metrics.

Product metrics gauge product usage/impact and include user adoption, engagement, satisfaction, and business impact. Operational metrics evaluate the performance and reliability of the product/service from the provider’s perspective including internal impact, service level, and reliability.


A side-by-side comparison of Apache Spark and Apache Flink for common streaming use cases | AWS Big Data Blog | 14mins

Apache Flink and Apache Spark are both widely used for big data processing and analytics. Spark is lauded for its simplicity, high-level APIs, and the capacity to process large amounts of data, while Flink excels at real-time, low-latency data stream processing and stateful computations. The post compares their usage for common streaming patterns, their APIs, and their techniques for handling data preparation, processing, and enrichment. It concludes that both systems are evolving quickly and are effective at handling big data, recommending that the choice between the two should depend on the specific needs of the workload and the surrounding architecture. Advice is also given on using user-defined functions judiciously to avoid slowing down the job or causing backpressure.

The Bard podcast was notably interesting. Thanks!

要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #17

    Art of Data Newsletter - Issue #17

    Welcome all Data fanatics. In today's issue: Are #Kubernetes days numbered? The future of #Observability - 7 things to…

  • Art of Data Newsletter - Issue #16

    Art of Data Newsletter - Issue #16

    Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…

  • Art of Data Newsletter - Issue #15

    Art of Data Newsletter - Issue #15

    Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…

  • Art of Data Newsletter - Issue #14

    Art of Data Newsletter - Issue #14

    Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #12

    Art of Data Newsletter - Issue #12

    Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

  • Art of Data Newsletter - Issue #9

    Art of Data Newsletter - Issue #9

    Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

社区洞察

其他会员也浏览了