Art of Data Newsletter - Issue #17
Photo by Joseph Costa: https://www.pexels.com/photo/brown-concrete-buildings-and-bridge-1462935/

Art of Data Newsletter - Issue #17

Welcome all Data fanatics. In today's issue:

Let's dive in!


Are Kubernetes days numbered?. …and if so — what is the future for… | 7mins

This article discusses the future of Kubernetes, the leading container orchestration system, and Google's managed Kubernetes offerings. The author, a Google Cloud Architect, believes Kubernetes is not likely to become obsolete, but expects a future shift towards increasingly managed Kubernetes systems. Despite the various components that potentially require extra maintenance, most Kubernetes clusters are run in a 'managed' way by large hyperscalers such as GKE, EKS, and AKS. Google currently offers five options for running containers, including GKE Standard vs Autopilot and Cloud Run. Recent focus has been on increasing the abstraction and decreasing the direct engineer involvement in the operation of the cluster.


The Future of Observability. How is observability changing in recent… | 10mins

Observability has evolved significantly in recent years due to the increased use of microservices and distributed systems in businesses. The trend toward complex systems require more automation and advanced observability practices. Cloud-native technology has also emerged, leading to changes in the companies leading the market. The current shift in observability in 2023 and beyond is driven by businesses using new technology such as microservices, Kubernetes, and distributed architecture, providing improved security, scalability, and efficiency. However, challenges include prohibitive costs, evolving priorities, and changing expectations for observability. In 2023, observability tools will need to address these challenges through unified observability, integration of observability and business data, vendor-agnostic approaches, predictive observability and security.


Finetuning Large Language Models | 19mins

In the rapidly developing field of artificial intelligence, efficient use of large language models (LLMs) has become key. These LLMs can primarily be used in two ways for new tasks: in-context learning and finetuning. In-context learning can be applied when direct access to the model is limited, for instance, when using the model through an APIs. However, if access to the model is available, adapting and finetuning it usually yields superior outcomes. Conventional methods of adaptation include feature-based approach, updating output layers, and updating all layers. Despite being costlier, updating all layers or finetuing II yields superior modeling results.


Riverbed: Optimizing Data Access at Airbnb’s Scale | 10mins

Airbnb has launched a Lambda-like data framework called Riverbed that targets fast read performance and high availability. Riverbed was created in response to the growing number of queries that access multiple data sources and necessitate complex data transformations. Riverbed addresses challenges faced by Airbnb's payment system, which requires accessing multiple data sources and complex business logic to compute data such as fees, transaction dates, currencies, amounts, and total earnings. Riverbed utilises a combination of Change-Data-Capture (CDC), stream processing, and a database to persist the final results. The system currently processes 2.4 billion events and writes 350 million documents on a daily basis, powering 50+ materialized views across Airbnb.


Securely Scaling Big Data Access Controls At Pinterest | 24mins

Pinterest has implemented a Finer Grained Access Control (FGAC) framework to securely manage big data and provide users and services access only to the data they require for their work. The system extends and enhances Monarch, Pinterest's Hadoop based batch processing system. Initially, the system, dealing with large quantities of non-transient data, used dedicated service instances where data clusters were granted access to specific datasets.?

However, as new datasets requiring different access controls were created, new clusters, which increased hardware and maintenance costs, had to be created as well. Therefore, Pinterest switched from a host-centric system to a user-centric system, granting different users access to specific data via a common set of service clusters, thus preventing the creation of large supersets.


Lessons Learned Running Presto at Meta Scale?| 12mins

Meta has scaled up its use of Presto, a free, open source SQL query engine, over the past decade, learning valuable lessons in the process. To accommodate the rapid scaling of Presto to meet growing demand, Meta created an efficient deployment process and used automation to ensure constant availability and reduced manual tasks. Among the strategies using a load balancer dubbed Gateway to direct Presto queries, automatically generating configurations for new clusters, and integrating the process with company-wide infrastructure services. Additionally, automated debugging and remediation tools minimize human involvement while improving performance. Two examples of such tools are ‘Bad Host Detection’ for identifying problematic machines, and diagnostic tools geared towards queueing issues.

GitHub - paypal/data-contract-template: Template for a data contract used in a data mesh | 2mins

PayPal has introduced a data contract template as part of its Data Mesh implementation. This data contract is a mutual agreement between a data producer and consumer. The data contract comprises several elements such as fundamentals, schema, data quality, Service-level agreement (SLA), security & stakeholders, and custom properties. PayPal uses these data contracts to manage data interactions. A number of articles have been published concerning this data contract template, and any new articles found on the subject are encouraged to be added via a pull request.


要查看或添加评论,请登录

Bartosz Gajda的更多文章

  • Art of Data Newsletter - Issue #19

    Art of Data Newsletter - Issue #19

    Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…

  • Art of Data Newsletter - Issue #18

    Art of Data Newsletter - Issue #18

    Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

    1 条评论
  • Art of Data Newsletter - Issue #16

    Art of Data Newsletter - Issue #16

    Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…

  • Art of Data Newsletter - Issue #15

    Art of Data Newsletter - Issue #15

    Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…

  • Art of Data Newsletter - Issue #14

    Art of Data Newsletter - Issue #14

    Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…

  • Art of Data Newsletter - Issue #13

    Art of Data Newsletter - Issue #13

    Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…

  • Art of Data Newsletter - Issue #12

    Art of Data Newsletter - Issue #12

    Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.

  • Art of Data Newsletter - Issue #11

    Art of Data Newsletter - Issue #11

    Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…

  • Art of Data Newsletter - Issue #10

    Art of Data Newsletter - Issue #10

    Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…

  • Art of Data Newsletter - Issue #9

    Art of Data Newsletter - Issue #9

    Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

社区洞察

其他会员也浏览了