Art of Data Newsletter - Issue #17
Welcome all Data fanatics. In today's issue:
Let's dive in!
This article discusses the future of Kubernetes, the leading container orchestration system, and Google's managed Kubernetes offerings. The author, a Google Cloud Architect, believes Kubernetes is not likely to become obsolete, but expects a future shift towards increasingly managed Kubernetes systems. Despite the various components that potentially require extra maintenance, most Kubernetes clusters are run in a 'managed' way by large hyperscalers such as GKE, EKS, and AKS. Google currently offers five options for running containers, including GKE Standard vs Autopilot and Cloud Run. Recent focus has been on increasing the abstraction and decreasing the direct engineer involvement in the operation of the cluster.
Observability has evolved significantly in recent years due to the increased use of microservices and distributed systems in businesses. The trend toward complex systems require more automation and advanced observability practices. Cloud-native technology has also emerged, leading to changes in the companies leading the market. The current shift in observability in 2023 and beyond is driven by businesses using new technology such as microservices, Kubernetes, and distributed architecture, providing improved security, scalability, and efficiency. However, challenges include prohibitive costs, evolving priorities, and changing expectations for observability. In 2023, observability tools will need to address these challenges through unified observability, integration of observability and business data, vendor-agnostic approaches, predictive observability and security.
Finetuning Large Language Models | 19mins
In the rapidly developing field of artificial intelligence, efficient use of large language models (LLMs) has become key. These LLMs can primarily be used in two ways for new tasks: in-context learning and finetuning. In-context learning can be applied when direct access to the model is limited, for instance, when using the model through an APIs. However, if access to the model is available, adapting and finetuning it usually yields superior outcomes. Conventional methods of adaptation include feature-based approach, updating output layers, and updating all layers. Despite being costlier, updating all layers or finetuing II yields superior modeling results.
领英推荐
Airbnb has launched a Lambda-like data framework called Riverbed that targets fast read performance and high availability. Riverbed was created in response to the growing number of queries that access multiple data sources and necessitate complex data transformations. Riverbed addresses challenges faced by Airbnb's payment system, which requires accessing multiple data sources and complex business logic to compute data such as fees, transaction dates, currencies, amounts, and total earnings. Riverbed utilises a combination of Change-Data-Capture (CDC), stream processing, and a database to persist the final results. The system currently processes 2.4 billion events and writes 350 million documents on a daily basis, powering 50+ materialized views across Airbnb.
Pinterest has implemented a Finer Grained Access Control (FGAC) framework to securely manage big data and provide users and services access only to the data they require for their work. The system extends and enhances Monarch, Pinterest's Hadoop based batch processing system. Initially, the system, dealing with large quantities of non-transient data, used dedicated service instances where data clusters were granted access to specific datasets.?
However, as new datasets requiring different access controls were created, new clusters, which increased hardware and maintenance costs, had to be created as well. Therefore, Pinterest switched from a host-centric system to a user-centric system, granting different users access to specific data via a common set of service clusters, thus preventing the creation of large supersets.
Meta has scaled up its use of Presto, a free, open source SQL query engine, over the past decade, learning valuable lessons in the process. To accommodate the rapid scaling of Presto to meet growing demand, Meta created an efficient deployment process and used automation to ensure constant availability and reduced manual tasks. Among the strategies using a load balancer dubbed Gateway to direct Presto queries, automatically generating configurations for new clusters, and integrating the process with company-wide infrastructure services. Additionally, automated debugging and remediation tools minimize human involvement while improving performance. Two examples of such tools are ‘Bad Host Detection’ for identifying problematic machines, and diagnostic tools geared towards queueing issues.
PayPal has introduced a data contract template as part of its Data Mesh implementation. This data contract is a mutual agreement between a data producer and consumer. The data contract comprises several elements such as fundamentals, schema, data quality, Service-level agreement (SLA), security & stakeholders, and custom properties. PayPal uses these data contracts to manage data interactions. A number of articles have been published concerning this data contract template, and any new articles found on the subject are encouraged to be added via a pull request.