登录查看更多内容

Art of Data Newsletter - Issue #17

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

发布日期: 2023年7月31日

+ 关注

Welcome all Data fanatics. In today's issue:

Are #Kubernetes days numbered?
The future of #Observability - 7 things to win the market
Finetuning #LargeLanguageModels
Riverbed - optimizing Data Access at Airbnb
Scaling Big Data Access at Pinterest
Lessons learned running Presto at Meta
Data Contract template by PayPal

Let's dive in!

Are Kubernetes days numbered?. …and if so — what is the future for… | 7mins

This article discusses the future of Kubernetes, the leading container orchestration system, and Google's managed Kubernetes offerings. The author, a Google Cloud Architect, believes Kubernetes is not likely to become obsolete, but expects a future shift towards increasingly managed Kubernetes systems. Despite the various components that potentially require extra maintenance, most Kubernetes clusters are run in a 'managed' way by large hyperscalers such as GKE, EKS, and AKS. Google currently offers five options for running containers, including GKE Standard vs Autopilot and Cloud Run. Recent focus has been on increasing the abstraction and decreasing the direct engineer involvement in the operation of the cluster.

The Future of Observability. How is observability changing in recent… | 10mins

Observability has evolved significantly in recent years due to the increased use of microservices and distributed systems in businesses. The trend toward complex systems require more automation and advanced observability practices. Cloud-native technology has also emerged, leading to changes in the companies leading the market. The current shift in observability in 2023 and beyond is driven by businesses using new technology such as microservices, Kubernetes, and distributed architecture, providing improved security, scalability, and efficiency. However, challenges include prohibitive costs, evolving priorities, and changing expectations for observability. In 2023, observability tools will need to address these challenges through unified observability, integration of observability and business data, vendor-agnostic approaches, predictive observability and security.

Finetuning Large Language Models | 19mins

In the rapidly developing field of artificial intelligence, efficient use of large language models (LLMs) has become key. These LLMs can primarily be used in two ways for new tasks: in-context learning and finetuning. In-context learning can be applied when direct access to the model is limited, for instance, when using the model through an APIs. However, if access to the model is available, adapting and finetuning it usually yields superior outcomes. Conventional methods of adaptation include feature-based approach, updating output layers, and updating all layers. Despite being costlier, updating all layers or finetuing II yields superior modeling results.

领英推荐

Pioneering the Next Generation of Vector Databases

Aishwarya Srinivasan 6 个月前

Inside the World’s Largest Data + AI Gathering

AIM 9 个月前

RAG Pipeline Evaluation, Integrating Data Science and…

Open Data Science Conference (ODSC) 11 个月前

Riverbed: Optimizing Data Access at Airbnb’s Scale | 10mins

Airbnb has launched a Lambda-like data framework called Riverbed that targets fast read performance and high availability. Riverbed was created in response to the growing number of queries that access multiple data sources and necessitate complex data transformations. Riverbed addresses challenges faced by Airbnb's payment system, which requires accessing multiple data sources and complex business logic to compute data such as fees, transaction dates, currencies, amounts, and total earnings. Riverbed utilises a combination of Change-Data-Capture (CDC), stream processing, and a database to persist the final results. The system currently processes 2.4 billion events and writes 350 million documents on a daily basis, powering 50+ materialized views across Airbnb.

Securely Scaling Big Data Access Controls At Pinterest | 24mins

Pinterest has implemented a Finer Grained Access Control (FGAC) framework to securely manage big data and provide users and services access only to the data they require for their work. The system extends and enhances Monarch, Pinterest's Hadoop based batch processing system. Initially, the system, dealing with large quantities of non-transient data, used dedicated service instances where data clusters were granted access to specific datasets.?

However, as new datasets requiring different access controls were created, new clusters, which increased hardware and maintenance costs, had to be created as well. Therefore, Pinterest switched from a host-centric system to a user-centric system, granting different users access to specific data via a common set of service clusters, thus preventing the creation of large supersets.

Lessons Learned Running Presto at Meta Scale?| 12mins

Meta has scaled up its use of Presto, a free, open source SQL query engine, over the past decade, learning valuable lessons in the process. To accommodate the rapid scaling of Presto to meet growing demand, Meta created an efficient deployment process and used automation to ensure constant availability and reduced manual tasks. Among the strategies using a load balancer dubbed Gateway to direct Presto queries, automatically generating configurations for new clusters, and integrating the process with company-wide infrastructure services. Additionally, automated debugging and remediation tools minimize human involvement while improving performance. Two examples of such tools are ‘Bad Host Detection’ for identifying problematic machines, and diagnostic tools geared towards queueing issues.

GitHub - paypal/data-contract-template: Template for a data contract used in a data mesh | 2mins

PayPal has introduced a data contract template as part of its Data Mesh implementation. This data contract is a mutual agreement between a data producer and consumer. The data contract comprises several elements such as fundamentals, schema, data quality, Service-level agreement (SLA), security & stakeholders, and custom properties. PayPal uses these data contracts to manage data interactions. A number of articles have been published concerning this data contract template, and any new articles found on the subject are encouraged to be added via a pull request.

Art of Data

284 位关注者

要查看或添加评论，请登录

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

2023年8月22日

Art of Data Newsletter - Issue #19

Welcome all Data fanatics. In today's issue: Open challenges in #LLM research How #GenerativeAI can revolutionize Data…
Art of Data Newsletter - Issue #18

2023年8月7日

Art of Data Newsletter - Issue #18

Welcome all Data fanatics. In today's issue: Google's Bard vs OpenAI's ChatGPT Why some Data Engineers love #Rust? 4…

1 条评论
Art of Data Newsletter - Issue #16

2023年7月23日

Art of Data Newsletter - Issue #16

Welcome all Data fanatics. In today's issue: Real-Time #MachineLearning foundations at Lyft Most data engineers are Mid…
Art of Data Newsletter - Issue #15

2023年7月10日

Art of Data Newsletter - Issue #15

Welcome all Data fanatics. In today's issue: LinkedIn explains their new data pipeline orchestrator - Hoptimator…
Art of Data Newsletter - Issue #14

2023年7月2日

Art of Data Newsletter - Issue #14

Welcome all Data fanatics. In today's issue: Databricks announces LakehouseIQ - LLM-based Assistant for working with…
Art of Data Newsletter - Issue #13

2023年6月23日

Art of Data Newsletter - Issue #13

Welcome all Data fanatics. In today's issue: StackOverflow Survey 2023 Why consumers don't trust your Data? Data…
Art of Data Newsletter - Issue #12

2023年6月13日

Art of Data Newsletter - Issue #12

Welcome all Data fanatics. In today's issue: The rapid explosion of #AI may come to an end, due to protective licensing.
Art of Data Newsletter - Issue #11

2023年6月6日

Art of Data Newsletter - Issue #11

Welcome all Data fanatics. In today's issue: Complexities of Production AI systems Uber built Spark Analysers that…
Art of Data Newsletter - Issue #10

2023年5月29日

Art of Data Newsletter - Issue #10

Welcome all Data fanatics. In today's issue: Microsoft announces new Microsoft Fabric Databricks published 2023 State…
Art of Data Newsletter - Issue #9

2023年5月22日

Art of Data Newsletter - Issue #9

Welcome all Data fanatics. In today's issue: MLOps basics for Data Engineers Managing BigQuery at Reddit scale Compass…

See all articles

Art of Data Newsletter - Issue #17

Bartosz Gajda

Databricks - Azure - Python | Staff Azure Data Engineer @ Lingaro

Are Kubernetes days numbered?. …and if so — what is the future for… | 7mins

The Future of Observability. How is observability changing in recent… | 10mins

Finetuning Large Language Models | 19mins

领英推荐

Riverbed: Optimizing Data Access at Airbnb’s Scale | 10mins

Securely Scaling Big Data Access Controls At Pinterest | 24mins

Lessons Learned Running Presto at Meta Scale?| 12mins

GitHub - paypal/data-contract-template: Template for a data contract used in a data mesh | 2mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

社区洞察

其他会员也浏览了

Cost-Effective Cloud Data Lakes, 10 Must-Read AI Books, and the Free ODSC East Open Pass

Inside the World’s Largest Data + AI Gathering

The January 2024 MinIO Newsletter

Selected Data Engineering Posts . . . February 2025

Top Product-Based Companies for Data Scientists in 2023

Big Data Rules for AI: How to Build a Foundation That Actually Works

Vector Database Revolution - Chroma, Pinecone, and Weaviate Explored

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

The Top 10 Data Science as a Service Companies Revolutionizing the Industry

Unleashing GenAI: How a Next-Gen Data Format is Revolutionizing AI Data Storage

Are Kubernetes days numbered?. …and if so — what is the future for… | 7mins

The Future of Observability. How is observability changing in recent… | 10mins

Finetuning Large Language Models | 19mins

领英推荐

Riverbed: Optimizing Data Access at Airbnb’s Scale | 10mins

Securely Scaling Big Data Access Controls At Pinterest | 24mins

Lessons Learned Running Presto at Meta Scale?| 12mins

GitHub - paypal/data-contract-template: Template for a data contract used in a data mesh | 2mins

Art of Data

284 位关注者

Bartosz Gajda的更多文章

Art of Data Newsletter - Issue #19

Art of Data Newsletter - Issue #18

Art of Data Newsletter - Issue #16

Art of Data Newsletter - Issue #15

Art of Data Newsletter - Issue #14

Art of Data Newsletter - Issue #13

Art of Data Newsletter - Issue #12

Art of Data Newsletter - Issue #11

Art of Data Newsletter - Issue #10

Art of Data Newsletter - Issue #9

社区洞察

其他会员也浏览了

Cost-Effective Cloud Data Lakes, 10 Must-Read AI Books, and the Free ODSC East Open Pass

Inside the World’s Largest Data + AI Gathering

The January 2024 MinIO Newsletter

Selected Data Engineering Posts . . . February 2025

Top Product-Based Companies for Data Scientists in 2023

Big Data Rules for AI: How to Build a Foundation That Actually Works

Vector Database Revolution - Chroma, Pinecone, and Weaviate Explored

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

The Top 10 Data Science as a Service Companies Revolutionizing the Industry

Unleashing GenAI: How a Next-Gen Data Format is Revolutionizing AI Data Storage