DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart

DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart


Hi,

What a week!

A lot of giants shared their data knowledge.

Dig into Pinterest, Walmart and Paypal articles, and, as always, more.?

And we will finally solve this dilemma:


ARTICLES

Last Mile Data Processing with Ray | 8 min | ML | Raymond Lee, Qingxian Lai , Karthik Anantha Padmanabhan, Se Won Jang | Pinterest Engineering Blog

The Pinterest team’s assessment of bottlenecks impacting ML developer velocity and the integration of Ray, an open-source framework, into their ML Platform. This integration has substantially improved dataset iteration speed, reduced the duration from days to hours, and increased GPU utilization to over 90%.

Head-to-head comparison of 3 dbt SQL engines | 8 min | SQL | Niels Claeys | Data Minded Blog

This blog post compares three popular open-source SQL engines (Duckdb, Trino and Spark) for use with dbt in data pipelines. The benchmarking setup uses the TPC-DS benchmark with medium-sized datasets, highlighting that Duckdb performs the fastest in 75% of cases due to its single-node advantage. Trino is the fastest in the remaining 25%. It also touches on the user experience and differences in SQL dialects between these engines, when integrating them with dbt.


Combining Kedro and Streamlit to build a simple LLM-based Reading Assistant | 10 min | LLM | Piotr Chaberski | Part of Xebia Blog

This article shows how to effectively harness the power of these models by combining robust language understanding capabilities with clean implementation and a user-friendly experience using commercial LLM APIs, Kedro and Streamlit.

In MORE LINKS you will find scaling Kafka to support PayPal’s data growth, Machine Learning Platform at Walmart, Best Practices for LLM

evaluation of RAG applications

{ MORE LINKS }



TUTORIALS

Building a Real-Time Service Marketplace with Confluent Cloud | 9 min | Cloud | Arpita Agarwal | Confluence Tech Blog

How a leading service marketplace overcame challenges and harnessed Confluent Cloud, a managed Apache Kafka? solution, to build a centralized streaming platform for event-driven processing. Learn about advanced techniques such as data integrity, security and real-time analytics that empower the creation of a scalable, responsive and reliable ecosystem for tradespeople and clients.??

In MORE LINKS you will find securely process near-real-time data from Amazon MSK Serverless using an AWS Glue streaming ETL job with IAM authentication

{ MORE LINKS }



NEWS

Introducing Infrastructure Manager: Provision Google Cloud resources with Terraform | 3 min | Cloud | Danny Hammo, Vlad Ouzienko | Google Cloud Blog

Infrastructure Manager allows users to efficiently oversee their Google Cloud infrastructure through Infrastructure as Code (IaC) principles, all powered by Terraform's robust foundation. This approach offers the benefits of both worlds - a managed, streamlined method for deploying, configuring and managing cloud resources using declarative configurations.



TOOLS

SQLLineage | SQL

This tool simplifies identifying source and target tables in SQL commands, handling all the parser intricacies using libraries like SQLfluff and SQLparse to generate a user-friendly lineage graph.?

In MORE LINKS you will find StarRocks

{ MORE LINKS }



PODCASTS

Productivization of Data | 2 h 43 min | AI | guest: Kristofer ?gren | AIAW Podcast

Let’s explore Telia Division X's future, organizational structure, customer-driven vs. tech-driven development, data product creation while ensuring user privacy, generative AI and LLMs, the changing landscape of transportation and coding in an AI-driven world, the call for an AI race pause and Kristofer's plans.



CONFS EVENTS AND MEETUPS

Google Cloud Summit Poland | On-site | 26th October | Warsaw

Join the most significant Google Cloud event of the year, organized for the first time in Poland at the Palace of Culture and Science in Warsaw. Google Cloud Summit Poland is a free event bringing? everyone together in the cloud community.

Discover advancements in artificial intelligence, application modernization, collaboration tools, data cloud solutions, open infrastructure and cutting-edge security measures. This is designed to propel your digital transformation efforts and enhance your business outcomes.

________________________

Have any interesting content to share in the DATA Pill newsletter?

? Join us on GitHub

? Dig previous editions of DataPill?


Adam from the GetInData | Part of Xebia

要查看或添加评论,请登录

Adam Kawa的更多文章

社区洞察

其他会员也浏览了