DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

Hi,

From Kafka to Delta Lake using Apache Spark Structured Streaming, AWS Lambda response streaming, and much more.

This week so much great content was created.

Dive into the best of them:



ARTICLES

Having your cake and eating it too: How Vizio built a next-generation data platform to enable BI reporting, real-time streaming, and AI/ML | 6 min | AI/ML | Parveen Jindal, Darren Liu, Alina Smirnova | Personal Blog

Vizio shares their success story of creative problem-solving by utilizing multiple data services and a data warehouse. When they needed to expand their capabilities, they developed a unified platform that consolidates different data platform use cases with linear scaling costs and full observability, setting them up for success with advanced analytics products.

No alt text provided for this image

Zero-ETL, ChatGPT, And The Future of Data Engineering | 9 min | Data Engineering | Barr Moses | Towards Data Science Blog

Let's explore the concept of Zero-ETL, which refers to the ability to directly access data from its source without implementing complex ETL (Extract, Transform, Load) processes. Barr outlines how this concept is being utilized in ChatGPT.?

Read about predictions that the future of data engineering lies in breaking down data silos and directly accessing data from its source using tools such as Zero-ETL, thereby reducing the time, effort, and costs involved in traditional ETL processes.

No alt text provided for this image

Do not use Kubeflow! | 7 min | Machine Learning | Josue Luzardo Gebrim | Personal Blog

It is not rocket science that not all tools are for everyone. Josue explains why he doesn't like to work with Kubeflow and describes 8 great alternatives for those who have similar opinions about this tool.


Enabling large-scale, multi-cloud computing with Dagster | 6 min | Machine Learning |? Fraser Marlow | Dagster.ai Blog

The Empirico team is working on building an innovative platform for automating machine learning experimentation. They are using Dagster to manage and orchestrate complex workflows. A case study where Fraser shares how they have been able to create a powerful platform capable of handling all stages of the machine learning experimentation process that also helped them dramatically reduce the number of bugs in their code, thanks to the tool's advanced testing and debugging features.

No alt text provided for this image


In MORE LINKS you will read about Google launching Non-Incremental Materialized Views For BigQuery.

{ MORE LINKS }



DATA LIBRARY

DeeprETA: An ETA Post-processing System at Scale | 15 min | Data Engineering | Xinyu Hu, Tanmay Binaykiya, Eric Frank, Olcay Cirit | Uber Technologies

How Uber predicts rides ETA? Let’s meet DeeprETA, an accurate and fast travel time prediction system in production. Uber’s team evaluations demonstrate significant improvements over traditional machine learning models. Their findings will benefit researchers and practitioners in similar geospatial-temporal problems. While their hybrid approach is limited to Uber's proprietary routing engine, it is easily adaptable to other routing engines. Read how they continue to refine their model architecture, loss functions, and infrastructure for even greater accuracy improvements.




TUTORIAL

From Kafka to Delta Lake using Apache Spark Structured Streaming | 11 min | Data Streaming | Fabien Pomerol | Michelin Blog

In this one, Fabien shows that configuring a stream consuming Kafka events and appending them in a Delta Lake table with Spark Structured Streaming is quite easy and does not require tons of code. Next, read how Michelin Team set up a pipeline based on a Spark Structured Stream consuming Avro events from an Apache Kafka topic and wrote them to a Delta Lake table.

No alt text provided for this image

In MORE LINKS you will read about introducing AWS Lambda response streaming.

{ MORE LINKS }



TOOLS

Recapit | AI | Recapit

Will AI replace us? Don’t think so, but this one can help you save your time everyday. Recaplt works by using AI technology to summarize news articles from over 60,000 news providers, providing a personalized and easy-to-listen audio news update delivered to your phone every morning based on your selected interests. RecapIt gives you access to a variety of news sources, so you can stay informed with the latest news from around the world.



NEWS

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | 8 min | AI | Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Meng, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia and Reynold Xin | Databricks Blog

A story about how Databricks developed Free Dolly 1.0, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity. Why did they create a new dataset, and how did they do it? Read their journey to create a commercially viable model now.




PODCAST

Serious Public Clouds Invest In Infrastructure With Charles Fitzgerald | 45 min | hosts: Ned Bellavance, Ethan Banks guest: Charles Fitzgerald | Day Two Cloud Podcast

In this episode of Day Two Cloud, you can explore the financial allocation of public clouds and examine what IT and engineering professionals can glean from these spending habits. Additionally, you will learn more about cloud repatriation and its prevalence. Charles Fitzgerald is an expert in Capital Expenditure and authors the Platformonomics blog. He also works as a consultant, strategist, and angel investor.

Discussed subjects:?

  • Why a public cloud’s CapEx matters?
  • How CapEx might indicate a particular strategy or direction?
  • Drawing conclusions from capital spending by the Big 3 Can smaller public clouds compete?




CONFS EVENTS AND MEETUPS

Enabling world changing applications with a modern data architecture | 19h April | 11:30am CET | Webinar

The realm of building and transporting applications is constantly evolving. The arrival of advanced technologies like Kubernetes and the cloud native ecosystem has immensely transformed infrastructure and applications. However, traditional systems lag in agility, calling for modernizing data technologies and methodologies. Failure to do so usually results in the inability to leverage the advantages of this technological revolution.

On the 19th of April, Cockroach Labs and Computacenter will host a comprehensive discussion on the difficulties encountered by contemporary data practices and the means of driving business value through data development.


PaperTalks - The Forward-Forward Algorithm: Some Preliminary Investigations | 27h April | 3pm CET | online meeting

Join the next edition of live meeting with GetInData’s Advanced Analytics Team!? Stay up-to-date with the newest achievements in the world of machine learning, and stay ahead with cutting-edge developments. Don’t miss out on this opportunity to level up your knowledge in data science!


In MORE LINKS you will find "Build Your Own Large Language Model Like Dolly" webinar by Databricks.

{ MORE LINKS }

________________________


Have any interesting content to share in the DATA Pill newsletter?

? Join us on GitHub

? Dig previous editions of DataPill?



Adam from the GetInData | Part of Xebia

Ethan Banks

Packet Pushers Founder & CEO

1 年

That was a solid Day Two Cloud episode. I like Charles' analysis. Thought provoking. Not 100% sure how I felt about his take on GCP, but it definitely got me wondering considering GOOG's track record of handling projects & priorities.

Josue Luzardo Gebrim

Platform and Data Engineer

1 年

Thanks for sharing my post

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了