登录查看更多内容

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

发布日期: 2023年4月17日

+ 关注

Hi,

From Kafka to Delta Lake using Apache Spark Structured Streaming, AWS Lambda response streaming, and much more.

This week so much great content was created.

Dive into the best of them:

ARTICLES

Having your cake and eating it too: How Vizio built a next-generation data platform to enable BI reporting, real-time streaming, and AI/ML | 6 min | AI/ML | Parveen Jindal, Darren Liu, Alina Smirnova | Personal Blog

Vizio shares their success story of creative problem-solving by utilizing multiple data services and a data warehouse. When they needed to expand their capabilities, they developed a unified platform that consolidates different data platform use cases with linear scaling costs and full observability, setting them up for success with advanced analytics products.

Zero-ETL, ChatGPT, And The Future of Data Engineering | 9 min | Data Engineering | Barr Moses | Towards Data Science Blog

Let's explore the concept of Zero-ETL, which refers to the ability to directly access data from its source without implementing complex ETL (Extract, Transform, Load) processes. Barr outlines how this concept is being utilized in ChatGPT.?

Read about predictions that the future of data engineering lies in breaking down data silos and directly accessing data from its source using tools such as Zero-ETL, thereby reducing the time, effort, and costs involved in traditional ETL processes.

Do not use Kubeflow! | 7 min | Machine Learning | Josue Luzardo Gebrim | Personal Blog

It is not rocket science that not all tools are for everyone. Josue explains why he doesn't like to work with Kubeflow and describes 8 great alternatives for those who have similar opinions about this tool.

Enabling large-scale, multi-cloud computing with Dagster | 6 min | Machine Learning |? Fraser Marlow | Dagster.ai Blog

The Empirico team is working on building an innovative platform for automating machine learning experimentation. They are using Dagster to manage and orchestrate complex workflows. A case study where Fraser shares how they have been able to create a powerful platform capable of handling all stages of the machine learning experimentation process that also helped them dramatically reduce the number of bugs in their code, thanks to the tool's advanced testing and debugging features.

In MORE LINKS you will read about Google launching Non-Incremental Materialized Views For BigQuery.

{ MORE LINKS }

DATA LIBRARY

DeeprETA: An ETA Post-processing System at Scale | 15 min | Data Engineering | Xinyu Hu, Tanmay Binaykiya, Eric Frank, Olcay Cirit | Uber Technologies

How Uber predicts rides ETA? Let’s meet DeeprETA, an accurate and fast travel time prediction system in production. Uber’s team evaluations demonstrate significant improvements over traditional machine learning models. Their findings will benefit researchers and practitioners in similar geospatial-temporal problems. While their hybrid approach is limited to Uber's proprietary routing engine, it is easily adaptable to other routing engines. Read how they continue to refine their model architecture, loss functions, and infrastructure for even greater accuracy improvements.

TUTORIAL

From Kafka to Delta Lake using Apache Spark Structured Streaming | 11 min | Data Streaming | Fabien Pomerol | Michelin Blog

In this one, Fabien shows that configuring a stream consuming Kafka events and appending them in a Delta Lake table with Spark Structured Streaming is quite easy and does not require tons of code. Next, read how Michelin Team set up a pipeline based on a Spark Structured Stream consuming Avro events from an Apache Kafka topic and wrote them to a Delta Lake table.

In MORE LINKS you will read about introducing AWS Lambda response streaming.

{ MORE LINKS }

领英推荐

Data Science and AI Trends 2021 Rundown

Michael Spencer 3 年前

RAG Pipeline Evaluation, Integrating Data Science and…

Open Data Science Conference (ODSC) 11 个月前

Reference Architecture for RAG applications

Sanjay Kumar MBA,MS,PhD 4 个月前

TOOLS

Recapit | AI | Recapit

Will AI replace us? Don’t think so, but this one can help you save your time everyday. Recaplt works by using AI technology to summarize news articles from over 60,000 news providers, providing a personalized and easy-to-listen audio news update delivered to your phone every morning based on your selected interests. RecapIt gives you access to a variety of news sources, so you can stay informed with the latest news from around the world.

NEWS

Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | 8 min | AI | Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Meng, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia and Reynold Xin | Databricks Blog

A story about how Databricks developed Free Dolly 1.0, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity. Why did they create a new dataset, and how did they do it? Read their journey to create a commercially viable model now.

PODCAST

Serious Public Clouds Invest In Infrastructure With Charles Fitzgerald | 45 min | hosts: Ned Bellavance, Ethan Banks guest: Charles Fitzgerald | Day Two Cloud Podcast

In this episode of Day Two Cloud, you can explore the financial allocation of public clouds and examine what IT and engineering professionals can glean from these spending habits. Additionally, you will learn more about cloud repatriation and its prevalence. Charles Fitzgerald is an expert in Capital Expenditure and authors the Platformonomics blog. He also works as a consultant, strategist, and angel investor.

Discussed subjects:?

Why a public cloud’s CapEx matters?
How CapEx might indicate a particular strategy or direction?
Drawing conclusions from capital spending by the Big 3 Can smaller public clouds compete?

CONFS EVENTS AND MEETUPS

Enabling world changing applications with a modern data architecture | 19h April | 11:30am CET | Webinar

The realm of building and transporting applications is constantly evolving. The arrival of advanced technologies like Kubernetes and the cloud native ecosystem has immensely transformed infrastructure and applications. However, traditional systems lag in agility, calling for modernizing data technologies and methodologies. Failure to do so usually results in the inability to leverage the advantages of this technological revolution.

On the 19th of April, Cockroach Labs and Computacenter will host a comprehensive discussion on the difficulties encountered by contemporary data practices and the means of driving business value through data development.

PaperTalks - The Forward-Forward Algorithm: Some Preliminary Investigations | 27h April | 3pm CET | online meeting

Join the next edition of live meeting with GetInData’s Advanced Analytics Team!? Stay up-to-date with the newest achievements in the world of machine learning, and stay ahead with cutting-edge developments. Don’t miss out on this opportunity to level up your knowledge in data science!

In MORE LINKS you will find "Build Your Own Large Language Model Like Dolly" webinar by Databricks.

{ MORE LINKS }

________________________

Have any interesting content to share in the DATA Pill newsletter?

? Join us on GitHub

? Dig previous editions of DataPill?

Adam from the GetInData | Part of Xebia

DATA Pill

2,557 位关注者

Ethan Banks

Packet Pushers Founder & CEO

1 年

That was a solid Day Two Cloud episode. I like Charles' analysis. Thought provoking. Not 100% sure how I felt about his take on GCP, but it definitely got me wondering considering GOOG's track record of handling projects & priorities.

1 次回应

Josue Luzardo Gebrim

Platform and Data Engineer

1 年

Thanks for sharing my post

查看更多评论

要查看或添加评论，请登录

Adam Kawa的更多文章

?? DATA Pill #148 - Tackling AI Hallucinations in LLM Apps, Open Standards for Data Lineage

2025年3月17日

?? DATA Pill #148 - Tackling AI Hallucinations in LLM Apps, Open Standards for Data Lineage

Hi, Welcome to this week’s DATA Pill! We’ve got two Microsoft Fabric tutorials, AI insights from IBM Research, key data…
?? DATA Pill #147 - Are you ready for MLOps? ?? DuckDB goes distributed?

2025年3月10日

?? DATA Pill #147 - Are you ready for MLOps? ?? DuckDB goes distributed?

Hi, This week, we dive into MLOps, scaling DuckDB, DeepSeek-R1’s cost, and PayPal’s causal inference. Plus, meaty…
?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

2025年3月2日

?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

Hi, The data world is moving fast. I bring you the latest in data engineering, AI, and analytics, from SQL tips to AI…

1 条评论
?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

2025年2月24日

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Hi, This week’s DATA Pill covers aligning data with business goals, key data trends for 2025, Apache Iceberg, and…

1 条评论
Mastering LLMs: 3 Blogs You Need to Read

2025年2月21日

Mastering LLMs: 3 Blogs You Need to Read

Large Language Models (LLMs) are at the forefront of technological innovation, transforming industries like e-commerce,…

1 条评论
?? DATA Pill #144 - Train 400x faster Static Embedding Models, LLMs and Graphs Synergy

2025年2月17日

?? DATA Pill #144 - Train 400x faster Static Embedding Models, LLMs and Graphs Synergy

Hi, Train embeddings 400x faster, boost LLMs with knowledge graphs, and integrate real-time AI. Explore reasoning…

4 条评论
?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

2025年2月10日

?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

Hi, Data engineering is shifting fast—ETL is evolving, AI is transforming search, and workflows are being redefined…
?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

2025年2月3日

?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

Hi, This week, we're covering the latest in AI, data engineering, and distributed systems. From optimizing ETL…

1 条评论
?? DATA Pill #141 - Multi-Team Airflow, The Dawn of AI Agents

2025年1月27日

?? DATA Pill #141 - Multi-Team Airflow, The Dawn of AI Agents

Hi, Dive into the latest trends, tutorials, and innovations shaping the data world. ARTICLES Exploring the Potential of…

2 条评论
?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

2025年1月20日

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

Hi, This week's highlights dive into AI-ready data strategies, real-time GenAI architectures, and a deep dive into the…

2 条评论

See all articles

DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

ARTICLES

DATA LIBRARY

TUTORIAL

领英推荐

TOOLS

NEWS

PODCAST

CONFS EVENTS AND MEETUPS

DATA Pill

2,557 位关注者

Adam Kawa的更多文章

社区洞察

其他会员也浏览了

Machine Learning and Big Data: Are They the Future?

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

Modern Data Stack for AI

Impact of LLMs on the evolving data + ML stack

DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning

Comparing Document Data Options for Generative AI

DATA Pill #055 - Microsoft builds the bomb and queues for Kafka

?? DATA Pill #115 - CI/CD at Amazon vs. Google, Building Churn Models, LLM Principles

?? DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

?? DATA Pill #099 - Conventional RAG → Graph RAG, Knowledge Graphs using Neo4j and Vertex AI

ARTICLES

DATA LIBRARY

TUTORIAL

领英推荐

TOOLS

NEWS

PODCAST

CONFS EVENTS AND MEETUPS

DATA Pill

2,557 位关注者

Adam Kawa的更多文章

?? DATA Pill #148 - Tackling AI Hallucinations in LLM Apps, Open Standards for Data Lineage

?? DATA Pill #147 - Are you ready for MLOps? ?? DuckDB goes distributed?

?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Mastering LLMs: 3 Blogs You Need to Read

?? DATA Pill #144 - Train 400x faster Static Embedding Models, LLMs and Graphs Synergy

?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

?? DATA Pill #141 - Multi-Team Airflow, The Dawn of AI Agents

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

社区洞察

其他会员也浏览了

Machine Learning and Big Data: Are They the Future?

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

Modern Data Stack for AI

Impact of LLMs on the evolving data + ML stack

DATA Pill #049 - 91% of ML Models degrade in time, MLflow 2.3 and Secrets of Deep Reinforcement Learning

Comparing Document Data Options for Generative AI

DATA Pill #055 - Microsoft builds the bomb and queues for Kafka

?? DATA Pill #115 - CI/CD at Amazon vs. Google, Building Churn Models, LLM Principles

?? DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

?? DATA Pill #099 - Conventional RAG → Graph RAG, Knowledge Graphs using Neo4j and Vertex AI