DATA Pill #048 - Zero-ETL, Chat GPT and why NOT to use Kubeflow
Hi,
From Kafka to Delta Lake using Apache Spark Structured Streaming, AWS Lambda response streaming, and much more.
This week so much great content was created.
Dive into the best of them:
ARTICLES
Having your cake and eating it too: How Vizio built a next-generation data platform to enable BI reporting, real-time streaming, and AI/ML | 6 min | AI/ML | Parveen Jindal, Darren Liu, Alina Smirnova | Personal Blog
Vizio shares their success story of creative problem-solving by utilizing multiple data services and a data warehouse. When they needed to expand their capabilities, they developed a unified platform that consolidates different data platform use cases with linear scaling costs and full observability, setting them up for success with advanced analytics products.
Zero-ETL, ChatGPT, And The Future of Data Engineering | 9 min | Data Engineering | Barr Moses | Towards Data Science Blog
Let's explore the concept of Zero-ETL, which refers to the ability to directly access data from its source without implementing complex ETL (Extract, Transform, Load) processes. Barr outlines how this concept is being utilized in ChatGPT.?
Read about predictions that the future of data engineering lies in breaking down data silos and directly accessing data from its source using tools such as Zero-ETL, thereby reducing the time, effort, and costs involved in traditional ETL processes.
Do not use Kubeflow! | 7 min | Machine Learning | Josue Luzardo Gebrim | Personal Blog
It is not rocket science that not all tools are for everyone. Josue explains why he doesn't like to work with Kubeflow and describes 8 great alternatives for those who have similar opinions about this tool.
Enabling large-scale, multi-cloud computing with Dagster | 6 min | Machine Learning |? Fraser Marlow | Dagster.ai Blog
The Empirico team is working on building an innovative platform for automating machine learning experimentation. They are using Dagster to manage and orchestrate complex workflows. A case study where Fraser shares how they have been able to create a powerful platform capable of handling all stages of the machine learning experimentation process that also helped them dramatically reduce the number of bugs in their code, thanks to the tool's advanced testing and debugging features.
In MORE LINKS you will read about Google launching Non-Incremental Materialized Views For BigQuery.
DATA LIBRARY
DeeprETA: An ETA Post-processing System at Scale | 15 min | Data Engineering | Xinyu Hu, Tanmay Binaykiya, Eric Frank, Olcay Cirit | Uber Technologies
How Uber predicts rides ETA? Let’s meet DeeprETA, an accurate and fast travel time prediction system in production. Uber’s team evaluations demonstrate significant improvements over traditional machine learning models. Their findings will benefit researchers and practitioners in similar geospatial-temporal problems. While their hybrid approach is limited to Uber's proprietary routing engine, it is easily adaptable to other routing engines. Read how they continue to refine their model architecture, loss functions, and infrastructure for even greater accuracy improvements.
TUTORIAL
From Kafka to Delta Lake using Apache Spark Structured Streaming | 11 min | Data Streaming | Fabien Pomerol | Michelin Blog
In this one, Fabien shows that configuring a stream consuming Kafka events and appending them in a Delta Lake table with Spark Structured Streaming is quite easy and does not require tons of code. Next, read how Michelin Team set up a pipeline based on a Spark Structured Stream consuming Avro events from an Apache Kafka topic and wrote them to a Delta Lake table.
In MORE LINKS you will read about introducing AWS Lambda response streaming.
领英推荐
TOOLS
Recapit | AI | Recapit
Will AI replace us? Don’t think so, but this one can help you save your time everyday. Recaplt works by using AI technology to summarize news articles from over 60,000 news providers, providing a personalized and easy-to-listen audio news update delivered to your phone every morning based on your selected interests. RecapIt gives you access to a variety of news sources, so you can stay informed with the latest news from around the world.
NEWS
Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM | 8 min | AI | Mike Conover, Matt Hayes, Ankit Mathur, Xiangrui Meng, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia and Reynold Xin | Databricks Blog
A story about how Databricks developed Free Dolly 1.0, a large language model (LLM) trained for less than $30 to exhibit ChatGPT-like human interactivity. Why did they create a new dataset, and how did they do it? Read their journey to create a commercially viable model now.
PODCAST
Serious Public Clouds Invest In Infrastructure With Charles Fitzgerald | 45 min | hosts: Ned Bellavance, Ethan Banks guest: Charles Fitzgerald | Day Two Cloud Podcast
In this episode of Day Two Cloud, you can explore the financial allocation of public clouds and examine what IT and engineering professionals can glean from these spending habits. Additionally, you will learn more about cloud repatriation and its prevalence. Charles Fitzgerald is an expert in Capital Expenditure and authors the Platformonomics blog. He also works as a consultant, strategist, and angel investor.
Discussed subjects:?
CONFS EVENTS AND MEETUPS
Enabling world changing applications with a modern data architecture | 19h April | 11:30am CET | Webinar
The realm of building and transporting applications is constantly evolving. The arrival of advanced technologies like Kubernetes and the cloud native ecosystem has immensely transformed infrastructure and applications. However, traditional systems lag in agility, calling for modernizing data technologies and methodologies. Failure to do so usually results in the inability to leverage the advantages of this technological revolution.
On the 19th of April, Cockroach Labs and Computacenter will host a comprehensive discussion on the difficulties encountered by contemporary data practices and the means of driving business value through data development.
PaperTalks - The Forward-Forward Algorithm: Some Preliminary Investigations | 27h April | 3pm CET | online meeting
Join the next edition of live meeting with GetInData’s Advanced Analytics Team!? Stay up-to-date with the newest achievements in the world of machine learning, and stay ahead with cutting-edge developments. Don’t miss out on this opportunity to level up your knowledge in data science!
In MORE LINKS you will find "Build Your Own Large Language Model Like Dolly" webinar by Databricks.
________________________
Have any interesting content to share in the DATA Pill newsletter?
? Join us on GitHub
? Dig previous editions of DataPill?
Adam from the GetInData | Part of Xebia
Packet Pushers Founder & CEO
1 年That was a solid Day Two Cloud episode. I like Charles' analysis. Thought provoking. Not 100% sure how I felt about his take on GCP, but it definitely got me wondering considering GOOG's track record of handling projects & priorities.
Platform and Data Engineer
1 年Thanks for sharing my post