登录查看更多内容

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

发布日期: 2024年6月24日

Welcome to this week's edition of DATA Pill, where we dive into optimizing Flink SQL and explore the process of reproducing GPT-2.

Enjoy!

ARTICLES

4 Tips for Data Quality Validations with Pytest and PySpark | 11 min | Data Quality | Taylor Wagner, Likitha Lokesh | Slalom Build Blog

A recent data software project required extensive testing on transformed data using AWS Glue. PySpark was used for data transformation, and the test automation framework incorporated Pytest alongside PySpark. This cohesive approach ensured high data quality standards, which are crucial for accurate data analysis and ingestion by third-party tools. This blog shares four key takeaways from the experience of enhancing data quality testing in similar projects.

The new wave of Composable Data Systems and the Interface to LLM agents | 11 min | LLM | Howard Chi | WrenAI Blog

Traditional databases use monolithic designs optimized for storage, computing, SQL, and API. There's recently been a shift towards open standards, with vendors like Snowflake and Databricks adopting formats like Apache Iceberg. This blog explores the benefits of composable data systems and integrating large language models into data infrastructure.

Maciej shares expert tips on optimizing Apache Flink SQL jobs for better performance and reliability. He covers strategies for efficient joins, state management, and checkpointing, providing practical advice to enhance data processing workflows.

In MORE LINKS you will read about:

Five Levels Of AI Agents
How does LinkedIn process 4 Trillion Events every day?

{ MORE LINKS }

TUTORIAL

Supercharging Airflow & dbt with Astronomer Cosmos on Azure Container Instances | 6 min | Data Engineering | Daniel van der Ende | Xebia Blog

Learn how to turn opaqueness into transparency by using Astronomer Cosmos to automatically render a dbt project into an Airflow DAG while running dbt on Azure Container Instances.

Rami Krispin 2 周前

Data Science Prowess in Microsoft Fabric

Sonata Software 1 年前

Setting Up, Designing, and Building Knowledge Graph…

Ketan Raval 2 个月前

In MORE LINKS you will read about:

Text-to-SQL Using SingleStore Helios, Groq, and Llama 3
How to Turn a REST API Into a Data Stream with Kafka and Flink

{ MORE LINKS }

PODCAST

Making ETL pipelines a thing of the past | 26 min | AI | Cassandra Shum, Ben Popper | The Stack Overflow Podcast

We chatted with Cassandra Shum, VP of Field Engineering at RelationalAI, about her company’s efforts to create what is called the industry’s first coprocessor for data clouds and language models. The goal is to allow companies to keep all their data where it is today while still tapping into the capabilities of the latest generation of AI tools.

DATA TUBE

Let's reproduce GPT-2 | 4 h | AI | Andrej Karpathy | Personal Channel

This video demonstrates the entire process of reproducing GPT-2 (124M) from scratch. It covers building the GPT-2 network, optimizing its speed training, setting up the training run with GPT-2 and GPT-3 hyperparameters, and reviewing the results the following day. Note that this video builds on knowledge from earlier Zero to Hero Playlist videos. It closely resembles the creation of my nanoGPT repo, which is about 90% similar by the end.

CONFS EVENTS AND MEETUPS

Azure & AI Lowlands '24 | Utrecht | 7-10th October

Azure & AI Lowlands is a single day event with five tracks around the Microsoft Azure Platform. Focussing on cloud engineers, azure developers, AI engineers and AI enthousiasts.

_______________________

Have any interesting content to share in the DATA Pill newsletter?

? Join us on GitHub

? Dig previous editions of DataPill ?

Adam from the GetInData | Part of Xebia

DATA Pill

2,457 位关注者

Vu Trinh

I write for 5k+ readers at vutr.substack.com

4 个月

Thank you so much for the mention :) Adam Kawa

要查看或添加评论，请登录

Adam Kawa的更多文章

?? DATA Pill #129 - From ETL to AI, dbt: Incremental but Incomplete

2024年11月4日

?? DATA Pill #129 - From ETL to AI, dbt: Incremental but Incomplete

Hi, Welcome to this week’s DATA Pill! We’re bringing you insights into annotation scaling, dbt’s new incremental…

1 条评论
?? DATA Pill #128 - dbt? at BlaBlaCar, What CDC is (and isn’t)

2024年10月28日

?? DATA Pill #128 - dbt? at BlaBlaCar, What CDC is (and isn’t)

Hi, This week's DATA Pill delivers actionable insights: discover how BlaBlaCar scaled their data ops with dbt?, get…
?? DATA Pill #127 - dbt Semantic Layer, CSVs Into Graphs Using LLMs

2024年10月21日

?? DATA Pill #127 - dbt Semantic Layer, CSVs Into Graphs Using LLMs

Hi, Welcome to this week's DATA Pill newsletter! From optimizing lakehouse systems to exploring GenAI applications…
?? DATA Pill #126 - 6 Best LLM Tools To Run Models Locally, Unified Data + AI Governance with Unity Catalog

2024年10月14日

?? DATA Pill #126 - 6 Best LLM Tools To Run Models Locally, Unified Data + AI Governance with Unity Catalog

Hi, Here’s your latest roundup of the most relevant reads, tutorials, and news in the data world! From improving user…

1 条评论
?? DATA Pill #125 - Exposing dbt models in Looker, RAG with Postgres

2024年10月7日

?? DATA Pill #125 - Exposing dbt models in Looker, RAG with Postgres

Hi, This week, we're covering how to expose data models in Looker, build RAG systems with Postgres, and, at the latest,…

2 条评论
Subject: ?? DATA Pill #124 - SQL Has Problems, RAG API, QueryGPT

2024年9月30日

Subject: ?? DATA Pill #124 - SQL Has Problems, RAG API, QueryGPT

Hi, Welcome to this week's DATA Pill! I've packed this edition with cutting-edge insights on scaling data pipelines…
?? DATA Pill #123 - Stateless vs. Stateful Stream Processing, BigQuery Engine for Apache Flink

2024年9月23日

?? DATA Pill #123 - Stateless vs. Stateful Stream Processing, BigQuery Engine for Apache Flink

Hi, Ready to level up your data skills? This week's DATA Pill is packed with expert insights on everything from…
?? DATA Pill #122 - Master Dashboards, Terraform Databricks, and Boost Your Data Strategy

2024年9月16日

?? DATA Pill #122 - Master Dashboards, Terraform Databricks, and Boost Your Data Strategy

Hi, This week’s huge tutorial section covers everything from Terraforming Databricks to hybrid RAG retrieval, plus…

2 条评论
?? DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

2024年9月9日

?? DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

Hi, Get ready for your data fix. This week's DATA Pill covers diving into GPU memory for LLMs, Kafka on object storage,…

2 条评论
?? DATA Pill #120 - Just use Postgres, How Pytorch Powers Training Inference

2024年9月2日

?? DATA Pill #120 - Just use Postgres, How Pytorch Powers Training Inference

Hi, Morning! Let’s have some reading today (or watching or listening). ARTICLES Just use Postgres | 7 min | Database…

See all articles

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

ARTICLES

TUTORIAL

领英推荐

PODCAST

DATA TUBE

CONFS EVENTS AND MEETUPS

DATA Pill

2,457 位关注者

Adam Kawa的更多文章

社区洞察

其他会员也浏览了

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

Data Engineering & Ice Cream, Together At Last

ProntoPro’s Data team - Gaining insights into the future of local services!

DATA Pill #062 - Netflix's Data Mesh, Lyft’s ML, Ubers lakehouse and (best?) open-source LLM

?? DATA Pill #112 - Decodable vs. Amazon MSF, Flink SQL - changelog and races

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt

Meet Chanakya: The Platform Behind Anko’s Data-Driven Solutions

DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart

ARTICLES

TUTORIAL

领英推荐

PODCAST

DATA TUBE

CONFS EVENTS AND MEETUPS

DATA Pill

2,457 位关注者

Adam Kawa的更多文章

?? DATA Pill #129 - From ETL to AI, dbt: Incremental but Incomplete

?? DATA Pill #128 - dbt? at BlaBlaCar, What CDC is (and isn’t)

?? DATA Pill #127 - dbt Semantic Layer, CSVs Into Graphs Using LLMs

?? DATA Pill #126 - 6 Best LLM Tools To Run Models Locally, Unified Data + AI Governance with Unity Catalog

?? DATA Pill #125 - Exposing dbt models in Looker, RAG with Postgres

Subject: ?? DATA Pill #124 - SQL Has Problems, RAG API, QueryGPT

?? DATA Pill #123 - Stateless vs. Stateful Stream Processing, BigQuery Engine for Apache Flink

?? DATA Pill #122 - Master Dashboards, Terraform Databricks, and Boost Your Data Strategy

?? DATA Pill #121 - Local & Free Multi-Agent RAG Superbot, Data Mesh - Where Are We Now?

?? DATA Pill #120 - Just use Postgres, How Pytorch Powers Training Inference

社区洞察

其他会员也浏览了

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

Data Engineering & Ice Cream, Together At Last

ProntoPro’s Data team - Gaining insights into the future of local services!

DATA Pill #062 - Netflix's Data Mesh, Lyft’s ML, Ubers lakehouse and (best?) open-source LLM

?? DATA Pill #112 - Decodable vs. Amazon MSF, Flink SQL - changelog and races

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

DATA Pill #066 - Powering the Latest LLM Innovation, Data contracts and schema enforcement with dbt

Meet Chanakya: The Platform Behind Anko’s Data-Driven Solutions

DATA Pill #070 - 3 dbt SQL engines, Machine Learning Platform at Walmart