?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

Welcome to this week's edition of DATA Pill, where we dive into optimizing Flink SQL and explore the process of reproducing GPT-2.

Enjoy!

ARTICLES

4 Tips for Data Quality Validations with Pytest and PySpark | 11 min | Data Quality | Taylor Wagner, Likitha Lokesh | Slalom Build Blog

A recent data software project required extensive testing on transformed data using AWS Glue. PySpark was used for data transformation, and the test automation framework incorporated Pytest alongside PySpark. This cohesive approach ensured high data quality standards, which are crucial for accurate data analysis and ingestion by third-party tools. This blog shares four key takeaways from the experience of enhancing data quality testing in similar projects.

The new wave of Composable Data Systems and the Interface to LLM agents | 11 min | LLM | Howard Chi | WrenAI Blog

Traditional databases use monolithic designs optimized for storage, computing, SQL, and API. There's recently been a shift towards open standards, with vendors like Snowflake and Databricks adopting formats like Apache Iceberg. This blog explores the benefits of composable data systems and integrating large language models into data infrastructure.


Optimizing Flink SQL: Joins, State Management and Efficient Checkpointing | 14 min | Data Processing | Maciej Maciejko | GetInData | Part of Xebia Blog

Maciej shares expert tips on optimizing Apache Flink SQL jobs for better performance and reliability. He covers strategies for efficient joins, state management, and checkpointing, providing practical advice to enhance data processing workflows.

In MORE LINKS you will read about:

  • Five Levels Of AI Agents
  • How does LinkedIn process 4 Trillion Events every day?

{ MORE LINKS }

TUTORIAL

Supercharging Airflow & dbt with Astronomer Cosmos on Azure Container Instances | 6 min | Data Engineering | Daniel van der Ende | Xebia Blog

Learn how to turn opaqueness into transparency by using Astronomer Cosmos to automatically render a dbt project into an Airflow DAG while running dbt on Azure Container Instances.

In MORE LINKS you will read about:

  • Text-to-SQL Using SingleStore Helios, Groq, and Llama 3
  • How to Turn a REST API Into a Data Stream with Kafka and Flink

{ MORE LINKS }

PODCAST

Making ETL pipelines a thing of the past | 26 min | AI | Cassandra Shum, Ben Popper | The Stack Overflow Podcast

We chatted with Cassandra Shum, VP of Field Engineering at RelationalAI, about her company’s efforts to create what is called the industry’s first coprocessor for data clouds and language models. The goal is to allow companies to keep all their data where it is today while still tapping into the capabilities of the latest generation of AI tools.

DATA TUBE

Let's reproduce GPT-2 | 4 h | AI | Andrej Karpathy | Personal Channel

This video demonstrates the entire process of reproducing GPT-2 (124M) from scratch. It covers building the GPT-2 network, optimizing its speed training, setting up the training run with GPT-2 and GPT-3 hyperparameters, and reviewing the results the following day. Note that this video builds on knowledge from earlier Zero to Hero Playlist videos. It closely resembles the creation of my nanoGPT repo, which is about 90% similar by the end.

CONFS EVENTS AND MEETUPS

Azure & AI Lowlands '24 | Utrecht | 7-10th October

Azure & AI Lowlands is a single day event with five tracks around the Microsoft Azure Platform. Focussing on cloud engineers, azure developers, AI engineers and AI enthousiasts.

_______________________

Have any interesting content to share in the DATA Pill newsletter?

? Join us on GitHub

? Dig previous editions of DataPill ?

Adam from the GetInData | Part of Xebia

Vu Trinh

I write for 5k+ readers at vutr.substack.com

4 个月

Thank you so much for the mention :) Adam Kawa

回复

要查看或添加评论,请登录

Adam Kawa的更多文章

社区洞察

其他会员也浏览了