?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2
Welcome to this week's edition of DATA Pill, where we dive into optimizing Flink SQL and explore the process of reproducing GPT-2.
Enjoy!
ARTICLES
4 Tips for Data Quality Validations with Pytest and PySpark | 11 min | Data Quality | Taylor Wagner, Likitha Lokesh | Slalom Build Blog
A recent data software project required extensive testing on transformed data using AWS Glue. PySpark was used for data transformation, and the test automation framework incorporated Pytest alongside PySpark. This cohesive approach ensured high data quality standards, which are crucial for accurate data analysis and ingestion by third-party tools. This blog shares four key takeaways from the experience of enhancing data quality testing in similar projects.
The new wave of Composable Data Systems and the Interface to LLM agents | 11 min | LLM | Howard Chi | WrenAI Blog
Traditional databases use monolithic designs optimized for storage, computing, SQL, and API. There's recently been a shift towards open standards, with vendors like Snowflake and Databricks adopting formats like Apache Iceberg. This blog explores the benefits of composable data systems and integrating large language models into data infrastructure.
Optimizing Flink SQL: Joins, State Management and Efficient Checkpointing | 14 min | Data Processing | Maciej Maciejko | GetInData | Part of Xebia Blog
Maciej shares expert tips on optimizing Apache Flink SQL jobs for better performance and reliability. He covers strategies for efficient joins, state management, and checkpointing, providing practical advice to enhance data processing workflows.
In MORE LINKS you will read about:
TUTORIAL
Supercharging Airflow & dbt with Astronomer Cosmos on Azure Container Instances | 6 min | Data Engineering | Daniel van der Ende | Xebia Blog
Learn how to turn opaqueness into transparency by using Astronomer Cosmos to automatically render a dbt project into an Airflow DAG while running dbt on Azure Container Instances.
领英推荐
In MORE LINKS you will read about:
PODCAST
Making ETL pipelines a thing of the past | 26 min | AI | Cassandra Shum, Ben Popper | The Stack Overflow Podcast
We chatted with Cassandra Shum, VP of Field Engineering at RelationalAI, about her company’s efforts to create what is called the industry’s first coprocessor for data clouds and language models. The goal is to allow companies to keep all their data where it is today while still tapping into the capabilities of the latest generation of AI tools.
DATA TUBE
Let's reproduce GPT-2 | 4 h | AI | Andrej Karpathy | Personal Channel
This video demonstrates the entire process of reproducing GPT-2 (124M) from scratch. It covers building the GPT-2 network, optimizing its speed training, setting up the training run with GPT-2 and GPT-3 hyperparameters, and reviewing the results the following day. Note that this video builds on knowledge from earlier Zero to Hero Playlist videos. It closely resembles the creation of my nanoGPT repo, which is about 90% similar by the end.
CONFS EVENTS AND MEETUPS
Azure & AI Lowlands '24 | Utrecht | 7-10th October
Azure & AI Lowlands is a single day event with five tracks around the Microsoft Azure Platform. Focussing on cloud engineers, azure developers, AI engineers and AI enthousiasts.
_______________________
Have any interesting content to share in the DATA Pill newsletter?
? Join us on GitHub
? Dig previous editions of DataPill ?
Adam from the GetInData | Part of Xebia
I write for 5k+ readers at vutr.substack.com
4 个月Thank you so much for the mention :) Adam Kawa