登录查看更多内容

DATA Pill #003: Apache Airflow at Scale, One-stop MLOps portal and more

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

发布日期: 2022年6月22日

+ 关注

Hi everyone ??

?let’s start the third leg of our DATA marathon.

?

ARTICLES?

Lessons Learned From Running Apache Airflow at Scale | 10 min read | Apache Airflow | Megan Parker | Shopify Blog?

Challenges in running Airflow at scale + concrete solutions

A combination of GCS and NFS allows for both performant and easy to use file management.
Metadata retention policies can reduce degradation of Airflow performance.
A centralized metadata repository can be used to track DAG origins and ownership.

One-stop MLOps portal at LinkedIn | 10 min read | MLOps| LinkedIn Blog

To visualize the entire ML lifecycle, an infrastructure is needed to automatically track every step of the machine learning process. We created a data schema to capture the complete, structured, and well-documented information detailing how machine learning models are produced.

Monitoring Large-Scale Apache Flink Applications, Part 1: Concepts & Continuous Monitoring | 12 min read | Apache Flink | Nico Kruber | Ververica Blog?

This post introducees various useful metrics which can be set up with proper alerts to inform you about imminent failures and allow you to monitor cluster and application health and checkpointing progress. Different ways to track latency and observe your application’s throughput for performance monitoring

Real-time ingestion to Iceberg with Kafka Connect - Apache Iceberg Sink | 11 min read | Apache Iceberg Sink | ?? Grzegorz Liter | GetInData Blog?

GetInData created an Apache Iceberg sink that can be deployed on a Kafka Connect instance. Data format that is consumed by Apache Iceberg has to represent table-like data and its schema, therefore we used a format created by Debezium for change data capture.

{ MORE LINKS }

领英推荐

The 2025 Comprehensive Guide to Apache Iceberg

Alex Merced 1 个月前

Data Engineering: From Zero ETL in the Past to LLM as…

Dr. RVS Praveen Ph.D 1 年前

Data Lakehouse Roundup #1 - News and Insights on the…

Alex Merced 5 个月前

____________________

PODCAST

Dataflow Automation | 47 min | The Data Exchange

Jeremiah Lowin CEO of Prefect on designing tools to allow teams to build, run, and monitor data pipelines at scale. Data engineering challenges facing data and ML teams today, and implications of looming trends in machine learning and AI are discussed.?

{ MORE LINKS }

____________________

DATAtube?

Things I Wish I Knew When I Started As A Data Engineer ?| 15 min | Seattle Data Guy

Lessons and advice after 10 years in data. Don't try to learn all technologies all at once - it’s gonna get you nowhere

{ MORE LINKS }

If You have any feedback, please leave a comment below.?

I want this newsletter to reach out to our tech community and its needs.

See You tomorrow!

Adam Kawa from GetInData

DATA Pill

2,557 位关注者

要查看或添加评论，请登录

Adam Kawa的更多文章

?? DATA Pill #148 - Tackling AI Hallucinations in LLM Apps, Open Standards for Data Lineage

2025年3月17日

?? DATA Pill #148 - Tackling AI Hallucinations in LLM Apps, Open Standards for Data Lineage

Hi, Welcome to this week’s DATA Pill! We’ve got two Microsoft Fabric tutorials, AI insights from IBM Research, key data…
?? DATA Pill #147 - Are you ready for MLOps? ?? DuckDB goes distributed?

2025年3月10日

?? DATA Pill #147 - Are you ready for MLOps? ?? DuckDB goes distributed?

Hi, This week, we dive into MLOps, scaling DuckDB, DeepSeek-R1’s cost, and PayPal’s causal inference. Plus, meaty…
?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

2025年3月2日

?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

Hi, The data world is moving fast. I bring you the latest in data engineering, AI, and analytics, from SQL tips to AI…

1 条评论
?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

2025年2月24日

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Hi, This week’s DATA Pill covers aligning data with business goals, key data trends for 2025, Apache Iceberg, and…

1 条评论
Mastering LLMs: 3 Blogs You Need to Read

2025年2月21日

Mastering LLMs: 3 Blogs You Need to Read

Large Language Models (LLMs) are at the forefront of technological innovation, transforming industries like e-commerce,…

1 条评论
?? DATA Pill #144 - Train 400x faster Static Embedding Models, LLMs and Graphs Synergy

2025年2月17日

?? DATA Pill #144 - Train 400x faster Static Embedding Models, LLMs and Graphs Synergy

Hi, Train embeddings 400x faster, boost LLMs with knowledge graphs, and integrate real-time AI. Explore reasoning…

4 条评论
?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

2025年2月10日

?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

Hi, Data engineering is shifting fast—ETL is evolving, AI is transforming search, and workflows are being redefined…
?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

2025年2月3日

?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

Hi, This week, we're covering the latest in AI, data engineering, and distributed systems. From optimizing ETL…

1 条评论
?? DATA Pill #141 - Multi-Team Airflow, The Dawn of AI Agents

2025年1月27日

?? DATA Pill #141 - Multi-Team Airflow, The Dawn of AI Agents

Hi, Dive into the latest trends, tutorials, and innovations shaping the data world. ARTICLES Exploring the Potential of…

2 条评论
?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

2025年1月20日

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

Hi, This week's highlights dive into AI-ready data strategies, real-time GenAI architectures, and a deep dive into the…

2 条评论

See all articles

DATA Pill #003: Apache Airflow at Scale, One-stop MLOps portal and more

Adam Kawa

CEO at GetInData, ex-Spotify | Data & AI for banks, telecoms, retail & more.

?

ARTICLES?

领英推荐

PODCAST

DATAtube?

DATA Pill

2,557 位关注者

Adam Kawa的更多文章

社区洞察

其他会员也浏览了

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

DATA Pill #033 - 4 ways to optimize BigQuery, 30 data models in DBT, 4 enablers of being data-driven, and a look back at the 2022 predictions

Data Wars: Vector Strikes Back

DATA Pill #061 - Apache Celeborn, 8 Futuristic Databases to Watch in 2023

Working with Semi-Structured JSON Data in Databricks

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Architecting Data Pipelines

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion

?

ARTICLES?

领英推荐

PODCAST

DATAtube?

DATA Pill

2,557 位关注者

Adam Kawa的更多文章

?? DATA Pill #148 - Tackling AI Hallucinations in LLM Apps, Open Standards for Data Lineage

?? DATA Pill #147 - Are you ready for MLOps? ?? DuckDB goes distributed?

?? DATA Pill #146 - SQL is all you need, 30 Must-Know Tools for Python Development

?? DATA Pill #145 - Data vs. Business Strategy, Top Themes in Data in 2025

Mastering LLMs: 3 Blogs You Need to Read

?? DATA Pill #144 - Train 400x faster Static Embedding Models, LLMs and Graphs Synergy

?? DATA Pill #143 - ETL is Dead, The Golden Path Revolution

?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

?? DATA Pill #141 - Multi-Team Airflow, The Dawn of AI Agents

?? DATA Pill #140 - Apache Kafka + Vector Database + LLM = Real-Time GenAI, 3 Steps to AI-Ready Data

社区洞察

其他会员也浏览了

GroupBy #17: Pinterest’s new wide column database using RocksDB, Fault tolerance Kafka on Kubernetes at Grab

DATA Pill #033 - 4 ways to optimize BigQuery, 30 data models in DBT, 4 enablers of being data-driven, and a look back at the 2022 predictions

Data Wars: Vector Strikes Back

DATA Pill #061 - Apache Celeborn, 8 Futuristic Databases to Watch in 2023

Working with Semi-Structured JSON Data in Databricks

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

Architecting Data Pipelines

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

DATA Pill #075 - 5 Best Data Observability Platforms, to dbt or not to dbt

Unveiling the Data Tapestry: A Data Engineer's Guide to Collection and Ingestion