登录查看更多内容

3 Tips for Data Engineers

Vincent Rainardi

Data Architect & Data Engineer

发布日期: 2025年1月2日

It is impossible to master all the tech in data engineering. In AWS alone there are over 40 things, from S3, Glue, Lambda, Cloud Formation, Stack, Athena, Redshift, RDS, KMS, Kafka, Kinesis, EMR, VPC, EC2, etc. In Azure, same thing. In GCP, same thing. And outside those 3, there are other major players like Databricks, Snowflake, dbt, Matillion, and many iPaaS like Boomi and Mulesoft.

So my first tip for you my fellow data engineers is this: don't try to keep up with them all, because we won't be able to. If we understand one platform end to end, it is not difficult to switch to other platform. For example, I worked with Azure Key Vault. When I had to use KMS (Key Management Service) in AWS or GCP it didn't take me long to understand them (under an hour) because I have used the same mechanism in Azure.

The same thing with Network Interface and Security Group. Because I've used them in one platform it's quick for me to adapt to another platform. Same with streaming i.e. Kafka and Event Hub. Same with AWS Lambda and Azure Function. Same with compute nodes and clusters.

Second tip: it's much more effective to learn by doing rather than by reading. There are 2 reasons for this. One, the tech keep changing. Two, there are a lot unwritten stuff. For example, you need to load data from S3 into Databricks. You read a book which suggests using PySpark and Delta table. The book was written 3 years ago when there was no Auto Loader, and when Unity Catalog external location did not exist. Again at that time there was no Iceberg.

The pace of changes in data engineering is very quick. All the tech providers keep building new things everyday and keep releasing new features every month. A book written 3 years ago would not pick up the new features released in the last 2 years. And 2 years my friends is a long time. Just look at what happened in Gen AI in the last 2 years.

We are not in civil engineering or mechanical engineering which moves at glacial pace in comparison. We work in data engineering and data science, which moves very quickly. The number of academic papers in US, India and China alone is thousands every year, and that's only the science in universities. Imagine the technology created from that science, in thousands of companies.

领英推荐

nOps Processes Billions of AWS Spend, Know How…

nOps 1 年前

Understanding Batch and Real-Time Processing in…

Scrumconnect Consulting 1 年前

Simplifying Analytics with Azure Databricks' Open…

Bosonit 1 年前

The second reason is that there is a lot which are not written in a book. Imagine there is a book that explains how to use AWS Lambda for loading data. Well there are many different use cases. Is it fast moving data like stream or slow moving like batch. Are you doing incremental ingestion or not? Do you need to land it on lake first or straight into stage? What security auth do you use? Glue or step function? Athena or Redshift? There are literally hundreds of permutations that can happen and there is no way a book can cover every scenario. The best way to learn is by trying out the approach that is appropriate for your particular situation. It is quicker and straight to the point (you don't have to deal with hundreds scenarios but only one).

Third tip: just try it out. It's free, it's cloud. It's not like 10 years ago when you had to install things on your hardware. You don't need any server hardware like 10 years ago. So if you are lack of experience in say Looker, just head to GCP and try it out (just be very careful with the Looker API, it could cost you thousands of dollars on your credit card). As a guide, anything that requires you to enter a credit card number, be very wary and walk away from them if you can.

But otherwise just try it out. That's the best way to kearn things. All those modern tools are waiting for you to discover them. Dbt, Airflow, Glue, Fabric, ADF, Python notebooks, ML models, Qlik, Snowflake and Databricks. If you haven't used them, just try them out. You'll learn a lot by trying them out for a few days.

Make 2025 the year of your learning! Good luck with your learning journey and best wishes for 2025.

Update 2/1/25:

Apologies I said 3 tips but this is so fundamental for a data engineer hence it worth mentioning: when building data platform or pipeline we need to build it bit by bit. Not in one go but piece by piece. Start with creating the database first. Then try to ingest one source file or table. Once it's working ok put it into Test environment. Because what works in Dev might not be working in Test. Once you got it working in Test put it into Prod environment. You will learn a lot by doing that, because it requires handover to prod support team. And that requires security scrutiny and architecture scrutiny. And documentation. Believe me, it you have 100 source files or tables, getting one file through to prod environment is 70% of the effort. The other 99 files require a lot of hours, but without any technical challenges because everything has been sorted out by that first file. Connectivity, authentication, key management, access, release pipeline, automated testing, monitoring, logging, and many more. I repeat: 70%.

Vincent Rainardi

Data Architect & Data Engineer

1 个月

Added the 4th tip at the bottom

1 次回应

查看更多评论

要查看或添加评论，请登录

Vincent Rainardi的更多文章

Data Architecture, Information Architecture, Data Governance, Data Engineer

2025年2月25日

Data Architecture, Information Architecture, Data Governance, Data Engineer

Just a quick one on the difference between Data Architecture and Information Architecture. Information Architecture is…

6 条评论
Data engineer and pizza

2025年2月23日

Data engineer and pizza

As a data engineer we have two jobs. The first one is to build data pipelines, and the second one is to build data…

1 条评论
Power BI Data Analyst Certification

2025年2月19日

Power BI Data Analyst Certification

Perhaps your friend or yourself are interested to work in data, but don’t know where to start. Well one way is to get a…

1 条评论
Data Catalog for Snowflake

2025年2月18日

Data Catalog for Snowflake

You have a data platform on Snowflake. And you are looking for a data catalog.
Building a Data Lake on AWS

2025年2月11日

Building a Data Lake on AWS

Principles: reliable storage, serverless, data governance, orchestration, IAM, IAC, audit trail, DevOps Decentralise…

2 条评论
Transformation Engineering

2025年2月2日

Transformation Engineering

ETL is Dead We no longer do ETL, Extract > Transform > Load. No.

230 条评论
A good company

2025年1月12日

A good company

I've been working for 15 companies in my career. I’m a contractor.

2 条评论
Advice for computer science students

2025年1月11日

Advice for computer science students

Someone who teaches in university just asked me a question: his students who completed CSE undergraduate degree…

1 条评论
A few quick tips on dbt.

2025年1月10日

A few quick tips on dbt.

1. dbt build error The line number on dbt build error is 2 lines too many.
Every data scientist needs to read this book

2025年1月4日

Every data scientist needs to read this book

Every data scientist needs to read this book: Big book of data science https://www.databricks.

See all articles

3 Tips for Data Engineers

Vincent Rainardi

Data Architect & Data Engineer

领英推荐

Vincent Rainardi的更多文章

社区洞察

其他会员也浏览了

AWS Data Engineering Essentials Guidebook

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

66% say AWS is the most required platform in job descriptions

AWS SERVERLESS DATA PLATFORM ARCHITECTURE

Mastering Data Modeling with MongoDB: Unleashing Performance and Scalability

Client Success Story: Unleashing the Power of AI and Big Data: Building Kudala's Private Multi-Tenant Cloud with Kubernetes

Why AWS is investing in a zero-ETL future

Data Engineering on AWS

AWS Reinvent 2024 – Simplify Your Data Journey

Cluster configuration on Databricks best practices

领英推荐

Vincent Rainardi的更多文章

Data Architecture, Information Architecture, Data Governance, Data Engineer

Data engineer and pizza

Power BI Data Analyst Certification

Data Catalog for Snowflake

Building a Data Lake on AWS

Transformation Engineering

A good company

Advice for computer science students

A few quick tips on dbt.

Every data scientist needs to read this book

社区洞察

其他会员也浏览了

AWS Data Engineering Essentials Guidebook

Transforming Big Data into Insights with AWS CDK / AWS Step Functions and more

66% say AWS is the most required platform in job descriptions

AWS SERVERLESS DATA PLATFORM ARCHITECTURE

Mastering Data Modeling with MongoDB: Unleashing Performance and Scalability

Client Success Story: Unleashing the Power of AI and Big Data: Building Kudala's Private Multi-Tenant Cloud with Kubernetes

Why AWS is investing in a zero-ETL future

Data Engineering on AWS

AWS Reinvent 2024 – Simplify Your Data Journey

Cluster configuration on Databricks best practices