3 Tips for Data Engineers

3 Tips for Data Engineers

It is impossible to master all the tech in data engineering. In AWS alone there are over 40 things, from S3, Glue, Lambda, Cloud Formation, Stack, Athena, Redshift, RDS, KMS, Kafka, Kinesis, EMR, VPC, EC2, etc. In Azure, same thing. In GCP, same thing. And outside those 3, there are other major players like Databricks, Snowflake, dbt, Matillion, and many iPaaS like Boomi and Mulesoft.

So my first tip for you my fellow data engineers is this: don't try to keep up with them all, because we won't be able to. If we understand one platform end to end, it is not difficult to switch to other platform. For example, I worked with Azure Key Vault. When I had to use KMS (Key Management Service) in AWS or GCP it didn't take me long to understand them (under an hour) because I have used the same mechanism in Azure.

The same thing with Network Interface and Security Group. Because I've used them in one platform it's quick for me to adapt to another platform. Same with streaming i.e. Kafka and Event Hub. Same with AWS Lambda and Azure Function. Same with compute nodes and clusters.

Second tip: it's much more effective to learn by doing rather than by reading. There are 2 reasons for this. One, the tech keep changing. Two, there are a lot unwritten stuff. For example, you need to load data from S3 into Databricks. You read a book which suggests using PySpark and Delta table. The book was written 3 years ago when there was no Auto Loader, and when Unity Catalog external location did not exist. Again at that time there was no Iceberg.

The pace of changes in data engineering is very quick. All the tech providers keep building new things everyday and keep releasing new features every month. A book written 3 years ago would not pick up the new features released in the last 2 years. And 2 years my friends is a long time. Just look at what happened in Gen AI in the last 2 years.

We are not in civil engineering or mechanical engineering which moves at glacial pace in comparison. We work in data engineering and data science, which moves very quickly. The number of academic papers in US, India and China alone is thousands every year, and that's only the science in universities. Imagine the technology created from that science, in thousands of companies.

The second reason is that there is a lot which are not written in a book. Imagine there is a book that explains how to use AWS Lambda for loading data. Well there are many different use cases. Is it fast moving data like stream or slow moving like batch. Are you doing incremental ingestion or not? Do you need to land it on lake first or straight into stage? What security auth do you use? Glue or step function? Athena or Redshift? There are literally hundreds of permutations that can happen and there is no way a book can cover every scenario. The best way to learn is by trying out the approach that is appropriate for your particular situation. It is quicker and straight to the point (you don't have to deal with hundreds scenarios but only one).

Third tip: just try it out. It's free, it's cloud. It's not like 10 years ago when you had to install things on your hardware. You don't need any server hardware like 10 years ago. So if you are lack of experience in say Looker, just head to GCP and try it out (just be very careful with the Looker API, it could cost you thousands of dollars on your credit card). As a guide, anything that requires you to enter a credit card number, be very wary and walk away from them if you can.

But otherwise just try it out. That's the best way to kearn things. All those modern tools are waiting for you to discover them. Dbt, Airflow, Glue, Fabric, ADF, Python notebooks, ML models, Qlik, Snowflake and Databricks. If you haven't used them, just try them out. You'll learn a lot by trying them out for a few days.

Make 2025 the year of your learning! Good luck with your learning journey and best wishes for 2025.

Update 2/1/25:

Apologies I said 3 tips but this is so fundamental for a data engineer hence it worth mentioning: when building data platform or pipeline we need to build it bit by bit. Not in one go but piece by piece. Start with creating the database first. Then try to ingest one source file or table. Once it's working ok put it into Test environment. Because what works in Dev might not be working in Test. Once you got it working in Test put it into Prod environment. You will learn a lot by doing that, because it requires handover to prod support team. And that requires security scrutiny and architecture scrutiny. And documentation. Believe me, it you have 100 source files or tables, getting one file through to prod environment is 70% of the effort. The other 99 files require a lot of hours, but without any technical challenges because everything has been sorted out by that first file. Connectivity, authentication, key management, access, release pipeline, automated testing, monitoring, logging, and many more. I repeat: 70%.

Vincent Rainardi

Data Architect & Data Engineer

1 个月

Added the 4th tip at the bottom

要查看或添加评论,请登录

Vincent Rainardi的更多文章

  • Data Architecture, Information Architecture, Data Governance, Data Engineer

    Data Architecture, Information Architecture, Data Governance, Data Engineer

    Just a quick one on the difference between Data Architecture and Information Architecture. Information Architecture is…

    6 条评论
  • Data engineer and pizza

    Data engineer and pizza

    As a data engineer we have two jobs. The first one is to build data pipelines, and the second one is to build data…

    1 条评论
  • Power BI Data Analyst Certification

    Power BI Data Analyst Certification

    Perhaps your friend or yourself are interested to work in data, but don’t know where to start. Well one way is to get a…

    1 条评论
  • Data Catalog for Snowflake

    Data Catalog for Snowflake

    You have a data platform on Snowflake. And you are looking for a data catalog.

  • Building a Data Lake on AWS

    Building a Data Lake on AWS

    Principles: reliable storage, serverless, data governance, orchestration, IAM, IAC, audit trail, DevOps Decentralise…

    2 条评论
  • Transformation Engineering

    Transformation Engineering

    ETL is Dead We no longer do ETL, Extract > Transform > Load. No.

    230 条评论
  • A good company

    A good company

    I've been working for 15 companies in my career. I’m a contractor.

    2 条评论
  • Advice for computer science students

    Advice for computer science students

    Someone who teaches in university just asked me a question: his students who completed CSE undergraduate degree…

    1 条评论
  • A few quick tips on dbt.

    A few quick tips on dbt.

    1. dbt build error The line number on dbt build error is 2 lines too many.

  • Every data scientist needs to read this book

    Every data scientist needs to read this book

    Every data scientist needs to read this book: Big book of data science https://www.databricks.

社区洞察

其他会员也浏览了