How to Become a Senior Data Engineer in 2024
This is a long article which talks about importance,mindset changes,challenges and not for people who wants to become data engineers in few days.

How to Become a Senior Data Engineer in 2024

The Modern world is changing fast. Globalization has reduced the boundaries. AI/ML buzz is everywhere. Digital Marketing driving us crazy and so is Data. Data is new gold and every organization no matter the size or scale of their business wants to use it as a tool to grow fast and mitigate risks. Demand in the Market for Data Analysts, Data Engineers, Data Scientists, ML Engineers, AI Engineers is immense and slowly every other role in the IT industry going to mix with these roles, some roles will become obsolete soon making life more complicated for people who work in dedicated roles like Developers, Testers, Admins etc. There is so much confusion and tons of skills mentioned in any Job posting related to Data roles expecting someone to be superhuman. After working in IT for over 14 years, I strongly believe each role is important and every organization needs good testers, good developers, good admins etc. to be successful because each one of them has a different mindset and it's not so easy to expect everything in one person.

Anyways can't change this world so let's focus on Data Engineering.

What is Data Engineering?

IBM definition

Data engineering is the practice of designing and building systems for the aggregation, storage and analysis of data at scale.

Wikipedia definition

Data engineering refers to the building of systems to enable the collection and usage of data.

Informatica definition

Data engineering is the process of discovering, designing and building the data infrastructure to help data owners and data users use and analyze raw data from multiple sources and formats. This allows businesses to use the data to make critical business decisions.

As you can see above even there is no standard definition and that's true. It depends on the company Type, Size, Business Model, Domain, Architecture, Year of creation and so many other factors.

In older companies, such as major banks, you’ll encounter a wide range of data engineering and architecture aspects. However, much of this architecture is legacy, which presents additional challenges. Conversely, modern Software as a Service (SaaS) companies typically have more streamlined models designed for rapid analytics, free from legacy architecture for obvious reasons. The size of an organization significantly impacts its data stack, with larger companies having more specialized roles. However, this is changing, and unfortunately many layoffs are expected soon. In small and medium-sized organizations, a few individuals often handle all aspects of data engineering and, sometimes, the entire data architecture which make it more challenging.

Some businesses, such as those in banking, healthcare, and airlines (highlighted by the significant impact of the recent Microsoft outage), require real-time analytics. In contrast, others, like government agencies handling census analysis and tax filings, can rely on batch analytics, which are not time-sensitive and can be processed after a few hours. However, there is a growing trend towards real-time analytics in IT, driven by the need for rapid action due to factors like security, profitability, and competition. Companies are increasingly leveraging data to gain timely insights and make swift decisions.

Before focusing on the different skills required for Data Engineers let me make a few things clear. Even though you might be working as a Data Engineer you are supposed to know a lot about many other things because boundaries are getting blurred, roles are getting merged and have huge expectations which is sometimes problematic.

For example, in one job posting you see there are no Big Data Technologies mentioned(since data volumes are low) while another put a lot of emphasis on them. Few companies have Power BI Analysts/Tableau Analysts as a role so they don't expect any Data Engineer to be very good at making reports just basic knowledge is enough but the majority of them now require looking after Visualization expertise. Sometimes role will be of Data warehouse engineer(No Data lake or Lakehouse) but job will be posted as a Data Engineer.

Top enterprises have multi-cloud models which increases the complexity of this role but after working them on almost all main cloud providers(AWS/Azure/GCP), I can surely say all of them have similar tools and technologies and that does not matter. These days all companies are looking for a specific cloud experience so better to do some hands-on three if you have time- depending on your aspirational role/dream company.

One piece of advice during any interview ask more questions to understand your role and responsibilities, include some data architecture-level queries and then decide as per your experience and knowledge whether that role suits you or not.

Skills Required :

Companies often look for a mix of skills in Data Architects, Data Engineers, Data Analysts and Visualization experts these days and unfortunately looks like will include Data Scientist, AI and ML skills after a few years, especially for Senior Data engineers/Lead Data engineers.

If you don't understand the difference between these roles, please check IBM link What Is Data Engineering? | IBM

1) Data Modelling: A good Data model key element of a great data architecture, simple queries and data quality and analytics. So please give importance to Data Modelling. I have seen many bad data models in the past which forced me to use out-of-the-box SQL queries using tricks.Small or medium enterprises generally consider Kimball approach while big enterprises consider Inmon approach.Due to issues with Data lake and Data Warehouse,now there is a shift towards creating Lakehouse which has best features of both.

Slowly Changing Dimensions is very critical for understanding the maintenance of historical data. Modern data warehouses/Lakehouse like Databricks, Snowflake, etc. have a time travel feature which is quite good.

Kimball -Star & Snowflake

Inmon

Data Vault & Data Vault 2.0 (Modern way especially in Databricks)

NoSQL Modelling

Medallion architecture-Bronze,Silver and Gold Layer

2) Advanced SQL, Joins and Relational Database principles:

In any analytics Database or even in SPARK SQL, you are going to write complex Join queries as a daily routine, unlike transactional databases. So better be an expert in Advanced SQL and do lots of hands-on.

Inner Join, Left Outer Join, Right Outer Join(use with caution), Full Outer Join

CTEs - Common Table Expressions(For modularity and simplicity) Love them as they allow me to test one part of it when something is wrong.

Windows functions -RANK,DENSE_RANK,LAG,LEAD,AVG(Rolling Average) etc

Stored Procedures, T-SQL(SQL Server), Scalar functions

Triggers

Views/ Materialized Views

Aggregate Functions like COUNT, SUM, AVERAGE, MAX, MIN

Performance Tuning Techniques

Metadata Checks

All DBMS systems like SQL Server, DB2, MYSQL have similar features and a few unique features. I will write about it in future articles.


3) Programming Languages: A good understanding of programming languages is very important as you will be writing custom scripts to handle complex validations or transformations logic. Trust me all technologies have evolved so much now that it's not so difficult to learn any programming language if you have a strong will.

Python (Expert-level) -Must

Java (Must for Enterprise Companies as REST API still built using Spring/Hibernate framework)

Scala, PySpark, Java(For Big Data)

R (Becoming Popular )

Python is used everywhere now. No need to learn web-related frameworks like Django, focus on basics first and then learn advanced concepts like List Comprehension, Decorators, Generators etc.

Learn Padas and Numpy libraries as they will also help you in learning Scala/PySpark easily as their underlying data structure is the same i.e. Dataframe which is just like a table.

If you already know SQL you will find learning pandas, NumPy, Polars extremely easy. Just a little bit of difference in syntax.


4) Data Warehousing Concepts:

Data Warehouse design has various aspects and it is crucial to understand the importance of each one of them. Data warehouse are mostly created either on the Kimball or Inmon approach of Data Modelling.

https://www.geeksforgeeks.org/difference-between-kimball-and-inmon/

Data Integration Challenges

Pipeline design(ETL/ELT)

Historical data use cases

Surrogate keys, Indexing, Partitioning

Staging Importance

Data Marts

Logging, Archiving

Metadata Management

Data Dictionaries

Data Governance, Security, Performance Monitoring

Understanding of Scheduling tools and dependencies

Change Data Capture


5) ETL(Extract, Transform and Load) /ELT(Extract, Load and Transform):

Data pipelines can be designed using either ETL (Extract, Transform, Load) or ELT (Extract, Load, Transform) approaches. Traditionally, ETL was the predominant method. However, with the advent of cloud computing, organizations now often prefer ELT for simple to medium transformations use cases due to its flexibility and scalability. In Databricks companies prefer using ELT when transformations are simple.

Despite this shift, ETL remains widely used, particularly in large enterprises. Typically, you’ll find a combination of both ETL and ELT in organizations, so it’s important to understand the design principles of each

The most popular ETL tools are

Azure Data Factory(ADF) -Most Popular due to its easy interface and design

SSIS (SQL Server Integration Services) - Simple but quite effective but Microsoft promoting ADF these days

AWS Glue - ADF alternative in AWS

Google cloud Dataflow

Databricks - Very Popular due to its amazing features

Talent Data Fabric

Informatica PowerCenter

Talend Data Fabric

DataStage (mostly in Enterprises and legacy systems)

Ab initio (mostly in Enterprises and legacy systems)


6) Big Data :

Big Data knowledge is a must. Any company will use Big data technologies depending on various factors like Type of data(Structured, Unstructured, Semi-structured), Data Volumes, Machine Learning use cases, Velocity etc. Databricks is getting popular and it truly deserve that but always remember you should decide toolset based on your company use cases considering future strategies.

Big Data learning will take time and it's very important to understand the history behind the evolution of Spark and Databricks and Lakehouse architecture

HADOOP

Spark -Spark SQL,One of languages( PySpark, Scala, Java)

Hive

HBase

Join Types, RDDs and DataFrames

Data Lake, Data Lakehouse and Performance Monitoring

Use of Notebooks

Partitioning and Bucketing Concepts

Understanding of different file formats and their advantages -Parquet, ORC, CSV etc

Data Provisioning- Data Ingestion, Consumption and Curation Architecture

Databricks -Medallion Architecture, Unity Catalog, Security,Batch and Streaming,AI/ML etc


7) Data Visualization :

All Data Engineers must have good experience in Data Visualization tools like Power bi, Tableau, Looker, Talend and even Excel basics and Power Pivots. How far you should know depends on your company. But these days it's a key skill and should be in every Data Engineer profile.

I have primarily worked with Power BI, but I also have some experience with Tableau and Looker. Recently, I have developed the skill of creating visually appealing reports but this takes time. It’s crucial to focus on the underlying data and ensure its of good quality because, regardless of how good your report looks, it must accurately represent the data in the right format. Validating data can often be a challenge so good knowledge of excel often help.

A key decision when working with Power BI is determining whether to handle logic in SQL, the Power Query editor, or DAX expressions.

Major Visual Types and their usage

DAX expressions (Learn it slowly-Row Context, Filter Context, Context Switching)

Power Query Editor ,M language basics

Star and Snowflake Schema

Measures, Calculated Columns, Tooltips, Bookmarks, Window Functions(Tableau)

DAX Studio for analyzing performance and debugging

Impact analysis, Usage Metrics report, Data security, Access Control, Row level security

All Data Visualization tools are great and have their strengths and limitations.


8) Data Quality:

Data Quality plays a very important role these days in Data World. If the quality of data is bad it will impact the business and affect the potential of Data Scientists and AI/ML Engineers and might result in reputational damage too. Think of a hypothetical situation where you don't have the right email and phone nos of some of your customers as you are a global company. You simply can't communicate with them.

Big Enterprises monitor data quality using tools like Informatica, Talend , Attaccama and have a big team(Data Quality Team) to manage to identify such issues, their root causes and build a strategy to fix them. Small and medium organizations might not have a whole team or most of the time responsibility lies on data engineers. In any situation, your understanding of data quality is a must for playing your role effectivity.


9) API Architecture and Webhooks :

Understanding REST APIs, GRAPH APIs, SOAP APIs architecture is crucial for designing good solutions and integrating with them. Webhooks are great in event-driven architecture-related use cases. REST API methods like GET, POST, PUT, DELETE, PATCH etc and response codes are quite important. One obvious task which each Data Engineer should know is how to parse complex JSON response


10) Data Security & Data Governance:

They are very important and there can be serious risks around Data Security and Data Governance so please pay utmost importance to them in the era of Cyber attacks. Basic understanding is a must.

Data Governance Tools like Collibra, Purview, Transcend etc

Data Governance standards (Depends on your domain like HIPAA,GDPR etc)

Use of Certificates, API Keys, OAuth, Encryption and Decryption

Data Anonymisation

VPN Concepts

Gateways/Network Security Groups


11) CI/CD, Version Control Tools:

Continuous Integration and Continuous Development part of DevOps is everywhere so you should understand how to use them properly.

A good understanding of the Version Control tools is required in any role. Some of the popular ones are listed below

GitHub

GitLab

Bitbucket

Important concepts like Branch, Commit, Push, Pull,Clone ,Rebase etc

12) Other Skills/Tools :

Performance Optimization -Try to understand how billing works in Cloud Platform. One of your responsibility will be to optimize pipelines so tracking and some knowledge of DBA is also important as sometimes companies do not have DBA role these days.

Continuous Learning

Problem Solving and Creativity

Debugging Skills

Types of Testing and TDD/BDD approach

ITSM Basics

Containerization, Kubernetes

Tools like Jira, Confluence, dbt, Postman, VS code, Visual Studio, Power shell, Apache Iceberg,Apache Airflow.

There are no shortcuts. Regardless of how many people sell their crash courses claiming to be experts, you need to do a lot of hands-on work to truly grasp the concepts. Engage in end-to-end projects on any cloud platform and be prepared to invest some money in building your career, beyond just relying on free trials.

The role of a Senior Data Engineer is multifaceted and requires proficiency in various areas. It can often be quite stressful, particularly in smaller companies since there are fewer people to support and review.

I have realized very few people share real IT industry experience which depicts the true picture. Linked In has become like Facebook these days and is full of just marketing posts with pictures.

Hope you find this article useful. Do your work diligently, share knowledge around and leave rest on almighty. Give credit to the colleagues who deserve that and never think that you know everything.

My apologies for such a long article but this is my honest attempt to show what exactly is happening in Data World.

Thanks!









Kishore Koneti

Data Architect || Data Modeler || Data Governance || Data Engineer || Data Analyst || Python|| ADF|| Databricks|| Synapse|| PySpark|| SQL|| DevOps|| Power-BI|| Mainframe|| Banking|| Retail|| Insurance

4 个月

Love this, very helpful ji Akash Deep Chauhan

Manan Chandna

Aspiring Data Engineer | Multi-Cloud Data Ecosystem Architect | Expertise in Real-Time AI, Secure Pipelines, and Federated Learning | Skilled in Building Secure, Scalable Data Pipelines

5 个月

The article is very well-written and informative, catering primarily to senior data engineers, yet as a fresher, I can still understand and gain valuable insights from it. Great job, Akash sir!

Vivek Kashyap ????

Technical/Business System Analyst/Scrum Master/Payments and Swift SME/ ISO 20022 Migration/Mainframe

5 个月

It will be really helpful for someone who wants to transform their career into data engineering side. Very well put together.

Joginder Singh

#Technical Architect/Lead Adhoc Scrum Master #Data Engineer#GenAILearner

5 个月

Very informative and interesting. Thanks for collating and summarising in one article.

Vrushank Joshi

Engineering Lead | Programme Manager | Agile Delivery Manager | Sr. Solution Architect

5 个月

Great Akash, this will really help data engineering community ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了