登录查看更多内容

What Skills Do Data Engineers Need?-The Data Engineer Skill Pyramid

Benjamin Rogojan

Fractional Head Of Data | Reach Out For Data Infra And Strategy Consults

发布日期: 2021年4月15日

With an extensive background in data engineering and analytics, I am consistently asked the same questions repeatedly. Besides wanting to know the difference between a data engineer and a data scientist, one of the most common questions is, what skills should I learn as a data engineer?

It’s an excellent inquiry for new or prospective data engineers based on the opportunities available.

The fact of the matter is, companies need data engineers more than ever before. At our current pace, there are approximately 2.5 quintillion bytes of data created every day — a figure that continues to grow at an accelerated pace. By 2025, experts estimate that the world will create 463 exabytes of data each day. That is the equivalent of 212,765,957 DVDs per day.

To better utilize data, companies are now realizing they need to hire data engineers to take their data from point A to point B. That way, data scientists and analysts can easily use it, increasing efficiency and productivity. That is why “data engineer” is one of the fastest-growing job titles, according to a 2019 analysis.

To assist you as a new data engineer, I have created a skill set pyramid, which can be thought of as a hierarchy of skill set needs. This will help you focus on the skills you should learn first, allowing you to build a solid foundation as you move onto more specific skills. Just remember, the way you learn each step of the pyramid does not need to be overly rigid, staying in a strict order. You can layer each step, helping you progress as you learn. Let’s get started!

Python and SQL

At the base of the pyramid, I recommend learning Structured Query Language (SQL) and some form of coding.

When I say coding, I mean learning the core concepts, such as loops, if statements, functions, and data structures. You need to understand what they are, what they do, and how they operate. Why would you want to use one over the other?

To become a successful data engineer, you need to be a proficient programmer. Currently, we live in the age of Python, which continues to be a standard entry point. This programming language is perfect for websites, scripting, and data. SQL is the language of data and relates to automation, scripting, and database modeling. Despite its age, it continues to play a pivotal role in managing and processing data.

Both SQL and Python are the most common technologies listed in job listings. Whether a data engineer is working for Apple or a small startup, they must be experts in SQL; and Python also remains in high demand.

The best languages and technologies for you will depend on what you aim to specialize in. For example, those who are experts in data processing may be highly proficient in Spark or AWS. However, before you reach that point, you need to learn the basics.

ETL and Data Warehousing

The next level includes ETLs (extract, transform, load) and ELTs, which are the processes that allow you to take data from one point to another, typically using a tool or programming. The data is processed, extracted, often transformed, and then loaded into a data lake or data warehouse. Understanding how to move data is critical for the next set of skills associated with data warehouses, data lakes, and sometimes, data lake houses — which is growing in popularity.

Data warehouses will help you understand data modeling and why experienced data engineers process data in certain ways. Gaining this insight will allow you to ensure greater consistency, helping companies make more informed decisions.
Understanding data lakes based on their role in companies, as this option allows businesses to manage data in a manner that is often less expensive and process heavy, compared to data warehousing.
Data lake houses is a term that has become popular over the past year. Again, companies are finding this an appealing option as it combines elements of both data warehouses and data lakes.

You can spend a lot of time learning about the three systems above, as there are many best practices in terms of ETLs, data modeling, etc. Don’t rush through this layer of learning, as it is the “meat and potatoes” of data engineering.

Ask yourself critical questions, such as:

What are these three concepts? Where have they evolved from and where are they going?
What is the difference between ETLs and ELTs?
What is the goal of this layer from a business perspective?

Cloud, DevOps, and Data Visualization

Once you gain more experience, the basics behind this step are fairly straightforward. However, when you are first developing data engineer skills, everything can seem overwhelming — only because there is a lot to learn.

Start by understanding the cloud in terms of VPCs, serverless computing, cloud data warehouses, etc. If you work for a startup in the future, this knowledge will be valuable.
DevOps will help you take code from your environment into a production environment. Become familiar with git — a tool that is used for source code management.
While learning about data visualization, you will pick a tool such as Tableau. Learn best practices as well.

Streaming Data, Distributed Computing, and Specialization

Once you have learned about the top three layers and the concepts within them, you can become more specific with your approach. Since you’ll have a background in ETLs and data warehousing, and will be accustomed to working with the cloud, setting up something on AWS Kinesis will come more naturally to you.

At this stage, you can also dive deeper into distributed processing, as well as the pros and cons of using that kind of system.

Some data engineers strive to become a specialist, working either strictly with Microsoft, Azure Data Factory, and the list goes on. Many companies are looking for experts in specific areas, so that is something that many new data engineers take into consideration while honing their skills.

The best part of being more knowledgeable is that you have the freedom to choose what you’d like to focus on. Some enjoy building infrastructure components while others prefer building data products.

As a new data engineer, your goal is to help companies better manage their data — and regardless of how big or successful a company is, there will always be data problems. This is great for budding data engineers because it increases the probability of high job security.

In summary, what skills should data engineers have?

You should be able to build and maintain database systems.
Understand and be fluent in programming languages, especially Python and SQL.
Know how to find and use warehousing solutions, as well as ETL tools.
A thorough understanding of cloud technology, data viz, etc.
You should also familiarize yourself with the most essential programs, building software-specific skills based on your expertise. For example, skills that are specific to Redshift, Azure, Apache, etc.

Unlike data scientists and data analysts, data engineers are more concerned with preparing data, compared to analyzing and interpreting it. Although many of the skills across all three titles overlap, data engineers focus on ETLs, data warehousing, advanced programming, scripting, data visualization, and pipelining. In-depth knowledge of SQL is imperative.

Once you hone the skills above, you will have the freedom to master the systems, tools, and models that appeal to you most. Whether you’re interested in managing a company’s Big Data infrastructure or are drawn to machine learning, your career can start immediately. Leverage the power of the basic skills discussed above today!

Simon Sp?ti

Data Engineer, Author & Educator | ssp.sh, dedp.online

3 年

Thanks so much, Benjamin, for that overview. I like the pyramid a lot! I also wrote about that on Quora a while back, but it still holds IMO. In case of Interest: https://qr.ae/TWhz0P. ?It has 18k views so far, so I guess it's also not totally off ;-)?

Iris Huang

Data Engineer | Business Analyst | Cloud Practitioner | Automation Junkie | Digital Transformation

3 年

Totally agree with the outlined layers. I think it's worth to point out from the company/business's perspective, you're very likely to start with working on OLTP more often than OLAP if the data infrastructure in the company isn't quite built and the data literacy isn't high enough. You will be one of the first few to start figuring out what kind of data keeps the business going (ex, cost reporting data hopefully at daily frequency but okay at weekly frequency; customer behavior data that shows you the engagement of customers...the list goes on). To this date, frankly, I've done very little OLAP, half probably because I'm more into infrastructure/raw layer but half definitely because the company's data just isn't quite ready for sophisticated OLAP yet. OLAP itself can have very complex design as well though, so I'm excited to dive into that unchartered territory.

1 次回应

查看更多评论

要查看或添加评论，请登录

Benjamin Rogojan的更多文章

How To Decimate Your Profits As A Data Analytics Consultant

2024年5月20日

How To Decimate Your Profits As A Data Analytics Consultant

In the past few articles, we have discussed how to get clients. This is a key concept, as in order to even have a…

12 条评论
How To Sell Your Data Consulting Services: 7 Tips to Make Sales Easier

2024年5月13日

How To Sell Your Data Consulting Services: 7 Tips to Make Sales Easier

If you have your own consulting company, congrats you’re a salesperson. Because guess what, no matter how amazing your…

1 条评论
How To Land Clients As A Data Consultant - Marketing Tips

2024年5月8日

How To Land Clients As A Data Consultant - Marketing Tips

There are many different ways to get clients. You can be great a selling, networking, partnering with vendors and my…

8 条评论
How To Break Into Data Engineering And Why It’s So Hard

2021年11月23日

How To Break Into Data Engineering And Why It’s So Hard

Being a data engineer can be both challenging and rewarding. But it’s not always easy to break into this part of the…

2 条评论
What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

2021年9月20日

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

Apache Airflow is a popular open-source tool that helps teams create, schedule, and monitor sequences of tasks, known…

7 条评论
5 Data Engineering Project Ideas To Put On Your Resume

2021年5月19日

5 Data Engineering Project Ideas To Put On Your Resume

fAll signs point towards an auspicious future for data engineering. Dice's 2020 tech jobs report cites Data Engineering…

4 条评论
5 Data Analytics Challenges Companies Will Face in 2021

2021年3月24日

5 Data Analytics Challenges Companies Will Face in 2021

Integrating data into strategy is proving to be a differentiator for businesses of all sizes. The clichéd term…
9 Real Challenges That Data Engineers Face

2021年3月14日

9 Real Challenges That Data Engineers Face

Photo by Jukan Tateisi on Unsplash As the data industry evolves with new technology, so do data engineering challenges.…

1 条评论
5 Important Lessons I Learned at my First Post-College Job

2016年10月4日

5 Important Lessons I Learned at my First Post-College Job

Looking back at my first my post-college job. I wanted to take a moment to reflect on all the different lessons I…

7 条评论

See all articles

What Skills Do Data Engineers Need?-The Data Engineer Skill Pyramid

Benjamin Rogojan

Fractional Head Of Data | Reach Out For Data Infra And Strategy Consults

Python and SQL

ETL and Data Warehousing

Cloud, DevOps, and Data Visualization

Streaming Data, Distributed Computing, and Specialization

In summary, what skills should data engineers have?

Benjamin Rogojan的更多文章

社区洞察

其他会员也浏览了

The Comprehensive Guide for Aspiring Data Professionals

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

The 10 Best Data Analytics Skills You Need To Survive In 2025

Data & Analytics Manager: "SQL is More Important Than Python To Me"

Extract, Transform, and Load (ETL) Amazon Data Books using Python, SQL & Power BI ??

Introductory SQL learning article for you being a data analyst - Part One

The Roadmap to learn Data Science in 2022 - The efficient way

The Tool-Agnostic Data Engineer: Building a Future-Proof Career

Essential Tools Every Aspiring Data Analyst Should Know

Comparison Between SQL Joins and Python Joins

Python and SQL

ETL and Data Warehousing

Cloud, DevOps, and Data Visualization

Streaming Data, Distributed Computing, and Specialization

In summary, what skills should data engineers have?

Benjamin Rogojan的更多文章

How To Decimate Your Profits As A Data Analytics Consultant

How To Sell Your Data Consulting Services: 7 Tips to Make Sales Easier

How To Land Clients As A Data Consultant - Marketing Tips

How To Break Into Data Engineering And Why It’s So Hard

What Is Managed Workflows for Apache Airflow On AWS And Why Companies Should Migrate To It

5 Data Engineering Project Ideas To Put On Your Resume

5 Data Analytics Challenges Companies Will Face in 2021

9 Real Challenges That Data Engineers Face

5 Important Lessons I Learned at my First Post-College Job

社区洞察

其他会员也浏览了

The Comprehensive Guide for Aspiring Data Professionals

Data Engineering Best Practices with Scala: Unlocking the Power of Big Data

The 10 Best Data Analytics Skills You Need To Survive In 2025

Data & Analytics Manager: "SQL is More Important Than Python To Me"

Extract, Transform, and Load (ETL) Amazon Data Books using Python, SQL & Power BI ??

Introductory SQL learning article for you being a data analyst - Part One

The Roadmap to learn Data Science in 2022 - The efficient way

The Tool-Agnostic Data Engineer: Building a Future-Proof Career

Essential Tools Every Aspiring Data Analyst Should Know

Comparison Between SQL Joins and Python Joins