What Skills Do Data Engineers Need?-The Data Engineer Skill Pyramid

What Skills Do Data Engineers Need?-The Data Engineer Skill Pyramid

With an extensive background in data engineering and analytics, I am consistently asked the same questions repeatedly. Besides wanting to know the difference between a data engineer and a data scientist, one of the most common questions is, what skills should I learn as a data engineer?

It’s an excellent inquiry for new or prospective data engineers based on the opportunities available.

The fact of the matter is, companies need data engineers more than ever before. At our current pace, there are approximately 2.5 quintillion bytes of data created every day — a figure that continues to grow at an accelerated pace. By 2025, experts estimate that the world will create 463 exabytes of data each day. That is the equivalent of 212,765,957 DVDs per day.

To better utilize data, companies are now realizing they need to hire data engineers to take their data from point A to point B. That way, data scientists and analysts can easily use it, increasing efficiency and productivity. That is why “data engineer” is one of the fastest-growing job titles, according to a 2019 analysis.

To assist you as a new data engineer, I have created a skill set pyramid, which can be thought of as a hierarchy of skill set needs. This will help you focus on the skills you should learn first, allowing you to build a solid foundation as you move onto more specific skills. Just remember, the way you learn each step of the pyramid does not need to be overly rigid, staying in a strict order. You can layer each step, helping you progress as you learn. Let’s get started!

Python and SQL

At the base of the pyramid, I recommend learning Structured Query Language (SQL) and some form of coding.

When I say coding, I mean learning the core concepts, such as loops, if statements, functions, and data structures. You need to understand what they are, what they do, and how they operate. Why would you want to use one over the other?

To become a successful data engineer, you need to be a proficient programmer. Currently, we live in the age of Python, which continues to be a standard entry point. This programming language is perfect for websites, scripting, and data. SQL is the language of data and relates to automation, scripting, and database modeling. Despite its age, it continues to play a pivotal role in managing and processing data.

Both SQL and Python are the most common technologies listed in job listings. Whether a data engineer is working for Apple or a small startup, they must be experts in SQL; and Python also remains in high demand.

The best languages and technologies for you will depend on what you aim to specialize in. For example, those who are experts in data processing may be highly proficient in Spark or AWS. However, before you reach that point, you need to learn the basics.


ETL and Data Warehousing

The next level includes ETLs (extract, transform, load) and ELTs, which are the processes that allow you to take data from one point to another, typically using a tool or programming. The data is processed, extracted, often transformed, and then loaded into a data lake or data warehouse. Understanding how to move data is critical for the next set of skills associated with data warehouses, data lakes, and sometimes, data lake houses — which is growing in popularity.

  • Data warehouses will help you understand data modeling and why experienced data engineers process data in certain ways. Gaining this insight will allow you to ensure greater consistency, helping companies make more informed decisions.
  • Understanding data lakes based on their role in companies, as this option allows businesses to manage data in a manner that is often less expensive and process heavy, compared to data warehousing.
  • Data lake houses is a term that has become popular over the past year. Again, companies are finding this an appealing option as it combines elements of both data warehouses and data lakes.

You can spend a lot of time learning about the three systems above, as there are many best practices in terms of ETLs, data modeling, etc. Don’t rush through this layer of learning, as it is the “meat and potatoes” of data engineering.

Ask yourself critical questions, such as:

  • What are these three concepts? Where have they evolved from and where are they going?
  • What is the difference between ETLs and ELTs?
  • What is the goal of this layer from a business perspective?

Cloud, DevOps, and Data Visualization

Once you gain more experience, the basics behind this step are fairly straightforward. However, when you are first developing data engineer skills, everything can seem overwhelming — only because there is a lot to learn.

  • Start by understanding the cloud in terms of VPCs, serverless computing, cloud data warehouses, etc. If you work for a startup in the future, this knowledge will be valuable.
  • DevOps will help you take code from your environment into a production environment. Become familiar with git — a tool that is used for source code management.
  • While learning about data visualization, you will pick a tool such as Tableau. Learn best practices as well.

Streaming Data, Distributed Computing, and Specialization

Once you have learned about the top three layers and the concepts within them, you can become more specific with your approach. Since you’ll have a background in ETLs and data warehousing, and will be accustomed to working with the cloud, setting up something on AWS Kinesis will come more naturally to you.

At this stage, you can also dive deeper into distributed processing, as well as the pros and cons of using that kind of system.

Some data engineers strive to become a specialist, working either strictly with Microsoft, Azure Data Factory, and the list goes on. Many companies are looking for experts in specific areas, so that is something that many new data engineers take into consideration while honing their skills.

The best part of being more knowledgeable is that you have the freedom to choose what you’d like to focus on. Some enjoy building infrastructure components while others prefer building data products.

As a new data engineer, your goal is to help companies better manage their data — and regardless of how big or successful a company is, there will always be data problems. This is great for budding data engineers because it increases the probability of high job security.

In summary, what skills should data engineers have?

  • You should be able to build and maintain database systems.
  • Understand and be fluent in programming languages, especially Python and SQL.
  • Know how to find and use warehousing solutions, as well as ETL tools.
  • A thorough understanding of cloud technology, data viz, etc.
  • You should also familiarize yourself with the most essential programs, building software-specific skills based on your expertise. For example, skills that are specific to Redshift, Azure, Apache, etc.

Unlike data scientists and data analysts, data engineers are more concerned with preparing data, compared to analyzing and interpreting it. Although many of the skills across all three titles overlap, data engineers focus on ETLs, data warehousing, advanced programming, scripting, data visualization, and pipelining. In-depth knowledge of SQL is imperative. 

Once you hone the skills above, you will have the freedom to master the systems, tools, and models that appeal to you most. Whether you’re interested in managing a company’s Big Data infrastructure or are drawn to machine learning, your career can start immediately. Leverage the power of the basic skills discussed above today!

Simon Sp?ti

Data Engineer, Author & Educator | ssp.sh, dedp.online

3 年

Thanks so much, Benjamin, for that overview. I like the pyramid a lot! I also wrote about that on Quora a while back, but it still holds IMO. In case of Interest: https://qr.ae/TWhz0P. ?It has 18k views so far, so I guess it's also not totally off ;-)?

回复
Iris Huang

Data Engineer | Business Analyst | Cloud Practitioner | Automation Junkie | Digital Transformation

3 年

Totally agree with the outlined layers. I think it's worth to point out from the company/business's perspective, you're very likely to start with working on OLTP more often than OLAP if the data infrastructure in the company isn't quite built and the data literacy isn't high enough. You will be one of the first few to start figuring out what kind of data keeps the business going (ex, cost reporting data hopefully at daily frequency but okay at weekly frequency; customer behavior data that shows you the engagement of customers...the list goes on). To this date, frankly, I've done very little OLAP, half probably because I'm more into infrastructure/raw layer but half definitely because the company's data just isn't quite ready for sophisticated OLAP yet. OLAP itself can have very complex design as well though, so I'm excited to dive into that unchartered territory.

要查看或添加评论,请登录

Benjamin Rogojan的更多文章

社区洞察

其他会员也浏览了