登录查看更多内容

Last 10 years of Data Engineering

Abhishek Singh

Data Engineering Lead @ NatWest Group

发布日期: 2024年10月29日

The role of data engineering has evolved significantly over the years, adapting to ever changing technologies and business needs. I started into the world of Data Engineering as a Data Warehouse/ETL Developer which now is a just a part of what the role of a Data Engineer entails. While reflecting on my own journey and listening to Seattle Data Guy / Benjamin Rogojan , I got inspired to write few words on this with some research and reflection.

The Early Days - Evolution of Data Engineering as a Role

Initially, data engineering was closely tied to database administration and data warehousing.

Data engineers primarily focused on:

Designing and maintaining data storage systems

Ensuring data quality and consistency

Creating ETL (Extract, Transform, Load) processes

Dimensional modelling

Data engineering as a formal job title is relatively new, less than 15 years old (regardless of what you might see as a requirement in the job spec). While the actual work has been around for decades, the role only gained prominence in the last decade, largely due to the rise of big data technologies like Hadoop and the specialisation of teams.

Emergence of Big Data

Early Years (2011-2013) – Rise of Hadoop and Data Science

Teradata, Vertica and Oracle (Including Exadata) were/are still the most popular choice.
Hadoop became popular for managing large datasets on cheap hardware, inspired by Google’s paper on distributed file systems.
Companies sought to emulate big tech like Google and adopted Hadoop through solutions like Cloudera and Hortonworks.
Data scientists were initially expected to manage Hadoop and perform data processing, which often pushed them into data engineering tasks.

Specialisation of Roles (2014-2015) – Emergence of Data Engineer Title

With the rise of big data, the role expanded to include:

Teradata, Vertica and Oracle (Including Exadata) were/are still the most popular choice.
By 2014-2015, the term "data engineer" began to emerge as a distinct role, allowing data scientists to focus more on machine learning and analytics while engineers handled data processing. Though Data Warehouse Developer and ETL Developer were still the most popular roles until 2018ish.
Handling large-scale distributed systems (e.g., Hadoop ecosystem)
Implementing batch and stream processing frameworks
Developing data pipelines for real-time analytics (Spark, flume, etc)

Then comes the "Cloud Era"

领英推荐

Apache Hudi: Copy on Write(CoW) Table

Ankur Ranjan 1 年前

Top 10 Big Data Tools & Technologies To Watch Out In…

ITIO Innovex Pvt. Ltd. 6 个月前

Data Engineer: Who Is This?

Zero to One Search | Recruitment Agency 1 年前

Hadoop’s Decline (2015-2016) – Shift to More Manageable Solutions

Hadoop’s complexity and the high cost of maintaining it led to its decline in favour of easier-to-use and more cost-effective cloud solutions.
The need for faster data processing and fewer engineers drove the shift toward technologies like Hive and Presto, which allowed data engineers to work with SQL rather than writing MapReduce jobs.

Public Release of Airflow (2015)

The release of Airflow revolutionised data pipeline management by offering a more standardised approach, reducing the need for custom solutions.
While it’s now seen as an industry-standard tool, many companies still maintain custom-built alternatives.

As cloud computing became prevalent, data engineers adapted to:

Managing cloud-based data infrastructure
Implementing serverless architectures
Utilising managed services for data processing and storage

Cloud Data Warehouses (2017-2018) – Adoption of Redshift and BigQuery

Redshift and BigQuery grew in popularity as cloud data warehouses, although both had limitations and quirks that frustrated users.
Redshift’s inability to scale easily and BigQuery’s non-standard SQL were significant pain points.
The era marked a shift towards cloud-based solutions for managing large-scale data processing.

Rise of Snowflake and Databricks (2019-2020)

Snowflake and Databricks became dominant players, capturing market share with their different approaches: Snowflake offering a traditional data warehouse experience, and Databricks appealing more to data science teams with a notebook-driven, Python-first approach.
Both platforms grew significantly in market presence, contributing to the "modern data stack.

Modern Data Engineering / Modern Data Stack - 2020 to present

Today, data engineering encompasses a broader range of responsibilities:

Data Architecture: Designing scalable and efficient data systems
Data Integration: Connecting various data sources and ensuring seamless data flow
Data Governance: Implementing data quality, security, and compliance measures
Machine Learning Operations (MLOps): Supporting ML model deployment and monitoring. Not common to have DE and MLOps as one team in large companies but seen as a trend with few interchangeable skills.
Data Democratization: Creating self-service data platforms for non-technical users

The Modern Data Stack (2020-Present)

The modern data stack, fuelled by venture capital and startups, grew as companies sought to emulate tech giants like Google and Facebook.
Tools were developed to enable data analytics for smaller companies, although enterprises largely stuck with established solutions like Informatica or Azure Data Factory.
Databricks, Snowflake, BigQuery, Azure Synapse/fabric seems, Redshift, to be covering majority of the implementations across the world depending on the size of organisation.
Storage continues to become cheaper
Decoupled storage and compute, sometimes multi-platform approach.
War for open file formats.
GenAI, SQL generation features supporting rapid development.
Data Management is still a challenge.
Data Mesh and Data products
Serverless products
No DBA and mostly self serve databases or data ecosystem
Rise of platform teams
The modern data stack is driven by specialised, often fragmented tools, though the term itself suggests a continuous cycle of innovation.

Where are we going from here?

Imtiaz Peerzade

Leadership | Data Maestro | Architect | Director | Crafting Data Solutions for Success

1 周

Good one Abhishek, very well documented, keep it up !!!

Gunasekaran kumar

1 周

Great work Abhishek Singh to pull together so much information. Data engineering using NLP has already started emerging Decoupled data products through Engineering solutions for every use case is not very far. Data driven Operational decision making capabilities in all industries. Etc

1 次回应

Mark Mc Naught

Lead ML Engineer | Changemaker | Scale AI Optimisations

3 周

Query over anything is the future I think - federated query engines sitting over lakes and warehouses.

2 次回应

查看更多评论

要查看或添加评论，请登录

Abhishek Singh的更多文章

Resume Driven Development (or Architecture) for Data Infrastructure

2024年9月2日

Resume Driven Development (or Architecture) for Data Infrastructure

Resume driven development or Resume driven architecture is a common phenomenon where the developers or the architects…
Befriending: Apache Hadoop and Spark

2016年7月3日

Befriending: Apache Hadoop and Spark

If you've ever got into a discussion involving Big data then Apache Spark and Hadoop, the two most popular projects of…
UNIX : Where there is a shell, there's a way

2016年6月1日

UNIX : Where there is a shell, there's a way

Whether you talk about flat files, talk about system processes, talk about scheduling, talk about automation, at last…

2 条评论

Last 10 years of Data Engineering

Abhishek Singh

Data Engineering Lead @ NatWest Group

The Early Days - Evolution of Data Engineering as a Role

Emergence of Big Data

Then comes the "Cloud Era"

领英推荐

Modern Data Engineering / Modern Data Stack - 2020 to present

Abhishek Singh的更多文章

社区洞察

其他会员也浏览了

Data Engineer

Data Engineering: The Backbone of Modern Data Science

The Essential Skills Every Aspiring Data Engineer Needs in 2024

Sr. Data Engineer 7+y Skills - Must - AWS, Python, Snowflake KAfka Bengaluru

UNDERSTANDING DATA ENGINEERING

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Senior Data Lakehouse Engineer

Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

Explore Top Data Engineering Tools: Unleash Your Potential

Introduction to Big Data Technologies and Concepts: Building a Foundation for Data-Driven Success

The Early Days - Evolution of Data Engineering as a Role

Emergence of Big Data

Then comes the "Cloud Era"

领英推荐

Modern Data Engineering / Modern Data Stack - 2020 to present

Abhishek Singh的更多文章

Resume Driven Development (or Architecture) for Data Infrastructure

Befriending: Apache Hadoop and Spark

UNIX : Where there is a shell, there's a way

社区洞察

其他会员也浏览了

Data Engineer

Data Engineering: The Backbone of Modern Data Science

The Essential Skills Every Aspiring Data Engineer Needs in 2024

Sr. Data Engineer 7+y Skills - Must - AWS, Python, Snowflake KAfka Bengaluru

UNDERSTANDING DATA ENGINEERING

Building a Data-Driven Future: Part 2 - Six ELT Challenges Nobody Tells You

Senior Data Lakehouse Engineer

Building a Real-Time Data Pipeline with Apache Kafka, ClickHouseDB, and AWS S3 for Data Integration and Normalization

Explore Top Data Engineering Tools: Unleash Your Potential

Introduction to Big Data Technologies and Concepts: Building a Foundation for Data-Driven Success