The role of data engineering has evolved significantly over the years, adapting to ever changing technologies and business needs. I started into the world of Data Engineering as a Data Warehouse/ETL Developer which now is a just a part of what the role of a Data Engineer entails. While reflecting on my own journey and listening to
Seattle Data Guy
/
Benjamin Rogojan
, I got inspired to write few words on this with some research and reflection.
The Early Days - Evolution of Data Engineering as a Role
Initially, data engineering was closely tied to database administration and data warehousing.
Data engineers primarily focused on:
Designing and maintaining data storage systems
Ensuring data quality and consistency
Creating ETL (Extract, Transform, Load) processes
Data engineering as a formal job title is relatively new, less than 15 years old (regardless of what you might see as a requirement in the job spec). While the actual work has been around for decades, the role only gained prominence in the last decade, largely due to the rise of big data technologies like Hadoop and the specialisation of teams.
Emergence of Big Data
Early Years (2011-2013) – Rise of Hadoop and Data Science
- Teradata, Vertica and Oracle (Including Exadata) were/are still the most popular choice.
- Hadoop became popular for managing large datasets on cheap hardware, inspired by Google’s paper on distributed file systems.
- Companies sought to emulate big tech like Google and adopted Hadoop through solutions like Cloudera and Hortonworks.
- Data scientists were initially expected to manage Hadoop and perform data processing, which often pushed them into data engineering tasks.
Specialisation of Roles (2014-2015) – Emergence of Data Engineer Title
With the rise of big data, the role expanded to include:
- Teradata, Vertica and Oracle (Including Exadata) were/are still the most popular choice.
- By 2014-2015, the term "data engineer" began to emerge as a distinct role, allowing data scientists to focus more on machine learning and analytics while engineers handled data processing. Though Data Warehouse Developer and ETL Developer were still the most popular roles until 2018ish.
- Handling large-scale distributed systems (e.g., Hadoop ecosystem)
- Implementing batch and stream processing frameworks
- Developing data pipelines for real-time analytics (Spark, flume, etc)
Then comes the "Cloud Era"
Hadoop’s Decline (2015-2016) – Shift to More Manageable Solutions
- Hadoop’s complexity and the high cost of maintaining it led to its decline in favour of easier-to-use and more cost-effective cloud solutions.
- The need for faster data processing and fewer engineers drove the shift toward technologies like Hive and Presto, which allowed data engineers to work with SQL rather than writing MapReduce jobs.
Public Release of Airflow (2015)
- The release of Airflow revolutionised data pipeline management by offering a more standardised approach, reducing the need for custom solutions.
- While it’s now seen as an industry-standard tool, many companies still maintain custom-built alternatives.
As cloud computing became prevalent, data engineers adapted to:
- Managing cloud-based data infrastructure
- Implementing serverless architectures
- Utilising managed services for data processing and storage
Cloud Data Warehouses (2017-2018) – Adoption of Redshift and BigQuery
- Redshift and BigQuery grew in popularity as cloud data warehouses, although both had limitations and quirks that frustrated users.
- Redshift’s inability to scale easily and BigQuery’s non-standard SQL were significant pain points.
- The era marked a shift towards cloud-based solutions for managing large-scale data processing.
Rise of Snowflake and Databricks (2019-2020)
- Snowflake and Databricks became dominant players, capturing market share with their different approaches: Snowflake offering a traditional data warehouse experience, and Databricks appealing more to data science teams with a notebook-driven, Python-first approach.
- Both platforms grew significantly in market presence, contributing to the "modern data stack.
Modern Data Engineering / Modern Data Stack - 2020 to present
Today, data engineering encompasses a broader range of responsibilities:
- Data Architecture: Designing scalable and efficient data systems
- Data Integration: Connecting various data sources and ensuring seamless data flow
- Data Governance: Implementing data quality, security, and compliance measures
- Machine Learning Operations (MLOps): Supporting ML model deployment and monitoring. Not common to have DE and MLOps as one team in large companies but seen as a trend with few interchangeable skills.
- Data Democratization: Creating self-service data platforms for non-technical users
The Modern Data Stack (2020-Present)
- The modern data stack, fuelled by venture capital and startups, grew as companies sought to emulate tech giants like Google and Facebook.
- Tools were developed to enable data analytics for smaller companies, although enterprises largely stuck with established solutions like Informatica or Azure Data Factory.
- Databricks, Snowflake, BigQuery, Azure Synapse/fabric seems, Redshift, to be covering majority of the implementations across the world depending on the size of organisation.
- Storage continues to become cheaper
- Decoupled storage and compute, sometimes multi-platform approach.
- War for open file formats.
- GenAI, SQL generation features supporting rapid development.
- Data Management is still a challenge.
- Data Mesh and Data products
- Serverless products
- No DBA and mostly self serve databases or data ecosystem
- Rise of platform teams
- The modern data stack is driven by specialised, often fragmented tools, though the term itself suggests a continuous cycle of innovation.
Where are we going from here?
Leadership | Data Maestro | Architect | Director | Crafting Data Solutions for Success
1 周Good one Abhishek, very well documented, keep it up !!!
Data Architect | Data Steward | Data Modeller | Business Intelligence Architect | Data Analytics | Data Engineering | Data Visualisation | MicroStrategy | Tableau | Power BI | Snowflake | Databricks | AWS | Azure | GCP
1 周Great work Abhishek Singh to pull together so much information. Data engineering using NLP has already started emerging Decoupled data products through Engineering solutions for every use case is not very far. Data driven Operational decision making capabilities in all industries. Etc
Lead ML Engineer | Changemaker | Scale AI Optimisations
3 周Query over anything is the future I think - federated query engines sitting over lakes and warehouses.