登录查看更多内容

Modern Data Engineering Core Building Blocks: Abstraction, Automation, and Orchestration

Niraj Kumar

Head - Data Engineering & Analytics(Customer Data Strategy and Privacy) at Wells Fargo

发布日期: 2024年1月7日

In the ever-evolving landscape of data engineering, the essential building blocks of abstraction, automation, and orchestration are transforming how organisations manage and derive value from their data. These foundational elements play a pivotal role in enhancing operational efficiency, scalability, and agility across the data lifecycle. Implementation of the modern data ecosystem following architectural concepts such as Datamesh (distributed, decentralised and? domain based approach) or Datafabric (centralised and interconnected approach) or hybrid, hugely rely on these 3 core building blocks. Let's dive into these.

Abstraction, is the art of simplifying complexity, proves critical in the data lifecycle. Consider the ingestion phase, where diverse data sources contribute to the data pool. Abstraction layers harmonise the integration process, abstracting away the intricate details of varied data formats. For instance, tools like Apache NiFi provide a visual interface to abstract data ingestion complexities, facilitating seamless integration of disparate sources into a unified platform. Ab Initio, Informatica, Talend, Dataprep(GCP), Glue (AWS) etc. are such similar tools that simplify complex heavy lifting ingestion and integration jobs for developers. Virtualization leverages abstraction to offer a unified view of disparate data sources. With tools like Denodo, organisations can abstract underlying data structures, providing a simplified, virtual layer for data consumers. This allows analysts to access and analyse data without being encumbered by the nuances of different source systems, promoting flexibility and ease of use.

Subsequently, in the data transformation phase of modern data engineering, abstraction plays a pivotal role. Tools like Apache Spark employ abstraction via high-level APIs, including DataFrame and Spark SQL. This enables data engineers to concentrate on transformations without grappling with the complexities of distributed processing beneath. Rapid delivery is exemplified by constructing a framework driven by declarative configuration files like XML, YAML, or JSON, adept at handling diverse developer requirements.?

Within the storage layer, a range of products abstracts underlying constructs by accommodating various data types (Structured, Semi-Structured, and Unstructured) and file types such as Parquet, key-value, avro, ORC, among others. Such a suite of products spans from structured databases (RDBMS - Oracle, Cloud SQL, Teradata),? semi structured/NoSQL databases (MongoDB, Cassandra, Big Table, HBase etc.),to unstructured (audio, video, picture files) object storage solutions (S3, GCS, NetAPP etc.),? offering comprehensive storage abstraction capabilities.

Data Analysis and Exploration phase leverages tools like Colab and Jupyter notebook for interactive data analysis followed by ML development and visualisation BI tools such as Tableau, PowerBI, ThoughtSpot, Looker; those abstracts out underlying complexities simplifying the job of developers and analysts.

In the realm of modern data engineering and machine learning (ML) lifecycle ecosystems,?

Orchestration refers to the coordination and management of complex workflows, tasks, and processes involved in data processing and ML model development. It plays a pivotal role in streamlining operations, ensuring efficiency, and facilitating collaboration among developers, data scientists, and other stakeholders.

Key Aspects of Orchestration:?

领英推荐

Fundamentals of Data Engineering: Building the…

Sankhyana Consultancy Services Pvt. Ltd. 6 个月前

The Rise of EtLT(Extract, Tweak Light Transform, Load,…

XenonStack 6 个月前

Master Data Pipeline in one Crash Course

Eleke Great 1 年前

Seamless coordination of various tasks and processes such as data extraction, transformation, loading (ETL), model training, evaluation, and deployment. (e.g. Talend, Ab Initio, Spark, Tensorflow Extended, Kubeflow, SageMaker, NiFi, Luigi, VertexAI etc.)

Job Scheduling tools from AutoSys (centralised and rigid)? to AirFlow (flexible, distributed , open source) automate the execution of tasks and workflows, reducing manual intervention. Scheduled workflows ensure timely execution of jobs, optimising resource utilisation and meeting business requirements while ensuring dependency management.

Orchestration facilitates scalability by managing the parallel execution of tasks. This is particularly beneficial in handling large datasets or when running multiple ML experiments concurrently.

Orchestration tools provide robust error handling mechanisms, allowing for graceful recovery from failures. Detailed logging assists in troubleshooting and monitoring the execution of workflows.

Orchestration fosters collaboration between different teams involved in the data engineering and ML lifecycle. Developers, data scientists, and operations teams can work cohesively within a unified orchestration framework.

Embracing scalability in modern data and analytics engineering is greatly facilitated by container orchestration tools like Kubernetes (K8S). The transition from batch to real-time analytics forms the robust foundation of contemporary business decision systems, and this sustainability is made possible by the robust infrastructure provided by these orchestration platforms.

In conclusion, the symbiotic integration of abstraction, automation, and orchestration empowers organisations to navigate the complexities of the data lifecycle. Real-world examples illustrate how these building blocks enhance efficiency, scalability, and reliability, propelling data engineering into a new era of innovation and value creation.

Woodley B. Preucil, CFA

Senior Managing Director

1 年

Niraj Kumar Very Informative. Thank you for sharing.

1 次回应

要查看或添加评论，请登录

Niraj Kumar的更多文章

ChatGPT: Unpacking the constructs

2023年4月24日

ChatGPT: Unpacking the constructs

ChatGPT: Unpacking the constructs ChatGPT is quite a popular LLM (Large Language Model) that is widely used these days…

1 条评论
Personal Data Protection: A Bill Must to Implement

2022年1月30日

Personal Data Protection: A Bill Must to Implement

Personal Data Protection regulation is about safeguarding our opinion, taste, privacy, interest, likes, secrets etc…

1 条评论
Sentiment Analysis (Climate Summit-COP26): Jumpstart with ML coding

2021年11月9日

Sentiment Analysis (Climate Summit-COP26): Jumpstart with ML coding

Machine Learning coding is considered to be the easiest of all coding landscapes whether you choose Python, Java, or R…
SVD to PCA: Technique to improve XAI (Explainable AI) (Part 2)

2020年12月20日

SVD to PCA: Technique to improve XAI (Explainable AI) (Part 2)

AI explainability (XAI) continues to challenge business to explain underlying reasons of a ML model behavior. With the…
SVD to PCA: Technique to improve XAI (Explainable AI) (Part 1)

2019年12月8日

SVD to PCA: Technique to improve XAI (Explainable AI) (Part 1)

XAI demand continues to grow across all industries primarily in finance and medical science. Before we decline loan…

2 条评论
Support Vector Machine: Part 3 (Machine Learning with Kernel implementation)

2019年2月24日

Support Vector Machine: Part 3 (Machine Learning with Kernel implementation)

Up until last article, I discussed about SVM and its underlying mathematics using very simple linear separable dataset…

2 条评论
Support Vector Machine : Part 2 (Build intuition before applying mathematics)

2019年1月12日

Support Vector Machine : Part 2 (Build intuition before applying mathematics)

As discussed in previous part, we continue building intuition before jumping into its (SVM) underlying mathematics…

1 条评论
Support Vector Machine : Part 1(Build intuition before applying mathematics)

2018年12月16日

Support Vector Machine : Part 1(Build intuition before applying mathematics)

Machine learning (ML) is the heart of AI products (Alexa, Siri, Cortana, Driverless Car etc.), algorithms are the core…

2 条评论
Data Science vs AI: Get to the Fundamentals

2018年9月9日

Data Science vs AI: Get to the Fundamentals

Introduction: Deriving meaningful information out of heap of data is the minimal requirement for any establishment…
Automation through Advance Analytics: Financial Services

2017年3月19日

Automation through Advance Analytics: Financial Services

Fintech industries have been catching up quite well in blending advance analytics with RPA (Robotic Process Automation)…

1 条评论

See all articles

Modern Data Engineering Core Building Blocks: Abstraction, Automation, and Orchestration

Niraj Kumar

Head - Data Engineering & Analytics(Customer Data Strategy and Privacy) at Wells Fargo

领英推荐

Niraj Kumar的更多文章

社区洞察

其他会员也浏览了

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Navigating the Data Seas: The Crucial Role of Data Engineering in the Data Ecosystem

The Critical Role of Data Engineering in Today's Data-Driven World

Demystifying File Formats in Data Engineering

Building Future-Ready Data Engineering Foundations: The Principles That Guide Scalable and Adaptable Data Pipelines

Data Engineering Day 5: AWS Glue for ETL

Data Engineering Demystified: Tools and Techniques You Need to Know

8 Timeless Data Engineering Optimization Techniques That Work Across Any Tech Stack

Data Pipelines: From Raw Data to Real Results

领英推荐

Niraj Kumar的更多文章

ChatGPT: Unpacking the constructs

Personal Data Protection: A Bill Must to Implement

Sentiment Analysis (Climate Summit-COP26): Jumpstart with ML coding

SVD to PCA: Technique to improve XAI (Explainable AI) (Part 2)

SVD to PCA: Technique to improve XAI (Explainable AI) (Part 1)

Support Vector Machine: Part 3 (Machine Learning with Kernel implementation)

Support Vector Machine : Part 2 (Build intuition before applying mathematics)

Support Vector Machine : Part 1(Build intuition before applying mathematics)

Data Science vs AI: Get to the Fundamentals

Automation through Advance Analytics: Financial Services

社区洞察

其他会员也浏览了

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

Navigating the Data Seas: The Crucial Role of Data Engineering in the Data Ecosystem

The Critical Role of Data Engineering in Today's Data-Driven World

Demystifying File Formats in Data Engineering

Building Future-Ready Data Engineering Foundations: The Principles That Guide Scalable and Adaptable Data Pipelines

Data Engineering Day 5: AWS Glue for ETL

Data Engineering Demystified: Tools and Techniques You Need to Know

8 Timeless Data Engineering Optimization Techniques That Work Across Any Tech Stack

Data Pipelines: From Raw Data to Real Results