登录查看更多内容

Data Engineering — Aamir P

AAMIR P

Senior Software Engineer at Tiger Analytics | Padma Shri Award nominee for the year 2023 | Author of 25+ books | Badminton Player | Udemy Instructor | Public Speaker | Podcaster | Chess Player | Coder | Yoga Volunteer |

发布日期: 2024年3月29日

Hello readers!

In this article, we will see a basic workflow of Data Engineering. Let's see how data is stored, governed, etc. in a lakehouse.

Lakehouse refers to a data architecture that combines elements of data lakes and data warehouses. It is built on open source and open standards. One consistent data platform across clouds. Unify your data warehousing and AI use cases in a single platform.

Unity Catalog provides centralised governance for the platform. Delta Lake runs on top of data lakes. It is a file-based open-source storage format that provides data.

Delta Live Tables is a declarative framework for building maintainable and testable data processing pipelines. You can define the transformations to perform on your data.

Delta Lake provides asset transactions, data versioning and indexing. It is an open-source project that enables building a data lakehouse on top of existing cloud storage.

Unity Catalog provides governance and auditing.

Delta Lake brings ACID to Object Storage:-

Atomicity means all transactions either succeed or fail completely.

Consistency guarantees relate to how a given state of data is observed by simultaneous operations.

Isolation refers to how simultaneous operations conflict with one another. The isolation guarantees that Delta Lake provides do differ from other systems.

Durability means that committed changes are permanent.

Problems solved by ACID

. Hard to append data.

. Modification of existing data difficult.

. Jobs failing midway.

. Real-time operations hard.

. Costly to keep historical data versions

Your data is stored in a cloud storage system such as W S Azure or GCP. This is called Cloud Object Storage. To separate the data from DataBricks we use the control pane and data pane.

The control pane has managed data services in the backend. In the data pane, the data is processed. Information is encrypted in a data pane. Data Engineers work with both batch and streamline processing.

MetaStore is a collection of Metadata and has a link to the cloud storage.

Schema is a container/Database that holds tables, views, and functions.

Workflows is a fully managed cloud-based general-purpose task orchestration service.

. Allows you to build simple ETL/ML task orchestration.

. Reduces infrastructure overhead.

. Easily integrate with external tools.

. Cloud-provider independent

. Enables re-using clusters to reduce cost and startup time.

The two main tasks of workflows are:-

Workflow Jobs Workflows for every job

Delta Live Tables Automated data pipelines for Delta Lake

DLT is a task in a workflow.

DataBricks Repos:-

Git Versioning

.Native integration with GitHub, GitLab, etc.

.Used in UI-based workflows

CI/CD Integration

. API surface to integrate with automation

. Simplifies the dev/staging/prod multi-workspace story

Enterprise Ready

. Allow lists to avoid exfiltration

. Secret detection to avoid leaking keys

A cluster is a collection of VM instances.

领英推荐

100 Data Engineering Jargon That You Must Know

Krishna Yogi Kolluru 6 个月前

Choosing the Right Data Engineering Platform:…

Sanjay Kumar MBA,MS,PhD 7 个月前

Selected Data Engineering Posts . . . August 2024

Axel Schwanke 6 个月前

All-purpose Clusters

Analyze data collaboratively using interactive dashboards.

Job Clusters

Run automated jobs. The Databricks job scheduler creates job clusters when running jobs.

Cluster modes

There are two modes namely:-

Single node

Low-cost single-instance cluster catering to single-node machine learning workloads and lightweight exploratory analysis.

Standard(Multi-Node)

Default mode for workloads developed in any supported language (requires at least 2 VM instances).

RunTime Version

. Standard

Apache Spark and many other components and updates to provide an optimised big data analytics experience.

. Photon

An optional add-on to optimise Spark queries (eg SQL, DataFrame)

. Machine Learning

Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost.

Cluster Policies

. Standardise cluster configurations

. Simplify the user experience

. Enforce correct tagging

. Prevent excessive use and control cost

DataBricks Notebooks are used to create visualisations based on query results or dataframes. You can use SQL, Python or Scala.

Let's get to know each other!

https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc.

https://linktr.ee/aamirp

Get my Podcasts on?Spotify

https://lnkd.in/gG7km8G5

Catch me on?Medium

https://lnkd.in/g2qqaWe2

Udemy

https://lnkd.in/gJjYNpNw

Follow me on?Instagram

https://lnkd.in/gkf3KPDQ

YouTube

https://lnkd.in/gX2qYSVB

Subscribe to my Channel for more useful content.

Dive Into Data with Aamir P

1,597 位关注者

Pranav Mehta

Simplifying Data Science for You | 7K+ Community | Director @ American Express | IIM Indore

11 个月

Great article, AAMIR P! Your insights on Data Engineering are truly valuable and insightful. Keep up the great work!

1 次回应

查看更多评论

要查看或添加评论，请登录

AAMIR P的更多文章

CPG (Consumer Packed Goods)— Aamir P

2025年2月12日

CPG (Consumer Packed Goods)— Aamir P

Hello Readers! In this article, we will gain some understanding about CPG. What is CPG? Things that are frequent in…

1 条评论
Dataiku — Aamir P

2024年10月11日

Dataiku — Aamir P

I found this tool very interesting and thought of sharing it with you all. I learnt this from Dataiku Academy.
PySpark — Aamir P

2024年10月3日

PySpark — Aamir P

As part of my learning journey and as a requirement for my new project, I have started exploring Pyspark. In this…
Data Build Tool(DBT) — Aamir P

2024年9月19日

Data Build Tool(DBT) — Aamir P

This is a command-line environment that allows you to transform and model the data in data warehousing using SQL…
SSIS Data Warehouse Developer — Aamir P

2024年9月10日

SSIS Data Warehouse Developer — Aamir P

SQL Server is an RDBMS developed by Microsoft. It is used to store and retrieve data requested by apps.

4 条评论
Talend — Aamir P

2024年8月7日

Talend — Aamir P

Hello Readers! In this article, we will learn about Talend. Data integration is crucial for businesses facing the…
Data Warehousing and BI Analytics — Aamir P

2024年5月7日

Data Warehousing and BI Analytics — Aamir P

Hello Readers! In this article, we will have a beginner-level understanding of Data Warehousing and BI Analytics. Hope…
TensorFlow - Aamir?P

2024年4月24日

TensorFlow - Aamir?P

Hi all! This is just some overview which I’m going to write about. Some beginners were asking me for a basic…
SnowPark Python— Aamir P

2024年3月17日

SnowPark Python— Aamir P

Hello readers! Thank you for supporting all my articles. This article SnowPark Python I am not so confident because…
SCD Data Warehousing?-?Aamir?P

2024年1月30日

SCD Data Warehousing?-?Aamir?P

Hello Readers! Today we will see about SCD in Data Warehousing. Slowly Changing Dimensions in Data Warehousing refers…

See all articles

Data Engineering — Aamir P

AAMIR P

Senior Software Engineer at Tiger Analytics | Padma Shri Award nominee for the year 2023 | Author of 25+ books | Badminton Player | Udemy Instructor | Public Speaker | Podcaster | Chess Player | Coder | Yoga Volunteer |

领英推荐

Dive Into Data with Aamir P

1,597 位关注者

AAMIR P的更多文章

社区洞察

其他会员也浏览了

Delta Live Tables in Databricks Series —Part 2 — The Architecture of Delta Live Tables

Building a Simple Data Pipeline with Mage: A Beginner's Guide

The Rise of EtLT(Extract, Tweak Light Transform, Load, Transform) in Modern Data Processing

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

Data Vault

Automation in Data Engineering: How No-Code and Low-Code Tools Are Redefining the Role

?? Traditional Data Engineering vs. MLOps Pipelines: Choosing the Right Approach ??

Data Engineering Guide to Build Strong Data Pipeline in Azure

Data Engineering: The Backbone of Modern Analytics Solutions

DataOps simple model

领英推荐

Dive Into Data with Aamir P

1,597 位关注者

AAMIR P的更多文章

CPG (Consumer Packed Goods)— Aamir P

Dataiku — Aamir P

PySpark — Aamir P

Data Build Tool(DBT) — Aamir P

SSIS Data Warehouse Developer — Aamir P

Talend — Aamir P

Data Warehousing and BI Analytics — Aamir P

TensorFlow - Aamir?P

SnowPark Python— Aamir P

SCD Data Warehousing?-?Aamir?P

社区洞察

其他会员也浏览了

Delta Live Tables in Databricks Series —Part 2 — The Architecture of Delta Live Tables

Building a Simple Data Pipeline with Mage: A Beginner's Guide

The Rise of EtLT(Extract, Tweak Light Transform, Load, Transform) in Modern Data Processing

Docker & Kafka on AWS: The Ultimate Guide for Data Engineers

Data Vault

Automation in Data Engineering: How No-Code and Low-Code Tools Are Redefining the Role

?? Traditional Data Engineering vs. MLOps Pipelines: Choosing the Right Approach ??

Data Engineering Guide to Build Strong Data Pipeline in Azure

Data Engineering: The Backbone of Modern Analytics Solutions

DataOps simple model