登录查看更多内容

Data Engineering — Aamir P

AAMIR P

Senior Software Engineer at Tiger Analytics | Padma Shri Award nominee for the year 2023 | Author of 25+ books | Badminton Player | Udemy Instructor | Public Speaker | Podcaster | Chess Player | Coder | Yoga Volunteer |

发布日期: 2024年3月29日

Hello readers!

In this article, we will see a basic workflow of Data Engineering. Let's see how data is stored, governed, etc. in a lakehouse.

Lakehouse refers to a data architecture that combines elements of data lakes and data warehouses. It is built on open source and open standards. One consistent data platform across clouds. Unify your data warehousing and AI use cases in a single platform.

Unity Catalog provides centralised governance for the platform. Delta Lake runs on top of data lakes. It is a file-based open-source storage format that provides data.

Delta Live Tables is a declarative framework for building maintainable and testable data processing pipelines. You can define the transformations to perform on your data.

Delta Lake provides asset transactions, data versioning and indexing. It is an open-source project that enables building a data lakehouse on top of existing cloud storage.

Unity Catalog provides governance and auditing.

Delta Lake brings ACID to Object Storage:-

Atomicity means all transactions either succeed or fail completely.

Consistency guarantees relate to how a given state of data is observed by simultaneous operations.

Isolation refers to how simultaneous operations conflict with one another. The isolation guarantees that Delta Lake provides do differ from other systems.

Durability means that committed changes are permanent.

Problems solved by ACID

. Hard to append data.

. Modification of existing data difficult.

. Jobs failing midway.

. Real-time operations hard.

. Costly to keep historical data versions

Your data is stored in a cloud storage system such as W S Azure or GCP. This is called Cloud Object Storage. To separate the data from DataBricks we use the control pane and data pane.

The control pane has managed data services in the backend. In the data pane, the data is processed. Information is encrypted in a data pane. Data Engineers work with both batch and streamline processing.

MetaStore is a collection of Metadata and has a link to the cloud storage.

Schema is a container/Database that holds tables, views, and functions.

Workflows is a fully managed cloud-based general-purpose task orchestration service.

. Allows you to build simple ETL/ML task orchestration.

. Reduces infrastructure overhead.

. Easily integrate with external tools.

. Cloud-provider independent

. Enables re-using clusters to reduce cost and startup time.

The two main tasks of workflows are:-

Workflow Jobs Workflows for every job

Delta Live Tables Automated data pipelines for Delta Lake

DLT is a task in a workflow.

DataBricks Repos:-

Git Versioning

.Native integration with GitHub, GitLab, etc.

.Used in UI-based workflows

CI/CD Integration

. API surface to integrate with automation

. Simplifies the dev/staging/prod multi-workspace story

Enterprise Ready

. Allow lists to avoid exfiltration

. Secret detection to avoid leaking keys

A cluster is a collection of VM instances.

Krishna Yogi Kolluru 3 个月前

Choosing the Right Data Engineering Platform:…

Sanjay Kumar MBA,MS,PhD 3 个月前

Selected Data Engineering Posts . . . August 2024

Axel Schwanke 3 个月前

All-purpose Clusters

Analyze data collaboratively using interactive dashboards.

Job Clusters

Run automated jobs. The Databricks job scheduler creates job clusters when running jobs.

Cluster modes

There are two modes namely:-

Single node

Low-cost single-instance cluster catering to single-node machine learning workloads and lightweight exploratory analysis.

Standard(Multi-Node)

Default mode for workloads developed in any supported language (requires at least 2 VM instances).

RunTime Version

. Standard

Apache Spark and many other components and updates to provide an optimised big data analytics experience.

. Photon

An optional add-on to optimise Spark queries (eg SQL, DataFrame)

. Machine Learning

Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost.

Cluster Policies

. Standardise cluster configurations

. Simplify the user experience

. Enforce correct tagging

. Prevent excessive use and control cost

DataBricks Notebooks are used to create visualisations based on query results or dataframes. You can use SQL, Python or Scala.

Let's get to know each other!

https://lnkd.in/gdBxZC5j

Get my books, podcasts, placement preparation, etc.

https://linktr.ee/aamirp

Get my Podcasts on?Spotify

https://lnkd.in/gG7km8G5

Catch me on?Medium

https://lnkd.in/g2qqaWe2

Udemy

https://lnkd.in/gJjYNpNw

Follow me on?Instagram

https://lnkd.in/gkf3KPDQ

YouTube

https://lnkd.in/gX2qYSVB

Subscribe to my Channel for more useful content.

Dive Into Data with Aamir P

1,590 位关注者

Pranav Mehta

Simplifying Data Science for You | 7K+ Community | Director @ American Express | IIM Indore

8 个月

Great article, AAMIR P! Your insights on Data Engineering are truly valuable and insightful. Keep up the great work!

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Data Engineering — Aamir P

AAMIR P

Senior Software Engineer at Tiger Analytics | Padma Shri Award nominee for the year 2023 | Author of 25+ books | Badminton Player | Udemy Instructor | Public Speaker | Podcaster | Chess Player | Coder | Yoga Volunteer |

领英推荐

Dive Into Data with Aamir P

1,590 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Fundamentals of Data Engineering: Building the Backbone of Modern Data Infrastructure

The Rise of EtLT(Extract, Tweak Light Transform, Load, Transform) in Modern Data Processing

Data Engineering Best Practices: Building Efficient Data Pipeline

Why Do Modern Businesses Need Data Engineering Services?

Data Vault

How Data Engineering Can Revolutionize Your Operations

Navigating the Data Seas: The Crucial Role of Data Engineering in the Data Ecosystem

The Critical Role of Data Engineering in Today's Data-Driven World

dbt’s Data Mastery: Why It’s Leading the Data Engineering Revolution

The Future of Big Data Engineering: 5 Trends to Watch in 2023

领英推荐

Dive Into Data with Aamir P

1,590 位关注者

Dataiku — Aamir P

2024年10月11日

PySpark — Aamir P

2024年10月3日

Data Build Tool(DBT) — Aamir P

2024年9月19日

SSIS Data Warehouse Developer — Aamir P

2024年9月10日

Talend — Aamir P

2024年8月7日

Data Warehousing and BI Analytics — Aamir P

2024年5月7日

TensorFlow - Aamir?P

2024年4月24日

SnowPark Python— Aamir P

2024年3月17日

SCD Data Warehousing?-?Aamir?P

2024年1月30日

Data Warehousing Basics - Aamir P

2024年1月23日

社区洞察

其他会员也浏览了

Fundamentals of Data Engineering: Building the Backbone of Modern Data Infrastructure

The Rise of EtLT(Extract, Tweak Light Transform, Load, Transform) in Modern Data Processing

Data Engineering Best Practices: Building Efficient Data Pipeline

Why Do Modern Businesses Need Data Engineering Services?

Data Vault

How Data Engineering Can Revolutionize Your Operations

Navigating the Data Seas: The Crucial Role of Data Engineering in the Data Ecosystem

The Critical Role of Data Engineering in Today's Data-Driven World

dbt’s Data Mastery: Why It’s Leading the Data Engineering Revolution

The Future of Big Data Engineering: 5 Trends to Watch in 2023