Data Engineering — Aamir P
Data Engineering Aamir P

Data Engineering — Aamir P

Hello readers!

In this article, we will see a basic workflow of Data Engineering. Let's see how data is stored, governed, etc. in a lakehouse.

Lakehouse refers to a data architecture that combines elements of data lakes and data warehouses. It is built on open source and open standards. One consistent data platform across clouds. Unify your data warehousing and AI use cases in a single platform.

Unity Catalog provides centralised governance for the platform. Delta Lake runs on top of data lakes. It is a file-based open-source storage format that provides data.

Delta Live Tables is a declarative framework for building maintainable and testable data processing pipelines. You can define the transformations to perform on your data.

Delta Lake provides asset transactions, data versioning and indexing. It is an open-source project that enables building a data lakehouse on top of existing cloud storage.

Unity Catalog provides governance and auditing.

Delta Lake brings ACID to Object Storage:-

Atomicity means all transactions either succeed or fail completely.

Consistency guarantees relate to how a given state of data is observed by simultaneous operations.

Isolation refers to how simultaneous operations conflict with one another. The isolation guarantees that Delta Lake provides do differ from other systems.

Durability means that committed changes are permanent.

Problems solved by ACID

. Hard to append data.

. Modification of existing data difficult.

. Jobs failing midway.

. Real-time operations hard.

. Costly to keep historical data versions

Your data is stored in a cloud storage system such as W S Azure or GCP. This is called Cloud Object Storage. To separate the data from DataBricks we use the control pane and data pane.

The control pane has managed data services in the backend. In the data pane, the data is processed. Information is encrypted in a data pane. Data Engineers work with both batch and streamline processing.

MetaStore is a collection of Metadata and has a link to the cloud storage.

Schema is a container/Database that holds tables, views, and functions.

Workflows is a fully managed cloud-based general-purpose task orchestration service.

. Allows you to build simple ETL/ML task orchestration.

. Reduces infrastructure overhead.

. Easily integrate with external tools.

. Cloud-provider independent

. Enables re-using clusters to reduce cost and startup time.

The two main tasks of workflows are:-

Workflow Jobs Workflows for every job

Delta Live Tables Automated data pipelines for Delta Lake

DLT is a task in a workflow.

DataBricks Repos:-

Git Versioning

.Native integration with GitHub, GitLab, etc.

.Used in UI-based workflows

CI/CD Integration

. API surface to integrate with automation

. Simplifies the dev/staging/prod multi-workspace story

Enterprise Ready

. Allow lists to avoid exfiltration

. Secret detection to avoid leaking keys

A cluster is a collection of VM instances.

All-purpose Clusters

Analyze data collaboratively using interactive dashboards.

Job Clusters

Run automated jobs. The Databricks job scheduler creates job clusters when running jobs.

Cluster modes

There are two modes namely:-

Single node

Low-cost single-instance cluster catering to single-node machine learning workloads and lightweight exploratory analysis.

Standard(Multi-Node)

Default mode for workloads developed in any supported language (requires at least 2 VM instances).

RunTime Version

. Standard

Apache Spark and many other components and updates to provide an optimised big data analytics experience.

. Photon

An optional add-on to optimise Spark queries (eg SQL, DataFrame)

. Machine Learning

Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost.

Cluster Policies

. Standardise cluster configurations

. Simplify the user experience

. Enforce correct tagging

. Prevent excessive use and control cost

DataBricks Notebooks are used to create visualisations based on query results or dataframes. You can use SQL, Python or Scala.

Let's get to know each other!

https://lnkd.in/gdBxZC5j


Get my books, podcasts, placement preparation, etc.

https://linktr.ee/aamirp


Get my Podcasts on?Spotify

https://lnkd.in/gG7km8G5


Catch me on?Medium

https://lnkd.in/g2qqaWe2


Udemy

https://lnkd.in/gJjYNpNw


Follow me on?Instagram

https://lnkd.in/gkf3KPDQ


YouTube

https://lnkd.in/gX2qYSVB


Subscribe to my Channel for more useful content.

Pranav Mehta

Simplifying Data Science for You | 7K+ Community | Director @ American Express | IIM Indore

8 个月

Great article, AAMIR P! Your insights on Data Engineering are truly valuable and insightful. Keep up the great work!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了