Data Engineering — Aamir P
Data Engineering Aamir P

Data Engineering — Aamir P

Hello readers!

In this article, we will see a basic workflow of Data Engineering. Let's see how data is stored, governed, etc. in a lakehouse.

Lakehouse refers to a data architecture that combines elements of data lakes and data warehouses. It is built on open source and open standards. One consistent data platform across clouds. Unify your data warehousing and AI use cases in a single platform.

Unity Catalog provides centralised governance for the platform. Delta Lake runs on top of data lakes. It is a file-based open-source storage format that provides data.

Delta Live Tables is a declarative framework for building maintainable and testable data processing pipelines. You can define the transformations to perform on your data.

Delta Lake provides asset transactions, data versioning and indexing. It is an open-source project that enables building a data lakehouse on top of existing cloud storage.

Unity Catalog provides governance and auditing.

Delta Lake brings ACID to Object Storage:-

Atomicity means all transactions either succeed or fail completely.

Consistency guarantees relate to how a given state of data is observed by simultaneous operations.

Isolation refers to how simultaneous operations conflict with one another. The isolation guarantees that Delta Lake provides do differ from other systems.

Durability means that committed changes are permanent.

Problems solved by ACID

. Hard to append data.

. Modification of existing data difficult.

. Jobs failing midway.

. Real-time operations hard.

. Costly to keep historical data versions

Your data is stored in a cloud storage system such as W S Azure or GCP. This is called Cloud Object Storage. To separate the data from DataBricks we use the control pane and data pane.

The control pane has managed data services in the backend. In the data pane, the data is processed. Information is encrypted in a data pane. Data Engineers work with both batch and streamline processing.

MetaStore is a collection of Metadata and has a link to the cloud storage.

Schema is a container/Database that holds tables, views, and functions.

Workflows is a fully managed cloud-based general-purpose task orchestration service.

. Allows you to build simple ETL/ML task orchestration.

. Reduces infrastructure overhead.

. Easily integrate with external tools.

. Cloud-provider independent

. Enables re-using clusters to reduce cost and startup time.

The two main tasks of workflows are:-

Workflow Jobs Workflows for every job

Delta Live Tables Automated data pipelines for Delta Lake

DLT is a task in a workflow.

DataBricks Repos:-

Git Versioning

.Native integration with GitHub, GitLab, etc.

.Used in UI-based workflows

CI/CD Integration

. API surface to integrate with automation

. Simplifies the dev/staging/prod multi-workspace story

Enterprise Ready

. Allow lists to avoid exfiltration

. Secret detection to avoid leaking keys

A cluster is a collection of VM instances.

All-purpose Clusters

Analyze data collaboratively using interactive dashboards.

Job Clusters

Run automated jobs. The Databricks job scheduler creates job clusters when running jobs.

Cluster modes

There are two modes namely:-

Single node

Low-cost single-instance cluster catering to single-node machine learning workloads and lightweight exploratory analysis.

Standard(Multi-Node)

Default mode for workloads developed in any supported language (requires at least 2 VM instances).

RunTime Version

. Standard

Apache Spark and many other components and updates to provide an optimised big data analytics experience.

. Photon

An optional add-on to optimise Spark queries (eg SQL, DataFrame)

. Machine Learning

Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost.

Cluster Policies

. Standardise cluster configurations

. Simplify the user experience

. Enforce correct tagging

. Prevent excessive use and control cost

DataBricks Notebooks are used to create visualisations based on query results or dataframes. You can use SQL, Python or Scala.

Let's get to know each other!

https://lnkd.in/gdBxZC5j


Get my books, podcasts, placement preparation, etc.

https://linktr.ee/aamirp


Get my Podcasts on?Spotify

https://lnkd.in/gG7km8G5


Catch me on?Medium

https://lnkd.in/g2qqaWe2


Udemy

https://lnkd.in/gJjYNpNw


Follow me on?Instagram

https://lnkd.in/gkf3KPDQ


YouTube

https://lnkd.in/gX2qYSVB


Subscribe to my Channel for more useful content.

Pranav Mehta

Simplifying Data Science for You | 7K+ Community | Director @ American Express | IIM Indore

11 个月

Great article, AAMIR P! Your insights on Data Engineering are truly valuable and insightful. Keep up the great work!

要查看或添加评论,请登录

AAMIR P的更多文章

  • CPG (Consumer Packed Goods)— Aamir P

    CPG (Consumer Packed Goods)— Aamir P

    Hello Readers! In this article, we will gain some understanding about CPG. What is CPG? Things that are frequent in…

    1 条评论
  • Dataiku — Aamir P

    Dataiku — Aamir P

    I found this tool very interesting and thought of sharing it with you all. I learnt this from Dataiku Academy.

  • PySpark — Aamir P

    PySpark — Aamir P

    As part of my learning journey and as a requirement for my new project, I have started exploring Pyspark. In this…

  • Data Build Tool(DBT) — Aamir P

    Data Build Tool(DBT) — Aamir P

    This is a command-line environment that allows you to transform and model the data in data warehousing using SQL…

  • SSIS Data Warehouse Developer — Aamir P

    SSIS Data Warehouse Developer — Aamir P

    SQL Server is an RDBMS developed by Microsoft. It is used to store and retrieve data requested by apps.

    4 条评论
  • Talend — Aamir P

    Talend — Aamir P

    Hello Readers! In this article, we will learn about Talend. Data integration is crucial for businesses facing the…

  • Data Warehousing and BI Analytics — Aamir P

    Data Warehousing and BI Analytics — Aamir P

    Hello Readers! In this article, we will have a beginner-level understanding of Data Warehousing and BI Analytics. Hope…

  • TensorFlow - Aamir?P

    TensorFlow - Aamir?P

    Hi all! This is just some overview which I’m going to write about. Some beginners were asking me for a basic…

  • SnowPark Python— Aamir P

    SnowPark Python— Aamir P

    Hello readers! Thank you for supporting all my articles. This article SnowPark Python I am not so confident because…

  • SCD Data Warehousing?-?Aamir?P

    SCD Data Warehousing?-?Aamir?P

    Hello Readers! Today we will see about SCD in Data Warehousing. Slowly Changing Dimensions in Data Warehousing refers…

社区洞察

其他会员也浏览了