Data Engineering — Aamir P
Hello readers!
In this article, we will see a basic workflow of Data Engineering. Let's see how data is stored, governed, etc. in a lakehouse.
Lakehouse refers to a data architecture that combines elements of data lakes and data warehouses. It is built on open source and open standards. One consistent data platform across clouds. Unify your data warehousing and AI use cases in a single platform.
Unity Catalog provides centralised governance for the platform. Delta Lake runs on top of data lakes. It is a file-based open-source storage format that provides data.
Delta Live Tables is a declarative framework for building maintainable and testable data processing pipelines. You can define the transformations to perform on your data.
Delta Lake provides asset transactions, data versioning and indexing. It is an open-source project that enables building a data lakehouse on top of existing cloud storage.
Unity Catalog provides governance and auditing.
Delta Lake brings ACID to Object Storage:-
Atomicity means all transactions either succeed or fail completely.
Consistency guarantees relate to how a given state of data is observed by simultaneous operations.
Isolation refers to how simultaneous operations conflict with one another. The isolation guarantees that Delta Lake provides do differ from other systems.
Durability means that committed changes are permanent.
Problems solved by ACID
. Hard to append data.
. Modification of existing data difficult.
. Jobs failing midway.
. Real-time operations hard.
. Costly to keep historical data versions
Your data is stored in a cloud storage system such as W S Azure or GCP. This is called Cloud Object Storage. To separate the data from DataBricks we use the control pane and data pane.
The control pane has managed data services in the backend. In the data pane, the data is processed. Information is encrypted in a data pane. Data Engineers work with both batch and streamline processing.
MetaStore is a collection of Metadata and has a link to the cloud storage.
Schema is a container/Database that holds tables, views, and functions.
Workflows is a fully managed cloud-based general-purpose task orchestration service.
. Allows you to build simple ETL/ML task orchestration.
. Reduces infrastructure overhead.
. Easily integrate with external tools.
. Cloud-provider independent
. Enables re-using clusters to reduce cost and startup time.
The two main tasks of workflows are:-
Workflow Jobs Workflows for every job
Delta Live Tables Automated data pipelines for Delta Lake
DLT is a task in a workflow.
DataBricks Repos:-
Git Versioning
.Native integration with GitHub, GitLab, etc.
.Used in UI-based workflows
CI/CD Integration
. API surface to integrate with automation
. Simplifies the dev/staging/prod multi-workspace story
Enterprise Ready
. Allow lists to avoid exfiltration
. Secret detection to avoid leaking keys
A cluster is a collection of VM instances.
领英推荐
All-purpose Clusters
Analyze data collaboratively using interactive dashboards.
Job Clusters
Run automated jobs. The Databricks job scheduler creates job clusters when running jobs.
Cluster modes
There are two modes namely:-
Single node
Low-cost single-instance cluster catering to single-node machine learning workloads and lightweight exploratory analysis.
Standard(Multi-Node)
Default mode for workloads developed in any supported language (requires at least 2 VM instances).
RunTime Version
. Standard
Apache Spark and many other components and updates to provide an optimised big data analytics experience.
. Photon
An optional add-on to optimise Spark queries (eg SQL, DataFrame)
. Machine Learning
Adds popular machine learning libraries like TensorFlow, Keras, PyTorch, and XGBoost.
Cluster Policies
. Standardise cluster configurations
. Simplify the user experience
. Enforce correct tagging
. Prevent excessive use and control cost
DataBricks Notebooks are used to create visualisations based on query results or dataframes. You can use SQL, Python or Scala.
Let's get to know each other!
Get my books, podcasts, placement preparation, etc.
Get my Podcasts on?Spotify
Catch me on?Medium
Follow me on?Instagram
Subscribe to my Channel for more useful content.
Simplifying Data Science for You | 7K+ Community | Director @ American Express | IIM Indore
8 个月Great article, AAMIR P! Your insights on Data Engineering are truly valuable and insightful. Keep up the great work!