登录查看更多内容

Microsoft Azure Data Fundamentals (DP-900) Exam: An Overview of Tools and Concepts

Yasin Sharaf

Analyst @ Baitulmaal · 3x Azure Certified

发布日期: 2023年6月16日

Data Ingestion

ETL (Extract, Transform, Load): Extracts from source, transforms (column mapping, cleansing, validating, wrangling, etc), and loads into target destination.
ELT (Extract, Load, Transform): Extracts from source, loads raw data into target, and transforms within destination.

Tools: Azure Data Factory (drag/drop pipelines), SSIS (drag/drop pipelines for SQL Server), Data Catalog (discovering data sources)

Data Types

Relational: consists of tables, columns, and records.
Columnar: Groups a collection of data regarding the same entity under the same column.
Graph: Maps out relationships between entities using nodes and edges. Useful for highly connected data.
Document: Stores data as name/value pairs in formats such as JSON and AVRO. Has a flexible schema.
Key-Value: Stores data as dictionaries for simple lookups and quick ingestion.
Object: Files such as audio, video, and images.

Data Storage (Structured): Structured data is stored in tabular format which can be queried using SQL.

SQL DB: Relational database used for consuming transactional data (OLTP) in a normalized format.
Dedicated SQL Pool (formerly DW): Used to store fact/dimension tables for analytical processing (explained below).

Tools: SSMS (GUI for SQL Server), Azure Data Studio (cross-platform lightweight GUI), SSDT

Data Storage (Semi/Unstructured): Semi-structured can include key-value pairs in documents, and does not follow relational database protocols such as normalization. Unstructured data can include audio, video, images and the likes.

Cassandra DB: Stores column-family data.
Azure Cosmos DB: Database which can hold and query any type of data. Data stored as documents and queried using the appropriate API. Allows for using SQL to query non-relational data. APIs: SQL/Core, NoSQL (documents), Table, MongoDB, Cassandra (col-family), Gremlin (graphs)
Azure Data Lake: Captures high volumes of raw data.
Azure Files: Used to store files like .docx and .png which can be shared with others.
Azure BLOB (binary large objects) Storage: Holds large amounts of unstructured data.
Azure Table Storage: Stores data using key-value pairs. Keys are grouped into partition keys and row keys.
Azure Catalog: Stores metadata.

Tools: Data Migration Assistant (for moving data to cloud), PolyBase (querying using T-SQL)

Data Processing

Batch: Periodic processing of data. Batches can be scheduled for ingestion manually or using a trigger which executes upon an event.

Architecture: Ingestion --> Storage --> Processing --> Analytical Store --> Reporting/Analysis

Stream: Processes and captures data as it is generated in the source. Useful for IoT, gaming, and stock data.

Architecture: Real-Time Ingestion --> Storage/Processing (real-time) --> Analytical Store --> Reporting/Analysis

Tools: Event Hubs (stream), IoT Hubs (stream), Apache Kafka (stream), U-SQL (batch), Apache Hive/Pig/Spark (batch), HDInsight (HDFS/big data), Databricks

领英推荐

Data Engineering | A comprehensive understanding of…

Himanshu Ramchandani 2 年前

Azure Data Engineer Interview questions with Answers…

Sharad Sonwane 4 个月前

What Skills Should Every Data Engineer Have in 2025? ??

WalkingTree Resources Pvt. Ltd. 2 个月前

Data Analysis and Reporting (Synapse Analytics)

Data Warehousing: This type of storage uses an OLAP (Online Analytical Processing) model where fact and dimension entities are created using stored data for more efficient queries when performing analytics and reporting. Star schemas create denormalized tables (easier joins) while snowflake schemas further normalize/break down dimension tables (less redundancy).
Complex Analysis: Can perform various types of analytics including descriptive for BI and prepping/training predictive models for ML (descriptive, predictive, prescriptive, diagnostic, cognitive).
Reporting: Visualization of data using KPIs, charts, and maps.

Tools: Dedicated SQL Pool (warehousing), Spark (processing), Data Factory (ingestion), Link (real-time connection to cosmos/sql db), PowerBI (reporting)

Data Redundancy/Disaster Recovery: the way data is copied and stored across availability zones (separate data centers)

Zone-redundant Storage (ZRS): data copied across 3 availability zones in primary region
Locally-Redundant Storage (LRS): data copied 3 times in local storage
Geo-Redundant Storage (GRS): data copied 3 times locally, copies to region 2, then copies 3 times in that storage (LRS locally, LRS region 2). Replicated to region 2 asynchronously (not real time). Region 2 does not have read/write access.
Geo-Zone Redundant Storage (GZRS): like GRS, but uses ZRS in primary region rather than LRS (ZRS locally, LRS region 2)
Geo-Zone Redundant Storage (GZRS-RA): Gives read access to region 2 (ability to view/query data)

Service Types

SaaS/Software-as-a-Service: Apps that are ready to use and offered as a subscription. Fully administered and managed by the provider.
PaaS/Platform-as-a-Service: Offers a complete set of tools and services to develop applications (such as SQL DB). Includes managed instances with minimal administration efforts regarding backups, security patches, and updates.
IaaS/Infrastructure-as-a-Service: Provides virtual machines (complete virtual operating system) giving you the most control over your environment, but also the most responsibility. Updates, patches, etc. have to be done manually and administered in-house.

SQL Commands

DML/Data Manipulation Language (for making changes to data): INSERT, UPDATE, DELETE, MERGE
DDL/Data Definition Language/ (for altering schema of database): CREATE, ALTER, TRUNCATE, DROP
DCL/Data Control Language (for managing user permissions): GRANT, DENY, REVOKE
TCL/Transaction Control Language (for managing transactions/set of DML statements): COMMIT, ROLLBACK, SAVEPOINT
DQL/Data Query Language (for querying/viewing data): SELECT

Sample SQL SELECT syntax: This is a simple SELECT statement I wrote to create a fact table from a sample database in the star schema (refer to data warehousing above) for completed sales orders. This type of query denormalizes relational data and allows for simple retrieval based on the use case of the business unit/department in a company.

select
? ? sales_orders.order_id,
? ? sales_orders.customer_id,
? ? sales_orders.order_status,
? ? sales_orders.order_date,
? ? sales_orders.shipped_date,
? ? extract(day from sales_orders.shipped_date - sales_orders.order_date) as shipping_speed,
? ? sales_orders.store_id,
? ? sales_orders.staff_id,
? ? sales_staffs.manager_id,
? ? round(amount,2) as amount
from 
? ? sales_staffs 
? ? join sales_orders using (staff_id)
? ? join payment_amount using (order_id)
where order_status = 4
order by order_date desc 

-- SELECT contains columns to be retrieved
-- FROM contains tables that contain those columns
-- JOIN matches columns from multiple tables based on a matching key
-- WHERE filters the data based on conditions
-- ORDER BY determines the order based on a column (ascending, descending)

For more detailed explanations of any services or concepts, refer to Microsoft's official documentation:

https://learn.microsoft.com/en-us/azure/?product=popular

Microsoft Azure Data Fundamentals (DP-900) Exam: An Overview of Tools and Concepts

Yasin Sharaf

Analyst @ Baitulmaal · 3x Azure Certified

领英推荐

社区洞察

其他会员也浏览了

Top 9 Important Tools that every Data Engineer Needs

SNOW-SQL in SNOWFLAKE

Databricks SQL - The new Cloud Data Ware(Lake)house

Mastering Azure Data Factory: A Comprehensive Guide to Modern Data Integration

AZURE DATA FACTORY ACTIVITIES AND ITS TYPES

Navigating the Transition from SQL/Data Warehouse Developer to Data Lakehouse Developer with Databricks

How to Get Started With ADF As a Beginner?

Azure Synapse

Logical Data Warehouse With Azure Synapse Serverless SQL - Incremental Data Loading

Build a Secure and Scalable Data Lakehouse on Snowflake.