Microsoft Azure Data Fundamentals (DP-900) Exam: An Overview of Tools and Concepts

Data Ingestion

  • ETL (Extract, Transform, Load): Extracts from source, transforms (column mapping, cleansing, validating, wrangling, etc), and loads into target destination.
  • ELT (Extract, Load, Transform): Extracts from source, loads raw data into target, and transforms within destination.

Tools: Azure Data Factory (drag/drop pipelines), SSIS (drag/drop pipelines for SQL Server), Data Catalog (discovering data sources)

?

Data Types

  • Relational: consists of tables, columns, and records.
  • Columnar: Groups a collection of data regarding the same entity under the same column.
  • Graph: Maps out relationships between entities using nodes and edges. Useful for highly connected data.
  • Document: Stores data as name/value pairs in formats such as JSON and AVRO. Has a flexible schema.
  • Key-Value: Stores data as dictionaries for simple lookups and quick ingestion.
  • Object: Files such as audio, video, and images.

?

Data Storage (Structured): Structured data is stored in tabular format which can be queried using SQL.

  • SQL DB: Relational database used for consuming transactional data (OLTP) in a normalized format.
  • Dedicated SQL Pool (formerly DW): Used to store fact/dimension tables for analytical processing (explained below).

Tools: SSMS (GUI for SQL Server), Azure Data Studio (cross-platform lightweight GUI), SSDT

?

Data Storage (Semi/Unstructured): Semi-structured can include key-value pairs in documents, and does not follow relational database protocols such as normalization. Unstructured data can include audio, video, images and the likes.

  • Cassandra DB: Stores column-family data.
  • Azure Cosmos DB: Database which can hold and query any type of data. Data stored as documents and queried using the appropriate API. Allows for using SQL to query non-relational data. APIs: SQL/Core, NoSQL (documents), Table, MongoDB, Cassandra (col-family), Gremlin (graphs)
  • Azure Data Lake: Captures high volumes of raw data.
  • Azure Files: Used to store files like .docx and .png which can be shared with others.
  • Azure BLOB (binary large objects) Storage: Holds large amounts of unstructured data.
  • Azure Table Storage: Stores data using key-value pairs. Keys are grouped into partition keys and row keys.
  • Azure Catalog: Stores metadata.

Tools: Data Migration Assistant (for moving data to cloud), PolyBase (querying using T-SQL)

?

Data Processing

  • Batch: Periodic processing of data. Batches can be scheduled for ingestion manually or using a trigger which executes upon an event.

Architecture: Ingestion --> Storage --> Processing --> Analytical Store --> Reporting/Analysis

  • Stream: Processes and captures data as it is generated in the source. Useful for IoT, gaming, and stock data.

Architecture: Real-Time Ingestion --> Storage/Processing (real-time) --> Analytical Store --> Reporting/Analysis

?

Tools: Event Hubs (stream), IoT Hubs (stream), Apache Kafka (stream), U-SQL (batch), Apache Hive/Pig/Spark (batch), HDInsight (HDFS/big data), Databricks

?

Data Analysis and Reporting (Synapse Analytics)

  • Data Warehousing: This type of storage uses an OLAP (Online Analytical Processing) model where fact and dimension entities are created using stored data for more efficient queries when performing analytics and reporting. Star schemas create denormalized tables (easier joins) while snowflake schemas further normalize/break down dimension tables (less redundancy).
  • Complex Analysis: Can perform various types of analytics including descriptive for BI and prepping/training predictive models for ML (descriptive, predictive, prescriptive, diagnostic, cognitive).
  • Reporting: Visualization of data using KPIs, charts, and maps.

Tools: Dedicated SQL Pool (warehousing), Spark (processing), Data Factory (ingestion), Link (real-time connection to cosmos/sql db), PowerBI (reporting)


Data Redundancy/Disaster Recovery: the way data is copied and stored across availability zones (separate data centers)

  • Zone-redundant Storage (ZRS): data copied across 3 availability zones in primary region
  • Locally-Redundant Storage (LRS): data copied 3 times in local storage
  • Geo-Redundant Storage (GRS): data copied 3 times locally, copies to region 2, then copies 3 times in that storage (LRS locally, LRS region 2). Replicated to region 2 asynchronously (not real time). Region 2 does not have read/write access.
  • Geo-Zone Redundant Storage (GZRS): like GRS, but uses ZRS in primary region rather than LRS (ZRS locally, LRS region 2)
  • Geo-Zone Redundant Storage (GZRS-RA): Gives read access to region 2 (ability to view/query data)

?

Service Types

  • SaaS/Software-as-a-Service: Apps that are ready to use and offered as a subscription. Fully administered and managed by the provider.
  • PaaS/Platform-as-a-Service: Offers a complete set of tools and services to develop applications (such as SQL DB). Includes managed instances with minimal administration efforts regarding backups, security patches, and updates.
  • IaaS/Infrastructure-as-a-Service: Provides virtual machines (complete virtual operating system) giving you the most control over your environment, but also the most responsibility. Updates, patches, etc. have to be done manually and administered in-house.

?

SQL Commands

  • DML/Data Manipulation Language (for making changes to data): INSERT, UPDATE, DELETE, MERGE
  • DDL/Data Definition Language/ (for altering schema of database): CREATE, ALTER, TRUNCATE, DROP
  • DCL/Data Control Language (for managing user permissions): GRANT, DENY, REVOKE
  • TCL/Transaction Control Language (for managing transactions/set of DML statements): COMMIT, ROLLBACK, SAVEPOINT
  • DQL/Data Query Language (for querying/viewing data): SELECT


Sample SQL SELECT syntax: This is a simple SELECT statement I wrote to create a fact table from a sample database in the star schema (refer to data warehousing above) for completed sales orders. This type of query denormalizes relational data and allows for simple retrieval based on the use case of the business unit/department in a company.

select
? ? sales_orders.order_id,
? ? sales_orders.customer_id,
? ? sales_orders.order_status,
? ? sales_orders.order_date,
? ? sales_orders.shipped_date,
? ? extract(day from sales_orders.shipped_date - sales_orders.order_date) as shipping_speed,
? ? sales_orders.store_id,
? ? sales_orders.staff_id,
? ? sales_staffs.manager_id,
? ? round(amount,2) as amount
from 
? ? sales_staffs 
? ? join sales_orders using (staff_id)
? ? join payment_amount using (order_id)
where order_status = 4
order by order_date desc 

-- SELECT contains columns to be retrieved
-- FROM contains tables that contain those columns
-- JOIN matches columns from multiple tables based on a matching key
-- WHERE filters the data based on conditions
-- ORDER BY determines the order based on a column (ascending, descending)        



?

?

For more detailed explanations of any services or concepts, refer to Microsoft's official documentation:

https://learn.microsoft.com/en-us/azure/?product=popular

Milad Zazai

Entwickler im Bereich Künstliche Intelligenz

1 年

Thanks for sharing

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了