Microsoft Fabric: Unified Integrated Analytics
Amit Chandak
Join us at 3-Days Microsoft Analytics Community Online/Virtual Conference on Microsoft Fabric, Copilot, and Purview | Chief Analytics Officer - Kanerika, Microsoft Data Platform MVP, Super User- Power BI Community
Microsoft Fabric: New Age Analytics| Lakehouse| Warehouse | Dataflow Gen2 | Power BI | Data Pipeline | Spark | Notebook | KSQL
Evolution of Data
We live in an era of data and?data-driven decision-making. Modern organizations depend on Data and Data Engineering processes to take quality decisions.?This all started with emerges of RDBMS around 1970's
Relational Database Management Systems (RDBMS) have played a pivotal role in the evolution of data management and data engineering since their inception. From their humble beginnings to their current prominence, RDBMS have transformed the way organizations store, manage, and retrieve data.
With the need for analyzing large volumes of data, data warehousing, and OLAP technologies have gained prominence. We explore the integration of these technologies with RDBMS and their role in decision support systems.
The explosion of unstructured and semi-structured data challenges the relational model’s suitability.?It led to the emergence of DataLake NoSQL databases and their different models, and they have significantly impacted the data engineering landscape.?Recognizing the strengths and weaknesses of both approaches, a convergence between RDBMS and NoSQL technologies began to combine the benefits of both worlds.
With the evolution of machine learning and AI capabilities, the integration of predictive analytics, data mining, and machine learning algorithms has become part of the data engineering process. This has enabled advanced analytics and automation.
In the age of technological advancements, data has become the centerpiece of discussions. AI and BI have revolutionized industries, all driven by the immense amount of data collected over the past few decades. AI models like ChatGPT have emerged, providing intelligence that rivals human capabilities. It seems like a thrilling chapter in human progress.
However, amidst this transformative era, organizations grapple with data engineering challenges. Despite AI and BI promises, outcomes often fall short, either due to delays or the inability to achieve objectives. Why is this happening at a time when we enter into an era of AI and BI??The answer lies in the enigmatic nature of the data engineering process, particularly during its initial phases.
The Challenges
Data engineering, the foundation of successful data analysis and insights, involves various stages such as data collection, transformation, cleaning, and integration. It is a complex process similar to solving a complex puzzle. The initial phase, known as data wrangling, serves as the gateway to the entire journey. However, data engineering often becomes a black box.
Imagine an organization embarking on a data engineering project. They possess a wealth of data collected over the years, encompassing customer information, sales figures, market trends, and more. Excitement surges as they aim to unlock the hidden potential within this treasure trove. Yet, as the data engineers delve into the project, they encounter numerous obstacles.
The first challenge they face is data quality. The collected data often comes from diverse sources, each with its own format, structure, and accuracy. It is like gathering puzzle pieces from different sets, each with its own unique shape and color scheme. Aligning these pieces becomes an arduous task, as data engineers struggle to ensure consistency and reliability. Missing values, outliers, and inconsistent formatting plague the dataset, introducing complexities that must be resolved.
The next hurdle is data integration. Organizations often possess multiple data sources, each providing valuable insights into different aspects of their operations. However, merging these sources seamlessly becomes a daunting endeavor. Incompatible data formats, conflicting definitions, and varying granularity levels hinder the creation of a unified view. The jigsaw puzzle grows more perplexing as data engineers strive to establish relationships and linkages between disparate datasets.
Furthermore, the sheer volume of data overwhelms engineers. With each passing day, the data landscape expands exponentially.?Organizations collect vast amounts of structured and unstructured data, including social media feeds, sensor readings, and customer interactions. It is as if the puzzle keeps growing in size, with new pieces added constantly. Managing and processing Big Data becomes a challenge, requiring robust infrastructure and efficient algorithms.
In the midst of these challenges, deadlines loom, and organizational stakeholders eagerly await actionable insights. However, data engineering operates behind closed doors, shrouded in complexity and uncertainty. It is no surprise that delays occur, leaving stakeholders frustrated and questioning the AI/BI revolution’s effectiveness
Data Replication
Data engineering processes often become complex due to the current technological landscape and choices made regarding data storage. Let’s examine how data flows from its source to the end user for consumption.
Multiple Data Source -> Data Lake -> Data Warehouse -> BI Tools
Multiple Data Source -> Data Lake -> AI Engine -> Data Lake -> Data Warehouse -> BI Tools
Multiple data sources feed into a data lake, which serves as a central repository for raw data. From the data lake, the data is then transformed and loaded into a data warehouse, which acts as a structured storage solution for analytics and reporting. Finally, business intelligence (BI) tools access the data from the data warehouse for analysis and visualization.
In some cases, an additional step involves routing the data from the data lake to an AI engine for machine learning and AI modeling. This allows advanced data processing and modeling capabilities.
Data lakes are particularly suitable for Data Science, ML, and AI modeling due to their column-based storage structure. They provide the flexibility and scalability required for these types of analytical workloads.
Data warehouses, on the other hand, store data in proprietary formats, which can vary from one database to another. When BI tools interact with data warehouses, they typically operate in import mode to optimize performance using their own proprietary format. This creates a separate copy of the data in their own proprietary format. This redundancy can lead to multiple copies of the data throughout the process.
As a result, we end up with multiple copies of data, each in a proprietary format, requiring data validation and quality assurance at each layer. This process can be seen as a black box due to limited visibility for the Quality Assurance team and end users until it reaches a BI tool. This lack of transparency can cause delays and limited data availability at certain stages unless there is a dedicated technical QA team involved.?Also, I often feel that the process does not seem aligned with the true Agile/Scrum process
And these data copies do not include replication to achieve hyper performance.
Microsoft Fabric
Microsoft has taken a significant leap forward to address unnecessary data replication. Their goal is to ensure that data can be stored and accessed by all their tools in a standardized format. This approach eliminates data duplication within the Microsoft ecosystem, allowing others to read and utilize the data as well.
Microsoft Fabric aims to revolutionize data engineering and data analytics by promoting the concept of having only one copy of data stored in a central lake(one lake), using the industry-standard Delta Parquet format.?This means organizations can avoid unnecessary data replication and have a true single source of data.
The advantage of this approach is that it enables seamless compatibility with various Microsoft tools, including data pipelines, data flows, notebooks, SQL engine, Kusto engine, and Power BI. All of these tools can read or write data in the standardized parquet format, ensuring consistency and ease of use across the entire Microsoft ecosystem.
Image Source:?Microsoft
By adopting this unified approach, organizations can now create a more efficient and streamlined data engineering process. They no longer need to replicate data solely to work with different technologies. This eliminates redundancies, simplifies data management, and promotes a true single source of data for analysis and decision-making.
In summary,?Microsoft’s Fabric represents a significant step towards redefining the data analytics landscape. It allows for the storage and utilization of data in a standardized format, eliminating unnecessary replication and enabling seamless integration with various Microsoft tools. This advancement empowers organizations to work with a true single source of data, streamlining their workflows and maximizing their data assets.
“The Microsoft Fabric Platform provides customers with a SaaS-ified, open, lake-centric, full-featured data, analytics, and AI platform that meets all of their data Analytics needs.”
Microsoft Fabric target different set of user like C-level executive, Department head, Data Scientists, Business Analyst, Data Scientist, and Data Analyst by providing a single source of truth, democratized access, reducing duplication, removing complexity, creating greater visibility, an integrated environment, enhanced productivity, and better collaboration. Data on the new oil will become a new advantage.
These are the key features:
Microsoft Fabric provide
1. Unified Analytics Platform
2. Lake-Centric and Open
3. Empower every user
Image Source:?Microsoft
Components of Microsoft Fabric:
OneLake
It is on the foundation of the data lake that all Fabric services are built upon. The Microsoft Fabric Lake is called OneLake.
There are two primary ways to start using Fabric
Lakehouse:
Microsoft Fabric Lakehouse is a cutting-edge data architecture platform designed to efficiently store, manage, and analyze both structured and unstructured data within a unified storage system.?By seamlessly integrating structured and unstructured data, the Lakehouse enables organizations to derive valuable insights from their data assets in a single, consolidated location. Notably, the Lakehouse simplifies data access through the automatic generation of a read-only SQL endpoint and a default dataset upon creation. This SQL endpoint provides users with convenient access to the data for analysis and querying purposes.
Warehouse(Synapse Data Warehouse):
The warehouse provides comprehensive support for both transactional Data Definition Language (DDL) and Data Manipulation Language (DML). It functions as a data warehouse and possesses robust T-SQL capabilities similar to those found in enterprise data warehouses.?Managed by a SQL engine, the Warehouse facilitates data updates through pipelines, SQL, Spark, and shortcuts. Files are stored in the delta Parquet format and are presented in the form of tables. Users can utilize SQL for data updates and transformations.
Unlike the SQL Endpoint, which automatically generates tables and data, you have complete control over creating tables, loading data, performing transformations, and executing queries within the data warehouse. This can be achieved through either the Microsoft Fabric portal or T-SQL commands. Additionally, the warehouse provides a default dataset for your convenience.
What does this mean for data engineering and data analytics processes?
With the default dataset, you can seamlessly bring data into Power BI as soon as it arrives at the Lakehouse or warehouse. This enables the quality assurance team and ends users to accurately analyze raw data quality and explain the underlying logic directly within Power BI.
领英推荐
As a data engineer, you have the flexibility to bring in data using data pipelines, and data flows, and transform data using options such as SQL and Spark/Notebooks (Python/SQL/Scala), depending on your technology preferences. The choice of technology depends on
1. How much data do you have?
2. How large of data do you have?
3. What technology landscape you are comfortable with?
In Microsoft Fabric, you can embark on your data journey from the Lakehouse, which provides a default SQL endpoint and a Power BI dataset. You can enhance the dataset by adding measures specific to your analytical needs.
Alternatively, you can start your journey directly from SQL-managed warehouses, which also come with a default Power BI dataset.
To avoid unnecessary data replication, users can create shortcuts in the Lakehouse. This allows them to access and view data from other Lakehouses and warehouses. This promotes data efficiency and reduces redundancy.
Now, let’s delve into some further details and explore the intricacies of this unified data analytics platform.
Lakehouse
Microsoft Fabric Lakehouse is a data architecture platform that enables storing, managing, and analyzing structured and unstructured data in a single location.?It offers flexibility and scalability to handle large volumes of data with various processing and analysis tools.
1. Store, manage, and analyze all data in one location for easy sharing across the enterprise.
2. Flexible and scalable solution for handling diverse data types and sizes.
3. Ingest data from multiple sources and store it in an open format.
4. Easily import data from different sources.
5. Directly upload files from your computer.
6. Connect to 100+ data sources and apply transformations using Dataflows.
7. Copy large-scale lakes through the copy activity in pipelines.
8. Store data in an open format for accessibility.
9. Provides a fully managed process of converting files into tables.
The Lakehouse creates a serving layer with an auto-generated SQL endpoint and default dataset, allowing users to work directly on delta tables for a seamless experience from data ingestion to reporting.
1. Tables are automatically discovered and registered.
2. SQL endpoint and semantic dataset enable working on delta tables.
3. Default warehouse is read-only and limited in T-SQL capabilities.
4. Only tables in Delta format are accessible via the SQL Endpoint.
Data Factory- A persona for Data Engineers
Data Factory in Microsoft Fabric provides a cutting-edge data integration experience that empowers both citizen and professional developers to ingest, prepare, and transform data from a wide range of data sources. These sources include databases, data warehouses, Lakehouses, real-time data, and more. With intelligent transformations and a comprehensive set of activities, users can efficiently manipulate and optimize their data.
One of the key features introduced in Data Factory is Fast Copy, which brings rapid data movement capabilities to both dataflows and data pipelines. With Fast Copy, users can seamlessly transfer data between their preferred data stores with blazing speed. This functionality is particularly significant as it allows users to bring data into the Microsoft Fabric Lakehouse and Data Warehouse for advanced analytics and insights.
Data Factory offers two primary high-level features: dataflows and pipelines.
Dataflows provide a robust data transformation environment with a dataflows designer offering over 300 transformations. This comprehensive set of transformations ensures that users can easily and flexibly manipulate their data, surpassing the capabilities of other tools. Moreover, Data Factory leverages intelligent AI-based data transformations to further enhance the efficiency and effectiveness of data manipulation.
Data pipelines deliver powerful data orchestration capabilities out-of-the-box. Users can leverage these capabilities to compose flexible data workflows that align with the specific needs of their enterprise. The rich set of data orchestration features empowers users to seamlessly manage and optimize their data processing workflows, enabling smooth data movement and transformations.
Data Science — A persona for Data Scientist
Microsoft Fabric provides comprehensive Data Science experiences that empower users to perform end-to-end data science workflows, enabling data enrichment and generating valuable business insights. With these experiences, users can seamlessly execute a wide range of activities throughout the entire data science process.?This includes tasks such as data exploration, preparation, and cleansing, as well as experimentation, modeling, model scoring, and serving predictive insights to BI reports.
The Data Science experience within Microsoft Fabric enables users to effortlessly build, deploy, and operationalize machine learning models. It seamlessly integrates with Azure Machine Learning, offering built-in capabilities for experiment tracking and model registry. This integration enhances the workflow for data scientists, enabling them to enrich organizational data with predictions derived from machine learning models. These predictions can then be seamlessly integrated into business intelligence (BI) reports, allowing business analysts to incorporate valuable insights into their reporting processes.
OneLake
Microsoft Fabric’s foundation lies in the data lake, which is also known as OneLake. OneLake is an integral part of the Fabric service, providing a centralized location to store all organizational data, ensuring seamless operation across various experiences.
OneLake, similar to OneDrive, is automatically included with every Microsoft Fabric tenant. It serves as the single repository for all analytics data, catering to the entire organization. This logical data lake, OneLake, is built on top of Azure Data Lake Storage (ADLS) Gen2, offering a unified Software-as-a-Service (SaaS) experience for both professional and citizen developers. By simplifying the user experience, OneLake eliminates the need for users to have a deep understanding of infrastructure concepts like resource groups, Role-Based Access Control (RBAC), Azure Resource Manager, redundancy, or regions. In fact, having an Azure account is not even a requirement to utilize OneLake.
OneLake embraces openness at every level, leveraging the capabilities of Azure Data Lake Storage Gen2. It supports any type of file, whether it is structured or unstructured. Within the Fabric ecosystem, all data items such as data warehouses and lakehouses automatically store their data in OneLake in the delta parquet format. This means that regardless of whether a data engineer loads data into a lakehouse using Spark or a SQL developer loads data into a fully transactional data warehouse using T-SQL, all contributions are directed towards building the same unified data lake. Tabular data, specifically, is stored in OneLake in the delta parquet format.
You can find more details?here -?https://learn.microsoft.com/en-us/fabric/
How to enable Fabric, find details?here -?https://learn.microsoft.com/en-us/fabric/admin/fabric-switch
Microsoft Fabric licenses details?here ?-https://learn.microsoft.com/en-us/fabric/enterprise/licenses
Microsoft Fabric administration?here -?https://learn.microsoft.com/en-us/fabric/admin/microsoft-fabric-admin
Microsoft is revolutionizing the data landscape, offering an unparalleled data experience that removes concerns about the ever-changing technological landscape. Moreover, it provides enhanced visibility into the data engineering process, aligning it more effectively with Agile/Scrum methodologies. Let’s delve into the exciting realm of data engineering with the public preview.
Microsoft Fabric: A Brief Introduction
How to enable Microsoft Fabric; on Power BI Service| Start the Fabric (Preview) trial
Microsoft Fabric-What it is, Why Use, Who Should, When and How to use
Microsoft Fabric: Various Components| Microsoft Fabric Create Workspace, Lakehouse, Warehouse
Microsoft Fabric: How to load data in Lakehouse using Dataflow Gen2
Microsoft Fabric: How to load data in Warehouse using Dataflow Gen2
Introduction to Microsoft Fabric in Hindi | What it is, Why Use, Who Should, When and How to use
To take your fabric journey to the next level, I’ll be providing more examples in my upcoming blogs and videos.
Complete Power BI in one Video — 11 Hours
Click?here ?to access all my blogs and videos in a jiffy via an exclusive glossary using Power BI. Please like, share, and comment on these blogs. I would appreciate your suggestions for improvement, challenges, and suggestions for topics so that I can explore these in more depth.
In addition, I have over 500 videos on my?YouTube channel ?that are related to Power BI, Tableau, and Incorta. With the help of these videos, you will learn hundreds of DAX and Power Query functions, in addition to hundreds of Power BI, use cases that will assist you in becoming an expert in Power BI. Make sure you?subscribe , like, and share it with your friends.
Business Intelligence Analyst | Machine Learning, Data Analytics|FP&A
10 个月The simplicity behind Fabric's approach to data and intelligence tools is nothing short of a game-changer. It's about empowering users with the ability to delve straight into valuable insights without the cumbersome hassle of piecing together different platforms and products. Fabric’s potential to revolutionize analytics and decision-making is nothing short of inspiring. Take a leap into this transformative world and explore how Fabric can redefine the landscape of analytics across diverse organizations on their official website. Thanks for Sharing Amit. #Microsoft #MicrosoftFabric #DataAnalytics
Helping Startups with Business, Data, App, & Tech.
1 年Amit, thanks for sharing!