How modern data-analytics architecture works with Azure Databricks

How modern data-analytics architecture works with Azure Databricks

We built the modern data architectures by considering the below criteria:

  • Unify data, analytics, and AI workloads.
  • Run efficiently and reliably at any scale.
  • Provide insights through analytics dashboards, operational reports, or advanced analytics.

?The Architecture

No alt text provided for this image

Dataflow

  1. Azure Databricks ingests raw streaming data from Azure Event Hubs.
  2. Azure Data Factory loads raw batch data into Data Lake Storage.
  3. For data storage:

  • Data Lake Storage houses data of all types, such as structured, unstructured, and semi-structured. It also stores batch and streaming data.
  • Delta Lake forms the curated layer of the data lake. It stores the refined data in an open-source format.
  • Azure Databricks works well with architectures?that organizes data into layers as shown below:

- Bronze: Holds raw data.

- Silver: Contains cleaned, filtered data.

- Gold: Stores aggregated data that's useful for business analytics.


4. The analytical platform ingests data from the disparate batch and streaming sources like IoT hub.

Data scientists use this data for these tasks:

- Data preparation.

- Data exploration.

- Model preparation.

- Model training.

5. ML-flow manages parameter, metric, and model tracking in data science code runs. The coding possibilities are flexible:

  • Code can be in SQL, Python, R, and Scala.
  • Code can use popular open-source libraries and frameworks such as Koalas, Pandas, and scikit-learn, which are pre-installed and optimized.
  • Practitioners can optimize for performance and cost with single-node and multi-node compute options.

6. Machine learning models are available in several formats:

  • Azure Databricks stores information about models in the?ML Model Flow Registry. The registry makes models available through batch, streaming, and REST APIs.
  • The solution can also deploy models to Azure Machine Learning web services or Azure Kubernetes Service (AKS).

7. Services that work with the data connect to a single underlying data source to ensure consistency. For instance, users can run SQL queries on the data lake with Azure Databricks SQL Analytics. This service:

  • Provides a query editor and catalog, the query history, basic dashboarding, and alerting.
  • Uses integrated security that includes row-level and column-level permissions..

8. Power BI generates analytical and historical reports and dashboards from the unified data platform. This service uses these features when working with Azure Databricks:

  1. Users can export gold data sets out of the data lake into Azure Synapse via the optimized Synapse connector. SQL pools in Azure Synapse provide a data warehousing and compute environment.

9. The solution uses Azure services for collaboration, performance, reliability, governance, and security:

  • Microsoft Purview provides data discovery services, sensitive data classification, and governance insights across the data estate.
  • Azure DevOps offers continuous integration and continuous deployment (CI/CD) and other integrated version control features.
  • Azure Key Vault securely manages secrets, keys, and certificates.
  • Azure Active Directory (Azure AD) provides single sign-on (SSO) for Azure Databricks users. Azure Databricks supports automated user provisioning with Azure AD for these tasks:

- Creating new users.

- Assigning each user an access level.

- Removing users and denying them access.

  • Azure Monitor collects and analyzes Azure resource telemetry. By proactively identifying problems, this service maximizes performance and reliability.
  • Azure Cost Management and Billing provide financial governance services for Azure workloads.


Definition of the Components used in this Architecture:

  1. Azure Databricks?is a data analytics platform. Its fully managed Spark clusters process large streams of data from multiple sources. Azure Databricks cleans and transforms structureless data sets. It combines the processed data with structured data from operational databases or data warehouses. Azure Databricks also trains and deploys scalable machine learning and deep learning models.
  2. Event Hubs?is a big data streaming platform. As a platform as a service (PaaS), this event ingestion service is fully managed.
  3. Data Factory?is a hybrid data integration service. You can use this fully managed, serverless solution to create, schedule, and orchestrate data transformation workflows.
  4. Data Lake Storage?is a scalable and secure data lake for high-performance analytics workloads. This service can manage multiple petabytes of information while sustaining hundreds of gigabits of throughput. The data may be structured, semi-structured, or unstructured. It typically comes from multiple, heterogeneous sources like logs, files, and media.
  5. Azure Databricks SQL Analytics?runs queries on data lakes. This service also visualizes data in dashboards.
  6. Machine Learning?is a cloud-based environment that helps you build, deploy, and manage predictive analytics solutions. With these models, you can forecast behavior, outcomes, and trends.
  7. AKS?is a highly available, secure, and fully managed Kubernetes service. AKS makes it easy to deploy and manage containerized applications.
  8. Azure Synapse?is an analytics service for data warehouses and big data systems. This service integrates with Power BI, Machine Learning, and other Azure services.
  9. Azure Synapse connectors?provide a way to access Azure Synapse from Azure Databricks. These connectors efficiently transfer large volumes of data between Azure Databricks clusters and Azure Synapse instances.
  10. SQL pools?provide a data warehousing and compute environment in Azure Synapse. The pools are compatible with Azure Storage and Data Lake Storage.
  11. Delta Lake?is a storage layer that uses an open file format. This layer runs on top of cloud storage such as Data Lake Storage. Delta Lake supports data versioning, rollback, and transactions for updating, deleting, and merging data.
  12. MLflow?is an open-source platform for the machine learning lifecycle. Its components monitor machine learning models during training and running. MLflow also stores models and loads them in production.


Reporting and Governing Components:

  1. Power BI?is a collection of software services and apps. These services create and share reports that connect and visualize unrelated sources of data. Together with Azure Databricks, Power BI can provide root cause determination and raw data analysis.
  2. Microsoft Purview?manages on-premises, multicloud, and software as a service (SaaS) data. This governance service maintains data landscape maps. Features include automated data discovery, sensitive data classification, and data lineage.
  3. Azure DevOps?is a DevOps orchestration platform. This SaaS provides tools and environments for building, deploying, and collaborating on applications.
  4. Azure Key Vault?stores and controls access to secrets such as tokens, passwords, and API keys. Key Vault also creates and controls encryption keys and manages security certificates.
  5. Azure AD?offers cloud-based identity and access management services. These features provide a way for users to sign in and access resources.
  6. Azure Monitor?collects and analyzes data on environments and Azure resources. This data includes app telemetry, such as performance metrics and activity logs.
  7. Azure Cost Management and Billing?manage cloud spending. By using budgets and recommendations, this service organizes expenses and shows how to reduce costs.

Real-Life use case that has implemented this solution:

Swiss Re Group built for its Property & Casualty Reinsurance division using this solution.

Fahad Ali

Pre-Sales Solutions Architect & Technical Storyteller | Microsoft Azure Certified | Data & Analytics Lead: Compelling Presentations, Agile Data Solutions | Turning Data into Growth Engines for Cloud-First Businesses

9 个月

If anyone wishes to learn in depth, the same article is here at Microsoft Architectural site: https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/azure-databricks-modern-analytics-architecture

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了