Hadoop to Azure Databricks Migration

Hadoop to Azure Databricks Migration


Problem or Pain point statement:

  1. Ongoing support and maintenance challenges that include setting up servers,networking, storage, installing software, and configuring best practices for deployed technologies.
  2. Operations engineer team required for ongoing upgrades, patches, and maintenance.
  3. Technical challenges with items like small files performance.
  4. Oozie the packaged service for scheduling and workflow automation was too complex and difficult to use forcing customers to choose their own enterprise schedulers.
  5. The immutable nature of the datasets stored in HDFS required extensive training to educate resources on how to store and process data in HDFS.

What Databricks brings to picture:

  1. A unified platform for data and AI: One cloud platform for massive scale data engineering and collaborative data science.
  2. Shared Notebooks: Collaborate in different languages from Python to Scala to SQL, and share code via notebooks with revision history and GitHub integration.
  3. Data Integration: Integrating with source systems is made easy with a wide range of connectors
  4. Reliable and Scalable Clusters: Databricks provides automated cluster management spinning up clusters, determining their optimal size for the job, and scaling them down when the job is done.
  5. Job Scheduling: Databricks allows for job scheduling via Notebooks. No need to rewrite your code for production, your working Notebook can be put right into production and you can chain Notebooks to create a workflow that enables a component architecture.
  6. Delta Lake implemented as traditional Parquet files with a transaction log defining what data and files are the most recent so when a job queries the datasets, users are presented with accurate consistent datasets. Other key features include ACID transactions, scalable metadata handling, time travel, schema enforcement, schema evolution, audit history, and full DML support offering UPDATE, DELETE, and MERGE INTO capabilities.


The Migration Plan -platform migration can see cost savings of ~30%

Current State Evaluation: Gather information about current Hadoop environment including tools and technologies, data sources, use cases, resources,integrations, and service level agreements.Review HDInsight sizings

Create a complete inventory of tools and technologies used in the current environment native as well as third party. This helps us identify the tools which we could get rid during migration as Databricks could have something inbuilt. Evaluate the tools if they can be integrated seamlessly in Cloud native environment.

Identify Data Sources that deliver results .It could be external to the Hadoop e.g Databases, ERP, CRM,streaming, etc. Or internal data sources used for integrations feeding data out of the platform.Identify data sets that can be marked as unnecessary to move to the cloud.

Fully understand the applications running in the current Hadoop environment. This will better define the amount of effort in migrating on-premises environment to the cloud.

Understanding tools and processes in existing staffs daily workflow and making alternatives available will make sure that people feel comfortable using Databricks.

External applications accessing the Hadoop environment need to be ensured that the cloud-native Databricks environment can serve data to these applications.Access patterns for authentication and authorization which will need to be evaluated. However, external applications connecting using JDBC or ODBC should work out of the box.

Security and governance configurations must be collected tools like Sentry, Ranger, HDFS ACLs. Use of Kerberos, Active Directory, encryption-at-rest and encryption-in-transit must be taken into account. MapR filesystem permissions, ACLs and Access Control Expressions must be looked into.

Implementation:Using information gathered during the evaluation phase a prioritized list of data sources, applications, and tools will be selected for migration.Divide migration evaluation with two indicative pipelines (medium and large size) to support costing and identify possible challenges.Most likely approach to meet timelines will be lift and shift, with iterative optimizations

Storage Migration Using the Data Source inventory move data stored in HDFS to the cloud vendor’s storage layer (Blob Storage or S3). Apply same information architecture to the new cloud storage file system and the resulting folder and file structure should be a one to one match of the HDFS file system. You may face challenges migrating role-based access controls to the new cloud storage system. Sometimes a tool will need to be developed to migrate these policies for each cloud vendor.

Hive Metastore Migration The next step is to migrate the Hive Metastore from the Hadoop to Databricks. Hive Metastore contains all the location and struc- ture of all the data assets in the Hadoop environment. Migrating the Hive Metastore is required for users to query tables in Databricks notebooks using SQL statements. During migration process, the locations of the underlying datasets will need to be updated to reference the Databricks file system.

HiveQL/Impala Migration Look at how HIVE could potentially be moved to SPARK SQL.Many times customers use Hive SQL or Impala files to execute pieces of data pipeline or workflow. After the migration of both the storage assets and the Hive Metastore these types of workflow items can use Spark SQL within a notebook.

MapR DB Convert heavy OLTP application data pipeline into a Spark Streaming application on Databricks and landing data in Azure CosmoDB or DynamoDB. Traditional analytics can be implemented as Spark Streaming into Delta Lake.

MapR Streams (MapR Event Store) These types of integrations will need to be migrated to a Spark Streaming application.

Apache Kudu use cases will primarily be migrated to Delta Lake, as it offers comparable capabilities. Impala scripts and Spark applications utilizing Kudu will be migrated appropriately.

Apache HBase Use cases that are primarily OLTP driven will be converted to Spark Streaming applications utilizing Azure CosmoDB or DynamoDB for storage. Analytics applications can be migrated to Spark Streaming and Delta Lake.

Apache Solr most cases that utilize Solr will migrate to Amazon Elasticsearch Service or Azure Cognitive Search.

Apache Spark Application Migration Any Apache Spark version 1 application will need to be refactored to an Apache Spark version 2 application as Databricks does not support Apache Spark version 1. Lift and Shift model for as long as the source data is available to Databricks and the application can be packaged as a Jar these applications should run on Databricks.

Apache Spark to Delta Lake Migration Applications that are doing a small number of updates or deletes will be prime candidates for this refactoring. Other considerations will be around performance, ACID transactions, schema enforcement,and data consistency. Flat file to Delta file type to remove some of the small file challenges, better compression/ performance and SQL access

Look at VM types to see if further optimizations to performance and cost can be made.

Move where possible more workloads from Interactive to Automated ( lower cost)

Review Pipeline efficiency and code — this is on going optimizations likely to continue after migration.

Validation Phase The final phase will be to validate the outcome of the migration from Hadoop to Databricks. This step should be performed using traditional A -> B testing. The customer will need to be running the existing Hadoop implementation alongside the cloud-native Databricks offering. Scripts and processes should be developed to ensure the delivered results in Hadoop match with that in Databricks. As applications and use cases get cleared and all checkouts have been performed they can be shut down in Hadoop.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了