Azure Databricks?—?An Intro
For those of you whom are familiar with cloud and the machine learning field Azure and Databricks are two terms which you have probably heard quite often. For those who have not heard of them before, Azure is Microsoft’s cloud platform for IaaS, PaaS and much more services, while Databricks is an Unified Analytics by Matei Zaharia and the rest of the team behind Apache Spark. Databricks is also available on AWS but for the purposes of this article I will be primarily touching on the Azure variant in this brief intro.
The value proposition outlined by Databricks is that it helps “accelerate innovation by unifying data science, engineering and business” . The key benefits that I see in Databricks is the value it brings to both data engineers as well as the data scientists allowing complex ETL pipelines as well as ecosystem integration across a variety of services such as Hadoop, Kafka, Parquet and Tensorflow to be carried out seamlessly reducing the time taken to deploy the latest in class AI algorithms to production from development and QA environments. (Yep, I am a fan — As you may have gathered from the above sentence :) )
In terms of the cloud offering with respect to Databricks in terms of Azure and AWS as well the infrastructure complexity is minimized to a great deal. By offering the Pay-As-You-Go model for spinning and the auto-scaling feature for Databricks clusters it allows the data analytics/business intelligence and IT teams of an organization to “reap the benefits of a fully managed service” and “focus more on innovation” to quote verbatim from the Databricks website.
In addition with the 99.99% SLA offered by most cloud vendors this enables organization to confidently run their business-critical applications. Azure Databricks offers integration with Azure Active Directory (AD) and also integrates with Azure Databases and data stores such as, SQL Data Warehouse, Cosmos DB, Data Lake Store and Blob Storage.
I plan to cover a bit more on later articles on how Spark works on Databricks as well as a project being outlined with the Analytics/BI team at John Keells IT to build a Change-Data-Capture (CDC) mechanism for database logs using Apache Kafka with Databricks which is still in its infancy and where we are debating the pros and cons of doing so.