Apache Spark VS DATABRICKS
Apache Spark is one of the main data processing engines in data lake house architecture. Apache Spark provides speed, ease of use with wide range of use cases:
But Spark lacks many essential features that needed real-time.
Databricks builds on top of Spark and created an eco-system that helps end to end solution architecture. Databricks is founded by the authors of Apache Spark. It’s a commercial product, but it has a free community edition with many features. Below are key features that Databricks brings to the table:
ACID Transactions via Delta Lake Integration
ACID transactions guarantee that each read, write, or modification of a table has the following properties:
Atomicity: Either the entire statement is executed, or none of it is executed.
Consistency: Consistency ensures that corruption or errors in your data do not create unintended consequences.
Isolation: When multiple users are reading and writing from the same table all at once, isolation of their transactions ensures that the concurrent transactions don't interfere with or affect one another.
Durability: Ensures that changes to your data made by successfully executed transactions will be saved, even in the event of system failure.
Unity Catalog for Metadata Management
领英推荐
Cluster Management
Databricks provides cluster management options including displaying, editing, starting, terminating, deleting, controlling access, and monitoring performance and logs. We can also use the Clusters API to manage compute programmatically.
Secure Cloud Storage Integration
Databricks uses cloud object storage to store data files and tables. During workspace deployment, Databricks configures a cloud object storage location known as the DBFS root . Databricks supports configuring connections to other cloud object storage locations.
Notebooks and Workspace
Notebooks are the primary tool for creating data science and machine learning workflows and collaborating with colleagues. Databricks notebooks provide real-time coauthoring in multiple languages, automatic versioning, and built-in data visualizations.
Photon Query Engine
Photon is a vectorized query engine written in C++ that leverages data and instruction-level parallelism available in CPUs.
It’s 100% compatible with Apache Spark APIs which means you don’t have to rewrite your existing code ( SQL, Python, R, Scala) to benefit from its advantages.
Photon is an ANSI compliant Engine, it was primarily focused on SQL but now the scope is much larger, with more ingestion sources, formats, APIs and methods since the launch.
Automation Tools
Databricks Workflows supports scheduling jobs, triggering them or having them run continuously when building pipelines for real-time streaming data. Databricks Workflows also provides advanced monitoring capabilities and efficient resource allocation for automated jobs.