What is a Data Lake?
A Data Lake is a storage repository that holds vast amounts of raw data in its native format.
Advantages:
- Cost-efficient and scalable.
- Can store any kind of data (structured, unstructured, semi-structured).
Challenges:
- No ACID guarantees.
- No data validation, leading to data quality issues.
- Difficult to maintain historical versions.
- No DML operations (updates, deletes).
What is Delta Lake?
Delta Lake is an improvement on data lakes, providing a transactional layer to solve the challenges mentioned above.
- Delta Lake is an open-source storage layer that works with Apache Spark. It acts like a small utility installed on SparkCluster.
- Combines Parquet (storage format) with transaction logs to enable ACID transactions.
Delta Lake = Parquet + Transaction Logs:
- Transaction logs ensure consistency and reliability. They are updated only after a successful write operation.
Key Operations in Delta Lake:
Write Operation:
- Data is written to part files.
- A transaction log is updated after successful writes.
- Write → Part files + Transaction log (e.g., 0000.json).
Append Operation:
- New data is added to part files.
- The transaction log is updated with each append operation.
- Append → Transaction log (e.g., 0000.json, 0001.json).
Handling Failures in Delta Lake:
Scenario 1: Job Failing While Appending Data
- Data Lake: No atomicity; partial data may remain.
- Delta Lake: Uses transaction logs to ensure consistency and prevent reading partial data.
Scenario 2: Job Failing While Overwriting Data
- Data Lake: If a job fails during overwrite, previous data is lost, leading to an inconsistent state.
- Delta Lake: Old files are not deleted initially. New files are written first. Only after a successful write is the transaction log updated, ensuring consistency. If the job fails, previous data remains intact, avoiding data loss.
Scenario 3: Simultaneous Reads and Writes
- Delta Lake: Transaction logs ensure that only successfully written data is available for reading, preventing inconsistent reads.
Updates and Deletes in Delta Lake:
Updates:
- A new part file is created with updated records.
- Unchanged records are copied from the old part file to the new one.
- The transaction log is updated to add the new file and remove the old one.
Deletes:
- A new part file is created without the deleted records.
- Remaining records are copied to the new part file.
- The transaction log is updated to add the new file and remove the old one.
Scenario 4: Appending Data with a Different Schema
- Schema Evolution: Delta Lake handles schema changes smoothly, allowing different schema versions to coexist.
- Version History: Delta Lake supports Time Travel, enabling you to query previous versions of data.
- Data Quality: Commit logs ensure data quality through validation.
- Performance: Delta Lake optimizes performance using commit logs and data statistics.
Summary of Key Benefits:
- ACID Transactions: Guarantees consistency and reliability.
- Schema Enforcement and Evolution: Handles schema changes effectively.
- Time Travel: Ability to query historical versions.
- Support for Updates/Deletes: DML operations are possible.
- Data Quality: Ensures data integrity with commit logs.
--
2 个月Interesting Charita.