Creating a Local Data Lakehouse using Spark/Minio/Dremio/Nessie
Alex Merced
Co-Author of “Apache Iceberg: The Definitive Guide” | Senior Tech Evangelist at Dremio | LinkedIn Learning Instructor | Tech Content Creator
Data is becoming the cornerstone of modern businesses. As businesses scale, so does their data, and this leads to the need for efficient data storage, retrieval, and processing systems. This is where Data Lakehouses come into the picture.
What is a Data Lakehouse?
A Data Lakehouse combines the best of both data lakes and data warehouses. It provides data lakes' raw storage capabilities and data warehouses' structured querying capabilities. This hybrid model ensures flexibility in storing vast amounts of unstructured data, while also enabling SQL-based structured queries, ensuring that businesses can derive meaningful insights from their data.
We will be implementing a basic lakehouse locally on your laptop with the following components:
Create a Docker Compose File and Running it
Building a Data Lakehouse on your local machine requires orchestrating multiple services, and Docker Compose is a perfect tool for this task. This section will create a docker-compose.yml file to define and run multi-container Docker applications for our Data Lakehouse environment.
Steps:
version: "3.9"
services:
dremio:
platform: linux/x86_64
image: dremio/dremio-oss:latest
ports:
- 9047:9047
- 31010:31010
- 32010:32010
container_name: dremio
minioserver:
image: minio/minio
ports:
- 9000:9000
- 9001:9001
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
container_name: minio
command: server /data --console-address ":9001"
spark_notebook:
image: alexmerced/spark33-notebook
ports:
- 8888:8888
env_file: .env
container_name: notebook
nessie:
image: projectnessie/nessie
container_name: nessie
ports:
- "19120:19120"
networks:
default:
name: iceberg_env
driver: bridge
Create Environment File:
Alongside your docker-compose.yml file, create an .env file to manage your environment variables. Start with the given template and fill in the details as you progress.
# Fill in Details
# AWS_REGION is used by Spark
AWS_REGION=us-east-1
# This must match if using minio
MINIO_REGION=us-east-1
# Used by pyIceberg
AWS_DEFAULT_REGION=us-east-1
# AWS Credentials (this can use minio credential, to be filled in later)
AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXX
AWS_SECRET_ACCESS_KEY=xxxxxxx
# If using Minio, this should be the API address of Minio Server
AWS_S3_ENDPOINT=https://minioserver:9000
# Location where files will be written when creating new tables
WAREHOUSE=s3a://warehouse/
# URI of Nessie Catalog
NESSIE_URI=https://nessie:19120/api/v1
Run the Docker Compose:
Open four terminals for clearer log visualization. Each service should have its terminal. Ensure you are in the directory containing the docker-compose.yml file, and then run:
docker-compose up minioserver
Headover to localhost:9001 in your browser and login with credentials minioadmin/minioadmin. Once your logged in do the following:
docker-compose up nessie
docker-compose up spark_notebook
In the logs when this container open look for output the looks like the following and copy and paste the URL into your browser.
领英推荐
notebook |? or https://127.0.0.1:8888/?token=9db2c8a4459b4aae3132dfabdf9bf4396393c608816743a9
docker-compose up dremio
After completing these steps, you will have Dremio, Minio, and a Spark Notebook running in separate Docker containers on your laptop, ready for further Data Lakehouse operations!
Creating a Table in Spark
From the jupyter notebook window you opened in the browser earlier, create a new notebook with the following code.
import pyspark
from pyspark.sql import SparkSession
import os
## DEFINE SENSITIVE VARIABLES
NESSIE_URI = os.environ.get("NESSIE_URI") ## Nessie Server URI
WAREHOUSE = os.environ.get("WAREHOUSE") ## BUCKET TO WRITE DATA TOO
AWS_ACCESS_KEY = os.environ.get("AWS_ACCESS_KEY") ## AWS CREDENTIALS
AWS_SECRET_KEY = os.environ.get("AWS_SECRET_KEY") ## AWS CREDENTIALS
AWS_S3_ENDPOINT= os.environ.get("AWS_S3_ENDPOINT") ## MINIO ENDPOINT
print(AWS_S3_ENDPOINT)
print(NESSIE_URI)
print(WAREHOUSE)
conf = (
? ? pyspark.SparkConf()
? ? ? ? .setAppName('app_name')
? ? ? ? .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.67.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178')
? ? ? ? .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
? ? ? ? .set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog')
? ? ? ? .set('spark.sql.catalog.nessie.uri', NESSIE_URI)
? ? ? ? .set('spark.sql.catalog.nessie.ref', 'main')
? ? ? ? .set('spark.sql.catalog.nessie.authentication.type', 'NONE')
? ? ? ? .set('spark.sql.catalog.nessie.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
? ? ? ? .set('spark.sql.catalog.nessie.s3.endpoint', AWS_S3_ENDPOINT)
? ? ? ? .set('spark.sql.catalog.nessie.warehouse', WAREHOUSE)
? ? ? ? .set('spark.sql.catalog.nessie.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
? ? ? ? .set('spark.hadoop.fs.s3a.access.key', AWS_ACCESS_KEY)
? ? ? ? .set('spark.hadoop.fs.s3a.secret.key', AWS_SECRET_KEY)
)
## Start Spark Session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")
## Create a Table
spark.sql("CREATE TABLE nessie.names (name STRING) USING iceberg;").show()
## Insert Some Data
spark.sql("INSERT INTO nessie.names VALUES ('Alex Merced'), ('Dipankar Mazumdar'), ('Jason Hughes')").show()
## Query the Data
spark.sql("SELECT * FROM nessie.names;").show()
Run the code, creating a table you can confirm by opening up the minio dashboard.
Querying the Data in Dremio
Now that the data has been created, catalogs like Nessie for your Apache Iceberg tables offer easy visibility of your tables from tool to tool. We open Dremio and see we can immediately query the data we have in our Nessie catalog.
Fill out the first two tabs as shown in the images below:
Now you can see our names table on the Dremio dashboard under our new connected source.
You can now click on it and run a query:
Keep in mind Dremio also has full DML support for Apache Iceberg so you can do creates and inserts directly from Dremio like in the screenshot below.
So now you get the performance of Dremio and Apache Iceberg along with the git-like capabilities that the Nessie catalog brings, allowing you to isolate ingestion on branches, create zero-clones for experimentation, and rollback your catalog for disaster recovery all on your data lakehouse.
Senior Information Architect at Unify Consulting
1 个月I get all the way to Querying the Data in Dremio where I just get the error message when trying to add source: "Make sure that Nessie endpoint URL is valid." I've gone over more recent documetation for adding Nessie as a source but it persists. I'm using this URL from Jupyter: https://nessie:19120/api/v1
Delivery Manager | Cloud & Data Architecture Enthusiast
3 个月Nice tutorial for a typical modern data stack, thanks for the share Alex Merced
Sr. Manager at Morgan Stanley
3 个月Hi, We encountered an issue when trying to perform update or delete at branch level using dremio. Kindly assist what are we doing wrong. Update Statement: UPDATE nessie.inventory AT REF br_day1 SET B = 20 where B = 1 Error: Cannot assign to target field 'D_R_E_M_I_O_D_A_T_A_F_I_L_E_F_I_L_E_P_A_T_H' of type VARCHAR(65536) from source field 'D_R_E_M_I_O_D_A_T_A_F_I_L_E_F_I_L_E_P_A_T_H' of type VARCHAR(65536) NOT NULL
Sr. Manager at Morgan Stanley
3 个月conf = ( ??pyspark.SparkConf() ????.setAppName('app_name') ? #packages ????.set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.67.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178') ? #SQL Extensions ????.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions') ? #Configuring Catalog ????.set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog') ????.set('spark.sql.catalog.nessie.uri', NESSIE_URI) ????.set('spark.sql.catalog.nessie.ref', 'main') ????.set('spark.sql.catalog.nessie.authentication.type', 'NONE') ????.set('spark.sql.catalog.nessie.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog') ????.set('spark.sql.catalog.nessie.warehouse', 'file://'+ os.getcwd() +'/Docker/warehouse') )
Sr. Manager at Morgan Stanley
3 个月will connect with u again tomorrow....please assist if u find any miss