登录查看更多内容

Maximizing Efficiency and Productivity with Great Expectations on Databricks ??

Abdelbarre Chafik

Senior Data Engineer

发布日期: 2023年6月27日

Data quality is crucial for organizations relying on data-driven insights to make informed decisions. Accurate, consistent, and reliable data is vital to the success of analytics and machine learning projects. Great Expectations, an open-source Python library, offers a robust framework for validating and documenting data quality expectations. When combined with the capabilities of Databricks, a cloud-based big data processing and analytics platform, organizations can unlock significant benefits in terms of efficiency and productivity. In this article, we will explore how to leverage Great Expectations on Databricks, providing concrete examples to establish robust data quality practices.

Setting Up Great Expectations on Databricks:

To begin, you need to set up Great Expectations and Databricks. Follow these steps:

Step 1: Install Great Expectations using pip:

pip install great_expectations

Step 2: Create a new notebook in Databricks. ??

Step 3: Import the necessary libraries:

from great_expectations.dataset import PandasDataset

Generating Mock Data:

Before we dive into the details, let's generate some mock data to work with. You can use online services like Mockaroo (https://www.mockaroo.com/) to generate realistic-looking mock data. For the purpose of this article, we will be using a mock dataset called "MOCK_DATA.csv". This csv has the following columns ['age', 'first_name', 'last_name', 'company_name', 'country', 'income', 'job_title']

Defining Expectations on Databricks:

Once you have Great Expectations installed and Databricks set up, you can define your data quality expectations. Let's consider an example where you have a Spark DataFrame containing customer information:

Step 1: Convert the Spark DataFrame to a Pandas DataFrame:

df = spark.read.format("csv").option("header", "true").load("/FileStore/tables/MOCK_DATA.csv")
pandas_df = df.toPandas()

Step 2: Convert the 'age' and 'income' columns to numeric types:

pandas_df['age'] = pd.to_numeric(pandas_df['age'], errors='coerce') 
pandas_df['income'] = pd.to_numeric(pandas_df['income'], errors='coerce')

Step 3: Create a Great Expectations dataset:

dataset = PandasDataset(pandas_df)

Step 4: Define expectations using Great Expectations:

dataset.expect_column_values_to_be_between('age', min_value=35) # Expect age to be greater than 0 
dataset.expect_column_values_to_be_between('income', min_value=0) # Expect income to be greater than or equal to 0 
dataset.expect_column_values_to_not_be_null('first_name') # Expect first_name to not be null 
dataset.expect_column_values_to_not_be_null('last_name') # Expect last_name to not be null

领英推荐

Customize Your Own Data Science Platform

Kate Strachnyi 3 年前

GroupBy #9: FDAP stack, Iceberg and Hudi ACID…

Vu Trinh 1 年前

Hiding within those mounds of data is knowledge that…

Santosh Raman Mishra 4 年前

Validating Data Quality Expectations:

Once you have defined your expectations, it's time to validate them against your data:

Step 1: Validate the dataset:

results = dataset.validate()

Step 2: Review the validation results:

results.head()

Generating an HTML Report:

To generate a comprehensive HTML report of the validation results, follow these steps:

Step 1: Import the necessary libraries:

from great_expectations.render.renderer import ValidationResultsPageRenderer 
from great_expectations.render.view import DefaultJinjaPageView

Step 2: Generate the HTML report:

renderer = ValidationResultsPageRenderer()
document = renderer.render(results)
view = DefaultJinjaPageView()
html = view.render(document)

Step 3: Write the HTML report to a file:

with open('validation_report.html', 'w') as f: 
    f.write(html)

Step 4: Display the HTML report in Databricks:

displayHTML(html)

Automating Data Quality Checks:

To ensure ongoing data quality, you can automate the validation process using Databricks Jobs and Great Expectations:

Step 1: Create a new Databricks Job. ??

Step 2: In the Job, execute the validation code at regular intervals.

Step 3: Set up slack (full example) or email notifications or alerts for failed expectations. ????

# Validate the datase
results = dataset.validate()


# Check if any expectations have failed
if not results["success"]:
? ? # Prepare the email content
? ? recipient_email = '[email protected]'
? ? subject = 'Data Quality Alert: Failed Expectations'
? ? content = 'The validation of your dataset has failed. Please review the attached validation report for details.'


? ? # Send the email notification
? ? send_email_notification(recipient_email, subject, content)

By incorporating Great Expectations into your Databricks workflows, you can establish robust data quality practices, automate validation processes, and ensure ongoing data accuracy and consistency. Leveraging the power of Great Expectations within Databricks empowers organizations to make more confident and reliable data-driven decisions.

Remember, data quality is a continuous effort. Regularly review and update your expectations as your data evolves. With Great Expectations on Databricks, you can effectively manage data quality at scale, enabling more accurate and meaningful insights for your business.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Abdelbarre Chafik的更多文章

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

2023年7月27日

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

Now that you’ve mastered the basics of dbt and Databricks integration (Article 1), let’s take it a step further and…
??Article 1: Setting Up dbt with Databricks ??: Supercharge Your Data Transformations!

2023年7月12日

??Article 1: Setting Up dbt with Databricks ??: Supercharge Your Data Transformations!

Hey there, Are you looking to supercharge your data transformations using the powerful combination of dbt and…
Step by step : Seamlessly Send Messages from Databricks to Slack

2023年6月21日

Step by step : Seamlessly Send Messages from Databricks to Slack

Hey #DataScientists and #DataEngineers, I wanted to share a fantastic integration that will level up your collaboration…

Maximizing Efficiency and Productivity with Great Expectations on Databricks ??

Abdelbarre Chafik

Senior Data Engineer

Setting Up Great Expectations on Databricks:

Defining Expectations on Databricks:

领英推荐

Validating Data Quality Expectations:

Generating an HTML Report:

Automating Data Quality Checks:

Abdelbarre Chafik的更多文章

社区洞察

其他会员也浏览了

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

Using the alexmerced/datanotebook Docker Image

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

How to Rename and Reorder Column Names in Pandas DataFrames

Best Practices and Spark optimisation Tips for Data engineers

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Setting Up Great Expectations on Databricks:

Defining Expectations on Databricks:

领英推荐

Validating Data Quality Expectations:

Generating an HTML Report:

Automating Data Quality Checks:

Abdelbarre Chafik的更多文章

Using Models and Tests with dbt and Databricks: Ensuring Data Quality and Accuracy! ?

??Article 1: Setting Up dbt with Databricks ??: Supercharge Your Data Transformations!

Step by step : Seamlessly Send Messages from Databricks to Slack

社区洞察

其他会员也浏览了

?? DATA Pill #097 - LLMs meet SQL, Confluent + Apache Flink = ?

?? DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ?

?? DATA Pill #108 - Orchestrating 2000+ dbt Models, Databricks + Tabular

?? DATA Pill #110 - Optimizing Flink SQL, Let's reproduce GPT-2

?? DATA Pill #142 - From RAG to fabric, Don’t count rows in ETL, use Delta Log metrics!

Using the alexmerced/datanotebook Docker Image

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

How to Rename and Reorder Column Names in Pandas DataFrames

Best Practices and Spark optimisation Tips for Data engineers

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive