登录查看更多内容

点击“继续加入或登录”，即表示您同意遵守领英的《用户协议》、《隐私政策》及《Cookie 政策》。

Stat Rules

Kirk Haslbeck

Field CTO, Data Quality at Collibra

发布日期: 2024年5月1日

+ 关注

Want Data Quality Rules without long runtimes and table scans?

In the world of data quality there is a common understanding in regards to scanning large amounts of data. Typically the more data you scan (health check), the move coverage you've achieved. The challenge becomes scale and cost. The way most DQ rule engines operate is, for each new rule added, that rule adds another scan or workload to the rule engine. Each rule is typically shared nothing or an independent operation. Each individual rule will validate itself and generate break records where applicable. Side note: the DQ world has advanced and many checks may find changes, trends or anomalies not just break records, but that is besides the point for now.

So what happens when you have successfully add 450 rules to your pipeline?

The way most rules engines handle this is by re-scanning the dataset 455 times. Scanning often known as table scanning in the database world. Hundreds or table scans adds up quickly and can make your pipelines perform 10X slower. Your compute costs and cloud costs could also increase 10X.

Stat Rules to the rescue

Collibra Data Quality (CDQ) offers statistical rules that do not require any table scan or compute operation. They achieve this by passing over the dataset in a read only-once fashion using parallel executors during the single pass. This guarantees the fastest profile ability on both files and tables but also collects all the "Stats" needed to run rules in a subsequent stage. By taking this approach an end user can type SQL, or nowadays SpeeQL using GenAI and the rule does not need to go back out to the dataset and produce a table scan. Rather it does a lookup against the internal metric system providing two really important items. 1) sub second response times on any stat rule. 2) Zero-cost to any cloud or compute tier server.

Examples below:

$daysSinceLastRun > 4

Alert when dataset has not run for 5 days.?

$daysWithoutData > 4

Alert when dataset has 0 rows

$rowCount = 0

Type Checking

@dataset.fname.$type = ‘String’

Dataset column "ID" unique check

ID.$uniqueCount != $rowCount

DISTRIBUTION

Dataset Column (GENDER) doesn't have 60/40 ratio

GENDER[‘MALE’].$uniqueRatio not between 40 and 60

RUNTIME

Alert when job runs too long

$totalRuntimeSeconds > 45
$totalRuntimeMinutes > 2

For a more complete list checkout the link below.

https://productresources.collibra.com/docs/collibra/latest/Content/DataQuality/DQCoreComponents/Stat%20Rules_1.htm

Sabbir Samdani, MBA

6 个月

excellent.

Gideon Kory, CFA ???

Artificially Intelligent. Bringing together people, ideas, and data. I am because we are.

SAY "We need to cut costs. ONE MORE TIME!"

2 次回应

Stijn (Stan) Christiaens

Co-founder & Chief Data Citizen at Collibra

Zero cost stat rules rule ??. They should give me a bonus for this ??

3 次回应

查看更多评论

Stat Rules

Kirk Haslbeck

Field CTO, Data Quality at Collibra

Want Data Quality Rules without long runtimes and table scans?

So what happens when you have successfully add 450 rules to your pipeline?

Stat Rules to the rescue

Examples below:

DISTRIBUTION

RUNTIME

更多精彩文章

社区洞察

Want Data Quality Rules without long runtimes and table scans?

So what happens when you have successfully add 450 rules to your pipeline?

Stat Rules to the rescue

Examples below:

DISTRIBUTION

RUNTIME

When People say Cloud, what do they really mean?

2024年7月22日

5 things I wish I knew before I built a next gen Data Quality product

2021年5月5日

Which Book is the Anomaly?

2020年12月6日

My Fraudulent Credit Card Charge came from 1 of 4 Major Carriers.

2019年4月25日

Categorical Outliers Don’t Exist

2018年9月10日

社区洞察