Stat Rules

Stat Rules

Want Data Quality Rules without long runtimes and table scans?

In the world of data quality there is a common understanding in regards to scanning large amounts of data. Typically the more data you scan (health check), the move coverage you've achieved. The challenge becomes scale and cost. The way most DQ rule engines operate is, for each new rule added, that rule adds another scan or workload to the rule engine. Each rule is typically shared nothing or an independent operation. Each individual rule will validate itself and generate break records where applicable. Side note: the DQ world has advanced and many checks may find changes, trends or anomalies not just break records, but that is besides the point for now.

So what happens when you have successfully add 450 rules to your pipeline?

The way most rules engines handle this is by re-scanning the dataset 455 times. Scanning often known as table scanning in the database world. Hundreds or table scans adds up quickly and can make your pipelines perform 10X slower. Your compute costs and cloud costs could also increase 10X.

Stat Rules to the rescue

Collibra Data Quality (CDQ) offers statistical rules that do not require any table scan or compute operation. They achieve this by passing over the dataset in a read only-once fashion using parallel executors during the single pass. This guarantees the fastest profile ability on both files and tables but also collects all the "Stats" needed to run rules in a subsequent stage. By taking this approach an end user can type SQL, or nowadays SpeeQL using GenAI and the rule does not need to go back out to the dataset and produce a table scan. Rather it does a lookup against the internal metric system providing two really important items. 1) sub second response times on any stat rule. 2) Zero-cost to any cloud or compute tier server.

Examples below:

$daysSinceLastRun > 4        
Alert when dataset has not run for 5 days.?
$daysWithoutData > 4        
Alert when dataset has 0 rows
$rowCount = 0        
Type Checking
@dataset.fname.$type = ‘String’        
Dataset column "ID" unique check
ID.$uniqueCount != $rowCount        



DISTRIBUTION

Dataset Column (GENDER) doesn't have 60/40 ratio
GENDER[‘MALE’].$uniqueRatio not between 40 and 60        


RUNTIME

Alert when job runs too long
$totalRuntimeSeconds > 45
$totalRuntimeMinutes > 2        


For a more complete list checkout the link below.

https://productresources.collibra.com/docs/collibra/latest/Content/DataQuality/DQCoreComponents/Stat%20Rules_1.htm


Sabbir Samdani, MBA

Enterprise Platform Product Manager | human oriented, business focused | Data Product Executive | AI/ML DQ and Data Observability thought leader | SaaS | Ex-Soc Gen | Ex-JPMC | Ex-Bank of Tokyo

6 个月

excellent.

回复
Gideon Kory, CFA ???

Artificially Intelligent. Bringing together people, ideas, and data. I am because we are.

6 个月

SAY "We need to cut costs. ONE MORE TIME!"

Stijn (Stan) Christiaens

Co-founder & Chief Data Citizen at Collibra

6 个月

Zero cost stat rules rule ??. They should give me a bonus for this ??

要查看或添加评论,请登录

社区洞察