Stat Rules
Want Data Quality Rules without long runtimes and table scans?
In the world of data quality there is a common understanding in regards to scanning large amounts of data. Typically the more data you scan (health check), the move coverage you've achieved. The challenge becomes scale and cost. The way most DQ rule engines operate is, for each new rule added, that rule adds another scan or workload to the rule engine. Each rule is typically shared nothing or an independent operation. Each individual rule will validate itself and generate break records where applicable. Side note: the DQ world has advanced and many checks may find changes, trends or anomalies not just break records, but that is besides the point for now.
So what happens when you have successfully add 450 rules to your pipeline?
The way most rules engines handle this is by re-scanning the dataset 455 times. Scanning often known as table scanning in the database world. Hundreds or table scans adds up quickly and can make your pipelines perform 10X slower. Your compute costs and cloud costs could also increase 10X.
Stat Rules to the rescue
Collibra Data Quality (CDQ) offers statistical rules that do not require any table scan or compute operation. They achieve this by passing over the dataset in a read only-once fashion using parallel executors during the single pass. This guarantees the fastest profile ability on both files and tables but also collects all the "Stats" needed to run rules in a subsequent stage. By taking this approach an end user can type SQL, or nowadays SpeeQL using GenAI and the rule does not need to go back out to the dataset and produce a table scan. Rather it does a lookup against the internal metric system providing two really important items. 1) sub second response times on any stat rule. 2) Zero-cost to any cloud or compute tier server.
Examples below:
$daysSinceLastRun > 4
Alert when dataset has not run for 5 days.?
$daysWithoutData > 4
Alert when dataset has 0 rows
$rowCount = 0
Type Checking
@dataset.fname.$type = ‘String’
Dataset column "ID" unique check
ID.$uniqueCount != $rowCount
DISTRIBUTION
Dataset Column (GENDER) doesn't have 60/40 ratio
GENDER[‘MALE’].$uniqueRatio not between 40 and 60
RUNTIME
Alert when job runs too long
$totalRuntimeSeconds > 45
$totalRuntimeMinutes > 2
For a more complete list checkout the link below.
Enterprise Platform Product Manager | human oriented, business focused | Data Product Executive | AI/ML DQ and Data Observability thought leader | SaaS | Ex-Soc Gen | Ex-JPMC | Ex-Bank of Tokyo
6 个月excellent.
Artificially Intelligent. Bringing together people, ideas, and data. I am because we are.
6 个月SAY "We need to cut costs. ONE MORE TIME!"
Co-founder & Chief Data Citizen at Collibra
6 个月Zero cost stat rules rule ??. They should give me a bonus for this ??