Optimizer¶
The optimizer is the main performance lever. Instead of running N queries
against the same table (one per check), it compiles batchable checks into
a single SELECT with one aggregate per metric.
- File:
provero-core/src/provero/core/optimizer.py - Entry points:
plan_batch(table, checks),execute_batch(connection, plan)
Why Batching Matters¶
A naive implementation of a 30-check suite runs 30 queries. Each query pays per-round-trip latency (TCP, parser, planner, first row) that dominates runtime on small-to-medium tables. With batching, all of those aggregates fold into a single scan of the table.
A concrete example. Without batching:
SELECT COUNT(*) FROM orders WHERE order_id IS NULL;
SELECT COUNT(DISTINCT order_id) FROM orders;
SELECT MIN(amount), MAX(amount) FROM orders;
With batching:
SELECT
SUM(CASE WHEN "order_id" IS NULL THEN 1 ELSE 0 END) AS nn_order_id_null,
COUNT("order_id") AS uniq_order_id_total,
COUNT(DISTINCT "order_id") AS uniq_order_id_distinct,
MIN("amount") AS range_amount_min,
MAX("amount") AS range_amount_max,
COUNT(*) AS _total
FROM "orders"
One query. One scan.
What Can Be Batched¶
Six check types are batchable today, declared in _BATCHABLE_TYPES:
not_nullcompletenessuniquerangerow_countaccepted_values
Every other check (freshness, regex, custom_sql,
referential_integrity, anomaly, row_count_change, etc.) is appended
to plan.non_batchable and runs as its own query in the engine's
non-batchable path.
What Cannot Be Batched¶
A few edge cases are routed to non-batchable even within batchable types:
accepted_valueswith no values. Without a list to compare against, there is no metric to emit.
Most non-batchable checks are non-batchable because they either need a
self-join (referential_integrity), a different scan strategy
(freshness on one column does not share a scan with another table), or
historical data (anomaly, row_count_change both read from the result
store).
Building the Plan¶
plan_batch(table, checks) walks the check list and, for each batchable
check, appends one or more BatchedMetric entries to the plan:
@dataclass
class BatchedMetric:
alias: str # SQL column alias, e.g. "nn_order_id_null"
expression: str # SQL expression, e.g. "SUM(CASE WHEN ... END)"
check_config: CheckConfig
The plan itself holds both the metrics and the checks routed out:
@dataclass
class BatchPlan:
table: str
metrics: list[BatchedMetric]
non_batchable: list[CheckConfig]
Per-check metric emission¶
| Check | Emitted expression(s) |
|---|---|
not_null: order_id |
SUM(CASE WHEN "order_id" IS NULL THEN 1 ELSE 0 END) AS nn_order_id_null |
unique: order_id |
COUNT("order_id") AS uniq_order_id_total, COUNT(DISTINCT "order_id") AS uniq_order_id_distinct |
range: {column: amount, min: 0, max: 1000} |
MIN("amount"), MAX("amount"), SUM(CASE WHEN "amount" < 0 OR "amount" > 1000 THEN 1 ELSE 0 END) |
accepted_values: {column: status, values: [a, b]} |
SUM(CASE WHEN "status" NOT IN ('a','b') AND "status" IS NOT NULL THEN 1 ELSE 0 END) |
row_count: {min: 1} |
COUNT(*) AS _row_count |
completeness: {column: email} |
COUNT("email") AS comp_email_nonnull |
A COUNT(*) AS _total metric is always added if no other total is
present, so every batch has a denominator for ratio-based checks like
completeness.
Alias sanitization¶
Column names are passed through _safe_alias(), which replaces dots with
__dot__ and spaces with underscores. This keeps aliases unambiguous when
a column name contains a dot (e.g. a.b vs a_b would otherwise
collide).
Building the Query¶
build_batch_query(plan) assembles the metrics into one SELECT, with
one subtle optimization: identical expressions are deduplicated. A
COUNT(*) used by both row_count and completeness appears once in
the final query under a single alias.
The FROM clause goes through quote_identifier() (see SQL
safety) so identifiers cannot inject SQL.
Executing the Batch¶
execute_batch(connection, plan) runs the single query and interprets the
result row back into one CheckResult per original check.
Handling NULL results¶
SUM and COUNT on empty tables return NULL in most dialects. The
executor coerces every None in the row-dict to 0 before interpretation,
which means a check on an empty table does not crash: it just reports
observed=0.
Dedup via processed_checks¶
Some checks emit multiple metrics (e.g. unique emits both total and
distinct). The executor iterates over plan.metrics but uses a
processed_checks set keyed by check_type:column to ensure only one
CheckResult per original check is produced.
Failing rows query¶
When a check fails, the executor also populates
CheckResult.failing_rows_query with SQL the user can copy-paste to
inspect the bad rows. For example:
not_null:order_idsetsSELECT * FROM "orders" WHERE "order_id" IS NULLunique:order_idsetsSELECT "order_id", COUNT(*) FROM "orders" GROUP BY "order_id" HAVING COUNT(*) > 1
The engine (not the optimizer) later runs this with LIMIT 5 to populate
failing_rows_sample, so reports can show actual rows, not just counts.
Subtle correctness: unique uses COUNT(col)¶
The unique check compares COUNT(col) against COUNT(DISTINCT col),
not COUNT(*). This matters because COUNT(DISTINCT) excludes NULLs in
most dialects. If the total were COUNT(*) and the column allowed NULLs,
the check would report false positives (a column full of unique values
plus two NULLs would appear to have a "duplicate").
This correctness fix is in PR #128.
Failure Modes¶
If the batch query itself fails (bad table, permissions, etc.), the
engine catches the exception and emits a single
CheckResult with check_type="batch" and status=ERROR. The user sees
a hint suggesting --no-optimize to isolate the broken check. This keeps
a single malformed column from taking down the entire suite silently.
If a single check in the batch has invalid params (e.g. range with a
non-numeric min), plan_batch() itself raises ValueError. The engine
catches it, emits an ERROR result, and continues with plan=None so that
non-batchable checks still run.
Disabling the Optimizer¶
Pass optimize=False to run_suite() (or --no-optimize on the CLI) to
skip batching entirely. Every check runs as its own query through the
engine's single-check path.
This is primarily a debugging tool: if a batch fails, rerunning without optimization makes the specific failing check obvious.