Anomaly Detection¶

Provero includes built-in statistical anomaly detection that compares current metric values against historical baselines. All detection algorithms use Python's standard library only, with no dependency on scipy or other scientific computing packages.

Overview¶

Anomaly detection works by collecting metric values over time through the result store and flagging values that deviate significantly from the historical pattern. This means you need to run provero run regularly (daily, hourly, etc.) to build up a baseline before anomaly checks become effective.

A minimum of 5 historical data points is required before any detection method will flag anomalies. Until that threshold is reached, anomaly checks return a SKIP status.

Configuration¶

The anomaly check type accepts these parameters:

Parameter	Type	Required	Default	Description
`metric`	string	Yes		Metric to monitor: `row_count`, `null_count`, `null_rate`, `distinct_count`, `mean`, `min`, `max`
`column`	string	No		Column name (required for column-level metrics)
`method`	string	No	`mad`	Detection method: `zscore`, `mad`, `iqr`
`sensitivity`	string	No	`medium`	Sensitivity level: `low`, `medium`, `high`
`threshold`	float	No		Direct threshold override (bypasses sensitivity mapping)

checks:
  - anomaly:
      metric: row_count
      method: mad
      sensitivity: medium

Available Metrics¶

Provero queries the current value directly from the data source using SQL, then compares it against stored history.

Metric	Requires Column	SQL Used
`row_count`	No	`COUNT(*)`
`null_count`	Yes	`COUNT(*) FILTER (WHERE column IS NULL)`
`null_rate`	Yes	Ratio of null rows to total rows
`distinct_count`	Yes	`COUNT(DISTINCT column)`
`mean`	Yes	`AVG(column)`
`min`	Yes	`MIN(column)`
`max`	Yes	`MAX(column)`

Detection Methods¶

Z-Score¶

The Z-Score method calculates how many standard deviations the current value is from the historical mean.

Formula: z = |current - mean| / stdev

Best for: Metrics that follow a roughly normal distribution with stable variance.

Limitation: Sensitive to outliers in the historical data, since both mean and standard deviation are affected by extreme values.

checks:
  - anomaly:
      metric: row_count
      method: zscore
      sensitivity: medium

MAD (Median Absolute Deviation)¶

MAD is the default method. It uses the median instead of the mean and computes MAD = median(|xi - median(x)|) * 1.4826. The scaling factor 1.4826 makes MAD comparable to standard deviation for normally distributed data.

Formula: modified_z = |current - median| / MAD

Best for: Most use cases. MAD is robust to outliers in the historical data, making it a safer default than Z-Score.

checks:
  - anomaly:
      column: daily_revenue
      metric: mean
      method: mad
      sensitivity: high

IQR (Interquartile Range)¶

The IQR method uses the first (Q1) and third (Q3) quartiles to define acceptable bounds: [Q1 - k*IQR, Q3 + k*IQR], where k is the threshold.

Formula: IQR = Q3 - Q1, bounds = Q1 - k*IQR to Q3 + k*IQR

Best for: Metrics with skewed distributions where median-based methods might still be too sensitive on one side.

checks:
  - anomaly:
      column: order_count
      metric: distinct_count
      method: iqr
      sensitivity: low

Sensitivity Levels¶

Sensitivity controls the threshold used by the detection algorithm. Lower thresholds are more sensitive (flag more anomalies).

Sensitivity	Threshold	Effect
`high`	2.0	Flags smaller deviations, more false positives
`medium`	3.0	Balanced default
`low`	4.0	Only flags large deviations, fewer false positives

You can also set a threshold value directly to bypass the sensitivity mapping:

checks:
  - anomaly:
      metric: row_count
      method: zscore
      threshold: 2.5

How the Result Store Feeds History¶

Anomaly detection depends on the result store to track metric values across runs. Each time you run provero run, the engine:

Queries the current metric value from the data source.
Loads historical values from the SQLite result store.
Runs the detection algorithm.
Saves the current value to the store for future comparisons.

The result store is enabled by default and persists to a local SQLite database. You can disable it with --no-store, but this also disables anomaly detection.

# Normal run: persists results and enables anomaly detection
provero run

# Disable persistence (anomaly checks will SKIP)
provero run --no-store

Use provero history to inspect stored results:

provero history --suite orders_quality

Row Count Change Detection¶

The row_count_change check is a simpler alternative to full anomaly detection that compares the current row count against the previous run.

Parameter	Type	Required	Default	Description
`max_decrease`	string	No	`50%`	Maximum allowed decrease percentage
`max_increase`	string	No	`500%`	Maximum allowed increase percentage

checks:
  - row_count_change:
      max_decrease: "10%"
      max_increase: "200%"

This check does not require a minimum number of data points. On the first run, it passes with a "first run" note. On subsequent runs, it compares against the most recent stored row count.

Complete Example¶

source:
  type: duckdb
  table: read_parquet('data/orders/*.parquet')

checks:
  # Basic volume check
  - row_count:
      min: 1

  # Row count stability between runs
  - row_count_change:
      max_decrease: "20%"
      max_increase: "300%"

  # Anomaly detection on row count trend
  - anomaly:
      metric: row_count
      method: mad
      sensitivity: medium

  # Anomaly detection on column metrics
  - anomaly:
      column: total_amount
      metric: mean
      method: mad
      sensitivity: high

  - anomaly:
      column: customer_id
      metric: distinct_count
      method: iqr
      sensitivity: low

  - anomaly:
      column: email
      metric: null_rate
      method: zscore
      sensitivity: medium