preprocessing.target_corruption.detect_target_corruption

preprocessing.target_corruption.detect_target_corruption(
    df,
    *,
    targets,
    range_mw,
    step_mw,
    window_days,
    deviation_mw=None,
    deviation_ref=None,
    deviation_slots=2,
)

Detect physically-impossible target-column corruption in the native frame.

Applies two independent rules on the native-cadence (e.g. 15-min) series within a rolling look-back window ending at the last observed target timestamp:

Range rule (sub-hourly cadence only): an hour is flagged when intra-hour max - intra-hour min > range_mw for any target column. Vacuously skipped for hourly-or-coarser cadence (intra-hour range is undefined on a single slot per hour).
Step rule: an hour is flagged when any |adjacent-slot diff| that touches that hour exceeds step_mw for any target column. Applies to all cadences.
Deviation rule (dropout-only, all cadences): an hour is flagged when target − reference < -deviation_mw holds for at least deviation_slots consecutive native-cadence slots within the scan window, where the reference is a published companion column such as the ENTSO-E day-ahead "Forecasted Load". The rule is asymmetric by design: the known corruption class is exclusively a dropout below the day-ahead forecast, while actuals above the forecast are ordinary under-forecasting. NaN in either column yields a NaN difference, which compares False — so the publication-lag frontier (forecast published, actual not yet) never flags, and a data gap breaks a consecutive run. On hourly-or-coarser cadence the sustained requirement collapses to a single slot. The rule is silently skipped when deviation_ref is missing from the frame (mirroring how absent target columns are skipped).

Flags are OR-ed across target columns. ALL native-cadence slots of a flagged calendar hour are marked True in the returned boolean Series, so downstream NaN-ing operates on full hours rather than individual sub-hourly slots.

The detector is inert (returns all-False) unless window_days is set AND at least one of range_mw / step_mw / deviation_mw is set. If the data is shorter than window_days, the window is clamped to df.index.min() without raising.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	Native-cadence `DataFrame` indexed by a `DatetimeIndex`. Must contain all columns listed in `targets`.	required
targets	Sequence[str]	Sequence of target column names to inspect.	required
range_mw	Optional[float]	Maximum allowed intra-hour range (MW). `None` skips the range rule.	required
step_mw	Optional[float]	Maximum allowed absolute adjacent-slot difference (MW). `None` skips the step rule.	required
window_days	Optional[int]	Number of days before the last observed target to include in the scan. `None` makes the detector inert.	required
deviation_mw	Optional[float]	Maximum allowed dropout below the reference column (MW, positive magnitude): slots with `target − reference < -deviation_mw` are candidates. `None` skips the deviation rule.	`None`
deviation_ref	Optional[str]	Name of the reference column (e.g. `"Forecasted Load"`). The rule is skipped when `None` or when the column is absent from `df`. The reference column itself is never checked as a target by this rule.	`None`
deviation_slots	int	Minimum number of consecutive sub-hourly slots the dropout must sustain before any hour is flagged (default `2` — a single-slot blip is more likely a metering glitch than the oscillating dropout class). Clamped to `1` on hourly-or-coarser cadence.	`2`

Returns

Name	Type	Description
	pd.Series	Boolean `pd.Series` aligned to `df.index`. `True` means the
	pd.Series	slot belongs to a flagged calendar hour. All-`False` when the
	pd.Series	detector is inert or no corruption is found.

Examples

import pandas as pd
import numpy as np
from spotforecast2_safe.preprocessing.target_corruption import (
    detect_target_corruption,
)

# 15-min cadence; one GW dropout at 12:15 inside the window
idx = pd.date_range("2026-06-03", periods=48, freq="15min", tz="UTC")
vals = [55_000.0] * 48
vals[5] = 44_000.0          # 11 GW step drop  -> flags 12:00 hour
df = pd.DataFrame({"load": vals}, index=idx)

mask = detect_target_corruption(
    df, targets=["load"], range_mw=5_000, step_mw=8_000, window_days=3
)
# Slots in the 12:00 hour (index 4-7) are flagged
assert mask.iloc[4:8].all(), "Slots in the flagged hour must be True"
assert not mask.iloc[8:].any(), "Subsequent clean slots must be False"
print("flagged:", mask.sum(), "slots")

flagged: 4 slots

# Deviation rule: a sub-threshold dropout the dynamics rules miss.
import pandas as pd
import numpy as np
from spotforecast2_safe.preprocessing.target_corruption import (
    detect_target_corruption,
)

idx = pd.date_range("2026-06-07", periods=16, freq="15min", tz="UTC")
forecast = pd.Series(48_000.0, index=idx)
actual = forecast.copy()
# Two consecutive slots 11.6 GW below the forecast, stepping by
# only 5.8 GW per slot — below a 6 GW step rule, no range breach.
actual.iloc[4] = forecast.iloc[4] - 5_800.0
actual.iloc[5] = forecast.iloc[5] - 11_600.0
actual.iloc[6] = forecast.iloc[6] - 11_600.0
actual.iloc[7] = forecast.iloc[7] - 5_800.0
# Publication-lag frontier: forecast published, actual not yet.
actual.iloc[12:] = np.nan
df = pd.DataFrame({"Actual Load": actual, "Forecasted Load": forecast})

dyn_only = detect_target_corruption(
    df, targets=["Actual Load"],
    range_mw=15_000, step_mw=6_000, window_days=3,
)
with_dev = detect_target_corruption(
    df, targets=["Actual Load"],
    range_mw=15_000, step_mw=6_000, window_days=3,
    deviation_mw=8_000, deviation_ref="Forecasted Load",
)
assert not dyn_only.any(), "dynamics rules miss the dropout"
assert with_dev.iloc[4:8].any(), "deviation rule catches it"
assert not with_dev.iloc[12:].any(), "NaN frontier never flags"
print("dynamics-only:", int(dyn_only.sum()), "| with deviation:",
      int(with_dev.sum()))

dynamics-only: 0 | with deviation: 4