preprocessing.target_corruption.detect_target_corruption

preprocessing.target_corruption.detect_target_corruption(
    df,
    *,
    targets,
    range_mw,
    step_mw,
    window_days,
    deviation_mw=None,
    deviation_ref=None,
    deviation_slots=2,
)

Detect physically-impossible target-column corruption in the native frame.

Applies two independent rules on the native-cadence (e.g. 15-min) series within a rolling look-back window ending at the last observed target timestamp:

Flags are OR-ed across target columns. ALL native-cadence slots of a flagged calendar hour are marked True in the returned boolean Series, so downstream NaN-ing operates on full hours rather than individual sub-hourly slots.

The detector is inert (returns all-False) unless window_days is set AND at least one of range_mw / step_mw / deviation_mw is set. If the data is shorter than window_days, the window is clamped to df.index.min() without raising.

Parameters

Name Type Description Default
df pd.DataFrame Native-cadence DataFrame indexed by a DatetimeIndex. Must contain all columns listed in targets. required
targets Sequence[str] Sequence of target column names to inspect. required
range_mw Optional[float] Maximum allowed intra-hour range (MW). None skips the range rule. required
step_mw Optional[float] Maximum allowed absolute adjacent-slot difference (MW). None skips the step rule. required
window_days Optional[int] Number of days before the last observed target to include in the scan. None makes the detector inert. required
deviation_mw Optional[float] Maximum allowed dropout below the reference column (MW, positive magnitude): slots with target − reference < -deviation_mw are candidates. None skips the deviation rule. None
deviation_ref Optional[str] Name of the reference column (e.g. "Forecasted Load"). The rule is skipped when None or when the column is absent from df. The reference column itself is never checked as a target by this rule. None
deviation_slots int Minimum number of consecutive sub-hourly slots the dropout must sustain before any hour is flagged (default 2 — a single-slot blip is more likely a metering glitch than the oscillating dropout class). Clamped to 1 on hourly-or-coarser cadence. 2

Returns

Name Type Description
pd.Series Boolean pd.Series aligned to df.index. True means the
pd.Series slot belongs to a flagged calendar hour. All-False when the
pd.Series detector is inert or no corruption is found.

Examples

import pandas as pd
import numpy as np
from spotforecast2_safe.preprocessing.target_corruption import (
    detect_target_corruption,
)

# 15-min cadence; one GW dropout at 12:15 inside the window
idx = pd.date_range("2026-06-03", periods=48, freq="15min", tz="UTC")
vals = [55_000.0] * 48
vals[5] = 44_000.0          # 11 GW step drop  -> flags 12:00 hour
df = pd.DataFrame({"load": vals}, index=idx)

mask = detect_target_corruption(
    df, targets=["load"], range_mw=5_000, step_mw=8_000, window_days=3
)
# Slots in the 12:00 hour (index 4-7) are flagged
assert mask.iloc[4:8].all(), "Slots in the flagged hour must be True"
assert not mask.iloc[8:].any(), "Subsequent clean slots must be False"
print("flagged:", mask.sum(), "slots")
flagged: 4 slots
# Deviation rule: a sub-threshold dropout the dynamics rules miss.
import pandas as pd
import numpy as np
from spotforecast2_safe.preprocessing.target_corruption import (
    detect_target_corruption,
)

idx = pd.date_range("2026-06-07", periods=16, freq="15min", tz="UTC")
forecast = pd.Series(48_000.0, index=idx)
actual = forecast.copy()
# Two consecutive slots 11.6 GW below the forecast, stepping by
# only 5.8 GW per slot — below a 6 GW step rule, no range breach.
actual.iloc[4] = forecast.iloc[4] - 5_800.0
actual.iloc[5] = forecast.iloc[5] - 11_600.0
actual.iloc[6] = forecast.iloc[6] - 11_600.0
actual.iloc[7] = forecast.iloc[7] - 5_800.0
# Publication-lag frontier: forecast published, actual not yet.
actual.iloc[12:] = np.nan
df = pd.DataFrame({"Actual Load": actual, "Forecasted Load": forecast})

dyn_only = detect_target_corruption(
    df, targets=["Actual Load"],
    range_mw=15_000, step_mw=6_000, window_days=3,
)
with_dev = detect_target_corruption(
    df, targets=["Actual Load"],
    range_mw=15_000, step_mw=6_000, window_days=3,
    deviation_mw=8_000, deviation_ref="Forecasted Load",
)
assert not dyn_only.any(), "dynamics rules miss the dropout"
assert with_dev.iloc[4:8].any(), "deviation rule catches it"
assert not with_dev.iloc[12:].any(), "NaN frontier never flags"
print("dynamics-only:", int(dyn_only.sum()), "| with deviation:",
      int(with_dev.sum()))
dynamics-only: 0 | with deviation: 4