preprocessing.curate_data.remove_duplicate_timestamps

preprocessing.curate_data.remove_duplicate_timestamps(
    df,
    time_col='Time (UTC)',
    agg='mean',
)

Resolve duplicate timestamps across all data columns. Groups rows that share the same time_col value and collapses them using the chosen aggregation. All columns except time_col are aggregated. The resulting frame is sorted chronologically, re-indexed, and returned.

Parameters

Name	Type	Description	Default
df	pd.DataFrame	Input dataframe containing `time_col` and one or more data columns.	required
time_col	str	Name of the column that holds timestamps. Defaults to `"Time (UTC)"`.	`'Time (UTC)'`
agg	str \| Callable	Aggregation applied when collapsing duplicate rows. Accepts any string recognised by `pandas.core.groupby.GroupBy.agg()` (`"mean"`, `"median"`, `"min"`, `"max"`, `"sum"`, `"std"`, `"var"`, `"first"`, `"last"`) as well as `"mode"` (most frequent value per group) or any custom callable. Defaults to `"mean"`.	`'mean'`

Returns

Name	Type	Description
	pd.DataFrame	pd.DataFrame: Deduplicated dataframe with unique `time_col` rows,
	pd.DataFrame	sorted ascending by timestamp.

Raises

Name	Type	Description
	KeyError	If `time_col` is not present in `df`.

Examples

Mean-aggregate two data columns with the default time column:

import pandas as pd
from spotforecast2_safe.preprocessing.curate_data import remove_duplicate_timestamps
df = pd.DataFrame(
     {
         "Time (UTC)": [
             "2026-01-01 00:00:00",
             "2026-01-01 00:00:00",
             "2026-01-01 01:00:00",
         ],
         "Load A": [100.0, 120.0, 130.0],
         "Load B": [200.0, 220.0, 210.0],
     }
    )
out = remove_duplicate_timestamps(df)
print(f"len(out): {len(out)}")
print(f"Load A: {float(out.loc[0, 'Load A'])}")
print(f"Load B: {float(out.loc[0, 'Load B'])}")

len(out): 2
Load A: 110.0
Load B: 210.0

Median aggregation on a custom time column:

import pandas as pd
from spotforecast2_safe.preprocessing.curate_data import remove_duplicate_timestamps
df2 = pd.DataFrame(
    {
        "ts": ["2026-01-01", "2026-01-01", "2026-01-02"],
        "value": [10.0, 30.0, 20.0],
    }
)
out2 = remove_duplicate_timestamps(
    df2, time_col="ts", agg="median"
)
print(f"Value: {float(out2.loc[0, 'value'])}")

Value: 20.0