preprocessing.curate_data.remove_duplicate_timestamps

preprocessing.curate_data.remove_duplicate_timestamps(
    df,
    time_col='Time (UTC)',
    agg='mean',
)

Resolve duplicate timestamps across all data columns. Groups rows that share the same time_col value and collapses them using the chosen aggregation. All columns except time_col are aggregated. The resulting frame is sorted chronologically, re-indexed, and returned.

Parameters

Name Type Description Default
df pd.DataFrame Input dataframe containing time_col and one or more data columns. required
time_col str Name of the column that holds timestamps. Defaults to "Time (UTC)". 'Time (UTC)'
agg str | Callable Aggregation applied when collapsing duplicate rows. Accepts any string recognised by :meth:pandas.core.groupby.GroupBy.agg ("mean", "median", "min", "max", "sum", "std", "var", "first", "last") as well as "mode" (most frequent value per group) or any custom callable. Defaults to "mean". 'mean'

Returns

Name Type Description
pd.DataFrame pd.DataFrame: Deduplicated dataframe with unique time_col rows,
pd.DataFrame sorted ascending by timestamp.

Raises

Name Type Description
KeyError If time_col is not present in df.

Examples

Mean-aggregate two data columns with the default time column:

import pandas as pd
from spotforecast2_safe.preprocessing.curate_data import remove_duplicate_timestamps
df = pd.DataFrame(
     {
         "Time (UTC)": [
             "2026-01-01 00:00:00",
             "2026-01-01 00:00:00",
             "2026-01-01 01:00:00",
         ],
         "Load A": [100.0, 120.0, 130.0],
         "Load B": [200.0, 220.0, 210.0],
     }
    )
out = remove_duplicate_timestamps(df)
print(f"len(out): {len(out)}")
print(f"Load A: {float(out.loc[0, 'Load A'])}")
print(f"Load B: {float(out.loc[0, 'Load B'])}")
len(out): 2
Load A: 110.0
Load B: 210.0

Median aggregation on a custom time column:

import pandas as pd
from spotforecast2_safe.preprocessing.curate_data import remove_duplicate_timestamps
df2 = pd.DataFrame(
    {
        "ts": ["2026-01-01", "2026-01-01", "2026-01-02"],
        "value": [10.0, 30.0, 20.0],
    }
)
out2 = remove_duplicate_timestamps(
    df2, time_col="ts", agg="median"
)
print(f"Value: {float(out2.loc[0, 'value'])}")
Value: 20.0