Resolve Duplicate Timestamps

remove_duplicate_timestamps() collapses duplicate rows that share the same timestamp into a single row by applying an aggregation function across all data columns. The default time column is "Time (UTC)" and the default aggregation is "mean", but both are configurable.

Basic usage — multiple data columns, mean aggregation

import pandas as pd
from spotforecast2_safe.preprocessing.curate_data import remove_duplicate_timestamps

df = pd.DataFrame(
    {
        "Time (UTC)": [
            "2026-01-01 00:00:00",
            "2026-01-01 00:00:00",   # duplicate
            "2026-01-01 01:00:00",
        ],
        "Load A": [100.0, 120.0, 130.0],
        "Load B": [200.0, 220.0, 210.0],
    }
)

clean_df = remove_duplicate_timestamps(df=df)
clean_df
Time (UTC) Load A Load B
0 2026-01-01 00:00:00 110.0 210.0
1 2026-01-01 01:00:00 130.0 210.0

Both Load A and Load B are averaged for the duplicate 00:00:00 row: Load A → 110.0, Load B → 210.0.

Custom time column

Pass time_col when the timestamp column has a different name:

df2 = pd.DataFrame(
    {
        "measurement_time": [
            "2026-03-01 06:00:00",
            "2026-03-01 06:00:00",
            "2026-03-01 07:00:00",
        ],
        "sensor_1": [10.0, 14.0, 12.0],
        "sensor_2": [5.0, 7.0, 6.0],
    }
)

clean_df2 = remove_duplicate_timestamps(
    df=df2,
    time_col="measurement_time",
)
clean_df2
measurement_time sensor_1 sensor_2
0 2026-03-01 06:00:00 12.0 6.0
1 2026-03-01 07:00:00 12.0 6.0

Alternative aggregation functions

Supported string values: "mean" (default), "median", "min", "max", "sum", "std", "var", "first", "last", "mode". Any callable is also accepted.

df3 = pd.DataFrame(
    {
        "Time (UTC)": [
            "2026-01-01 00:00:00",
            "2026-01-01 00:00:00",
            "2026-01-01 00:00:00",
            "2026-01-01 01:00:00",
        ],
        "load": [10.0, 10.0, 90.0, 55.0],
    }
)

results = {}
for fn in ("mean", "median", "min", "max", "mode"):
    out = remove_duplicate_timestamps(
        df=df3.copy(), agg=fn
    )
    results[fn] = float(out.loc[0, "load"])

pd.DataFrame.from_dict(results, orient="index", columns=["00:00 value"])
00:00 value
mean 36.666667
median 10.000000
min 10.000000
max 90.000000
mode 10.000000