from sklearn.datasets import load_diabetes
from spotpython.plot.xy import plot_y_vs_X
= load_diabetes()
data = data.data, data.target
X, y =5, ncols=2, figsize=(20, 15)) plot_y_vs_X(X, y, nrows
Appendix E — Datasets
E.1 The Diabetes Data Set
This section describes the Diabetes
data set. This is a PyTorch Dataset for regression, which is derived from the Diabetes
data set from scikit-learn
(sklearn). Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.
E.1.1 Data Exploration of the sklearn Diabetes Data Set
Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of n_samples (i.e., the sum of squares of each column totals 1).
s3_hdl
shows a different behavior than the other features. It has a negative slope.HDL
(high-density lipoprotein) cholesterol, sometimes called “good” cholesterol, absorbs cholesterol in the blood and carries it back to the liver. The liver then flushes it from the body. High levels of HDL cholesterol can lower your risk for heart disease and stroke.
E.1.2 Generating the PyTorch Data Set
spotpython
provides a Diabetes
class to load the diabetes data set. The Diabetes
class is a subclass of torch.utils.data.Dataset
. It loads the diabetes data set from sklearn
and returns the data set as a torch.utils.data.Dataset
object, so that features and targets can be accessed as torch.tensor
s. [CODE REFERENCE].
from spotpython.data.diabetes import Diabetes
= Diabetes()
data_set print(len(data_set))
print(data_set.names)
442
['age', 'sex', 'bmi', 'bp', 's1_tc', 's2_ldl', 's3_hdl', 's4_tch', 's5_ltg', 's6_glu']
E.2 The Friedman Drift Dataset
E.2.1 The Friedman Drift Dataset as Implemented in river
We will describe the Friedman synthetic dataset with concept drifts [SOURCE], see also Friedman (1991) and Ikonomovska, Gama, and Džeroski (2011). Each observation is composed of ten features. Each feature value is sampled uniformly in [0, 1]. Only the first five features are relevant. The target is defined by different functions depending on the type of the drift. Global Recurring Abrupt drift will be used, i.e., the concept drift appears over the whole instance space.
The target is defined by the following function: \[ y = 10 \sin(\pi x_0 x_1) + 20 (x_2 - 0.5)^2 + 10 x_3 + 5 x_4 + \epsilon, \] where \(\epsilon \sim \mathcal{N}(0, 1)\) is normally distributed noise.
If the Global Recurring Abrupt drift variant of the Friedman Drift dataset is used, the target function changes at two points in time, namely \(p_1\) and \(p_2\). At the first point, the concept changes to: \[ y = 10 \sin(\pi x_3 x_5) + 20 (x_1 - 0.5)^2 + 10 x_0 + 5 x_2 + \epsilon, \] At the second point of drift the old concept reoccurs. This can be implemented as follows, see https://riverml.xyz/latest/api/datasets/synth/FriedmanDrift/:
def __iter__(self):
= random.Random(self.seed)
rng
= 0
i while True:
= {i: rng.uniform(a=0, b=1) for i in range(10)}
x = self._global_recurring_abrupt_gen(x, i) + rng.gauss(mu=0, sigma=1)
y
yield x, y
+= 1 i
def _global_recurring_abrupt_gen(self, x, index: int):
if index < self._change_point1 or index >= self._change_point2:
# The initial concept is recurring
return (
10 * math.sin(math.pi * x[0] * x[1]) + 20 * (x[2] - 0.5) ** 2 + 10 * x[3] + 5 * x[4]
)else:
# Drift: the positions of the features are swapped
return (
10 * math.sin(math.pi * x[3] * x[5]) + 20 * (x[1] - 0.5) ** 2 + 10 * x[0] + 5 * x[2]
)
spotpython
requires the specification of a train
and test
data set. These data sets can be generated as follows:
from river.datasets import synth
import pandas as pd
import numpy as np
from spotriver.utils.data_conversion import convert_to_df
= 123
seed = True
shuffle = 6_000
n_train = 4_000
n_test = n_train + n_test
n_samples = "y"
target_column
= synth.FriedmanDrift(
dataset ='gra',
drift_type=(n_train/4, n_train/2),
position=123
seed
)
= convert_to_df(dataset, n_total=n_train)
train = [f"x{i}" for i in range(1, 11)] + [target_column]
train.columns
= synth.FriedmanDrift(
dataset ='gra',
drift_type=(n_test/4, n_test/2),
position=123
seed
)= convert_to_df(dataset, n_total=n_test)
test = [f"x{i}" for i in range(1, 11)] + [target_column] test.columns
def plot_data_with_drift_points(data, target_column, n_train, title=""):
= range(len(data))
indices = data[target_column]
y_values
=(10, 6))
plt.figure(figsize="y Value", color='blue')
plt.plot(indices, y_values, label
= [n_train / 4, n_train / 2]
drift_points for dp in drift_points:
=dp, color='red', linestyle='--', label=f'Drift Point at {int(dp)}')
plt.axvline(x
= plt.gca().get_legend_handles_labels()
handles, labels = dict(zip(labels, handles))
by_label
plt.legend(by_label.values(), by_label.keys())
'Index')
plt.xlabel('Target Value (y)')
plt.ylabel(
plt.title(title)True)
plt.grid( plt.show()
="Training Data with Drift Points") plot_data_with_drift_points(train, target_column, n_train, title
="Testing Data with Drift Points") plot_data_with_drift_points(test, target_column, n_train, title
E.2.2 The Friedman Drift Data Set from spotpython
A data generator for the Friedman Drift dataset is implemented in the spotpython
package, see friedman.py. The spotpython
version is a simplified version of the river
implementation. The spotPyton
version allows the generation of constant input values for the features. This is useful for visualizing the concept drifts. For the productive use the river version
should be used.
Plotting the first 100 samples of the Friedman Drift dataset, we can not see the concept drifts at \(p_1\) and \(p_2\). Drift can be visualized by plotting the target values over time for constant features, e,g, if \(x_0\) is set to \(1\) and all other features are set to \(0\). This is illustrated in the following plot.
from spotpython.data.friedman import FriedmanDriftDataset
def plot_friedman_drift_data(n_samples, seed, change_point1, change_point2, constant=True):
= FriedmanDriftDataset(n_samples=n_samples, seed=seed, change_point1=change_point1, change_point2=change_point2, constant=constant)
data_generator = [data for data in data_generator]
data = [i for _, _, i in data]
indices = {f"x{i}": [] for i in range(5)}
values "y"] = []
values[for x, y, _ in data:
for i in range(5):
f"x{i}"].append(x[i])
values["y"].append(y)
values[
=(10, 6))
plt.figure(figsizefor label, series in values.items():
=label)
plt.plot(indices, series, label'Index')
plt.xlabel('Value')
plt.ylabel(=change_point1, color='k', linestyle='--', label='Drift Point 1')
plt.axvline(x=change_point2, color='r', linestyle='--', label='Drift Point 2')
plt.axvline(x
plt.legend()True)
plt.grid(
plt.show()
=100, seed=42, change_point1=50, change_point2=75, constant=False)
plot_friedman_drift_data(n_samples=100, seed=42, change_point1=50, change_point2=75, constant=True) plot_friedman_drift_data(n_samples