Skip to content

sea

SEA

Bases: SyntheticDataset

SEA synthetic dataset.

Implementation of the data stream with abrupt drift described in [1]. Each observation is composed of 3 features. Only the first two features are relevant. The target is binary, and is positive if the sum of the features exceeds a certain threshold. There are 4 thresholds to choose from. Concept drift can be introduced by switching the threshold anytime during the stream.

  • Variant 0: True if att1 + att2 > 8

  • Variant 1: True if att1 + att2 > 9

  • Variant 2: True if att1 + att2 > 7

  • Variant 3: True if att1 + att2 > 9.5

Parameters:

Name Type Description Default
variant int

The variant of the data stream to use. Can be 0, 1, 2, or 3.

0
noise float

The probability of generating label noise.

0.0
seed int

Random seed for reproducibility.

None

Returns:

Type Description
Generator

A generator of features and labels.

Examples:

>>> from spotriver.data.synth import SEA
    dataset = synth.SEA(variant=0, seed=42)
    for x, y in dataset.take(5):
        print(x, y)
    {0: 6.39426, 1: 0.25010, 2: 2.75029} False
    {0: 2.23210, 1: 7.36471, 2: 6.76699} True
    {0: 8.92179, 1: 0.86938, 2: 4.21921} True
    {0: 0.29797, 1: 2.18637, 2: 5.05355} False
    {0: 0.26535, 1: 1.98837, 2: 6.49884} False
References

[1]: A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification

Source code in spotriver/data/synth/sea.py
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
class SEA(datasets.base.SyntheticDataset):
    """SEA synthetic dataset.

    Implementation of the data stream with abrupt drift described in [1]. Each observation is
    composed of 3 features. Only the first two features are relevant. The target is binary, and is
    positive if the sum of the features exceeds a certain threshold. There are 4 thresholds to
    choose from. Concept drift can be introduced by switching the threshold anytime during the
    stream.

    * **Variant 0**: `True` if `att1 + att2 > 8`

    * **Variant 1**: `True` if `att1 + att2 > 9`

    * **Variant 2**: `True` if `att1 + att2 > 7`

    * **Variant 3**: `True` if `att1 + att2 > 9.5`

    Args:
        variant (int): The variant of the data stream to use. Can be 0, 1, 2, or 3.
        noise (float): The probability of generating label noise.
        seed (int): Random seed for reproducibility.

    Returns:
        (Generator): A generator of features and labels.

    Examples:
        >>> from spotriver.data.synth import SEA
            dataset = synth.SEA(variant=0, seed=42)
            for x, y in dataset.take(5):
                print(x, y)
            {0: 6.39426, 1: 0.25010, 2: 2.75029} False
            {0: 2.23210, 1: 7.36471, 2: 6.76699} True
            {0: 8.92179, 1: 0.86938, 2: 4.21921} True
            {0: 0.29797, 1: 2.18637, 2: 5.05355} False
            {0: 0.26535, 1: 1.98837, 2: 6.49884} False

    References:
        [1]: A Streaming Ensemble Algorithm (SEA) for Large-Scale Classification
    """

    def __init__(self, variant=0, noise=0.0, seed: int = None):
        super().__init__(n_features=3, task=datasets.base.BINARY_CLF)

        if variant not in (0, 1, 2, 3):
            raise ValueError("Unknown variant, possible choices are: 0, 1, 2, 3")

        self.variant = variant
        self.noise = noise
        self.seed = seed
        self._threshold = {0: 8, 1: 9, 2: 7, 3: 9.5}[variant]

    def __iter__(self):
        rng = random.Random(self.seed)

        while True:
            x = {i: rng.uniform(0, 10) for i in range(3)}
            y = x[0] + x[1] > self._threshold

            if self.noise and rng.random() < self.noise:
                y = not y

            yield x, y

    @property
    def _repr_content(self):
        return {**super()._repr_content, "Variant": str(self.variant)}