Skip to content

split

calculate_data_split(test_size, full_size, verbosity=0, stage=None)

Calculates the split sizes for training, validation, and test datasets. Returns a tuple containing the sizes (full_train_size, val_size, train_size, test_size), where full_train_size is the size of the full dataset minus the test set.

Note

The first return value is full_train_size, i.e., the size of the full dataset minus the test set.

Parameters:

Name Type Description Default
test_size float or int

The size of the test set. Can be a float for proportion or an int for absolute number of test samples.

required
full_size int

The size of the full dataset.

required
verbosity int

The level of verbosity for debug output. Defaults to 0.

0
stage str

The stage of setup, for debug output if needed.

None

Returns:

Name Type Description
tuple tuple

A tuple containing the sizes (full_train_size, val_size, train_size, test_size).

Examples:

>>> from spotpython.utils.split import calculate_data_split
    # Using proportion for test size
    calculate_data_split(0.2, 1000)
        (0.8, 0.16, 0.64, 0.2)
    # Using absolute number for test size
    calculate_data_split(200, 1000)
        (800, 160, 640, 200)

Raises:

Type Description
ValueError

If the sizes are not correct, i.e., full_size != train_size + val_size + test_size.

Source code in spotpython/utils/split.py
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
def calculate_data_split(test_size, full_size, verbosity=0, stage=None) -> tuple:
    """
    Calculates the split sizes for training, validation, and test datasets.
    Returns a tuple containing the sizes (full_train_size, val_size, train_size, test_size),
    where full_train_size is the size of the full dataset minus the test set.

    Note:
        The first return value is full_train_size, i.e.,
        the size of the full dataset minus the test set.

    Args:
        test_size (float or int):
            The size of the test set.
            Can be a float for proportion or an int for absolute number of test samples.
        full_size (int):
            The size of the full dataset.
        verbosity (int, optional):
            The level of verbosity for debug output. Defaults to 0.
        stage (str, optional):
            The stage of setup, for debug output if needed.

    Returns:
        tuple: A tuple containing the sizes (full_train_size, val_size, train_size, test_size).

    Examples:
        >>> from spotpython.utils.split import calculate_data_split
            # Using proportion for test size
            calculate_data_split(0.2, 1000)
                (0.8, 0.16, 0.64, 0.2)
            # Using absolute number for test size
            calculate_data_split(200, 1000)
                (800, 160, 640, 200)

    Raises:
        ValueError: If the sizes are not correct, i.e., full_size != train_size + val_size + test_size.
    """
    if isinstance(test_size, float):
        full_train_size = round(1.0 - test_size, 2)
        val_size = round(full_train_size * test_size, 2)
        train_size = 1.0 - test_size - val_size
        # check if the sizes are correct, i.e., 1.0 = train_size + val_size + test_size
        if full_train_size + test_size != 1.0:
            raise ValueError(f"full_size ({full_size}) != full_train_size ({full_train_size}) + test_size ({test_size})")
    else:
        # test_size is considered an int, training size calculation directly based on it
        # everything is calculated as an int
        # return values are also ints
        # check if test_size does not exceed full_size
        if test_size > full_size:
            raise ValueError(f"test_size ({test_size}) > full_size ({full_size})")
        full_train_size = full_size - test_size
        val_size = int(full_train_size * test_size / full_size)
        train_size = full_train_size - val_size
        # check if the sizes are correct, i.e., full_size = train_size + val_size + test_size
        if train_size + val_size + test_size != full_size:
            raise ValueError(f"full_size ({full_size}) != full_train_size ({full_train_size}) + test_size ({test_size})")

    if verbosity > 0:
        print(f"stage: {stage}")
    if verbosity > 1:
        print(f"full_sizefull_train_size: {full_train_size}")
        print(f"full_sizeval_size: {val_size}")
        print(f"full_sizetrain_size: {train_size}")
        print(f"full_sizetest_size: {test_size}")

    return full_train_size, val_size, train_size, test_size

compute_lengths_from_fractions(fractions, dataset_length)

Compute lengths of dataset splits from given fractions.

Given a list of fractions that sum up to 1, compute the lengths of each corresponding partition of a dataset with a specified length. Each length is determined as floor(frac * dataset_length). Any remaining items (due to flooring) are distributed among the partitions in a round-robin fashion.

Parameters:

Name Type Description Default
fractions List[float]

A list of fractions that should sum to 1.

required
dataset_length int

The length of the dataset.

required

Returns:

Type Description
List[int]

List[int]: A list of lengths corresponding to each fraction.

Raises:

Type Description
ValueError

If the fractions do not sum to 1.

ValueError

If any fraction is outside the range [0, 1].

ValueError

If the sum of computed lengths does not equal the dataset length.

Examples:

>>> from spotpython.utils.split import compute_lengths_from_fractions
>>> dataset_length = 5
>>> fractions = [0.2, 0.3, 0.5]
>>> compute_lengths_from_fractions(fractions, dataset_length)
[1, 1, 3]

In this example, ‘dataset_length’ is 5 and the ‘fractions’ specify the desired size distribution. The function calculates partitions of lengths [1, 1, 3] based on the given fractions.

Source code in spotpython/utils/split.py
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def compute_lengths_from_fractions(fractions: List[float], dataset_length: int) -> List[int]:
    """Compute lengths of dataset splits from given fractions.

    Given a list of fractions that sum up to 1, compute the lengths of each
    corresponding partition of a dataset with a specified length. Each length is
    determined as `floor(frac * dataset_length)`. Any remaining items (due to flooring)
    are distributed among the partitions in a round-robin fashion.

    Args:
        fractions (List[float]): A list of fractions that should sum to 1.
        dataset_length (int): The length of the dataset.

    Returns:
        List[int]: A list of lengths corresponding to each fraction.

    Raises:
        ValueError: If the fractions do not sum to 1.
        ValueError: If any fraction is outside the range [0, 1].
        ValueError: If the sum of computed lengths does not equal the dataset length.

    Examples:
        >>> from spotpython.utils.split import compute_lengths_from_fractions
        >>> dataset_length = 5
        >>> fractions = [0.2, 0.3, 0.5]
        >>> compute_lengths_from_fractions(fractions, dataset_length)
        [1, 1, 3]

        In this example, 'dataset_length' is 5 and the 'fractions' specify the
        desired size distribution. The function calculates partitions of lengths
        [1, 1, 3] based on the given fractions.

    """
    if not math.isclose(sum(fractions), 1) or sum(fractions) > 1:
        raise ValueError("Fractions must sum up to 1.")

    subset_lengths: List[int] = []
    for i, frac in enumerate(fractions):
        if frac < 0 or frac > 1:
            raise ValueError(f"Fraction at index {i} is not between 0 and 1")
        n_items_in_split = int(math.floor(dataset_length * frac))
        subset_lengths.append(n_items_in_split)

    remainder = dataset_length - sum(subset_lengths)

    # Add 1 to all the lengths in a round-robin fashion until the remainder is 0
    for i in range(remainder):
        idx_to_add_at = i % len(subset_lengths)
        subset_lengths[idx_to_add_at] += 1

    lengths = subset_lengths
    for i, length in enumerate(lengths):
        if length == 0:
            warnings.warn(f"Length of split at index {i} is 0. " f"This might result in an empty dataset.")

    if sum(lengths) != dataset_length:
        raise ValueError("Sum of computed lengths does not equal the input dataset length!")

    return lengths