split
calculate_data_split(test_size, full_size, verbosity=0, stage=None)
¶
Calculates the split sizes for training, validation, and test datasets. Returns a tuple containing the sizes (full_train_size, val_size, train_size, test_size), where full_train_size is the size of the full dataset minus the test set.
Note
The first return value is full_train_size, i.e., the size of the full dataset minus the test set.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
test_size |
float or int
|
The size of the test set. Can be a float for proportion or an int for absolute number of test samples. |
required |
full_size |
int
|
The size of the full dataset. |
required |
verbosity |
int
|
The level of verbosity for debug output. Defaults to 0. |
0
|
stage |
str
|
The stage of setup, for debug output if needed. |
None
|
Returns:
Name | Type | Description |
---|---|---|
tuple |
tuple
|
A tuple containing the sizes (full_train_size, val_size, train_size, test_size). |
Examples:
>>> from spotpython.utils.split import calculate_data_split
# Using proportion for test size
calculate_data_split(0.2, 1000)
(0.8, 0.16, 0.64, 0.2)
# Using absolute number for test size
calculate_data_split(200, 1000)
(800, 160, 640, 200)
Raises:
Type | Description |
---|---|
ValueError
|
If the sizes are not correct, i.e., full_size != train_size + val_size + test_size. |
Source code in spotpython/utils/split.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
|
compute_lengths_from_fractions(fractions, dataset_length)
¶
Compute lengths of dataset splits from given fractions.
Given a list of fractions that sum up to 1, compute the lengths of each
corresponding partition of a dataset with a specified length. Each length is
determined as floor(frac * dataset_length)
. Any remaining items (due to flooring)
are distributed among the partitions in a round-robin fashion.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fractions |
List[float]
|
A list of fractions that should sum to 1. |
required |
dataset_length |
int
|
The length of the dataset. |
required |
Returns:
Type | Description |
---|---|
List[int]
|
List[int]: A list of lengths corresponding to each fraction. |
Raises:
Type | Description |
---|---|
ValueError
|
If the fractions do not sum to 1. |
ValueError
|
If any fraction is outside the range [0, 1]. |
ValueError
|
If the sum of computed lengths does not equal the dataset length. |
Examples:
>>> from spotpython.utils.split import compute_lengths_from_fractions
>>> dataset_length = 5
>>> fractions = [0.2, 0.3, 0.5]
>>> compute_lengths_from_fractions(fractions, dataset_length)
[1, 1, 3]
In this example, ‘dataset_length’ is 5 and the ‘fractions’ specify the desired size distribution. The function calculates partitions of lengths [1, 1, 3] based on the given fractions.
Source code in spotpython/utils/split.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|