Growing Window method

`timecave.validation_methods.prequential.GrowingWindow(splits, ts, fs=1, gap=0, weight_function=constant_weights, params=None)`

Bases: BaseSplitter

Implements every variant of the Growing Window method.

This class implements the Growing Window method. It also supports every variant of this method, including Gap Growing Window and Weighted Growing Window. The gap parameter can be used to implement the former, while the weight_function argument allows the user to implement the latter in a convenient way.

Parameters:

Name	Type	Description	Default
`splits`	`int`	The number of folds used to partition the data.	required
`ts`	`ndarray \| Series`	Univariate time series.	required
`fs`	`float \| int`	Sampling frequency (Hz).	`1`
`gap`	`int`	Number of folds separating the validation set from the training set. If this value is set to zero, the validation set will be adjacent to the training set.	`0`
`weight_function`	`callable`	Fold weighting function. Check the weights module for more details.	`constant_weights`
`params`	`dict`	Parameters to be passed to the weighting functions.	`None`

Attributes:

Name	Type	Description
`n_splits`	`int`	The number of splits.
`sampling_freq`	`int \| float`	The series' sampling frequency (Hz).

Methods:

Name	Description
`split`	Split the time series into training and validation sets.
`info`	Provide additional information on the validation method.
`statistics`	Compute relevant statistics for both training and validation sets.
`plot`	Plot the partitioned time series.

Raises:

Type	Description
`TypeError`	If `gap` is not an integer.
`ValueError`	If `gap` is a negative number.
`ValueError`	If `gap` surpasses the limit imposed by the number of folds.

1

Vitor Cerqueira, Luis Torgo, and Igor Mozetiˇc. Evaluating time series forecasting models: An empirical study on performance estimation methods. Machine Learning, 109(11):1997–2028, 2020.

Source code in timecave/validation_methods/prequential.py

def __init__(
    self,
    splits: int,
    ts: np.ndarray | pd.Series,
    fs: float | int = 1,
    gap: int = 0,
    weight_function: callable = constant_weights,
    params: dict = None,
) -> None:

    super().__init__(splits, ts, fs)
    self._check_gap(gap)
    self._gap = gap
    self._splitting_ind = self._split_ind()
    self._weights = weight_function(self.n_splits, self._gap, 1, params)

    return

`info()`

Provide some basic information on the training and validation sets.

This method displays the number of splits, the fold size, the maximum and minimum training set sizes, the gap, and the weights that will be used to compute the error estimate.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.prequential import GrowingWindow
>>> ts = np.ones(10);
>>> splitter = GrowingWindow(5, ts);
>>> splitter.info();
Growing Window method
---------------------
Time series size: 10 samples
Number of splits: 5
Fold size: 2 to 2 samples (20.0 to 20.0 %)
Minimum training set size: 2 samples (20.0 %)
Maximum training set size: 8 samples (80.0 %)
Gap: 0
Weights: [1. 1. 1. 1.]

Source code in timecave/validation_methods/prequential.py

def info(self) -> None:
    """
    Provide some basic information on the training and validation sets.

    This method displays the number of splits, the fold size, the maximum and minimum training set sizes, the gap, 
    and the weights that will be used to compute the error estimate.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.prequential import GrowingWindow
    >>> ts = np.ones(10);
    >>> splitter = GrowingWindow(5, ts);
    >>> splitter.info();
    Growing Window method
    ---------------------
    Time series size: 10 samples
    Number of splits: 5
    Fold size: 2 to 2 samples (20.0 to 20.0 %)
    Minimum training set size: 2 samples (20.0 %)
    Maximum training set size: 8 samples (80.0 %)
    Gap: 0
    Weights: [1. 1. 1. 1.]
    """

    min_fold_size = int(np.floor(self._n_samples / self.n_splits))
    max_fold_size = min_fold_size

    remainder = self._n_samples % self.n_splits

    if remainder != 0:

        max_fold_size += 1

    min_fold_size_pct = np.round(min_fold_size / self._n_samples * 100, 2)
    max_fold_size_pct = np.round(max_fold_size / self._n_samples * 100, 2)

    max_train = (
        min_fold_size * (self.n_splits - remainder - 1) + max_fold_size * remainder
    )
    max_train_pct = np.round(max_train / self._n_samples * 100, 2)

    print("Growing Window method")
    print("---------------------")
    print(f"Time series size: {self._n_samples} samples")
    print(f"Number of splits: {self.n_splits}")
    print(
        f"Fold size: {min_fold_size} to {max_fold_size} samples ({min_fold_size_pct} to {max_fold_size_pct} %)"
    )
    print(
        f"Minimum training set size: {max_fold_size} samples ({max_fold_size_pct} %)"
    )
    print(f"Maximum training set size: {max_train} samples ({max_train_pct} %)")
    print(f"Gap: {self._gap}")
    print(f"Weights: {np.round(self._weights, 3)}")

    return

`plot(height, width)`

Plot the partitioned time series.

This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

Parameters:

Name	Type	Description	Default
`height`	`int`	The figure's height.	required
`width`	`int`	The figure's width.	required

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.prequential import GrowingWindow
>>> ts = np.ones(100);
>>> splitter = GrowingWindow(5, ts);
>>> splitter.plot(10, 10);

grow_plot

Source code in timecave/validation_methods/prequential.py

def plot(self, height: int, width: int) -> None:
    """
    Plot the partitioned time series.

    This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

    Parameters
    ----------
    height : int
        The figure's height.

    width : int
        The figure's width.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.prequential import GrowingWindow
    >>> ts = np.ones(100);
    >>> splitter = GrowingWindow(5, ts);
    >>> splitter.plot(10, 10);

    ![grow_plot](../../../images/Grow_plot.png)
    """

    fig, axs = plt.subplots(self.n_splits - self._gap - 1, 1, sharex=True)
    fig.set_figheight(height)
    fig.set_figwidth(width)
    fig.supxlabel("Samples")
    fig.supylabel("Time Series")
    fig.suptitle("Growing Window method")

    if(self.n_splits - self._gap - 1 > 1):

        for it, (training, validation, weight) in enumerate(self.split()):

            axs[it].scatter(training, self._series[training], label="Training set")
            axs[it].scatter(
                validation, self._series[validation], label="Validation set"
            )
            axs[it].set_title("Iteration: {} Weight: {}".format(it + 1, np.round(weight, 3)))
            axs[it].set_ylim([self._series.min() - 1, self._series.max() + 1])
            axs[it].set_xlim([- 1, self._n_samples + 1])
            axs[it].legend()

    else:

        for (training, validation, weight) in self.split():

            axs.scatter(training, self._series[training], label="Training set")
            axs.scatter(
                validation, self._series[validation], label="Validation set"
            )
            axs.set_title("Iteration: {} Weight: {}".format(1, np.round(weight, 3)))
            axs.legend()

    plt.show()

    return

`split()`

Split the time series into training and validation sets.

This method splits the series' indices into disjoint sets containing the training and validation indices. At every iteration, an array of training indices and another one containing the validation indices are generated. Note that this method is a generator. To access the indices, use the next() method or a for loop.

Yields:

Type	Description
`ndarray`	Array of training indices.
`ndarray`	Array of validation indices.
`float`	Weight assigned to the error estimate.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.prequential import GrowingWindow
>>> ts = np.ones(10);
>>> splitter = GrowingWindow(5, ts); # Split the data into 5 different folds
>>> for ind, (train, val, _) in enumerate(splitter.split()):
... 
...     print(f"Iteration {ind+1}");
...     print(f"Training set indices: {train}");
...     print(f"Validation set indices: {val}");
Iteration 1
Training set indices: [0 1]
Validation set indices: [2 3]
Iteration 2
Training set indices: [0 1 2 3]
Validation set indices: [4 5]
Iteration 3
Training set indices: [0 1 2 3 4 5]
Validation set indices: [6 7]
Iteration 4
Training set indices: [0 1 2 3 4 5 6 7]
Validation set indices: [8 9]

If the number of samples is not divisible by the number of folds, the first folds will contain more samples:

>>> ts2 = np.ones(17);
>>> splitter = GrowingWindow(5, ts2);
>>> for ind, (train, val, _) in enumerate(splitter.split()):
... 
...     print(f"Iteration {ind+1}");
...     print(f"Training set indices: {train}");
...     print(f"Validation set indices: {val}");
Iteration 1
Training set indices: [0 1 2 3]
Validation set indices: [4 5 6 7]
Iteration 2
Training set indices: [0 1 2 3 4 5 6 7]
Validation set indices: [ 8  9 10]
Iteration 3
Training set indices: [ 0  1  2  3  4  5  6  7  8  9 10]
Validation set indices: [11 12 13]
Iteration 4
Training set indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13]
Validation set indices: [14 15 16]

If a gap is specified (Gap Growing Window), the validation set will no longer be adjacent to the training set. Keep in mind that, the larger the gap between these two sets, the fewer iterations are run:

>>> splitter = GrowingWindow(5, ts, gap=1);
>>> for ind, (train, val, _) in enumerate(splitter.split()):
... 
...     print(f"Iteration {ind+1}");
...     print(f"Training set indices: {train}");
...     print(f"Validation set indices: {val}");
Iteration 1
Training set indices: [0 1]
Validation set indices: [4 5]
Iteration 2
Training set indices: [0 1 2 3]
Validation set indices: [6 7]
Iteration 3
Training set indices: [0 1 2 3 4 5]
Validation set indices: [8 9]

Weights can be assigned to the error estimates (Weighted Growing Window method). The parameters for the weighting functions must be passed to the class constructor:

>>> from timecave.validation_methods.weights import exponential_weights
>>> splitter = GrowingWindow(5, ts, weight_function=exponential_weights, params={"base": 2});
>>> for ind, (train, val, weight) in enumerate(splitter.split()):
... 
...     print(f"Iteration {ind+1}");
...     print(f"Training set indices: {train}");
...     print(f"Validation set indices: {val}");
...     print(f"Weight: {np.round(weight, 3)}");
Iteration 1
Training set indices: [0 1]
Validation set indices: [2 3]
Weight: 0.067
Iteration 2
Training set indices: [0 1 2 3]
Validation set indices: [4 5]
Weight: 0.133
Iteration 3
Training set indices: [0 1 2 3 4 5]
Validation set indices: [6 7]
Weight: 0.267
Iteration 4
Training set indices: [0 1 2 3 4 5 6 7]
Validation set indices: [8 9]
Weight: 0.533

Source code in timecave/validation_methods/prequential.py

def split(self) -> Generator[tuple[np.ndarray, np.ndarray, float], None, None]:
    """
    Split the time series into training and validation sets.

    This method splits the series' indices into disjoint sets containing the training and validation indices.
    At every iteration, an array of training indices and another one containing the validation indices are generated.
    Note that this method is a generator. To access the indices, use the `next()` method or a `for` loop.

    Yields
    ------
    np.ndarray
        Array of training indices.

    np.ndarray
        Array of validation indices.

    float
        Weight assigned to the error estimate.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.prequential import GrowingWindow
    >>> ts = np.ones(10);
    >>> splitter = GrowingWindow(5, ts); # Split the data into 5 different folds
    >>> for ind, (train, val, _) in enumerate(splitter.split()):
    ... 
    ...     print(f"Iteration {ind+1}");
    ...     print(f"Training set indices: {train}");
    ...     print(f"Validation set indices: {val}");
    Iteration 1
    Training set indices: [0 1]
    Validation set indices: [2 3]
    Iteration 2
    Training set indices: [0 1 2 3]
    Validation set indices: [4 5]
    Iteration 3
    Training set indices: [0 1 2 3 4 5]
    Validation set indices: [6 7]
    Iteration 4
    Training set indices: [0 1 2 3 4 5 6 7]
    Validation set indices: [8 9]

    If the number of samples is not divisible by the number of folds, the first folds will contain more samples:

    >>> ts2 = np.ones(17);
    >>> splitter = GrowingWindow(5, ts2);
    >>> for ind, (train, val, _) in enumerate(splitter.split()):
    ... 
    ...     print(f"Iteration {ind+1}");
    ...     print(f"Training set indices: {train}");
    ...     print(f"Validation set indices: {val}");
    Iteration 1
    Training set indices: [0 1 2 3]
    Validation set indices: [4 5 6 7]
    Iteration 2
    Training set indices: [0 1 2 3 4 5 6 7]
    Validation set indices: [ 8  9 10]
    Iteration 3
    Training set indices: [ 0  1  2  3  4  5  6  7  8  9 10]
    Validation set indices: [11 12 13]
    Iteration 4
    Training set indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13]
    Validation set indices: [14 15 16]

    If a gap is specified (Gap Growing Window), the validation set will no longer be adjacent to the training set.
    Keep in mind that, the larger the gap between these two sets, the fewer iterations are run:

    >>> splitter = GrowingWindow(5, ts, gap=1);
    >>> for ind, (train, val, _) in enumerate(splitter.split()):
    ... 
    ...     print(f"Iteration {ind+1}");
    ...     print(f"Training set indices: {train}");
    ...     print(f"Validation set indices: {val}");
    Iteration 1
    Training set indices: [0 1]
    Validation set indices: [4 5]
    Iteration 2
    Training set indices: [0 1 2 3]
    Validation set indices: [6 7]
    Iteration 3
    Training set indices: [0 1 2 3 4 5]
    Validation set indices: [8 9]

    Weights can be assigned to the error estimates (Weighted Growing Window method). 
    The parameters for the weighting functions must be passed to the class constructor:

    >>> from timecave.validation_methods.weights import exponential_weights
    >>> splitter = GrowingWindow(5, ts, weight_function=exponential_weights, params={"base": 2});
    >>> for ind, (train, val, weight) in enumerate(splitter.split()):
    ... 
    ...     print(f"Iteration {ind+1}");
    ...     print(f"Training set indices: {train}");
    ...     print(f"Validation set indices: {val}");
    ...     print(f"Weight: {np.round(weight, 3)}");
    Iteration 1
    Training set indices: [0 1]
    Validation set indices: [2 3]
    Weight: 0.067
    Iteration 2
    Training set indices: [0 1 2 3]
    Validation set indices: [4 5]
    Weight: 0.133
    Iteration 3
    Training set indices: [0 1 2 3 4 5]
    Validation set indices: [6 7]
    Weight: 0.267
    Iteration 4
    Training set indices: [0 1 2 3 4 5 6 7]
    Validation set indices: [8 9]
    Weight: 0.533
    """

    for i, (ind, weight) in enumerate(zip(self._splitting_ind[:-1], self._weights)):

        gap_ind = self._splitting_ind[i + self._gap]
        gap_end_ind = self._splitting_ind[i + self._gap + 1]

        train = self._indices[:ind]
        validation = self._indices[gap_ind:gap_end_ind]

        yield (train, validation, weight)

`statistics()`

Compute relevant statistics for both training and validation sets.

This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set. It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast. If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

Returns:

Type	Description
`DataFrame`	Relevant features for the entire time series.
`DataFrame`	Relevant features for the training set.
`DataFrame`	Relevant features for the validation set.

Raises:

Type	Description
`ValueError`	If the time series is composed of less than three samples.
`ValueError`	If the folds comprise less than two samples.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.prequential import GrowingWindow
>>> ts = np.hstack((np.ones(5), np.zeros(5)));
>>> splitter = GrowingWindow(5, ts);
>>> ts_stats, training_stats, validation_stats = splitter.statistics();
Frequency features are only meaningful if the correct sampling frequency is passed to the class.
>>> ts_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515           0.114058               0.5           0.38717            1.59099            0.111111              0.111111
>>> training_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0  1.000000     1.0  1.0  1.0  0.000000            0.0 -7.850462e-17           0.000000               0.0          0.000000                inf            0.000000              0.000000
0  1.000000     1.0  1.0  1.0  0.000000            0.0 -8.214890e-17           0.000000               0.0          0.000000                inf            0.000000              0.000000
0  0.833333     1.0  0.0  1.0  0.138889            1.0 -1.428571e-01           0.125000               0.5          0.792481           0.931695            0.200000              0.200000
0  0.625000     1.0  0.0  1.0  0.234375            1.0 -1.785714e-01           0.122818               0.5          0.600876           1.383496            0.142857              0.142857
>>> validation_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   1.0     1.0  1.0  1.0      0.00            0.0 -7.850462e-17               0.00               0.0               0.0                inf                 0.0                   0.0
0   0.5     0.5  0.0  1.0      0.25            1.0 -1.000000e+00               0.25               0.5               0.0                inf                 1.0                   1.0
0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00               0.00               0.0               0.0                inf                 0.0                   0.0
0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00               0.00               0.0               0.0                inf                 0.0                   0.0

Source code in timecave/validation_methods/prequential.py

def statistics(self) -> tuple[pd.DataFrame]:
    """
    Compute relevant statistics for both training and validation sets.

    This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set.
    It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast.
    If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

    Returns
    -------
    pd.DataFrame
        Relevant features for the entire time series.

    pd.DataFrame
        Relevant features for the training set.

    pd.DataFrame
        Relevant features for the validation set.

    Raises
    ------
    ValueError
        If the time series is composed of less than three samples.

    ValueError
        If the folds comprise less than two samples.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.prequential import GrowingWindow
    >>> ts = np.hstack((np.ones(5), np.zeros(5)));
    >>> splitter = GrowingWindow(5, ts);
    >>> ts_stats, training_stats, validation_stats = splitter.statistics();
    Frequency features are only meaningful if the correct sampling frequency is passed to the class.
    >>> ts_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515           0.114058               0.5           0.38717            1.59099            0.111111              0.111111
    >>> training_stats
           Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0  1.000000     1.0  1.0  1.0  0.000000            0.0 -7.850462e-17           0.000000               0.0          0.000000                inf            0.000000              0.000000
    0  1.000000     1.0  1.0  1.0  0.000000            0.0 -8.214890e-17           0.000000               0.0          0.000000                inf            0.000000              0.000000
    0  0.833333     1.0  0.0  1.0  0.138889            1.0 -1.428571e-01           0.125000               0.5          0.792481           0.931695            0.200000              0.200000
    0  0.625000     1.0  0.0  1.0  0.234375            1.0 -1.785714e-01           0.122818               0.5          0.600876           1.383496            0.142857              0.142857
    >>> validation_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   1.0     1.0  1.0  1.0      0.00            0.0 -7.850462e-17               0.00               0.0               0.0                inf                 0.0                   0.0
    0   0.5     0.5  0.0  1.0      0.25            1.0 -1.000000e+00               0.25               0.5               0.0                inf                 1.0                   1.0
    0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00               0.00               0.0               0.0                inf                 0.0                   0.0
    0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00               0.00               0.0               0.0                inf                 0.0                   0.0
    """

    if self._n_samples <= 2:

        raise ValueError(
            "Basic statistics can only be computed if the time series comprises more than two samples."
        )

    if int(np.floor(self._n_samples / self.n_splits)) < 2:

        raise ValueError(
            "The folds are too small to compute most meaningful features."
        )

    print("Frequency features are only meaningful if the correct sampling frequency is passed to the class.")

    full_features = get_features(self._series, self.sampling_freq)
    training_stats = []
    validation_stats = []

    for (training, validation, _) in self.split():

        training_feat = get_features(self._series[training], self.sampling_freq)
        training_stats.append(training_feat)

        validation_feat = get_features(self._series[validation], self.sampling_freq)
        validation_stats.append(validation_feat)

    training_features = pd.concat(training_stats)
    validation_features = pd.concat(validation_stats)

    return (full_features, training_features, validation_features)