Skip to content

Holdout method

timecave.validation_methods.OOS.Holdout(ts, fs=1, validation_size=0.3)

Bases: BaseSplitter

Implements the classic Holdout method.

This class implements the classic Holdout method, which splits the time series into two disjoint sets: one used for training, and another one used for validation purposes. Note that the larger the validation set, the smaller the training set, and vice-versa. As this is an Out-of-Sample method, the training indices precede the validation ones.

Parameters:

Name Type Description Default
ts ndarray | Series

Univariate time series.

required
fs float | int

Sampling frequency (Hz).

1
validation_size float

Validation set size (relative to the time series size).

0.3

Attributes:

Name Type Description
n_splits int

The number of splits.

sampling_freq int | float

The series' sampling frequency (Hz).

Methods:

Name Description
split

Split the time series into training and validation sets.

info

Provide additional information on the validation method.

statistics

Compute relevant statistics for both training and validation sets.

plot

Plot the partitioned time series.

Raises:

Type Description
TypeError

If the validation size is not a float.

ValueError

If the validation size does not lie in the ]0, 1[ interval.

See also

RepeatedHoldout: Perform several iterations of the Holdout method with a randomised validation set size.

Notes

The classic Holdout method consists of splitting the time series in two different sets: one for training and one for validation. This method preserves the temporal order of observations: the oldest set of observations is used for training, while the most recent data is used for validating the model. The model's error on the validation set data is used as an estimate of its true error.

OOS_image

This method's computational cost is negligible.

Source code in timecave/validation_methods/OOS.py
def __init__(
    self, ts: np.ndarray | pd.Series, fs: float | int = 1, validation_size: float = 0.3
) -> None:

    super().__init__(2, ts, fs)
    self._check_validation_size(validation_size)
    self._val_size = validation_size

    return

info()

Provide some basic information on the training and validation sets.

This method displays the time series size along with those of the training and validation sets.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.OOS import Holdout
>>> ts = np.ones(10);
>>> splitter = Holdout(ts);
>>> splitter.info();
Holdout method
--------------
Time series size: 10 samples
Training set size: 7 samples (70.0 %)
Validation set size: 3 samples (30.0 %)
Source code in timecave/validation_methods/OOS.py
def info(self) -> None:
    """
    Provide some basic information on the training and validation sets.

    This method displays the time series size along with those of the training and validation sets.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.OOS import Holdout
    >>> ts = np.ones(10);
    >>> splitter = Holdout(ts);
    >>> splitter.info();
    Holdout method
    --------------
    Time series size: 10 samples
    Training set size: 7 samples (70.0 %)
    Validation set size: 3 samples (30.0 %)
    """

    print("Holdout method")
    print("--------------")
    print(f"Time series size: {self._n_samples} samples")
    print(
        f"Training set size: {int(np.round((1 - self._val_size) * self._n_samples))} samples ({np.round(1 - self._val_size, 4) * 100} %)"
    )
    print(
        f"Validation set size: {int(np.round(self._val_size * self._n_samples))} samples ({np.round(self._val_size, 4) * 100} %)"
    )

    return

plot(height, width)

Plot the partitioned time series.

This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

Parameters:

Name Type Description Default
height int

The figure's height.

required
width int

The figure's width.

required

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.OOS import Holdout
>>> ts = np.arange(1, 11);
>>> splitter = Holdout(ts);
>>> splitter.plot(10, 10);

Holdout_plot_image

Source code in timecave/validation_methods/OOS.py
def plot(self, height: int, width: int) -> None:
    """
    Plot the partitioned time series.

    This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours. 

    Parameters
    ----------
    height : int
        The figure's height.

    width : int
        The figure's width.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.OOS import Holdout
    >>> ts = np.arange(1, 11);
    >>> splitter = Holdout(ts);
    >>> splitter.plot(10, 10);

    ![Holdout_plot_image](../../../images/Holdout_plot.png)
    """

    split = self.split()
    training, validation, _ = next(split)

    fig = plt.figure(figsize=(height, width))
    ax = fig.add_subplot(1, 1, 1)
    ax.scatter(training, self._series[training], label="Training set")
    ax.scatter(validation, self._series[validation], label="Validation set")
    ax.set_xlabel("Samples")
    ax.set_ylabel("Time Series")
    ax.set_title("Holdout method")
    ax.legend()
    plt.show()

    return

split()

Split the time series into training and validation sets.

This method splits the series' indices into two disjoint sets: one containing the training indices, and another one with the validation indices. Note that this method is a generator. To access the indices, use the next() method or a for loop.

Yields:

Type Description
ndarray

Array of training indices.

ndarray

Array of validation indices.

float

Used for compatibility reasons. Irrelevant for this method.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.OOS import Holdout
>>> ts = np.ones(10);
>>> splitter = Holdout(ts);
>>> for train, val, _ in splitter.split():
...     
...     # Print the training indices and their respective values
...     print(f"Training indices: {train}");
...     print(f"Training values: {ts[train]}");
...     
...     # Do the same for the validation indices
...     print(f"Validation indices: {val}");
...     print(f"Validation values: {ts[val]}");
Training indices: [0 1 2 3 4 5 6]
Training values: [1. 1. 1. 1. 1. 1. 1.]
Validation indices: [7 8 9]
Validation values: [1. 1. 1.]
Source code in timecave/validation_methods/OOS.py
def split(self) -> Generator[tuple[np.ndarray, np.ndarray, float], None, None]:
    """
    Split the time series into training and validation sets.

    This method splits the series' indices into two disjoint sets: one containing the training indices, and another one with the validation indices.
    Note that this method is a generator. To access the indices, use the `next()` method or a `for` loop.

    Yields
    ------
    np.ndarray
        Array of training indices.

    np.ndarray
        Array of validation indices.

    float
        Used for compatibility reasons. Irrelevant for this method.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.OOS import Holdout
    >>> ts = np.ones(10);
    >>> splitter = Holdout(ts);
    >>> for train, val, _ in splitter.split():
    ...     
    ...     # Print the training indices and their respective values
    ...     print(f"Training indices: {train}");
    ...     print(f"Training values: {ts[train]}");
    ...     
    ...     # Do the same for the validation indices
    ...     print(f"Validation indices: {val}");
    ...     print(f"Validation values: {ts[val]}");
    Training indices: [0 1 2 3 4 5 6]
    Training values: [1. 1. 1. 1. 1. 1. 1.]
    Validation indices: [7 8 9]
    Validation values: [1. 1. 1.]
    """

    split_ind = int(np.round((1 - self._val_size) * self._n_samples))

    train = self._indices[:split_ind]
    validation = self._indices[split_ind:]

    yield (train, validation, 1.0)

statistics()

Compute relevant statistics for both training and validation sets.

This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set. It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast. If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

Returns:

Type Description
DataFrame

Relevant features for the entire time series.

DataFrame

Relevant features for the training set.

DataFrame

Relevant features for the validation set.

Raises:

Type Description
ValueError

If the time series is composed of less than three samples.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.OOS import Holdout
>>> ts = np.hstack((np.ones(5), np.zeros(5)));
>>> splitter = Holdout(ts, validation_size=0.5);
>>> ts_stats, training_stats, validation_stats = splitter.statistics();
Frequency features are only meaningful if the correct sampling frequency is passed to the class.
>>> ts_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515           0.114058               0.5           0.38717            1.59099            0.111111              0.111111
>>> training_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   1.0     1.0  1.0  1.0       0.0            0.0 -1.050792e-16                0.0               0.0               0.0                inf                 0.0                   0.0
>>> validation_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.0     0.0  0.0  0.0       0.0            0.0          0.0                  0               0.0               0.0                inf                 0.0                   0.0
Source code in timecave/validation_methods/OOS.py
def statistics(self) -> tuple[pd.DataFrame]:
    """
    Compute relevant statistics for both training and validation sets.

    This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set.
    It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast.
    If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

    Returns
    -------
    pd.DataFrame
        Relevant features for the entire time series.

    pd.DataFrame
        Relevant features for the training set.

    pd.DataFrame
        Relevant features for the validation set.

    Raises
    ------
    ValueError
        If the time series is composed of less than three samples.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.OOS import Holdout
    >>> ts = np.hstack((np.ones(5), np.zeros(5)));
    >>> splitter = Holdout(ts, validation_size=0.5);
    >>> ts_stats, training_stats, validation_stats = splitter.statistics();
    Frequency features are only meaningful if the correct sampling frequency is passed to the class.
    >>> ts_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515           0.114058               0.5           0.38717            1.59099            0.111111              0.111111
    >>> training_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   1.0     1.0  1.0  1.0       0.0            0.0 -1.050792e-16                0.0               0.0               0.0                inf                 0.0                   0.0
    >>> validation_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.0     0.0  0.0  0.0       0.0            0.0          0.0                  0               0.0               0.0                inf                 0.0                   0.0
    """

    if self._n_samples <= 2:

        raise ValueError(
            "Basic statistics can only be computed if the time series comprises more than two samples."
        )

    print("Frequency features are only meaningful if the correct sampling frequency is passed to the class.")

    split = self.split()
    training, validation, _ = next(split)

    full_feat = get_features(self._series, self.sampling_freq)

    if self._series[training].shape[0] >= 2:

        training_feat = get_features(self._series[training], self.sampling_freq)

    else:

        training_feat = pd.DataFrame(columns=full_feat.columns)
        warn("Training and validation set statistics can only be computed if each of these comprise two or more samples.")

    if self._series[validation].shape[0] >= 2:

        validation_feat = get_features(self._series[validation], self.sampling_freq)

    else:

        validation_feat = pd.DataFrame(columns=full_feat.columns)
        warn("Training and validation set statistics can only be computed if each of these comprise two or more samples.")

    return (full_feat, training_feat, validation_feat)