Skip to content

hv Block Cross-validation method

timecave.validation_methods.CV.hvBlockCV(ts, fs=1, h=0, v=0)

Bases: BaseSplitter

Implements the hv Block Cross-validation method.

This class implements the hv Block Cross-validation method. It is similar to the BlockCV class, but it does not support weight generation. Consequently, in order to implement a weighted version of this method, the user must implement their own derived class or compute the weights separately.

Parameters:

Name Type Description Default
ts ndarray | Series

Univariate time series.

required
fs float | int

Sampling frequency (Hz).

1
h int

Controls the amount of samples that will be removed from the training set. The h samples immediately following and preceding the validation set are not used for training.

1
v int

Controls the size of the validation set. \(2v + 1\) samples will be used for validation.

1

Attributes:

Name Type Description
n_splits int

The number of splits.

sampling_freq int | float

The series' sampling frequency (Hz).

Methods:

Name Description
split

Split the time series into training and validation sets.

info

Provide additional information on the validation method.

statistics

Compute relevant statistics for both training and validation sets.

plot

Plot the partitioned time series.

Raises:

Type Description
TypeError

If either h or v are not integers.

ValueError

If either h or v are smaller than or equal to zero.

ValueError

If the sum of h and v is larger than half the amount of samples in the series.

Warning

Being a variant of the leave-one-out CV procedure, this method is computationally intensive.

See also

Block CV: The original Block CV method, which partitions the series into equally sized folds. No training samples are removed.

Notes

The hv Block Cross-validation method is essentially a leave-one-out version of the BlockCV method. There are, however, two nuances: the first one is that the \(h\) samples immediately following and preceding the validation set are removed from the training set; the second one is that more than one sample can be used for validation. More specifically, the validation set comprises \(2v + 1\) samples. Note that, if \(h = v = 0\), the method boils down to the classic leave-one-out cross-validation procedure. The average error on the validation sets is taken as the estimate of the model's true error. This method does not preserve the temporal order of the observations.

The method was first proposed by Racine [1].

References

1

Jeff Racine. Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of econometrics, 99(1):39–61, 2000

Source code in timecave/validation_methods/CV.py
def __init__(
    self,
    ts: np.ndarray | pd.Series,
    fs: float | int = 1,
    h: int = 0,
    v: int = 0,
) -> None:

    super().__init__(ts.shape[0], ts, fs)
    self._check_hv(h, v)
    self._h = h
    self._v = v

    return

info()

Provide some basic information on the training and validation sets.

This method displays the number of splits, the values of the h and v parameters, and the maximum and minimum sizes of both the training and validation sets.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.CV import hvBlockCV
>>> ts = np.ones(10);
>>> splitter = hvBlockCV(ts, h=2, v=2);
>>> splitter.info();
hv-Block CV method
------------------
Time series size: 10 samples
Number of splits: 10
Minimum training set size: 1 samples (10.0 %)
Maximum training set size: 5 samples (50.0 %)
Minimum validation set size: 3 samples (30.0 %)
Maximum validation set size: 5 samples (50.0 %)
h: 2
v: 2
Source code in timecave/validation_methods/CV.py
def info(self) -> None:
    """
    Provide some basic information on the training and validation sets.

    This method displays the number of splits, the values of the `h` and `v` 
    parameters, and the maximum and minimum sizes of both the training and validation sets.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.CV import hvBlockCV
    >>> ts = np.ones(10);
    >>> splitter = hvBlockCV(ts, h=2, v=2);
    >>> splitter.info();
    hv-Block CV method
    ------------------
    Time series size: 10 samples
    Number of splits: 10
    Minimum training set size: 1 samples (10.0 %)
    Maximum training set size: 5 samples (50.0 %)
    Minimum validation set size: 3 samples (30.0 %)
    Maximum validation set size: 5 samples (50.0 %)
    h: 2
    v: 2
    """

    min_train_size = self._n_samples - 2 * (self._h + self._v + 1) + 1
    max_train_size = self._n_samples - self._h - self._v - 1
    min_val_size = self._v + 1
    max_val_size = 2 * self._v + 1

    min_train_pct = np.round(min_train_size / self._n_samples * 100, 2)
    max_train_pct = np.round(max_train_size / self._n_samples * 100, 2)
    min_val_pct = np.round(min_val_size / self._n_samples * 100, 2)
    max_val_pct = np.round(max_val_size / self._n_samples * 100, 2)

    print("hv-Block CV method")
    print("------------------")
    print(f"Time series size: {self._n_samples} samples")
    print(f"Number of splits: {self.n_splits}")
    print(
        f"Minimum training set size: {min_train_size} samples ({min_train_pct} %)"
    )
    print(
        f"Maximum training set size: {max_train_size} samples ({max_train_pct} %)"
    )
    print(f"Minimum validation set size: {min_val_size} samples ({min_val_pct} %)")
    print(f"Maximum validation set size: {max_val_size} samples ({max_val_pct} %)")
    print(f"h: {self._h}")
    print(f"v: {self._v}")

    return

plot(height, width)

Plot the partitioned time series.

This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

Parameters:

Name Type Description Default
height int

The figure's height.

required
width int

The figure's width.

required

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.CV import hvBlockCV
>>> ts = np.ones(6);
>>> splitter = hvBlockCV(ts, h=1, v=1);
>>> splitter.plot(10, 10);

hv

Source code in timecave/validation_methods/CV.py
def plot(self, height: int, width: int) -> None:
    """
    Plot the partitioned time series.

    This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

    Parameters
    ----------
    height : int
        The figure's height.

    width : int
        The figure's width.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.CV import hvBlockCV
    >>> ts = np.ones(6);
    >>> splitter = hvBlockCV(ts, h=1, v=1);
    >>> splitter.plot(10, 10);

    ![hv](../../../images/hvBlock_plot.png)
    """

    fig, axs = plt.subplots(self.n_splits, 1, sharex=True)
    fig.set_figheight(height)
    fig.set_figwidth(width)
    fig.supxlabel("Samples")
    fig.supylabel("Time Series")
    fig.suptitle("hv-Block CV method")

    for it, (training, validation, _) in enumerate(self.split()):

        axs[it].scatter(training, self._series[training], label="Training set")
        axs[it].scatter(
            validation, self._series[validation], label="Validation set"
        )
        axs[it].set_title("Fold {}".format(it + 1))
        axs[it].set_ylim([self._series.min() - 1, self._series.max() + 1])
        axs[it].set_xlim([- 1, self._n_samples + 1])
        axs[it].legend()

    plt.show()

    return

split()

Split the time series into training and validation sets.

This method splits the series' indices into disjoint sets containing the training and validation indices. At every iteration, an array of training indices and another one containing the validation indices are generated. Note that this method is a generator. To access the indices, use the next() method or a for loop.

Yields:

Type Description
ndarray

Array of training indices.

ndarray

Array of validation indices.

float

Weight assigned to the error estimate.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.CV import hvBlockCV
>>> ts = np.ones(10);
>>> splitter = hvBlockCV(ts, h=2, v=1); # Use 3 samples for validation; remove 2-4 samples from the training set
>>> for ind, (train, val, _) in enumerate(splitter.split()):
... 
...     print(f"Iteration {ind+1}");
...     print(f"Training set indices: {train}");
...     print(f"Validation set indices: {val}");
Iteration 1
Training set indices: [4 5 6 7 8 9]
Validation set indices: [0 1]
Iteration 2
Training set indices: [5 6 7 8 9]
Validation set indices: [0 1 2]
Iteration 3
Training set indices: [6 7 8 9]
Validation set indices: [1 2 3]
Iteration 4
Training set indices: [7 8 9]
Validation set indices: [2 3 4]
Iteration 5
Training set indices: [0 8 9]
Validation set indices: [3 4 5]
Iteration 6
Training set indices: [0 1 9]
Validation set indices: [4 5 6]
Iteration 7
Training set indices: [0 1 2]
Validation set indices: [5 6 7]
Iteration 8
Training set indices: [0 1 2 3]
Validation set indices: [6 7 8]
Iteration 9
Training set indices: [0 1 2 3 4]
Validation set indices: [7 8 9]
Iteration 10
Training set indices: [0 1 2 3 4 5]
Validation set indices: [8 9]
Source code in timecave/validation_methods/CV.py
def split(self) -> Generator[tuple[np.ndarray, np.ndarray, float], None, None]:
    """
    Split the time series into training and validation sets.

    This method splits the series' indices into disjoint sets containing the training and validation indices.
    At every iteration, an array of training indices and another one containing the validation indices are generated.
    Note that this method is a generator. To access the indices, use the `next()` method or a `for` loop.

    Yields
    ------
    np.ndarray
        Array of training indices.

    np.ndarray
        Array of validation indices.

    float
        Weight assigned to the error estimate.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.CV import hvBlockCV
    >>> ts = np.ones(10);
    >>> splitter = hvBlockCV(ts, h=2, v=1); # Use 3 samples for validation; remove 2-4 samples from the training set
    >>> for ind, (train, val, _) in enumerate(splitter.split()):
    ... 
    ...     print(f"Iteration {ind+1}");
    ...     print(f"Training set indices: {train}");
    ...     print(f"Validation set indices: {val}");
    Iteration 1
    Training set indices: [4 5 6 7 8 9]
    Validation set indices: [0 1]
    Iteration 2
    Training set indices: [5 6 7 8 9]
    Validation set indices: [0 1 2]
    Iteration 3
    Training set indices: [6 7 8 9]
    Validation set indices: [1 2 3]
    Iteration 4
    Training set indices: [7 8 9]
    Validation set indices: [2 3 4]
    Iteration 5
    Training set indices: [0 8 9]
    Validation set indices: [3 4 5]
    Iteration 6
    Training set indices: [0 1 9]
    Validation set indices: [4 5 6]
    Iteration 7
    Training set indices: [0 1 2]
    Validation set indices: [5 6 7]
    Iteration 8
    Training set indices: [0 1 2 3]
    Validation set indices: [6 7 8]
    Iteration 9
    Training set indices: [0 1 2 3 4]
    Validation set indices: [7 8 9]
    Iteration 10
    Training set indices: [0 1 2 3 4 5]
    Validation set indices: [8 9]
    """

    for i, _ in enumerate(self._indices):

        validation = self._indices[
            np.fmax(i - self._v, 0) : np.fmin(i + self._v + 1, self._n_samples)
        ]
        h_ind = self._indices[
            np.fmax(i - self._v - self._h, 0) : np.fmin(
                i + self._v + self._h + 1, self._n_samples
            )
        ]
        train = np.array([el for el in self._indices if el not in h_ind])

        yield (train, validation, 1.0)

statistics()

Compute relevant statistics for both training and validation sets.

This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set. It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast. If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

Returns:

Type Description
DataFrame

Relevant features for the entire time series.

DataFrame

Relevant features for the training set.

DataFrame

Relevant features for the validation set.

Raises:

Type Description
ValueError

If the time series is composed of less than three samples.

ValueError

If the folds comprise less than two samples.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.CV import hvBlockCV
>>> ts = np.hstack((np.ones(5), np.zeros(5)));
>>> splitter = hvBlockCV(ts, h=2, v=2);
>>> ts_stats, training_stats, validation_stats = splitter.statistics();
Frequency features are only meaningful if the correct sampling frequency is passed to the class.
The training set is too small to compute most meaningful features.
The training set is too small to compute most meaningful features.
>>> ts_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515           0.114058               0.5           0.38717            1.59099            0.111111              0.111111
>>> training_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
0   1.0     1.0  1.0  1.0       0.0            0.0 -7.850462e-17                0.0               0.0               0.0                inf                 0.0                   0.0
0   1.0     1.0  1.0  1.0       0.0            0.0  8.985767e-17                0.0               0.0               0.0                inf                 0.0                   0.0
0   1.0     1.0  1.0  1.0       0.0            0.0 -8.214890e-17                0.0               0.0               0.0                inf                 0.0                   0.0
0   1.0     1.0  1.0  1.0       0.0            0.0 -1.050792e-16                0.0               0.0               0.0                inf                 0.0                   0.0
>>> validation_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   1.0     1.0  1.0  1.0      0.00            0.0  8.985767e-17           0.000000               0.0          0.000000                inf                0.00                  0.00
0   1.0     1.0  1.0  1.0      0.00            0.0 -8.214890e-17           0.000000               0.0          0.000000                inf                0.00                  0.00
0   1.0     1.0  1.0  1.0      0.00            0.0 -1.050792e-16           0.000000               0.0          0.000000                inf                0.00                  0.00
0   0.8     1.0  0.0  1.0      0.16            1.0 -2.000000e-01           0.100000               0.4          0.630930           0.923760                0.25                  0.25
0   0.6     1.0  0.0  1.0      0.24            1.0 -3.000000e-01           0.109017               0.4          0.347041           1.131371                0.25                  0.25
0   0.4     0.0  0.0  1.0      0.24            1.0 -3.000000e-01           0.134752               0.4          0.347041           1.131371                0.25                  0.25
0   0.2     0.0  0.0  1.0      0.16            1.0 -2.000000e-01           0.200000               0.4          1.000000           0.923760                0.25                  0.25
0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00           0.000000               0.0          0.000000                inf                0.00                  0.00
0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00           0.000000               0.0          0.000000                inf                0.00                  0.00
0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00           0.000000               0.0          0.000000                inf                0.00                  0.00
Source code in timecave/validation_methods/CV.py
def statistics(self) -> tuple[pd.DataFrame]:
    """
    Compute relevant statistics for both training and validation sets.

    This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set.
    It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast.
    If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

    Returns
    -------
    pd.DataFrame
        Relevant features for the entire time series.

    pd.DataFrame
        Relevant features for the training set.

    pd.DataFrame
        Relevant features for the validation set.

    Raises
    ------
    ValueError
        If the time series is composed of less than three samples.

    ValueError
        If the folds comprise less than two samples.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.CV import hvBlockCV
    >>> ts = np.hstack((np.ones(5), np.zeros(5)));
    >>> splitter = hvBlockCV(ts, h=2, v=2);
    >>> ts_stats, training_stats, validation_stats = splitter.statistics();
    Frequency features are only meaningful if the correct sampling frequency is passed to the class.
    The training set is too small to compute most meaningful features.
    The training set is too small to compute most meaningful features.
    >>> ts_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515           0.114058               0.5           0.38717            1.59099            0.111111              0.111111
    >>> training_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
    0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
    0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
    0   0.0     0.0  0.0  0.0       0.0            0.0  0.000000e+00                0.0               0.0               0.0                inf                 0.0                   0.0
    0   1.0     1.0  1.0  1.0       0.0            0.0 -7.850462e-17                0.0               0.0               0.0                inf                 0.0                   0.0
    0   1.0     1.0  1.0  1.0       0.0            0.0  8.985767e-17                0.0               0.0               0.0                inf                 0.0                   0.0
    0   1.0     1.0  1.0  1.0       0.0            0.0 -8.214890e-17                0.0               0.0               0.0                inf                 0.0                   0.0
    0   1.0     1.0  1.0  1.0       0.0            0.0 -1.050792e-16                0.0               0.0               0.0                inf                 0.0                   0.0
    >>> validation_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude   Trend_slope  Spectral_centroid  Spectral_rolloff  Spectral_entropy  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   1.0     1.0  1.0  1.0      0.00            0.0  8.985767e-17           0.000000               0.0          0.000000                inf                0.00                  0.00
    0   1.0     1.0  1.0  1.0      0.00            0.0 -8.214890e-17           0.000000               0.0          0.000000                inf                0.00                  0.00
    0   1.0     1.0  1.0  1.0      0.00            0.0 -1.050792e-16           0.000000               0.0          0.000000                inf                0.00                  0.00
    0   0.8     1.0  0.0  1.0      0.16            1.0 -2.000000e-01           0.100000               0.4          0.630930           0.923760                0.25                  0.25
    0   0.6     1.0  0.0  1.0      0.24            1.0 -3.000000e-01           0.109017               0.4          0.347041           1.131371                0.25                  0.25
    0   0.4     0.0  0.0  1.0      0.24            1.0 -3.000000e-01           0.134752               0.4          0.347041           1.131371                0.25                  0.25
    0   0.2     0.0  0.0  1.0      0.16            1.0 -2.000000e-01           0.200000               0.4          1.000000           0.923760                0.25                  0.25
    0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00           0.000000               0.0          0.000000                inf                0.00                  0.00
    0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00           0.000000               0.0          0.000000                inf                0.00                  0.00
    0   0.0     0.0  0.0  0.0      0.00            0.0  0.000000e+00           0.000000               0.0          0.000000                inf                0.00                  0.00
    """

    if self._n_samples <= 2:

        raise ValueError(
            "Basic statistics can only be computed if the time series comprises more than two samples."
        )

    print("Frequency features are only meaningful if the correct sampling frequency is passed to the class.")

    full_features = get_features(self._series, self.sampling_freq)
    training_stats = []
    validation_stats = []

    for training, validation, _ in self.split():

        if self._series[training].shape[0] >= 2:

            training_feat = get_features(self._series[training], self.sampling_freq)
            training_stats.append(training_feat)

        else:

            print(
                "The training set is too small to compute most meaningful features."
            )

        if self._series[validation].shape[0] >= 2:

            validation_feat = get_features(
                self._series[validation], self.sampling_freq
            )
            validation_stats.append(validation_feat)

        else:

            print(
                "The validation set is too small to compute most meaningful features."
            )

    training_features = pd.concat(training_stats)
    validation_features = pd.concat(validation_stats)

    return (full_features, training_features, validation_features)