Skip to content

Markov Cross-validation method

timecave.validation_methods.markov

This module contains the Markov cross-validation method.

Classes:

Name Description
MarkovCV

Implements the Markov cross-validation method.

MarkovCV(ts, p, seed=1)

Bases: BaseSplitter

Implements the Markov cross-validation method.

This class implements the Markov cross-validation method.

Parameters:

Name Type Description Default
ts ndarray | Series

Univariate time series.

required
p int

p-order autocorrelation.

required
seed int

Random seed.

1

Attributes:

Name Type Description
n_splits int

The number of splits.

sampling_freq int | float

The series' sampling frequency (Hz).

Methods:

Name Description
split

Split the time series into training and validation sets.

info

Provide additional information on the validation method.

statistics

Compute relevant statistics for both training and validation sets.

plot

Plot the partitioned time series.

Raises:

Type Description
TypeError

If seed is not an integer.

TypeError

If p is not an integer.

ValueError

If p is not positive.

Notes

The Markov cross-validation method partitions the data so that every partition can be regarded as a Markov Process. It uses the linear autocorrelation measure to ensure that the samples in both the training set and the validation set are neither too close nor too far apart.

Markov

For a thorough discussion of the method, see [1].

References
1

Gaoxia Jiang and Wenjian Wang. Markov cross-validation for time series model evaluations. Information Sciences, 375:219–233, 2017

Source code in timecave/validation_methods/markov.py
def __init__(self, ts: np.ndarray | pd.Series, p: int, seed: int = 1) -> None:
    self._check_seed(seed)
    self._check_p(p)

    if p % 3 == 0:
        self._m = math.floor(2 * p / 3) + 1
    else:
        self._m = math.floor(2 * p / 3) + 2

    self.n_subsets = (
        2 * self._m
    )  # total number of subsets (training + tests subsets)
    splits = 2 * self._m  # due to 2-fold CV
    super().__init__(splits, ts, 1)
    self._p = p
    self._seed = seed
    self._suo = {}
    self._sue = {}

sampling_freq: int | float property

Get the time series' sampling frequency.

This method can be used to access the time series' sampling frequency, in Hertz (this is set on intialisation). Since the method is implemented as a property, this information can simply be accessed as an attribute using dot notation.

Returns:

Type Description
int | float

The time series' sampling frequency (Hz).

info()

Provide some basic information on the training and validation sets.

This method displays the number of splits and the number of observations per set.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.markov import MarkovCV
>>> ts = np.ones(10);
>>> splitter = MarkovCV(ts, p=2);
>>> splitter.info();
Markov CV method
---------------
Time series size: 10 samples
Number of splits: 6
Number of observations per set: 1 to 3
Source code in timecave/validation_methods/markov.py
def info(self) -> None:
    """
    Provide some basic information on the training and validation sets.

    This method displays the number of splits and the number of observations per set.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.markov import MarkovCV
    >>> ts = np.ones(10);
    >>> splitter = MarkovCV(ts, p=2);
    >>> splitter.info();
    Markov CV method
    ---------------
    Time series size: 10 samples
    Number of splits: 6
    Number of observations per set: 1 to 3
    """

    self._markov_partitions()

    lengths = []
    for i in range(1, len(self._suo.items()) + 1):
        lengths.extend([len(self._suo[i]), len(self._sue[i])])

    print("Markov CV method")
    print("---------------")
    print(f"Time series size: {self._n_samples} samples")
    print(f"Number of splits: {self.n_splits}")
    print(f"Number of observations per set: {min(lengths)} to {max(lengths)}")
    pass

plot(height, width)

Plot the partitioned time series.

This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

Parameters:

Name Type Description Default
height int

The figure's height.

required
width int

The figure's width.

required

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.markov import MarkovCV
>>> ts = np.ones(100);
>>> splitter = MarkovCV(ts, p=1);
>>> splitter.plot(10, 10);

markov_plot

Source code in timecave/validation_methods/markov.py
def plot(self, height: int, width: int) -> None:
    """
    Plot the partitioned time series.

    This method allows the user to plot the partitioned time series. The training and validation sets are plotted using different colours.

    Parameters
    ----------
    height : int
        The figure's height.

    width : int
        The figure's width.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.markov import MarkovCV
    >>> ts = np.ones(100);
    >>> splitter = MarkovCV(ts, p=1);
    >>> splitter.plot(10, 10);

    ![markov_plot](../../../images/Markov_plot.png)
    """

    fig, axs = plt.subplots(self.n_splits, 1, sharex=True)
    fig.set_figheight(height)
    fig.set_figwidth(width)
    fig.supxlabel("Samples")
    fig.supylabel("Time Series")
    fig.suptitle("Markov CV method")

    for it, (training, validation, _) in enumerate(self.split()):

        axs[it].scatter(training, self._series[training], label="Training set")
        axs[it].scatter(
            validation, self._series[validation], label="Validation set"
        )
        axs[it].set_title("Iteration {}".format(it + 1))
        axs[it].legend()

    plt.subplots_adjust(hspace=0.5)
    plt.show()

    return

split()

Split the time series into training and validation sets.

This method splits the series' indices into disjoint sets containing the training and validation indices. At every iteration, an array of training indices and another one containing the validation indices are generated. Note that this method is a generator. To access the indices, use the next() method or a for loop.

Yields:

Type Description
ndarray

Array of training indices.

ndarray

Array of validation indices.

float

Used for compatibility reasons. Irrelevant for this method.

Examples:

>>> import numpy as np
>>> from timecave.validation_methods.markov import MarkovCV
>>> ts = np.ones(10);
>>> splitter = MarkovCV(ts, p=2);
>>> for ind, (train, val, _) in enumerate(splitter.split()):
... 
...     print(f"Iteration {ind+1}");
...     print(f"Training set indices: {train}");
...     print(f"Validation set indices: {val}");
Iteration 1
Training set indices: [8]
Validation set indices: [0 6]
Iteration 2
Training set indices: [0 6]
Validation set indices: [8]
Iteration 3
Training set indices: [3]
Validation set indices: [5]
Iteration 4
Training set indices: [5]
Validation set indices: [3]
Iteration 5
Training set indices: [1 2 7]
Validation set indices: [4 9]
Iteration 6
Training set indices: [4 9]
Validation set indices: [1 2 7]
Source code in timecave/validation_methods/markov.py
def split(self) -> Generator[tuple[np.ndarray, np.ndarray, float], None, None]:
    """
    Split the time series into training and validation sets.

    This method splits the series' indices into disjoint sets containing the training and validation indices.
    At every iteration, an array of training indices and another one containing the validation indices are generated.
    Note that this method is a generator. To access the indices, use the `next()` method or a `for` loop.

    Yields
    ------
    np.ndarray
        Array of training indices.

    np.ndarray
        Array of validation indices.

    float
        Used for compatibility reasons. Irrelevant for this method.

    Examples
    --------
    >>> import numpy as np
    >>> from timecave.validation_methods.markov import MarkovCV
    >>> ts = np.ones(10);
    >>> splitter = MarkovCV(ts, p=2);
    >>> for ind, (train, val, _) in enumerate(splitter.split()):
    ... 
    ...     print(f"Iteration {ind+1}");
    ...     print(f"Training set indices: {train}");
    ...     print(f"Validation set indices: {val}");
    Iteration 1
    Training set indices: [8]
    Validation set indices: [0 6]
    Iteration 2
    Training set indices: [0 6]
    Validation set indices: [8]
    Iteration 3
    Training set indices: [3]
    Validation set indices: [5]
    Iteration 4
    Training set indices: [5]
    Validation set indices: [3]
    Iteration 5
    Training set indices: [1 2 7]
    Validation set indices: [4 9]
    Iteration 6
    Training set indices: [4 9]
    Validation set indices: [1 2 7]
    """

    self._markov_partitions()
    for i in range(1, len(self._suo.items()) + 1):
        train, validation = self._suo[i], self._sue[i]
        yield (train, validation, 1.0)
        train, validation = self._sue[i], self._suo[i]
        yield (train, validation, 1.0)  # two-fold cross validation

statistics()

Compute relevant statistics for both training and validation sets.

This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set. It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast. If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

Returns:

Type Description
DataFrame

Relevant features for the entire time series.

DataFrame

Relevant features for the training set.

DataFrame

Relevant features for the validation set.

Raises:

Type Description
ValueError

If the time series is composed of less than three samples.

ValueError

If the folds comprise less than two samples.

Examples:

Frequency-domain features are not computed for the Markov CV method:

>>> import numpy as np
>>> from timecave.validation_methods.markov import MarkovCV
>>> ts = np.hstack((np.ones(5), np.zeros(5)));
>>> splitter = MarkovCV(ts, p=1);
>>> ts_stats, training_stats, validation_stats = splitter.statistics();
Frequency features are only meaningful if the correct sampling frequency is passed to the class.
>>> ts_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515            1.59099            0.111111              0.111111
>>> training_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
0   0.5     0.5  0.0  1.0      0.25            1.0         -0.4            1.06066            0.333333              0.333333
>>> validation_stats
   Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
0   0.5     0.5  0.0  1.0      0.25            1.0         -0.4            1.06066            0.333333              0.333333
0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
Source code in timecave/validation_methods/markov.py
def statistics(self) -> tuple[pd.DataFrame]:
    """
    Compute relevant statistics for both training and validation sets.

    This method computes relevant time series features, such as mean, strength-of-trend, etc. for both the whole time series, the training set and the validation set.
    It can and should be used to ensure that the characteristics of both the training and validation sets are, statistically speaking, similar to those of the time series one wishes to forecast.
    If this is not the case, using the validation method will most likely lead to a poor assessment of the model's performance.

    Returns
    -------
    pd.DataFrame
        Relevant features for the entire time series.

    pd.DataFrame
        Relevant features for the training set.

    pd.DataFrame
        Relevant features for the validation set.

    Raises
    ------
    ValueError
        If the time series is composed of less than three samples.

    ValueError
        If the folds comprise less than two samples.

    Examples
    --------

    Frequency-domain features are not computed for the Markov CV method:

    >>> import numpy as np
    >>> from timecave.validation_methods.markov import MarkovCV
    >>> ts = np.hstack((np.ones(5), np.zeros(5)));
    >>> splitter = MarkovCV(ts, p=1);
    >>> ts_stats, training_stats, validation_stats = splitter.statistics();
    Frequency features are only meaningful if the correct sampling frequency is passed to the class.
    >>> ts_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.5     0.5  0.0  1.0      0.25            1.0    -0.151515            1.59099            0.111111              0.111111
    >>> training_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
    0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
    0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
    0   0.5     0.5  0.0  1.0      0.25            1.0         -0.4            1.06066            0.333333              0.333333
    >>> validation_stats
       Mean  Median  Min  Max  Variance  P2P_amplitude  Trend_slope  Strength_of_trend  Mean_crossing_rate  Median_crossing_rate
    0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
    0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
    0   0.5     0.5  0.0  1.0      0.25            1.0         -0.4            1.06066            0.333333              0.333333
    0   0.5     0.5  0.0  1.0      0.25            1.0         -1.0                inf            1.000000              1.000000
    """

    columns = [
        "Mean",
        "Median",
        "Min",
        "Max",
        "Variance",
        "P2P_amplitude",
        "Trend_slope",
        "Strength_of_trend",
        "Mean_crossing_rate",
        "Median_crossing_rate",
    ]

    if self._n_samples <= 2:

        raise ValueError(
            "Basic statistics can only be computed if the time series comprises more than two samples."
        )

    print("Frequency features are only meaningful if the correct sampling frequency is passed to the class.")

    full_features = get_features(self._series, self._fs)[columns]
    training_stats = []
    validation_stats = []

    for training, validation, _ in self.split():

        training_feat = get_features(self._series[training], self._fs)
        training_stats.append(training_feat[columns])

        validation_feat = get_features(self._series[validation], self._fs)
        validation_stats.append(validation_feat[columns])

    training_features = pd.concat(training_stats)
    validation_features = pd.concat(validation_stats)

    return (full_features, training_features, validation_features)