The authors of the popular book Introduction to Statistical Machine Learning claim that the variance of the estimate of the MSE of a model increases when k increases, when using k-fold cross-validation. To the extreme, leave-one-out cross-validation, should exhibit maximum variance.
I know that this claim is somehow controversial, and I have read a few papers about it (Bengio and Grandvalet, “No Unbiased Estimator of the Variance of K-Fold Cross-Validation”; Kohavi, “A study of Cross-Validation and Bootstrap for accuracy estimation and model selection”) and followed the recent discussion on CrossValidated.
I would like to set up a computational experiment, to independently verify this claim. My idea is to work on synthetic data (analogously to what the author of the CrossValidated answer has done). To this end, I will generate a roughly linear dataset: y = 1.5 * x + e, where e ~ N(0,1) is the error. Let’s say my dataset consists of n points. I can then perform k-fold cross-validation, for all values of k from 2 to n. For a fixed k, I will train the k models and I will obtain k MSE’s, say MSE_1, …, MSE_k. The MSE associated with k, then will be the average of these MSE’s.
By repeating the above experiment a large number of times, say m, generating each time new data, I could then get a pretty accurate value for the “true” estimates of the MSE’s given by each value of k. The estimate associated with a fixed k, would be the average over the m simulation of the MSE associated with k in each of the simulations.
I would like, however, to decompose this into variance and bias (squared). I have the feeling that knowing the underlying distribution of the error should allow me to calculate both the bias and the variance, but I am unsure on how to proceed.
Can someone shed light on this?
Also, what if instead of generating new data at each of the m iterations, I just work on the same dataset, but I simply shuffle it before applying k-fold? How would that impact the robustness of my results?