Computationally proving (or disproving) ISML’s claim that variance increases with k in k-fold cross-validation

The authors of the popular book Introduction to Statistical Machine Learning claim that the variance of the estimate of the MSE of a model increases when k increases, when using k-fold cross-validation. To the extreme, leave-one-out cross-validation, should exhibit maximum variance.

I know that this claim is somehow controversial, and I have read a few papers about it (Bengio and Grandvalet, “No Unbiased Estimator of the Variance of K-Fold Cross-Validation”; Kohavi, “A study of Cross-Validation and Bootstrap for accuracy estimation and model selection”) and followed the recent discussion on CrossValidated.

I would like to set up a computational experiment, to independently verify this claim. My idea is to work on synthetic data (analogously to what the author of the CrossValidated answer has done). To this end, I will generate a roughly linear dataset: y = 1.5 * x + e, where e ~ N(0,1) is the error. Let’s say my dataset consists of n points. I can then perform k-fold cross-validation, for all values of k from 2 to n. For a fixed k, I will train the k models and I will obtain k MSE’s, say MSE_1, …, MSE_k. The MSE associated with k, then will be the average of these MSE’s.

By repeating the above experiment a large number of times, say m, generating each time new data, I could then get a pretty accurate value for the “true” estimates of the MSE’s given by each value of k. The estimate associated with a fixed k, would be the average over the m simulation of the MSE associated with k in each of the simulations.

I would like, however, to decompose this into variance and bias (squared). I have the feeling that knowing the underlying distribution of the error should allow me to calculate both the bias and the variance, but I am unsure on how to proceed.

Can someone shed light on this?

Also, what if instead of generating new data at each of the m iterations, I just work on the same dataset, but I simply shuffle it before applying k-fold? How would that impact the robustness of my results?

submitted by /u/PotatoResearch
[link] [comments]

interpreting odds ratios from an ordinal logistic regression with categorical predictor variables

so the title pretty much sums it up. How would I interpret ORs for an ordinal logistic given categorical predictors? Would i just treat the same as logistic when there is categorical?


submitted by /u/phraudsta
[link] [comments]

Reasons for why SAR model not significantly reducing spatial autocorrelation

What could the possible reasons be for a spatial autoregressive model (lagged on the dependent variable) to not significantly reduce autocorrelation in the residuals? I checked for spatial dependence between residuals, added the lagged term, fitted the SAR model and then checked for autocorrelation in the residuals- there are still the same number of certain regions rejecting the Moran’s/Geary’s test for no spatial autocorrelation..

submitted by /u/msspezza
[link] [comments]

How to recode into different variable?

Statistics Question

I have been trying to workout how SPSS works and it is pretty ok in terms of what to use and how to get to those setting. But I am stuck on how to recode a different variable into another.

For example, I have to recode inch into foot. I know that you have to use a table which you can convert the values (which is from your own data) into the new value(conversion table of cm-inch/2.54). As I was doing that, I realized some of the values that I collected are the same and spss would not let me use the same value as an input. So I just continue with all the other values that are not the same. When I ran the syntax, and went to the dataset tab, it gave me nothing but only a label of “foot” as the variable. Does anyone what I am doing wrong??

This is the syntax:

COMPUTE highinch=Height/2.54.

VARIABLE LABELS highinch ‘COMPUTE highinch=Height/2.54’.


RECODE highinch (37=23) (39=24.5) (36=23) (40=25) (43=27.5) INTO foot.

VARIABLE LABELS foot ‘american shoe size’.



Edit: I have tried using a different value for each data piont and it doesn’t work

submitted by /u/Hakunara-10
[link] [comments]

Efficient and Robust Scale Estimation [PDF] – A set of slides describing a highly efficient method of robustly estimating the scale parameter from a set of data

The scale estimator they suggest is interesting:

  • Given a sample X₁, …, Xₙ take all the pairs (Xᵢ, Xⱼ)
  • For each pair (Xᵢ, Xⱼ) compute the pair averages: (Xᵢ + Xⱼ)/2
  • Finally, the scale is estimated by taking the Interquartile Range (IQR) of all the computed pair averages.

If this seems weird, the sample variance actually can actually be derived in a similar way.

It can be shown that the variance of a distribution is equal to E[(X – Y)²/2], where X and Y and are two i.i.d random variables drawn from that distribution. This is the case because :

E[(X – Y)²]

= E[X² – XY + Y² – YX]

= E[X²] – E[X]E[Y] + E[Y²] – E[X]E[Y]

= (E[X²] – E[X]²) + (E[Y²] – E[Y]²)

= 2Var[X]

(Remember that E[X] = E[Y] and E[X²] =E[Y²] because the X and Y are from the same distribution.)

So given an i.i.d sample X₁, …, Xₙ from some probability distribution we can estimate the variance by taking some pair (Xᵢ, Xⱼ) and computing (Xᵢ – Xⱼ)² /2.

While this estimator will give us an unbiased estimate of σ², it’s not a very good one. What we need to do is combine all these crappy little estimators together into one good one. Here is a better estimator of σ²:

  • Given a sample X₁, …, Xₙ take all the pairs (Xᵢ, Xⱼ)
  • For each pair (Xᵢ, Xⱼ) compute a crappy estimate of σ² given by sᵢⱼ² = (Xᵢ – Xⱼ)²/2
  • To combine all the crappy estimators into one good estimator we take the average of all the sᵢⱼ² . Call the average s²

It isn’t too hard to show that s² is equal to the usual unbiased sample variance (the one with “n-1”)

submitted by /u/Candid_Cryptographer
[link] [comments]

Feeling so lost in my Probability class

I’m a sophmore stats major, and I’ve kinda fucked up. The midterm for my probability class is on Monday and I’ve missed far too many classes to get a grip on whats going on now. It’s an early morning class and I’ve given myself too many excuses not to go, now I’m feeling straight hopeless on this midterm. I know the concept of a PDF, PMF and CDF, but after that the math is just escaping me. I’ve always done good in my math classes but fuck this one you can’t fuck around in, which I’m learning the brutally hard way. I need urgent help in this class, if anyone knows an online service to learn probability I will be eternally grateful. The textbook I’m using isn’t even the right edition and is cryptic as fuck so it’s taking me hours just to comprehend a single chapter. As a stats major I feel like I fucked myself. SOS!!!!

submitted by /u/comicholdinghands
[link] [comments]

I have an exam tomorrow and i’m stuck on a practice question

12 of the top 20 finishers in the 2009 PGA Championship used a Titleist brand gold ball. Suppose these results are representative of the probability that a randomly selected PGA Tour player uses a Titleist brand gold ball. For a sample of 15 PGA Tour players.

a). Compute the probability that exactly 10 of the 15 PGA Tour players use a Titleist golf ball

Am I supposed to be using binomial distribution, i’m confused

submitted by /u/markofj12
[link] [comments]

Simulated data and permutation tests:


So I am working with my secondary advisor on learning more about permutation tests as they relate to n of 1 clinical trials, and I am having trouble translating it into code.

I understand the theory behind the test and the sampling process, but could anyone present a simple applied example with similar outcomes?

Take for example: intervention= drug X on pain in pediatric patients

outcome= analog scale of pain from 1-10, for a group of 6 patients.

Also say you know the baseline means for each patient and need to simulate/don’t know the post intervention.

submitted by /u/the1whowalks
[link] [comments]

What exactly is the problem with nonindependence of observations in regression?

I realize that I’ve never actually asked why this is an issue.

I know that one problem is that you might be missing explanations for the effects you observe. E.g., if you test ten kids from New York and ten kids from Houston, and find that the New York kids have a higher typing speed, BUT all of the New York kids went to the same school, then you have a hard time making the case that New York vs. Houston is explaining the effect. Instead, the effect could be arising due to specific schools.

Is there another reason other than the one I alluded to above?

submitted by /u/UnderwaterDialect
[link] [comments]