I just had a small insight about regression when there’s missing values in the target value. I’m surprised I missed this in the regression courses I’ve attended and I’m curious to see what this technique is called and its pros/cons. If we have a data matrix x with n observations and and f features, y with p target values then regression is to find W s.t y ≈ xW Since xT x is symmetric/square so often invertible we can estimate the least square solution as x^Ty = x^TxW (x^Tx)^{-1}x^Ty = (x^Tx)^{-1}x^TxW (x^Tx)^{-1}x^Ty = W Where (xT x) -1 is (f x f) and xT y is (p x f) Note that we can calculate (xT x) -1 without y! So even if some y_i=NaN we can utilize the (non-missing) feature values for this. Experiment shows its a good way, here with n=50, f=19, p=1. (torch is just like numpy) import torch torch.manual_seed(1) x = torch.randn(50,20) x[:,0] = 1 # Bias term. y = torch.randn(50,1) # m = torch.randn(50)>0.5 # Target (m)issing at random m = y.sum(-1)>0.5 # Target (m)issing in a biased way # Use all rows (impossible in real life due to missingness!) W_opt = x.t().mm(x).inverse().mm(x.t()).mm(y).t() # Use all rows to calculate some stuff (proposed) W_use = x.t().mm(x).inverse().mm(x[~m].t()).mm(y[~m]).t() # Use only rows with no missing target (regular technique) W_drop = x[~m].t().mm(x[~m]).inverse().mm(x[~m].t()).mm(y[~m]).t() # Impute target value Y (also commonplace) y_hat = y*1 y_hat[m] = y[~m].mean() W_impute = x.t().mm(x).inverse().mm(x.t()).mm(y_hat).t() # Calculate sum squared error print(‘W_opt t’,(y-W_opt.mm(x.t()).t()).pow(2).sum()) print(‘W_use t’,(y-W_use.mm(x.t()).t()).pow(2).sum()) print(‘W_drop t’,(y-W_drop.mm(x.t()).t()).pow(2).sum()) print(‘W_impute’,(y-W_impute.mm(x.t()).t()).pow(2).sum()) With target missing whenever its above a treshold (above): >> W_opt tensor(25.7269) >> W_use tensor(33.9462) #> W_drop tensor(49.8913) >> W_impute tensor(41.5520) With target missing at random: >> W_opt tensor(25.7269) >> W_use tensor(28.2667) #> W_drop tensor(58.3970) >> W_impute tensor(28.4812) Can anyone put this into an academic context? What is this called and why haven’t I heard about it before? TL:DR Using all feature data for some part of the regression-calculation works for NaN in target value submitted by /u/ragulpr [link] [comments]