[Q] Power analysis question for percent reduction?

How would I find the required sample size for alpha = 0.05, power = 0.80 for the following situation? We are interested in the percent reduction of arterial volume before and after treatment in the same individual. There will be a placebo group and a treatment group. From previous literature, the placebo group should have 0% change and the treatment group is expected to see a 2% change with a group standard deviation of 0.5%. We are only interested in the percent reduction because different individuals will have different artery sizes so absolute differences wouldn’t make sense to compare.

When I put those numbers into Stata’s power analysis, it gives me that only 3 subjects are needed. This both makes sense and makes me weary because a 2% change is essentially 4 SDs away from 0% so that’s quite a large change but the 3 subjects needed is quite low. Can anyone shed some light on this?

submitted by /u/ButtholePlungerz
[link] [comments]

[Q] What is the difference between Generalised Cross Validation and K-Fold Cross Validation ?

Hey folks,

I just implemented a 5-fold Cross Validation to determine the optimal penalty value for a ridge regression. (Code Below). I am using the lm.ridge function in the library(MASS).

I double checked the results of my own 5 fold cross validation function with integrated generalised cross validation function in lm.ridge function. To my surprise the optimal penalty value are quite far from each other (difference of about 4.6).

It got me curious on why the results are so far from each other ? Can the difference in the optimal lambda parameter value be explained by the difference in the two methods?

# Rridge Regression set.seed(3) library(MASS) grid = 10^seq(10, -2, length = 100) # grid with lambda/penalty values ridge_res = matrix(NA,1,100) # adapt lm crossvalidaiton for ridge grid cross_val_ridge = function(data,k) { require(MASS) set.seed(1) # student number as seed cv_index = sample(rep(1:5, length = nrow(data)) , nrow(data)) cv_train_e = matrix(NA, k) # create empty matrix to store cv_errors in cv_test_e = matrix(NA, k) for ( i in 1:k) { cv_train = data[cv_index!=i,] cv_test = data[cv_index==i,] cv_lm = lm.ridge(MEDV ~ . , data = cv_train, lambda = grid[j]) # compute prediction by hand pred.ridge = coef(cv_lm)[1] + coef(cv_lm)[2]*cv_test[,1] + coef(cv_lm)[3]*cv_test[,2] + coef(cv_lm)[4]*cv_test[,3] + coef(cv_lm)[5]*cv_test[,4] + coef(cv_lm)[6]*cv_test[,5] + coef(cv_lm)[7]*cv_test[,6] + coef(cv_lm)[8]*cv_test[,7] + coef(cv_lm)[9]*cv_test[,8] + coef(cv_lm)[10]*cv_test[,9] + coef(cv_lm)[11]*cv_test[,10] + coef(cv_lm)[12]*cv_test[,11] + coef(cv_lm)[13]*cv_test[,12] + coef(cv_lm)[14]*cv_test[,13] #cv_train_e[i,] = mean(cv_lm$residuals^2) cv_test_e[i,] = mean((cv_test$MEDV - pred.ridge) ^ 2) } return(mean(cv_test_e)) } for (j in 1:100) { ridge_res[j] = cross_val_ridge(train, k=5) } which.min(colMeans(ridge_res)) grid[76] # optimal lambda value as per 5k-cv own method = 8.111308 # double check using generalized cv ridge = lm.ridge(MEDV ~ . , data = train, lambda = grid) which.min(ridge$GCV) grid[79] # optimal lambda value as per GCV3.511192 

submitted by /u/deniz_sen
[link] [comments]

[Q] What sample size is appropriate to detect a small failure rate?

I have a failure rate of 0.6% on a process month to month, standard deviation of 0.2% month to month. If we make a change to the process that we expect will reduce the failure rate to 0.3%, how many parts need to be processed to have a 95% confidence that we succeeded?

submitted by /u/I_ate_it_all
[link] [comments]

[Q] Nuts and bolts of multivariate analysis

Hi all,

I understand that doing a multivariate analysis controls for potential relationships between dependent variables, but how is that actually done?

Does the stats software basically calculate an odds ratio for every possible combination of dependent variables to check for a relationship?

If such a relationship is found, how is that information then used to derive the adjusted odds ratio?

submitted by /u/1Surgeon
[link] [comments]

[Q] Idk how to phrase this but, does everytime you reject a null it means there is sufficient evidence to support the claim, and if you fail to reject it, it means there isn’t sufficient evidence to support the claim?

Im just always bugged by this question on our homeworks and Im starting to see the pattern. I just dont know if my assumptions are going to lead me to the right track. Thanks in advance!

submitted by /u/DeathLigerLion
[link] [comments]

[Q] How to propensity match 3 outcomes? in R?

I am trying to propensity score match in R for 3 treatments. There is a good tutorial using matchIt package for binary outcomes but I do not understand how to do this for 3 groups. From my search, MatchIt is not able to do this with 3 groups. This leaves me with Twang and CBPS. I simply do not understand how to extract my matched sets from either of these packages. Can anyone pleas help and point me in the direction of some sort of tutorial on how i can do this.

submitted by /u/Vervain7
[link] [comments]

probabilistic graphical models, factorization and parameters [Q]

In a “deep learning and graphical models class” we recently started an intro unit on probabilistic graphical models. The first section is about factoring joint distributions and determining the total number of parameters needed.

For example, here’s an unanswered question from our lecture slides:

—–

“Let A be a random variable (RV) with support {0, 1, 2}. Similarly, let B, C and D be random variables with supports {0, 1}, {1, 2, 3} and {10, 20}.

  1. Write down the joint distribution of A, B, C and D in a factored form. How many numbers (parameters) are needed to fully specify this joint distribution? Write down all factorizations that are possible for P(A, B, C) and P(A, B).

  2. If we know that P(A|B, C, D) = P(A|B) and P(C|D) = P(C), then what is the number of parameters needed to specify the join distribution P(A, B, C, D)?”

—–

Can someone point me in the right direction on reading resources (or better yet, explain how one should make sense of these sort of questions)?

submitted by /u/jbuddy_13
[link] [comments]

[Q] How to test the “strength” of correlation between two sets of data (at given points along the data)?

Hey all. I’m not great at statistics but I’m still fascinated by them. I was wondering if there happened to be a test that would allow one to test the correlation between two sets of data but chart it under the sets.

Something like the correlation coefficient but not just measuring if the two are moving “kinda the same”.

For example, if the two are moving perfectly relative to one another, give it a 1. If they’re moving sorta relative, give it a .8 etc.

Unless I’m using it wrong, the correlation I get is always 1 if they’re moving in the same direction, even if by wildly different rates.

Thanks! (hope I explained it ok, I’m really not great at this, admire the beauty of stats though!)

submitted by /u/loracsr
[link] [comments]

[Q] How to test effect of treatments in non normally distributed and differently shaped data?

Hi all,

I want to find out whether data differs across three treatments, my hypothesis is that it does.

Unfortunately the data collection process is very time and money consuming so I have a very small sample size for each treatment (but always bigger than 5) and different sample sizes across treatments. The data is not normally distributed (so I excluded parametric tests) and is not similarly shaped (so I excluded Kruskal Wallis, Mann Whitney and Moods Median test as they would not be testing my hypothesis but would rather be testing for mean ranks).

Does anyone have any idea of what test I could use to test my hypothesis given the data?

Thanks very much in advance!

submitted by /u/Ordep22
[link] [comments]

[Q] How to interpret a negative coefficient from a logistic regression?

I have a solid understanding of statistics. I recently was dabbling with some data at work and I was interested whether years since your last promotion (my predictor) can predict the outcome of your job status (whether you are active in the company, coded as 0, or if you terminated, coded as 1).

I ran a correlation (Pearson) on the two variables, and I noticed a significant weak negative correlation. My sample size is roughly over 530, with 80 something terminated employees. I couldn’t fully understand the negative correlation, but all I knew is that something was going on. My main assumption that I wanted to test, is whether the years since you last received a promotion can explain the odds of you leaving the company.

I then used a logistic regression to create a prediction equation. The predictor was significant of course, my coefficient was negative, and the odds ratio was at about a 1.

Something to point out is that in this sample, these are all employees that have practically never received a promotion. They range from your most senior hires to new hires, with LOS ranging from less than a year to over 30 years (13 upper outliers).

I am suspecting that somewhere along the way, there are some groups that are distorting this correlation that could explain the negative coefficient.

Anyone have any insight into what could possibly be going on?

submitted by /u/iFlipsy
[link] [comments]