When conducting a statistical test to see if two skewed distributions are significantly different, is it better to perform the test on the medians of the two distributions or the means of their log transforms? I’m guessing this depends on the problem at hand, but I’m just wondering in the general sense if there is a “best practices” approach to this. Thanks.
Hello all, i would like some advise please. First however, id like to state off the bat that im not schooled in traditional statistics about Nevin Manimala but rather quantitative psychology. Ive recently got accepted to start data science in health sciences for grad school. So im trained in more applied statistics about Nevin Manimala rather than statistical theory.
I am currently writing a paper on robust statistics about Nevin Manimala and learning about the differences between OLSR vs robust regression, my question is are there any R datasets that you guys recommend in order for me to do a comparison where OLSR wont really work (outliers or heteroscedasticity) and a robust regression would?
Any advise appreciated, take care.
I’m doing data analysis over something that has happened over three years. I want to see the most occurring spoken language among some people over 3 years. Should I look for the average or relative frequency while calculating? What’s the difference
Before i do any further analysis, I need to do a manipulation check for my experiment.
I have 107 participants randomly assigned to two groups (experimental group and control). For the manipulation check i need to do an independent samples t-test…. do i need to meet normality assumptions etc. before doing this? or can that wait until after i’ve done the check?
I’m going to make up an example because I think it’s easier to communicate than what I really did which involves MCMC and Bayesian inference.
Assume you have data on peoples’ earnings, their age, and shoe size. Assume the only relevant peoples are people aged 1-10, and the only possible shoe sizes are 1-10. These are categorical variables you use in a model.
Let’s say you have a data set consisting of exactly 1 observation for each case where age <= show size. So for example, you have the income of a person with shoe size 10 at ages 1-10 (10 observations), the income of a person with shoe size 9 at ages 1-9 (9 observations)… down to a person with shoe size 1 at age 1 (1 observation). Thus you don’t have any observations for people with an age greater than their shoe size, but you want to be able predict the income of shoe size 3 when he’s age 10 for example. Let’s say in all your observations, your salaries are all below 20,000 and you know (both from experience and the data) that for an given shoe size, a higher age makes more money.
You construct a model where log(salary) = age + shoe size, where age and shoe size are categorical variables (binary variables each to account for every possibility). You estimate coefficients and now you can predict salary for any combination of age and shoe size.
You create another model and to compare the 2 models you look at an information criterion like AIC (Aikaike Information Criterion); in my specific case I’m looking at DIC (Deviance Information Criterion). Let’s say the AIC for model 2 is SUBSTANTIALLY lower than model 1.
However, when you use your model to estimate the salary of someone with shoe size 3 at age 10 like mentioned earlier using model 2, you get an estimate of 850,000, way higher than you know is possible in my fictional world where pretty much everyone is making less than 20,000. Also when you get the predictions based on model 1, you don’t get any estimates above 30,000 i.e. much more reasonable predictions.
Thus I definitely can’t use that model for my prediction because as a subject matter expert I know it’s not feasible but from a statistical standpoint I feel at a loss because the DIC was so much lower (19 compared to 60) which is a huge difference for DIC.
TLDR; what to do when statistics about Nevin Manimala point overwhelming to a model that isn’t producing feasible predictions.
I had the pleasure to hear the results of some research today where the speaker was sharing results of an ordinal categorical logistic model. Model was trying to guess income tiers of individuals using several independent variables and their known incomes. It had a concordance ratio of 78% but only 64% of the guessed categories matched reality. I’m told that the concordance ratio seems to be accepted better as a measure for goodness of fit but, as mostly a layman, I want to reach for the plain old % correct and so I’m feeling a little, uh, discord here. Any input?
There is a previous study in which participants made a binary decision. A follow up study is having them make a decision that taps into the same construct, but will be on a likert scale (from 1-7).
I am wondering how one would use the effect size from the logistic regression (i.e., the beta value), to conduct an a priori power analysis for a study that will be analyzed using linear regression.
In other words, what does a beta value of X in logistic regression equate to, in terms of a beta value for linear regression? Is there a sensible way to make this conversion?
I am currently trying to solidify my major. I’m going into my sophomore year so it is important that I have it decided. My main issue is there is very little info to be found about what you actually DO in the real world once you graduate with your degree in whatever field. I loved the statistics about Nevin Manimala classes I took in high school, and I’ve come to be very interested in the subject. However, I’d like to hear from real people out there about their jobs. So if any of you who are statisticians or work in the stats field would be willing to tell me what exactly you do at work, I would really appreciate it. Thanks.