Is this paper incorrectly omitting the use of false discovery rate correction methods?

See this paper- table 3 is where I’m focusing on. They used Mann-Whitney p value and set cutoff to .05, but don’t seem to make any correction for false discovery rate which seems wrong given they have made a large number of comparisons (total of 268 comparisons).

Am I right in saying that setting this P value and not correcting for false discovery rate probably gave them some erroneous results?

submitted by /u/runninggartman
[link] [comments]

Interpreting statistical significance and statistical importance in survey data.

People have been raving about this using obvious examples and conceptual approaches and I still can’t see how to explain this.

I am working on a customer feedback datasheet and I have binary data I wish to correlate to a Net Promoter Score.

Statistical importance methods I have implemented, such as Shapley values on an OLS model and Jackson’s Relative Importance on an multinomial ordered logistic regression model have both agreed and given me a sketch on what are the variables that influenced the outcome the most.

The p-tests I have though I find hard to interpret in words I can explain to a customer.

Say for instance Shapley And Jackson both concluded that the waiting time is causing a low NPS outcome by having a high negative weight, I can go on and directly conclude that the people who complained about waiting time gave you people low scores.

But when the p-test comes in with a high p-value of like 0.3. What does that even mean? Why is it important according to Shapley, Beta coefficients, and Jackson, while its p-test is essentially saying this might as well be luck? How do I explain that to someone who just wants to know what’s wrong with his service?

Pictures and results here

submitted by /u/dariusnailedit
[link] [comments]

From my experience, few statisticians choose to go the social research route, or work in something social science related. Why is that the case?

Is it because the pay is subpar compared to tech, pharma or finance? I personally find social research extremely interesting and i also think that my services would be more impactful there because the statistics knowledge of social scientists tends to be even more underdeveloped than the above areas

But no statistician i know works in that area, and i read similar things online. Is it because of the pay?

submitted by /u/Asosas
[link] [comments]

Using a Machine Learning Model in a Web Application Client

Current trends seem to indicate that software engineers will increasingly be asked to apply machine learning models to production software. While the development of the models remain with Data Scientist, trained models often tossed over to Software Engineers. This creates a set of challenges for Software Engineers. Consider a situation where a model needs to be applied to a Web Application. Models are often delivered in Python while the client-side of the Web Application operates on JavaScript. How can Software Engineers apply a Python model on the Web in JavaScript? How could they minimise the data security footprint? Could the model operate offline?

Luckily, it is becoming easier to apply machine learning models on the Web. There are libraries, such as Tensorflow JS which enables the use of models with JavaScript. Additionally, packaging the model with the Web Application client allows the model to operate offline. This also means data does not need to leave the user’s machine for predictions to be made. This is a big data security win 🙂

submitted by /u/whitezl0
[link] [comments]

How do I find correlation between categorical features and target variable?

I have a linear regression model.i have done hot encoding for the nominal categorical features and normal category type in pandas for ordinal features.So now, how do I find the correlation between these with the target variable so I can see if there is any correlation.

submitted by /u/Ghjjj4433
[link] [comments]

Proper repeat statement for SAS PROC genmod

I am trying to implement Poisson regression with log link and with robust error variance for survey data.

Here is a working code for non survey data that I tested and it works as intended:

proc genmod data = eyestudy; class carrot id; model lenses = carrot/ dist = poisson link = log; repeated subject = id/ type = unstr; estimate 'Beta' carrot 1 -1/ exp; run; 

Code above and more information about Poisson regression with log link and with robust error variance but fro non survey data is here:

Below is an example how to use code for PROC genmod for survey analysis (but with dist=binomial link=identity and I think without robust error variance)

proc genmod data=nis10; class seqnumt estiapt10; model r_tet_not_utd = / dist=binomial link=identity; weight provwt; repeated subject=seqnumt(estiapt10); where sex = 2; run; 

here strata variable name is estiapt10, cluster variable name is seqnumt and weight variable name is provwt.

Code above and more information about survey data analysis here:

My strata variable name is CSTRATM, cluster variable name is CPSUM and weight variable name is PATWT. Dependent variable name is DIETNUTR independent variable name is age_group_var. My data is located in sas_stata. So I tryed this code:

proc genmod data=sas_stata; class age_group_var id CPSUM CSTRATM; model DIETNUTR = age_group_var/ dist = poisson link = log; weight PATWT; repeated subject = id/ type = unstr; repeated subject = CPSUM(CSTRATM); estimate 'Beta' age_group_var 1 -1/ exp; run; 

but it gave me warning:

WARNING: Only the last REPEATED statement is used. 

As I understand after reading articles above and some other material I am doing everything right except not the proper repeated statement. For Poisson regression with log link and with robust error variance for survey data I assume there should be some combination of two repeated statements in my code above. I tried several variants of combining those repeated statements but without any luck.

So my question is: What is the code for Poisson regression with log link and with robust error variance for survey data?

submitted by /u/vasili111
[link] [comments]

STATA not detecting multicollinearity?

I have a study in which I investigate seizures vs a lot of variables, two of which are presence of headache and age.

Both have been found to correlate significantly with seizures.

However, when I run a regression of these two variables, headache vs age, I find a statistical relationship as well, with presence of headaches significantly higher the lower the age.

Shouldn’t this be detected as multicollinearity by STATA? It seems to me that age has nothing to do with seizures, but whether people have a headache or not (or the other way around, or younger people have a higher risk of headache, whatever it is).

submitted by /u/OssToYouGoodSir
[link] [comments]

Test Two Means hypothesis (updated)

I tried to make a post about this before and realized I didn’t provide enough information. Sorry about that! Here’s the updated version. I’m trying to complete my homework and am a bit ahead of the class on this in doing so. I can wait until the teacher goes over it in class but I’m kinda antsy at the same time. We’re relying on StatCrunch only in this class, but that program isn’t giving me the correct answer for one of the parts I need and I can’t understand what I need to do. My mind went blank on some steps from previous lessons.

A study was done on proctored and nonproctored tests. The results are shown in the table. Assume that the two samples are independent simple random samples selected from normally distributed​ populations, and do not assume that the population standard deviations are equal. Complete parts​ (a) and​ (b) below. Use a 0.01 significance level for both parts.

Group 1: n = 31 x̅ = 75.44 s = 10.05

Group 2: n = 35 x̅ = 81.69 s = 20.03

a. Test the claim that students taking nonproctored tests get a higher mean score than those taking proctored tests.

(This one I got correctly)H0: μ1 = μ 2H1: μ1 < μ2

The test statistic, t, is -1.64. (Round to two decimals as needed. I got this one correct.)

The P-value is .053 (Round to three decimals as needed. Also correct.)

Failed to reject because of insufficient evidence to support the claim.

Now I am told to construct a confidence interval for testing the claim. I have seen the example formulas, I have used a variety of online calculators, my math whiz (but not stats whiz) boyfriend to learn this on the spot to help piece it together. The calculators all give me different answers. I was able to do this for the first round of hypothesis testing with only one sample data, but I’m having complete issues with the two sample data.

I don’t need to be given the answer, just the baby steps on how to get there. =) Thanks guys! Sorry for the confusion of the previous post!

submitted by /u/NomaNomNom
[link] [comments]

Which base do I use if I find significant results in one base of a categorical variable during regression, but not others?

I have a categorical variable with say 5 levels/categories. I am running a logistic regression with seizure no/yes as the dependent variable.

Question 1: In my categorical variable, I get significant results while using SOME bases but not others. The categories are primary cancer types so no one category seems more fit than the others for using as a base?

Question 2: What if I had an ordered categorical variable such as “tumor size” with the smallest size as the first level and the largest as the fifth level? Is there a “rule” that the first level (smallest size) should be the base here?

Question 3: What if I have a categorical variable in which SEVERAL options may apply, such as “tumor location” where the tumor can be in your liver or your lungs, but it might also be in BOTH. What type of variable do I use here? I am aware that I could dummy code them into 0/1 variables but I’ve heard that I need to be careful with this as technically the tumor locations are a category and should be viewed as such.

Question 4: How would I present this type of data? Say I find that 1 cancer type used as a base gives significantly lower risk for all other bases (cancer types). Do I have to run the same thing for every base and then describe how each base has a higher or lower risk than the other?

You don’t have to answer all questions, all help is appreciated, thank you!

submitted by /u/OssToYouGoodSir
[link] [comments]