What does it mean if my p-value shoots up when I utilize a +1 in my logarithmic transformations?

And what’s the difference between the p-value of a parameter estimate and the p-value for the variance of the regression model? When I run linear regressions I’m given both, but it’s only the one for parameters which shoots up. I’m using statcrunch.

Table 1 (Log +1 is used for both variables): https://gyazo.com/fdf3db316fe8c54f36276bae8305a686

Table 2 (Log + 1 is only used for the x variable): https://gyazo.com/23ce00eb8eafb52e8b19f5510451b63d

submitted by /u/T00Human
[link] [comments]

Is this correctly written academically? “If a correlation coefficient was ≥0.2 but

“If a correlation coefficient was ≥0.2 but <0.4, the correlation was regarded weak. If it was ≥ 0.4 but < 0.7, the correlation was regarded moderate. If it was ≥ 0.7 but <0.9, the correlation was regarded strong.”?

submitted by /u/miliseconds
[link] [comments]

I’m running many regressions each with 4 variables. The t-scores associated with every variable often seem to add up to zero. Is this a bug?

EDIT: This was resolved by mean centering variable 2. The takeaway here is: mean-center variables if you are doing interaction analyses.


In my dataset, I measured a dependent variable 3,000 times for each subject (time series data). For each time point, I then ran a regression with predictors: Var1, Var2, Var1*Var2 (an interaction), and an intercept. I then plotted the t-values associated with the regression beta coefficients of each predictor. The results can be found here: https://imgur.com/a/CBabOIV. Strangely, we are finding that the t-values are often summing up to 0 and the t-values seem to be symmetrical over y = 0. This mostly occurs between time = 1500-3000, where we would expect the weakest effect. Are t-values like this supposed to occur if the data is all noise? (as it might be for that time range). I did some sanity checks, and I don’t think there are any bugs in my code.

If this is not a bug, why would this occur?


Also, I know this 3,000 dependent variable procedure could lead to multiple-hypothesis testing which is problematic. We are accounting for this, but are presently just concerned that there was some bug in the analysis (or trying to understand these results).

submitted by /u/FireBoop
[link] [comments]

I am very confused about when it is okay to use Log-Log transformations and what exactly that means FOR MY SPECIFIC DATA SET.

In my emarketing data set, I am looking to see how good of a predictor Goal Conversion Rate is for Revenue.

Now, Goal Conversion Rate is given as a percentage, but it’s not really a proportion because it can go beyond 100 (some of my entries are at 500), so we can treat it as a count.

The difficulty is that there are a LOT of specific goal conversions clustered around 0 (it’s distribution is really wack), so I’ve logarithmically transformed it with the following Log2(1 + Goal Conversion Rate). Otherwise, there is very little to glean because I end up producing regression graphs that look like this: https://gyazo.com/4ee3489409c8099bc6b5c0b1f75e181c

However, if I only use Log2(Revenue), I get a negative correlation, which is no good.

So is it okay for me to use Log2(1 + Revenue)? The correlation I achieve with this expression is much higher, 0.77 as opposed to 0.26 when the + 1 is not added to revenue. Am I practicing some form of redundancy here? Should I look at standard error and decide which combination of transformations to use through that figure? For reference, here is what my regression graph looks like when they’re both transformed + 1: https://gyazo.com/0dd762014adb1eb010a001ed36add705 There are 400 0-values here for Log2(1+ Goal Conversion rate).

If someone could just better inform me as to what exactly the transformations do, especially when +1 is added, and results that creates, I would highly appreciate. I’m just a bit confused and working with my confusion has not been productive.

Miscellaneous info about my goals just to give a better sense of what I’m doing: I have hypothesized that Goal conversion correlates very poorly (below 0.15 or negative) with the sources that are not a significant source of income, but exhibits stronger correlation coefficients (0.30 and above) with Google/Organic, Gdeals, Googleplex, and Google Sites. Here is the data set for reference: https://docs.google.com/spreadsheets/d/1pIapZXgaScU44SFhwOaBcM1BBURvTB4p6wjSre5OP2Q/edit?usp=sharing

submitted by /u/T00Human
[link] [comments]

Robust gaussian fitter

I’m trying to outline a local gaussian in a given empirical distribution. I want it to focus on the local mode. I know that there are many modes, but I don’t know how many. I just want the gaussian that fits best to the data around an observation.

Lets say we have a true density of 0.8N(0,2)+0.2N(5,2), the algo outputs 0.8N(0,2) if I give 0 as input.

submitted by /u/hmiemad
[link] [comments]

Sampling methods for over-representing outliers?

Disclaimer: I’m not a statistician, just a data engineer.

In the context of a ‘big data’ project, I’m building small samples of large datasets in which I try to select ‘special cases’. The rationale is that I can test quickly my data-processing code against such samples, and so I want to be confident that rare ‘pathological cases’ of the data will be represented in such samples.

The way I do this is that I have a function that emits a list of ‘features’ for each data entry (e.g the list of keys in a JSON document), and I make it likely for rare features to be selected, hoping that pathological cases are correlated with rare features. E.g I set the probability of an entry to be included in the sample to 10/M, where M is the number of entries having the rarest feature of this entry (such that the expected number of selected entries having a given feature is at least 10), and I hope that the resulting sample will be of reasonable cardinality.

Is there a scientific name for such sampling techniques? Can you recommend material about them?

submitted by /u/vvvvalvalval
[link] [comments]

Item Response Theory Question

Our research team developed a new measure to test a latent variable in a certain population. We collected data on the same participants over four time points. I am currently running a graded response model to test how well the items perform. Currently, I am running the classical test theory and IRT analysis separately for each time point.

Is there a method to combine all four data points for each participant on the measurement instrument and run the graded response model?

submitted by /u/SubstancelessPsyche
[link] [comments]

Student: Choice of statistical test in paired samples

I’ve been struggling to pick the correct statistical test that I should use for a set of data results. I have narrowed it down to paired student’s t-test and wilcoxon signed rank test. I understand that t-test is mostly used for parametric data and wilcoxon for non-parametric however, I am struggling to determine if the data is normally distributed – which will ultimately decide which method to use.

The sample size is small and therefore wilcoxon may also be the better option due to this? Any help where I can look or guidance in the right direction would be greatly appreciated. The question is below:

In a clinical study, the effects of two new novel antihypertensive agents (Drug A and Drug B) on the lowering of the diastolic blood pressure have been compared. The results are shown in the table below.

Is there a difference between the antihypertensive effects of the two therapeutic agents?

Lowering of Diastolic Blood Pressure (mm Hg)

Subject Drug A Drug B

1 12.5 13.5

2 13.0 14.0

3 12.5 12.5

4 17.5 12.0

5 13.5 13.5

6 12.5 14.0

7 17.5 11.5

8 22.5 12.5

9 13.5 13.0

10 12.5 13.0

11 12.0 13.5

12 13.5 14.0

submitted by /u/Hab92
[link] [comments]