In my emarketing data set, I am looking to see how good of a predictor Goal Conversion Rate is for Revenue.
Now, Goal Conversion Rate is given as a percentage, but it’s not really a proportion because it can go beyond 100 (some of my entries are at 500), so we can treat it as a count.
The difficulty is that there are a LOT of specific goal conversions clustered around 0 (it’s distribution is really wack), so I’ve logarithmically transformed it with the following Log2(1 + Goal Conversion Rate). Otherwise, there is very little to glean because I end up producing regression graphs that look like this: https://gyazo.com/4ee3489409c8099bc6b5c0b1f75e181c
However, if I only use Log2(Revenue), I get a negative correlation, which is no good.
So is it okay for me to use Log2(1 + Revenue)? The correlation I achieve with this expression is much higher, 0.77 as opposed to 0.26 when the + 1 is not added to revenue. Am I practicing some form of redundancy here? Should I look at standard error and decide which combination of transformations to use through that figure? For reference, here is what my regression graph looks like when they’re both transformed + 1: https://gyazo.com/0dd762014adb1eb010a001ed36add705 There are 400 0-values here for Log2(1+ Goal Conversion rate).
If someone could just better inform me as to what exactly the transformations do, especially when +1 is added, and results that creates, I would highly appreciate. I’m just a bit confused and working with my confusion has not been productive.
Miscellaneous info about my goals just to give a better sense of what I’m doing: I have hypothesized that Goal conversion correlates very poorly (below 0.15 or negative) with the sources that are not a significant source of income, but exhibits stronger correlation coefficients (0.30 and above) with Google/Organic, Gdeals, Googleplex, and Google Sites. Here is the data set for reference: https://docs.google.com/spreadsheets/d/1pIapZXgaScU44SFhwOaBcM1BBURvTB4p6wjSre5OP2Q/edit?usp=sharing