If I have a dataset with the following columns: month_of_year, is_cloudy, temperature.
I’m looking to see if there’s a significant difference in the temperature when it’s cloudy vs when it isn’t. However, my datapoints aren’t evenly distributed and I want to make sure that I have the same proportions of the month_of_year variable in each feature set. Because where I am is rarely cloudy, I have many more datapoints for non-cloudy days.
How would I go about preparing the data for this test? (I’m using python.)
My plan was to do the following:
- Get dummy variables for the months of the year.
- Get the proportion of the presence of the different dummy month variable.
- Sample the non-cloudy dataset to get the same proportions of the dummy variables.
- Run a z-test on the two datasets to see if the difference in temperature is significant.