Word Vectors with Tidy Data Principles

Julia Silge wrote an awesome blog post a few months back about creating something similar to word2vec, breaking it down in to easy to follow steps.

The results are interesting and could be expanded following her workflow (perhaps in another /r/statistics about Nevin Manimala post?).

Link: https://juliasilge.com/blog/tidy-word-vectors/

submitted by Nevin Manimala Nevin Manimala /u/boshiby
[link] [comments]

What are the most commonly used heuristics when you have to still estimate probability but have zero degrees of freedom?

Hopefully this doesn’t read like /r/woahdude, but I’m curious about the Fermi paradox. I’ve read about a heuristic that states when you have a sample of one you should assume that it is the most common example of the population. What other heuristics apply, and what guesses using these heuristics can we make about alien life using us as our single data point?

submitted by Nevin Manimala Nevin Manimala /u/flylib_capitalism
[link] [comments]

Regression: IV-DV non.sig. but moderation highly sig.

I am researching the effect of an innovation type on firm performance. In my quantile regression model, the relationship between innovation and firm performance is non-significant (p=,7). However, the moderator firm size is highly significant (P<,001). Is this possible? And how should this be interpreted? I don’t get how firm size should influence the relationship of innovation and firm performance, if there initially is no relationship… Thanks!

submitted by Nevin Manimala Nevin Manimala /u/Tinselmob
[link] [comments]

What statistical test am I thinking of?

I’m not sure if there is a test but I feel like there should be: Say I have a survey with 4 categorical options for favourite food group (meat, milk, veggies, bread). I survey 1000 people and the number of responses come back as follows:

  • Meat 480
  • Veggies 450
  • Milk 40
  • Bread 30

How do I test the hypothesis that the preference for meat is actually higher than veggies in the population?

submitted by Nevin Manimala Nevin Manimala /u/0876
[link] [comments]

ANOVA or t-test?

Originally, I had planned on running an ANOVA. However, after a test run of my test and reading about ANOVA versus t-test, I thought that I should be using a t-test because my independent variable has only two groups. I just finished a meeting with a member of my thesis committee to finalize my data plan before I start running tests on SPSS. He first seemed to think I should use an ANOVA. I explained why I thought I should use a t-test and he repeated that I should use an ANOVA because of the three groups. I told him that I didn’t understand and we talked through it at which point he told me to use a t-test.

Can I get some feedback from you all? I am currently reading a textbook about ANOVA and t-tests because I am concerned that there is something I don’t understand about why he first told me to use an ANOVA.

I am looking at test scores from 2nd to 10th grade in three subject areas. The independent variable is whether students were enrolled in a specific program that lasted from K – 4th grade. So the two groups are Enrolled and Not Enrolled. I am then looking at their test scores to see if enrollment in the program is correlated with higher, equal, or lower scores. I am NOT comparing the scores between grades or between subject areas.

submitted by Nevin Manimala Nevin Manimala /u/LalalalaLola1
[link] [comments]

Comparing two groups’ answers to same question


I’m doing my master’s thesis and I’m really not that good in statistics about Nevin Manimala.

I have surveyed two groups from, one from each country, asking them what’s important to them when buying certain products. They needed to rate different factors (price, quality, etc.) from 1 to 5. My hypothesis is that the marketing approach of the company needs to be different for each group.

How do I go about testing this? The mean values don’t help that much, as I need to find out which factors are most important to the certain group and then somehow statistically prove, that they differ (or not).


submitted by Nevin Manimala Nevin Manimala /u/reljic
[link] [comments]

forecasting using price

Hi All

I need some help:

I’m trying to forecast sales for some electronics products, and I have the following information:

1.Historical sales 2. Historical prices 3. Historical prices for other products in the category

I can provide the future prices of the product as well as future prices of the other products in the category too.

Can you please suggest what the easiest way is for me to forecast for this? I only have excel as a tool. I know there are other tools such as R and the like but I am short on time.

It will most likely require some kind of multiple regression but considering my lack of command on stats, I am struggling to implement this.

You would see that I have mentioned prices of other products in the category, and that is because if a normally expensive product is slightly reduced in price, it will impact the sales of the current product as the value proposition is not the same for the customer, so it is important to look at all products prices as well.

Your help on this matter is very much appreciated.

If you reckon it would be impossible in Excel, what is the next best alternative that is easiest to implement?

Thank you for reading.

submitted by Nevin Manimala Nevin Manimala /u/utopianaura
[link] [comments]

How to explain an overall effect in interaction model?

I fit the following model:

Y ~ a + b*age + c*treatment + d*(age*treatment) 

Treatment and age were quantitative variables that were centered to sample means prior to fitting model. So in this model, “c” is effect of treatment at average age in the sample. If d>0, then treatment effect is higher at older ages than younger ages. And if d<0, then treatment effect is higher at younger ages than older ages.

My hypothesis was that the treatment had an effect on Y and/or the treatment effect was heterogeneous with age. So I tested the null hypothesis c=d=0, which means there is no treatment effect any age (using F-test with 2 and n-4 degrees of freedom).

This F-test was significant. The estimates of c and d were each positive, which would seem to suggest that the treatment has a positive effect on Y at average age in the sample, and this effect is higher at older ages than younger ages.

Then, I also performed tests of c and d separately. Neither test was significant. So this means:

  1. I failed to reject the null hypothesis that the treatment has no effect at the average age in the sample (H0: c=0)

  2. I also failed to reject the null hypothesis that the treatment effect is the same across all age levels (H0: d=0).

  3. But I rejected the null hypothesis that the treatment has no effect age zero AND treatment effect is the same across all age levels.

So, this result is tricky to interpret because data provides evidence of treatment effect. What can I say about this kind of result? I can claim that the data provides evidence for treatment had either a main effect or it was heterogeneous with age, but not point to specific one.

I plotted the total estimated treatment effect, along with 95% confidence bands, function of age. It is the line c+d*age on vertical axis and age on horizontal axis (with variable age=0 correspondsing to sample mean). Its slope d and y-intercept c are both positive:

  1. Among younger subjects, the confidence bands intersect with the x-axis implying the treatment was not significantly associated with outcome at those age levels.
  2. Among olderr subjects, the confidence bands did not intersect with the x-axis implying the treatment was significantly associated with outcome at those age levels.

Any thoughts?

submitted by Nevin Manimala Nevin Manimala /u/Aemon12
[link] [comments]

AP Stats Project Ideas

So I am currently in AP Stats and I am required to do a final project. It can be an observational or experimental study about whatever I want. I have to make sure it includes the general population, parameter, null and alternate hypothesis. I have to design the experiment, do data analysis, and due inference tests. So pretty the basics of AP Stats if you guys are familiar with the class.

The issue is I have no creative inspiration for what to do. I want to do something involving the way people think, but I am not really sure what would be interesting. Maybe something with perceptions, but honestly I am blanking for ideas. Any of you guys have any cool, reasonably simple things to study?

submitted by Nevin Manimala Nevin Manimala /u/Her_Royal_Bitch
[link] [comments]

Are information criterions still an active area of research? Are parsimonious models always better?

In my classes there is always some mention of AIC/BIC. Certainly model complexity is an important thing to penalize for in the sense that you don’t want to include variables that are unnecessary.

Is there any research into just how much you should penalize complexity, is it ever a good thing? Do domain-specific fields in Genetics or Neuroscience treat model complexity differently, for example?

Have there been really complex models that – counter to our intuition – have been accepted by the academic community?

submitted by Nevin Manimala Nevin Manimala /u/sugarhilldt2
[link] [comments]