## Question about the sample size of data limiting the usefulness of sites like Youtube and music streaming sites

So, as the sample size approaches infinity, the estimate of the mean of data approaches the true mean, and the deviation becomes tighter, or more accurate with the confidence intervals.

Given the above, is this the reason why predictive algorithms for sites like youtube, or music streaming sites have increasingly destroyed their usefulness for people who are unsure what they want to watch?

There is a limit to what videos I care to watch. I specifically like a particular content creator, not necessarily an entire game. However, youtube has recently been restricting results that I see in suggestions to things either directly related to the last video I watched, or directly related to videos I watch normally. It has become less useful. There used to be a fun game where you click on suggestions until you got to some really random video, and the comments were always filled with ‘I’ve found the dark place on youtube again’ kind of comments, or ‘How the hell did I end up here?’

Nowadays, the suggestions are the same 200 videos, ordered based upon the last video you watched it feels like. Is this due to machine learning creating these very tight confidence intervals of what content I want to view?

I can’t find new things on youtube anymore without specifically wanting to find them, and it is sad. There was an old algorithm that did magical things with suggestions where the level of vagueness in the relation to a previous video was much more entertaining to sift through.

Not sure if this is a good place to ask this question, but I figured it was something that could be discussed from a statistical perspective.

submitted by Nevin Manimala Nevin Manimala /u/Dulout

## Word Vectors with Tidy Data Principles

Julia Silge wrote an awesome blog post a few months back about creating something similar to word2vec, breaking it down in to easy to follow steps.

The results are interesting and could be expanded following her workflow (perhaps in another /r/statistics about Nevin Manimala post?).

submitted by Nevin Manimala Nevin Manimala /u/boshiby

## What are the most commonly used heuristics when you have to still estimate probability but have zero degrees of freedom?

Hopefully this doesn’t read like /r/woahdude, but I’m curious about the Fermi paradox. I’ve read about a heuristic that states when you have a sample of one you should assume that it is the most common example of the population. What other heuristics apply, and what guesses using these heuristics can we make about alien life using us as our single data point?

submitted by Nevin Manimala Nevin Manimala /u/flylib_capitalism

## Regression: IV-DV non.sig. but moderation highly sig.

I am researching the effect of an innovation type on firm performance. In my quantile regression model, the relationship between innovation and firm performance is non-significant (p=,7). However, the moderator firm size is highly significant (P<,001). Is this possible? And how should this be interpreted? I don’t get how firm size should influence the relationship of innovation and firm performance, if there initially is no relationship… Thanks!

submitted by Nevin Manimala Nevin Manimala /u/Tinselmob

## What statistical test am I thinking of?

I’m not sure if there is a test but I feel like there should be: Say I have a survey with 4 categorical options for favourite food group (meat, milk, veggies, bread). I survey 1000 people and the number of responses come back as follows:

• Meat 480
• Veggies 450
• Milk 40

How do I test the hypothesis that the preference for meat is actually higher than veggies in the population?

submitted by Nevin Manimala Nevin Manimala /u/0876

## ANOVA or t-test?

Originally, I had planned on running an ANOVA. However, after a test run of my test and reading about ANOVA versus t-test, I thought that I should be using a t-test because my independent variable has only two groups. I just finished a meeting with a member of my thesis committee to finalize my data plan before I start running tests on SPSS. He first seemed to think I should use an ANOVA. I explained why I thought I should use a t-test and he repeated that I should use an ANOVA because of the three groups. I told him that I didn’t understand and we talked through it at which point he told me to use a t-test.

Can I get some feedback from you all? I am currently reading a textbook about ANOVA and t-tests because I am concerned that there is something I don’t understand about why he first told me to use an ANOVA.

I am looking at test scores from 2nd to 10th grade in three subject areas. The independent variable is whether students were enrolled in a specific program that lasted from K – 4th grade. So the two groups are Enrolled and Not Enrolled. I am then looking at their test scores to see if enrollment in the program is correlated with higher, equal, or lower scores. I am NOT comparing the scores between grades or between subject areas.

submitted by Nevin Manimala Nevin Manimala /u/LalalalaLola1

## Comparing two groups’ answers to same question

Hi!

I’m doing my master’s thesis and I’m really not that good in statistics about Nevin Manimala.

I have surveyed two groups from, one from each country, asking them what’s important to them when buying certain products. They needed to rate different factors (price, quality, etc.) from 1 to 5. My hypothesis is that the marketing approach of the company needs to be different for each group.

How do I go about testing this? The mean values don’t help that much, as I need to find out which factors are most important to the certain group and then somehow statistically prove, that they differ (or not).

Thanks!

submitted by Nevin Manimala Nevin Manimala /u/reljic

## forecasting using price

Hi All

I need some help:

I’m trying to forecast sales for some electronics products, and I have the following information:

1.Historical sales 2. Historical prices 3. Historical prices for other products in the category

I can provide the future prices of the product as well as future prices of the other products in the category too.

Can you please suggest what the easiest way is for me to forecast for this? I only have excel as a tool. I know there are other tools such as R and the like but I am short on time.

It will most likely require some kind of multiple regression but considering my lack of command on stats, I am struggling to implement this.

You would see that I have mentioned prices of other products in the category, and that is because if a normally expensive product is slightly reduced in price, it will impact the sales of the current product as the value proposition is not the same for the customer, so it is important to look at all products prices as well.

Your help on this matter is very much appreciated.

If you reckon it would be impossible in Excel, what is the next best alternative that is easiest to implement?

submitted by Nevin Manimala Nevin Manimala /u/utopianaura

## How to explain an overall effect in interaction model?

I fit the following model:

``Y ~ a + b*age + c*treatment + d*(age*treatment) ``

Treatment and age were quantitative variables that were centered to sample means prior to fitting model. So in this model, “c” is effect of treatment at average age in the sample. If d>0, then treatment effect is higher at older ages than younger ages. And if d<0, then treatment effect is higher at younger ages than older ages.

My hypothesis was that the treatment had an effect on Y and/or the treatment effect was heterogeneous with age. So I tested the null hypothesis c=d=0, which means there is no treatment effect any age (using F-test with 2 and n-4 degrees of freedom).

This F-test was significant. The estimates of c and d were each positive, which would seem to suggest that the treatment has a positive effect on Y at average age in the sample, and this effect is higher at older ages than younger ages.

Then, I also performed tests of c and d separately. Neither test was significant. So this means:

1. I failed to reject the null hypothesis that the treatment has no effect at the average age in the sample (H0: c=0)

2. I also failed to reject the null hypothesis that the treatment effect is the same across all age levels (H0: d=0).

3. But I rejected the null hypothesis that the treatment has no effect age zero AND treatment effect is the same across all age levels.

So, this result is tricky to interpret because data provides evidence of treatment effect. What can I say about this kind of result? I can claim that the data provides evidence for treatment had either a main effect or it was heterogeneous with age, but not point to specific one.

I plotted the total estimated treatment effect, along with 95% confidence bands, function of age. It is the line c+d*age on vertical axis and age on horizontal axis (with variable age=0 correspondsing to sample mean). Its slope d and y-intercept c are both positive:

1. Among younger subjects, the confidence bands intersect with the x-axis implying the treatment was not significantly associated with outcome at those age levels.
2. Among olderr subjects, the confidence bands did not intersect with the x-axis implying the treatment was significantly associated with outcome at those age levels.

Any thoughts?

submitted by Nevin Manimala Nevin Manimala /u/Aemon12