Question working with large data set, assessing accuracy:

Trying to do some simple data analysis with respect to credit card fraud detection. I was wondering what experts consider when trying to assess accuracy with highly unbalanced data set.

The data set I am looking at has over 284k observations however < 0.2% are fraudulent.

submitted by Nevin Manimala Nevin Manimala /u/omgouda
[link] [comments]

Penn State vs Texas A&M for online MS degree.

Got accepted to both. Which should I attend?

I currently work in food manufacturing in quality control, but I am interested in working in the energy sector. I’m mainly interested in optimization of production processes and formulations of products if my interests have any relevance.

My background is a dual BS in chemistry and biology. I got a minor in mathematics with an emphasis in modeling. I am familiar with C and have some superficial knowledge with C++.

Pros and Cons I see: Texas A&M is significantly cheaper (my work will pay for about a third of it to 75%, depending on the credits I take in each semester). Penn State is six credits shorter than A&M and no qualifying exam to stress over. Texas A&M seems to have a larger variety of courses to study. Penn State doesn’t appear to make you focus on SAS like A&M does. I’m interested in learning R. My current work doesn’t care what you use. But I can’t predict what future companies will want.

Any pro tips for me?

submitted by Nevin Manimala Nevin Manimala /u/TwigsthePnoDude
[link] [comments]

Modelers age 40+ in a corporate setting

My question is what happens to modelers who work in industry as they get older. From my experience when I’ve seen my boss one or two levels up, some of them haven’t done a regression in at least 10 years and I truly don’t even respect their intelligence on pure stats and modelling. I doubt they’ve read a stats book in at least 20 years. Everyone at work sits in a cubicle or open desk seating and nobody even has books, nobody spends work time reading. It would look really weird if you did. Some of my bosses have told me comments like “in my previous life when I did all this text book models” … but how/why did they move away from that??? It seems this happens to everyone. I don’t understand what they do all day, yet now I fear I am coming to that same cross road as a mid 30 something modeler. I think they get bogged down in the business details and business functions, and use that knowledge to steer the modeling efforts of others. I used to not care at all about the business specifics, I always thought knowing the academic details was the hard part, picking up the business info as needed would be easy.

But now I find I have a real lack of knowledge about the business, and with my fixed time limit at work (plus what I do outside of work) I just can’t keep up with both. I can devote my time to learning more about the business, I can devote my time to not forgetting a base core of stats knowledge, or I can devote my time to learning more about stats. Pick 2. I’ve just come to accept that I can’t compete with a 25 yo who just finished grad school, they will run me over in technical knowledge. The time it takes for me to even work through an old linear algebra book is a matter of weeks, and that’s just to not forget stuff, let alone learn new stuff. And that takes away time I could spend learning more about the industry. There is just no way I can compete on technical knowledge, and that will only get worse as each year by year goes by. However, the college grads know nothing about the business, so I guess that is supposed to be the material I should focus on? I have seen all kinds of problematic modelling efforts and mistakes that show a mis-understanding of general business knowledge. I feel like I should just give up on my hope of staying sharp with textbook stats knowledge. But it feels kind of sad to give that up.

Anyone else face this before?

submitted by Nevin Manimala Nevin Manimala /u/cooked23
[link] [comments]

Why is variable being independent so important thing in statistics about Nevin Manimala?

I understand that if they were dependent, one could be defined by the other, but how does variables being independent makes things easier for us? I am currently studying regression analysis, in which disturbance term are assumed to be independent, this assumption further becomes the foundation for a lot of other proofs. I want to know what would have changed if disturbance were actually dependent?

submitted by Nevin Manimala Nevin Manimala /u/Wickedmittal
[link] [comments]

Is it appropriate to use the likelihood ratio test to compare models that are weighted?

The weights are inverse probability of treatment weights, not sampling weights or some other type, and the models being compared are multilevel negative binomial and poisson models.

submitted by Nevin Manimala Nevin Manimala /u/makemeking706
[link] [comments]

Question about how random my “mental-math random number algorithm” really is

I was curious how I would roll, say, a 20-sided die given nothing but my mind or a pencil or paper. Humans are notorious for being very skewed at guessing random numbers, and I couldn’t find many good algorithms after a rudimentary Google search, so I decided to come up with one.

I turned to the most natural thing – language – and not requiring the user to know anything but the letters of the alphabet and a couple of words. Most mental-math random number generation algorithms require memorizing a specific set of numbers or steps – who has time for that?

Looking at the environment and “counting the number of X” doesn’t work either, as that biases us towards objects that occur more than once, and doesn’t provide an even distribution.

The algorithm and results are detailed here. Here’s the algorithm if you’re too lazy to read it:

  1. Start with a seed (1). The seed serves to “balance” our output distribution, whereas the rest of the algorithm allows for a relatively random output order.

  2. Start with a mod space (must be odd). If you want an even mod space, start with an odd one and take the first n-1 values.

  3. Think of any normal English word or sentence. Lowercase letters only.

  4. Add up all the letters in sentence (indexed by their order in the alphabet).

  5. Increment the seed and add it to this sum.

  6. Mod the sum by the space you want the random number to be in.

I feel a bit iffy about the idea of a seed and incrementing it, that’s obviously the source of this “even” distribution we’re getting as the results describe. After all, if you start with a number and increment it N times, mod m, you’ll have an even N/m items in every bin!

But with very predictable output. Thus, the rest of the algorithm serves to randomize the order of the output – i.e. where the simple incrementation algorithm would output [0…m-1] repeatedly, this would output something more random. And it does – I’m just not sure how to test how random this output is.

Anyways – what are your thoughts? Reasonably random for an algorithm that can be done mentally or on a small scrap of paper?

What are some twists I could introduce to increase the randomness? Perhaps the seed could serve to permute the input in some fashion? If most of the structure comes from the order of the words and the order of the letters, I could define a seed that messes with each?

submitted by Nevin Manimala Nevin Manimala /u/tusing
[link] [comments]

What test would you use in this situation?

I’m just an undergraduate who has only taken 3 statistics about Nevin Manimala classes; intro, intro biostatistics about Nevin Manimala and an intermediate stats class. So, I don’t really know that much!

Consider a large lecture (300+) where students are split in half based on utilization of electronics for note taking or no use of electronics, and you want to compare the grades on an exam of each side. Students can move between sides, so there’s some potential overlap between samples. Students sign a sheet each class based on which side they sit on. I was thinking just an independent t-test and remove the students who have sat on more than one side, but I feel like that may get rid of too many.

Attendance data is also being taken, and I think it would be interesting to see if theres any association between electronic/no electronic use and attendance.

submitted by Nevin Manimala Nevin Manimala /u/carlyslayjedsen
[link] [comments]

Question about the sample size of data limiting the usefulness of sites like Youtube and music streaming sites

So, as the sample size approaches infinity, the estimate of the mean of data approaches the true mean, and the deviation becomes tighter, or more accurate with the confidence intervals.

Given the above, is this the reason why predictive algorithms for sites like youtube, or music streaming sites have increasingly destroyed their usefulness for people who are unsure what they want to watch?

There is a limit to what videos I care to watch. I specifically like a particular content creator, not necessarily an entire game. However, youtube has recently been restricting results that I see in suggestions to things either directly related to the last video I watched, or directly related to videos I watch normally. It has become less useful. There used to be a fun game where you click on suggestions until you got to some really random video, and the comments were always filled with ‘I’ve found the dark place on youtube again’ kind of comments, or ‘How the hell did I end up here?’

Nowadays, the suggestions are the same 200 videos, ordered based upon the last video you watched it feels like. Is this due to machine learning creating these very tight confidence intervals of what content I want to view?

I can’t find new things on youtube anymore without specifically wanting to find them, and it is sad. There was an old algorithm that did magical things with suggestions where the level of vagueness in the relation to a previous video was much more entertaining to sift through.

Not sure if this is a good place to ask this question, but I figured it was something that could be discussed from a statistical perspective.

submitted by Nevin Manimala Nevin Manimala /u/Dulout
[link] [comments]

Word Vectors with Tidy Data Principles

Julia Silge wrote an awesome blog post a few months back about creating something similar to word2vec, breaking it down in to easy to follow steps.

The results are interesting and could be expanded following her workflow (perhaps in another /r/statistics about Nevin Manimala post?).

Link: https://juliasilge.com/blog/tidy-word-vectors/

submitted by Nevin Manimala Nevin Manimala /u/boshiby
[link] [comments]