Good morning, statisticians of Reddit!

Long time lurker, first time poster!

I want to preemptively apologize for the less than adequate manner I am about to attempt to explain my question. This is will inevitably be like going mechanic and making strange noises with my mouth to tell him what is wrong with my car.

Thanks in advance for any insight!

My dataset contains ~~longitudinal cohort~~ data spanning 12 years for a high school.

Students per graduating class = ~450.

The association I am trying to explore is whether or not students enrolled in the STEM program (yes or no; 1 or 0) have a higher graduation rate (% of students that graduate on time with their cohort in 4-years) than students not enrolled.

I would like to first do some form of clustering, stratifying, or whatever is appropriate to distill the student population down into two homogenous groups – with the only difference between the groups being the dichotomous independent variable.

GROUP A (IV = YES) = Graduation Rate %

GROUP B (IV = NO) = Graduation Rate %

Both groups have been clustered, stratified to have the same general makeup. The only difference is whether or not they are enrolled in a STEM program.

My hypothesis is that, controlling for factors like race, gender, income, etc. (by clustering or stratifying) students enrolled in the STEM program have higher graduation rates than those not enrolled. This is as far as I want to go at this point. If something happens to show up and it appears to be a statistically significant (?) result, I will dig deeper into the WHY.

**My questions are:**

**How can I make these groups and what method is most appropriate?****What statistical test would I use to determine if being enrolled in a STEM program is associated with graduation rates (a 4- year percentage ranging from 0-100%)**

Thanks again for any help or direction!

submitted by Nevin Manimala Nevin Manimala /u/LurkAndWork

[link] [comments]