I recently came across a problem in some work I was doing and I was wondering how incorrect the following would be:
I ran 2 models with 1 DV and 1000 IVs. the IVs were exactly the same and only the DV changed.
Now for the vector of IV’s, the first 500 IVs came from one source of data and the last 500 came from another data source.
Now, the researchers wanted to know if for each of the dependent variables they received more “influence” or had more of a preference for data source 1 than data source 2. They had a theory that the DV for model 1 would be more impacted by the data from datasource 1 and the DV for model2 would be more impacted by datasource 2. All IV data was dichotomous (coded 0/1) and the DV data was on the same scale.
One suggestion put forward that I did not have a answer for was to stack the coefficients from the two models, create a dummy code for model they came from (1 or 2), a dummy code for which data source the coefficient comes from (source 1 or source 2) and an interaction between the two. The interaction would then be able to say if coefficients from one source would increase coefficients more for one model than the other.
What would be the problems for such a method?
Here is a more in depth explanation of the experiment just so someone doesn’t spend too long responding without all the information.
The study is a classic machine learning experiment of “will the machine beat the judge” but in an attempt to get machine learning more in the world of academia. Participants submitted a resume and then experts rated them on how well they believed they would do at a certain task based on their resume. The participants completed the task and got a score. A random forest model was then used to predict how well they would do on the task based on the same resume. No surprise to anyone the random forest model did better than the human rater at predicting participant performance. A second random forest was then used to predict judge scores based on judge scores (essentially an automatic rating system). Now we have 2 models, one that models the actual outcome and one the models judge scores.
Now, you can derive feature importance from a random forest here is a reasonable summary. Now we have feature importance coming from the judge based model (showing where judges put their importance) and the feature importance from the outcome model (showing optimal performance based on the random forest). Now comes the comparison… we want to figure out where the judges went wrong and how they could change how they evaluate the information. We can see they didn’t put the same weight exactly as the outcome model but it is pretty hard to interpret feature importance but what we can do is categorize where the features came from. For example, did the features come from a an easily digestible demographic information or did it come from a more complicated open response question. In general we hypothesized that humans raters would focus more on the easily digestible demographical information than the open response questions which will receive more focus by the outcome model. However, the only way we could think about testing it was doing what is described above: Stack the feature importances and then use the following model
Feature importance = Model + easily_digestible_info + model*easily_digestible_info.
Any other suggestions would be greatly appreciated.