I did a survey to find out if people can distinguish human-drawn images and computer-generated images (I used 30 images for both categories). I showed every participant one image at a time and the participant could chose “I think this image is computer-generated” on a 5-point likert scale.
My theory/thesis: computer-generated images and human-drawn images cannot be distinguished.
To test this, I think I should calculate the correct identification (the human-drawn image was consideres as human-drawn / the compute-generated image was considered as computer-generated image) per image. To do this, I encoded the 5-point likert scale to numerical values, i.e. strongly disagree is 0, disagree is 1, neutral is 2, agree is 3 and strongly agree is 4. To calculate the correct identification I can calculate per participant the difference between the expected value (4 for computer-generated images and 0 for human-drawn images) and average them per image.
Then I have 30 (averaged) values for guessing correctly per category (human-drawn images and computer-generated images). Now I would like to apply the Mann–Whitney U test to check if there are more correct guesses for the computer-generated images than for the human-drawn images (i.e. because the computer-generated image is too simple). This would indicate, that participants were able to distinguish the two types of images.
Am I doing this right or am I missing something?
Sorry for the wall of text 🙂
/e: Maybe to tell you why I think this might be wrong: Even if there was just one participant I could apply this test and could get significant results. This irritates me, since I think that the amount of participants should influence the significance and not the amount of images