Some Thoughts on Dunning-Kruger (Study design) – 4/24/2019

1) Study 1 chose the subject of humor. Humor is subjective, changes over time, and changes based on a person’s age. This is not a small problem. It makes study 1 close to useless. To their credit, Dunning/Kruger recognize this and mention that interrater variability was indeed high. But why was humor chosen to begin with? This study was a waste of resources.

2) Study 2 asks people to say how many questions they think they got right. This question will never generate a significantly different result than the generic ability question. Because it’s just a restatement of that question. Indeed, the results confirm this – if anything ALL participants across the board expressed slightly lower confidence but otherwise gave a similar answer. This question tells you nothing because nothing has happened between taking the test and answering this question that would give anyone a feeling they should lower their confidence – regardless of their score. It does NOT tell you whether the incompetent are miscalibrated due to their own performance rather than their estimate of others.

3) Study 3, Phase 1: Grammar is not “clear and decisive” like they assert. Grammar rules vary and change. Not nearly as much as humor, so it’s not fatal to this study, but it’s not as ironclad as they assert (I know they focused on ASWE, but that is unlikely to be meaningful to test takers who think of grammar more generically). General note: tests of logic and grammar are likely to produce feelings of high confidence in test takers because those things are strongly associated with a person’s sense of intelligence. I could go into this more, but the ego demands people see themselves as at least average – this is a reason for the “above-average effect”.

In this study, they were again asked to rate how they performed. Again, like in study 2, this question was useless. And again, it tracked almost exact to their original perception. Of course it did. They were also asked to rate how they did relative to their peers after the test. And *again*, this is nothing more than re-asking the original question. Why would you expect a different answer just because you took a test in between when you do not yet know how you scored on the test? These questions seem designed with the assumption that merely by taking the test, people should realize how wrong they are. But how is that supposed to happen if they have not been shown how wrong they are? It cannot – at least not on this type of test.

4) Study 3, Phase 2: This design takes the previous mistakes and supercharges them. It asks participants to “grade” the exams of others. Without any knowledge their own answers are wrong, just how do the authors expect the graders to recognize the correct answer? This is again nothing more than the obvious logic that if I answer ‘A’, you answer ‘B’, the correct answer is ‘B’, but no one has told me that, I will of course still think ‘A’ is correct and mark your answer of ‘B’ as wrong. Just seeing that someone else answered ‘B’ is not going to change my answer from ‘A’. I already saw ‘B’ and determined (incorrectly) that it was wrong. Why would I see it again and decide it is right just because my peer did so? Perhaps if ALL the papers gave the answer ‘B’ I would start to lose confidence, but this is unlikely to happen and unlikely to be a recognized pattern by the grader.

So I do not believe this answers Prediction 3 at all. There is no behavior they observed here. The authors expect incompetent people to pick up that their competent peers are correct through some unknown force in the universe. The logic here really appears to be that they expect someone to change their answer every time someone else gives a different one. For a host of reasons, this does not happen. Again, the additional question asking people to re-grade themselves after grading others is useless. It tells you nothing. It is only a restatement that they have been wrong all along. There is NO intervening event that would change their mind.

The discussion in this section even acknowledges that their overall theory has problems, made glaring by the underestimation of the top quintile scorers. They try to explain this away, but I do not think they succeed. They do demonstrate one interesting result. After grading the bottom quintile do not change their perceptions by a significant margin. But the top quintile do increase their perception significantly. They make a lot of this, but I think the explanation is really simple.

I could do a longer discussion on this, but the plain fact is the study shows that NO ONE has any clue how their peers will perform. As a result, EVERYONE judges their own rank to be within a very narrow band. EVERYONE has confidence in their answers. Seeing the answers of their peers, the top quintile’s confidence encourages them to raise their self-assessment. While the bottom quintile did not get confirmation from the peers’ answers, they will nevertheless remain confident in their own and their scores will stay similar.

The key point is that the driving force is that EVERYONE is confident in their answers and NO ONE knows how their peers will perform. Even the highest performers are likely confident in their WRONG answers. The results seen here are nothing more than a statistical tautology. It is not surprising that people will be confident in their answers and that confidence will not be shaken simply because their peers disagree. In fact, we teach people they should not give up on their answers just because the crowd disagrees. Absent proof they are wrong, the typical reaction is to remain confident.

5) Study 4. This study is interesting because its findings contradict the way that Dunning/Kruger is often deployed to criticize people. The study put a selected group of participants through a “training” session on the topic. Unsurprisingly, the bottom two quintiles lowered their estimates of their abilities and test score after training. While the top raised theirs a bit. Is anyone surprised by this? The authors theorize that this proves that teaching them competence makes them more able to recognize their faults. Gives them better “meta-cognition”.

This seems wrong to me. A more simple explanation is that EVERYONE gives an estimate in a certain narrow range. The authors broke them out of that range by DEMONSTRATING to them they are wrong. The training showed them that many of their answers were in fact wrong. And interestingly, Dunning/Kruger is often deployed to shout down people who question “experts”. The idea being they think they are better than the experts because of some unique psychological trait that goes along with being stupid.

But the data in Study 4 shows clearly that *when confronted with evidence they are wrong, people adjust their self-confidence*. According to the Study 4 data, if people who question scientists (anti-vaxxers, for instance) were suffering from some “Dunning/Kruger” effect, they would adjust their confidence in themselves when confronted with evidence they are wrong. Yet, this does not happen. The reality is the effect is bogus and anti-vaxxers challenge scientists not because they are stupid, but for other reasons (mistrust being a big one). But it feels so much better to call people stupid and have a study to back it up.

Dunning/Kruger is widely deployed to say that people are stupid and will NEVER realize it because they are too stupid to realize it. The study does not actually say this, and the study itself refutes this idea. The “stupid” people in the study adjust their self-perception quite dramatically when confronted with their “stupidity”.

On a more general note, all 4 of the studies seem designed in a way that they could NEVER generate a null result. And their results are open to many interpretations. Dunning/Kruger then choose the interpretation that matches their hypothesis without refuting other interpretations that make more sense.

As I said, the one glaring piece of data that cannot be denied is that EVERYONE judges their own ranking within peers not with any rational basis, but using psychological and social heuristics that narrow the range of responses to a small subset of the total range. No one will rank their ability on general tests of intelligence below average because to do so would be to believe you are stupid as a general proposition. Talk about ego crushing. A person could not live like that. So, the bottom half of the scale is now gone as an option. Typically, the responses cluster at somewhere just above average. It appears that the highest performers do in fact recognize they are higher performers, so their self-assessment is somewhat higher, though lower than their actual ability. This is no doubt limited by the self-doubt that people are subject to. Once they see the responses of their peers, they realize more confidence is warranted and adjust their own scores up. (The same is true of self-assessment of their raw score. Everyone will believe they are a bit above average – especially on a test of general knowledge that they have not prepared for.)

But there is no psychological phenomenon that causes dumb people to rate themselves higher. It’s simply an artifact of them being objectively dumber while at the same time being just as clueless as everyone else about how their peers perform. When higher scorers see their peers’ answers, their confidence is bolstered. While lower performers are not affected by peer answers (at least not enough to change their scores substantially), they do respond to being shown they are wrong. But in the study, the lack of response to peer answers by low performers is not evidence of some unique phenomenon. It is simply that people (wisely) do not discard their answers just because someone else answered differently – they have to be shown they are wrong. But, contrary to how Dunning/Kruger is usually deployed – when shown they are wrong, they do adjust their perception!

The idea that people should change their scores when confronted by group disagreement, otherwise they are dumb, is very strange indeed. In fact, human progress likely depends on people rejecting this idea. The greatest thinkers have often had to fight through massive group disagreement. We should all be grateful they did not change their mind just because the majority disagreed.

If I had to summarize what this study means, it would be this: People are not a good judge of their ability relative to peers. All people will judge their ability within a narrow range, regardless of their actual ranking, because they have NO information to work with and resort to psychological/social heuristics to assess their ranking. The highest performers, who likely have some training in the field and therefore should be more confident, will indeed express more confidence in their ranking, tempered by a general hesitancy to overestimate.

As an aside, I think how people estimate their ranking in relation to peers is quite similar to how estimates are done in software development. We attempt to use information on past performance to produce future estimates, but the utility of that is questionable/marginal. It seems the only way to really get a handle on such estimating is to limit the scope of work as much as possible so that when estimates fail, they do not fail by large increments. The estimates themselves largely rely on heuristics that are more art than science (“tell me how long it will take you, then double it”). Experts admit that software estimation is to some extent guessing because we lack the information necessary to make estimates with any precision. Same with self-perception among peers. Of course, there are other factors at play, but lack of information is a big one.

Here is a funny exercise: while I don’t trust the numbers from the humor study too much, the chart is just too perfect. Take away anything below 50 and above 80. This is justified because no one will give responses in those ranges for social/psychological reasons. Now expand those values back to a scale from 0-100 and overlay the actual scores. My guess is the two data points now look pretty similar. It doesn’t work for the other studies though because those studies mainly find everyone giving the same perception, except for the highest performers. But it is a logical way to look at that data and make real sense of it, if you believe the results from study 1. People psychologically block out 0-50 and 80-100, so the real scale is 50-80. Once you realize that, you can make a chart that looks like people are *REALLY* good at assessing their own humor. I wouldn’t rely on the data from that study, though.

I’m sure you think I’m just a know it all – whatever, it doesn’t really bother me. But there has been plenty of criticism of Dunning/Kruger and follow up studies that put big dents in its armor. There is a much more involved discussion here: https://www.talyarkoni.org/blog/2010/07/07/what-the-dunning-kruger-effect-is-and-isnt/.

David Dunning even appears in the comments to dispute a few points and there is some good back and forth.

Having said all this, I am not asserting there is not something like a “Dunning/Kruger” effect in some people. It’s always possible. But I see no evidence in their study that proves what they say, and more pedestrian psycho-social explanations fit their data a lot better. And I think “Dunning/Kruger” has become a meme and been relatively unquestioned because it tells such a pleasing story for the people in the top quintile. “Everyone else is a bunch of overconfident idiots!” Oh so seductive fallacious reasoning. And the problem with deploying it so widely, as it is by many, is that its acceptance seems to be confirmation bias writ large and its use strikes me as exceedingly arrogant (and fallacious).

Some Thoughts on Dunning-Kruger (Study design) – 4/24/2019

Published by therealrthorat

Leave a comment Cancel reply

Share this:

Related

Published by therealrthorat

Leave a comment Cancel reply