OK, it’s answer time for these questions. First, a little background. This is the paper, or rather, here it is to download. The questions were asked of over 100 psychology researchers and 400 students and virtually none of them got all the answers right, with more wrong than right answers overall.

The questions were modelled on a paper by Gigerenzer who had done a similar investigation into the misinterpretation of p-values arising in null hypothesis significance testing. Confidence intervals are often recommended as an improvement over p-values, but as this research shows, they are just as prone to misinterpretation.

Some of my commenters argued that one or two of the questions were a a bit unclear or otherwise unsatisfactory, but the instructions were quite clear and the point was not whether one might think the statement probably right, but whether it could be deduced as correct from the stated experimental result. I do have my own doubts about statement 5, as I suspect that some scientists would assert that “We can be 95% confident” is exactly synonymous with “I have a 95% confidence interval”. That’s a confidence trick, of course, but that’s what confidence intervals are anyway. No untrained member of the public could ever guess what a confidence interval is.

Anyway, the answer, for those who have not yet guessed, is that all of the statements were false, broadly speaking because they were making probabilistic statements about the parameter of interest, which simply cannot be deduced from a frequentist confidence interval. Under repetition of an experiment, 95% of confidence intervals will contain the parameter of interest (assuming they are correctly constructed and all auxiliary hypotheses are true) but that doesn’t mean that, ONCE YOU HAVE CREATED A SPECIFIC INTERVAL, the parameter has a 95% probability of lying in that specific range.

In reading around the topic, I found one paper which had an example which is similar to my own favourite. We can generate valid confidence intervals for an unknown parameter with the following procedure: with probability 0.95, say “the whole number line”, otherwise say “the empty set”. If you repeat this many times, the long-run coverage frequency tends to 0.95, as 95% of the intervals do include the true parameter value. However, for a given example, we can state with absolute certainty whether the parameter is either in or outside the interval, so we will never be able to say, once we have generated an interval, that there is 95% probability that the parameter lies inside that interval.

(Someone is now going to raise the issue of Schrödinger’s interval, where the interval is calculated automatically, and sealed in a box. Yes, in this situation we can place 95% probability on that specific interval containing the parameter, but it’s not the situation we usually have where someone has published a confidence interval, and it’s not the situation in the quiz).

And how about my readers? These questions were asked on both blogs (here and here) and also on twitter, gleaning a handful of replies in all places. Votes here and on twitter were majority wrong (and no-one got them all right), interestingly all three of the commenters on the Empty Blog were basically correct though two of them gave slightly ambiguous replies but I think their intent was right. Maybe helps that I’ve been going on about this for years there 🙂

Shame I missed the quiz, I might have done reasonably well ;o)

https://stats.stackexchange.com/questions/26450/why-does-a-95-confidence-interval-ci-not-imply-a-95-chance-of-containing-the/26457#26457

I suspect in most cases thought the confidence interval would also be a credible interval for some reasonable prior, so in practice confidence intervals may still be better than NHSTs, but perhaps not for the right reason?

Well I was confident that some people would know the difference 🙂

As for “most cases”, well…possibly. There are cases where it’s ok, cases where it is clearly ridiculous, and (what is the worst situation) cases where it isn’t quite right but passes the sniff test so people end up believing something they shouldn’t…..

Some of the seemingly natural climate science cases fail the latter test.

The infamous Douglass et al on Tropical Trophosphere trends? ;o)

The real problem is that scientists can’t safely use a “statistics cookbook” approach to data analysis, and really do need to actually understand the methods they are applying (you also need to know what it is you are trying to cook before looking up the recipe).

“Anyway, the answer, for those who have not yet guessed, is that all of the statements were false … ”

And I scored 100%, so BFD!

“I’ll say false to all six statements.”

https://julesandjames.blogspot.com/2019/05/blueskiesresearchorguk-how-confident.html?showComment=1558858223857#c3221431842353993306

It took me less than 60 seconds even, sticking “true mean” in all six statements was a dead giveaway, as no one knows the true mean except in statistical experiments where the “true mean” is known from the selected PDF/CDF.

It has been over 40 years since my formal training in statistics, but the “true mean” of the population is not the sample mean. Same as it ever was.