Are LLMs Good at Making Judgements Under Uncertainty)?

(Work in process by John Kraus, 4/25/25)

Here are some snippets of evidence regarding how well LLMs perform in making judgements under uncertainty.

Prompt:

The mean IQ of the population of eighth graders in a city is known to be 100. You have selected a random sample of 50 children for a study of educational achievements. The first child tested has an IQ of 150. What do you expect the mean IQ to be for the whole sample? ¹

Correct Answer: 101.

Perplexity.ai and ChatGPT both answered 100 instead of the correct value, 101, demonstrating their susceptibility to the Gambler’s Fallacy.

(Wrong) answer from Perplexity.ai, 4/25/25, pplx.ai/share:

The mean IQ of the population is 100. When you select a random sample of 50 children, the expected mean IQ for your sample is also 100, regardless of the IQ of the first child tested. This is because, in random sampling, each child is equally likely to be selected, and the expected value of the sample mean is equal to the population mean. The fact that the first child has an IQ of 150 does not change the expected mean for the whole sample; it is just one data point out of 50. So, the expected mean IQ for the whole sample is 100.

Citations:

Answer from Perplexity: pplx.ai/share

(Wrong) answer from ChatGPT, 4/25/25:

To determine the expected mean IQ for the whole sample of 50 children, we need to understand that the mean IQ of the population is 100. Since the sample is random, and we don't know the IQ scores of the other 49 children, we can predict the mean IQ of the sample based on the available information. Key Points:

• Population Mean IQ: 100

• Sample Size: 50 children

• First Child's IQ: 150 (this is the score of just one child in the sample)

The expected mean IQ for the sample is the average of all the children's IQs, and since we have only one child’s score (150) and we don’t know the others, we would expect the mean to be close to the population mean, assuming the sample is random. We can calculate the expected mean IQ of the whole sample as the population mean (100) because this is the central tendency for the entire population. Adding one score of 150 into a sample of 50 children shouldn't dramatically alter the expected mean if the sample is random, but technically, it will increase the mean slightly. To find the expected mean IQ for the whole sample, we use the following reasoning:

• The first child's score is 150, and we expect the other 49 children's scores to be distributed around the population mean of 100.

Since we don't know the exact scores of the other 49 children, we can say:

Expected Mean IQ≈100\text{Expected Mean IQ} \approx 100

This is because the first child's score (150) is only one data point, and the rest of the 49 data points are expected to be close to 100, so the overall sample mean will not stray far from 100.

Huggingface Falcon Model Performance

Here are a couple of screenshots of AI hallucinations induced in AWS SageMaker using this model: "huggingface-llm-falcon-7b-instruct-bf16".

The prompts are adapted from research ² on the conjunction fallacy, which is when people think that two or more events are more likely to happen together than either event happening alone. (Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90, 293-315.)

These hallucinations suggest that this Falcon LLM is susceptible to the conjunction fallacy.

TO BE CONTINUED...

Tversky, A., & Kahneman, D. (1971), Belief in the Law of Small Numbers, Amos Tversky and Daniel Kahneman. ↩
Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in probability judgment. Psychological Review, 90, 293-315. ↩

johnfkraus/Are LLMs Good At Statistics.md

Select an option

No results found

Select an option

No results found

Are LLMs Good at Making Judgements Under Uncertainty)?

Prompt:

(Wrong) answer from Perplexity.ai, 4/25/25, pplx.ai/share:

(Wrong) answer from ChatGPT, 4/25/25:

Huggingface Falcon Model Performance

johnfkraus/Are LLMs Good At Statistics.md

Are LLMs Good at Making Judgements Under Uncertainty)?

Prompt:

(Wrong) answer from Perplexity.ai, 4/25/25, pplx.ai/share:

(Wrong) answer from ChatGPT, 4/25/25:

Huggingface Falcon Model Performance

Footnotes