Limited Benchmark Dataset
The benchmark dataset, while extensive, is not exhaustive and could benefit from additional data encompassing a wider range of concepts and types of keystone questions for a more comprehensive identification of potemkins.
The reliance on single definition questions as keystones may not fully capture the nuances of understanding a concept, as keystones in reality could involve multiple questions demonstrating application.
Potentially Difficult 'Use' Tasks
The difficulty of the 'use' tasks in the benchmark is questioned, with a possibility that even humans might struggle with them, potentially confounding the potemkin analysis.
Lower Bound on Potemkin Rate
The automated procedure for evaluating potemkins only provides a lower bound and may not capture the full extent of the issue.
LLM Self-Grading Assumption
The approach assumes that LLMs can be used for self-grading, which may not always be reliable due to potential biases or limitations in model capabilities.