The study used only 50 base multiple-choice questions, rewritten into 250 variants. This small dataset limits the generalizability of the findings across a wider range of tasks or knowledge domains.
Experiments primarily relied on ChatGPT-4o. The paper acknowledges that different LLM architectures and training corpora may respond differently, and thus, the findings may not apply to other models without further validation.
Narrow Performance Metric
The evaluation focused solely on accuracy in a multiple-choice setting. It did not assess other important qualities of LLM performance such as fluency, reasoning, coherence, or helpfulness.
Constrained Politeness Operationalization
The definition of 'politeness' and 'rudeness' relied on specific linguistic cues (prefixes), which may not encompass the full sociolinguistic spectrum of tone or account for cross-cultural differences. This could lead to a simplified understanding of how politeness manifests.
Ethical Implications of Findings
The authors acknowledge that the finding (rude prompts yielding better results) could encourage the deployment of hostile or toxic interfaces, leading to negative user experience and harmful communication norms, which is a significant concern for responsible AI development.