Limited LLM Selection and Computational Constraints
The study acknowledges computational limitations and primarily experiments with smaller LLMs due to budget and latency constraints. This limits the generalizability of the findings, as larger, more sophisticated LLMs might exhibit different behaviors.
Subjectivity of Narrative Evaluation
Narrative generation tasks are evaluated using LLM-as-a-judge scoring for qualities like "believability" or "emotional tone." This introduces subjectivity and potential biases from the judging LLM, making it difficult to objectively assess the impact of personality priming on narrative quality.
Limited Scope of Game Theory Experiments
While the game theory component is interesting, it focuses only on classic two-player games (Prisoner's Dilemma, Hawk-Dove) and simplified communication protocols. More complex game scenarios and multi-agent interactions would be needed to fully understand how personality influences strategic decision-making in LLMs.
The study lacks human evaluation of the LLM outputs, particularly for the narrative generation task. Human judgment of creativity, emotional expressiveness, and overall story quality would provide a more nuanced perspective.
MBTI's Scientific Validity
The study relies on the MBTI framework, which has known limitations in terms of its scientific validity and reliability. Using a more robust personality model could strengthen the findings.