Difficulty with Intricate Game Rules
The method struggled significantly with Gin Rummy, an imperfect information game with highly intricate, multi-step procedural subroutines. This indicates a limitation in reliably translating exceptionally complex natural language rules into flawless executable code, resulting in lower model accuracy for such games.
Reliance on LLM for Code Generation
The entire approach is fundamentally dependent on the LLM's ability to synthesize correct and verifiable Python code from natural language descriptions. While an iterative refinement process is employed, the initial quality of the generated code and the complexity of the game rules can pose a significant upstream challenge.
High Computational Cost for Refinement
Achieving high code accuracy, especially for more complex games like Gin Rummy, required a substantial number of LLM calls for iterative refinement (e.g., 500 calls). This can lead to significant computational expenses.
Limited Scope of Current Work
The current research focuses on two-player games and does not yet incorporate active/online learning of the world model or extend to open-world games with free-form text or visual interfaces. This limits its immediate applicability to more dynamic and less structured real-world scenarios.
Author-Created Novel Games
While the "out-of-distribution" games were created by the authors to avoid contamination from existing LLM training data, there is a potential for inadvertent alignment between these custom games and the LLM's internal representations, which might not fully reflect the challenges of truly novel, externally sourced OOD games.