Training large Transformer architectures for extended periods, as required by grokking, can be prohibitively expensive, limiting practical applicability and scalability.
Difficulty with Rare/Low-Frequency Relations
Achieving full generalization across all relations is challenging, particularly for rare or low-frequency relations, as they require substantial augmentation that is difficult to provide.
Sparse and Disconnected Knowledge Graphs
Real-world knowledge graphs are often sparse and disconnected, which inherently limits the number of multi-hop paths the model can learn from, hindering complex circuit formation.
Natural Language Challenges
Ambiguous references, unevenly distributed relations, and disjoint sub-graphs in real-world text make generating high-quality synthetic data for augmentation non-trivial and prone to noise.
Factuality Drift/Distortion
While synthetic data boosted generalization, there's a risk that factually incorrect data could distort real-world knowledge or lead to factual fragility in high-stakes domains like medical or legal reasoning.
Limited Scope of Benchmarks
The experiments primarily used Wikipedia-based QA, and the true scope and boundaries of the method for more complex reasoning chains, specialized domains, or temporal reasoning remain to be explored.
Partial Understanding of Mechanisms
Although emergent generalization circuits are observed, the precise mechanics of how these circuits form internally within Transformers remain only partially understood, making optimization difficult.
Inconsistencies in Dataset
The 2WikiMultiHopQA dataset itself has grammatical inconsistencies and inconsistent ground truth formats for some questions, which can prevent the model from achieving 100% accuracy regardless of the grokking mechanism.