Architectural Generalizability
The optimal TRM architecture (e.g., using MLP vs. self-attention) is task-specific, meaning a single "tiny network" design doesn't generalize perfectly across all problem types without modification. This implies architectural tuning is still necessary for new tasks.
Lack of Theoretical Justification for Recursion's Benefit
The paper admits it lacks a theoretical explanation for *why* deep recursion with small networks is so effective compared to larger, deeper models, attributing it to overfitting prevention without a formal proof.
TRM is a supervised model, providing a single deterministic answer. This limits its applicability to tasks where multiple valid answers might exist or where generative capabilities are desired, as suggested by the authors.
The performance benefits are primarily demonstrated on relatively small datasets (~1000 examples per task). The findings might not directly translate or hold the same significance for problems with much larger training data, where larger models typically thrive.
Computational Memory for Full Backpropagation
While simplifying training for ACT, TRM requires backpropagating through the full recursion process, which can lead to Out Of Memory (OOM) errors if the number of recursion steps is increased significantly, posing a practical limitation for scaling.