Reliance on Synthetic Data Composition
The core training relies on synthetically composed GSM8K math problems, which the authors acknowledge are 'relatively artificial.' While transfer to harder benchmarks is shown, the method's effectiveness might vary for real-world long-horizon tasks where such clear atomic problems or simple chaining structures are not readily available.
Computational Cost Trade-off
The paper notes that using 'cheaper' (skewed towards shorter problems) data distributions to achieve similar performance requires 'more training compute.' This implies a trade-off where cost efficiency in data might lead to increased computational cost, which could be a limitation in resource-constrained environments.
Simplified Theoretical Model
The theoretical analysis relies on a simplified model of long-horizon correctness. While this provides valuable insights into sample complexity, it is a simplification that may not fully capture the complexities of real-world LLM reasoning.
Limited Skill and Dependency Diversity
The discussion section points to future work on incorporating new sources of atomic skills beyond GSM8K and expanding the serial dependency structure. This suggests current limitations in the diversity of skills learned and the complexity of dependencies the method can handle.
While the method shows significant improvements on Instruct models (Qwen-2.5-3B, Qwen-2.5-7B, Llama-3.2-3B), the LLM landscape evolves rapidly. The performance gains are relative to the chosen base models, and the absolute 'groundbreaking' nature might need re-evaluation with newer, more powerful foundation models.