Limited Scope of Evaluation Tasks
The method was primarily evaluated on factual question answering tasks. While effective for these, the authors acknowledge that real-world continual learning involves more complex tasks like reasoning and coding, where the current solution may not be directly applicable, limiting the generalizability of benefits to broader LLM applications.
Reliance on Specific Memory Layer Architecture
The proposed method is deeply tied to the 'memory layer models' referenced as Meta internal research. This means its applicability is limited to LLMs that incorporate such a specific architecture, rather than being a universal finetuning strategy for all LLMs.
TF-IDF Ranking for Sparsity
The paper uses TF-IDF for ranking memory slots to update, which works for the chosen tasks. However, it's noted that 'more sophisticated scoring functions or granularities' might be needed for different tasks or finer-grained updates, meaning the optimality of TF-IDF for all continual learning scenarios is not guaranteed.
Scalability to Larger Models/Tasks
The experiments were conducted on a 1.3B parameter model. While promising, scaling these results to much larger LLMs (e.g., 70B+ parameters) and more diverse, complex continual learning scenarios could introduce new challenges not addressed by the current study.
Hyperparameter Sensitivity and Optimizer Choice
The paper notes sensitivity to optimizers (AdamW vs. SGD) and learning rates, requiring careful tuning. This suggests the method might be sensitive to specific hyperparameter choices in new, unseen continual learning settings, potentially affecting its robustness.