While the paper tests EVOL-RL on several mathematical and reasoning datasets, it would strengthen the conclusions to evaluate its performance on a wider range of tasks, including those involving natural language generation, common sense reasoning, or other domains beyond math and logical problem-solving.
The paper mentions using 64 samples per instance, and a smaller subset of 32 for update. This sampling and generation process likely carries a significant computational cost, especially for larger models or more complex tasks, making it potentially less accessible for researchers with limited resources. A discussion of the computational demands and potential optimizations would be beneficial.
The calculation of novelty uses mean and max cosine similarity across embedding vectors, which might not capture the diversity of the reasoning chains. The paper provides only some intuitive explanations of why this particular method was chosen over others, but not a rigorous comparison of alternative metrics (e.g. edit distance, Jaccard index, etc).
Comparison with other Diversity-Promoting Techniques
The paper does not extensively discuss or compare EVOL-RL with existing methods for promoting diversity in language models, especially in the reinforcement learning setting. A more comprehensive comparison would enhance the paper's contributions.
Limited Evaluation of Adaptability
The paper mentions adaptability as a potential benefit of EVOL-RL, but it doesn't provide dedicated experiments or analysis to assess how well the model adapts to new or unseen tasks, or if EVOL-RL leads to performance gains in continual learning scenarios.