Simulated User Data for Reasoning Tasks
The paper synthesized conversations by simulating users for reasoning tasks, rather than using genuine human interactions. This limits the ecological validity of the reported gains in reasoning, as the feedback was not organic real-world user input.
Reliance on LLM for Data Classification
A GPT-40 model was used to classify user follow-up messages into different feedback types. This introduces a dependency on another large language model, which could propagate its inherent biases or errors into the training data used for RLHI.
A significant portion of the model's evaluation, including personalization and instruction-following, relied on other large language models (OpenAI 03-based judge, GPT-4 Turbo) rather than exclusively human judges. This can lead to circular evaluation where models are judged by criteria potentially learned from similar models, potentially missing human nuances or introducing LLM-specific biases.
The paper acknowledges that real-world human interaction data is inherently noisy, often containing low-quality prompts, harmful feedback, or inconsistent signals. While filtering techniques are applied, this inherent noisiness requires mitigation, meaning the learning isn't from purely raw, unfiltered organic interaction.
All models were initialized from Llama-3.1-8B-Instruct. While this is a common practice, the observed improvements are specific to this particular base model and might not generalize identically or as effectively to other foundational language models.