Limited Task and Instruction Complexity
The benchmark focuses solely on keyword inclusion in a business report, which is a narrow and potentially unrealistic representation of real-world instruction-following scenarios. LLMs are often tasked with much more complex instructions involving reasoning, tool usage, and interaction.
Limited Domain and Language
The study uses only one domain (business reports based on SEC filings), which raises concerns about generalizability. The findings might not hold for instructions in other domains or with different linguistic structures.
Oversimplified Evaluation Metric
The evaluation method, based on keyword matching, doesn't capture the nuances of instruction following, such as intent understanding, coherence, and factual accuracy.
Lack of Analysis on Instruction Interactions
The study lacks analysis of the interaction between different types of instructions or the effect of instruction phrasing, which are crucial aspects of real-world scenarios where multiple guidelines need to be followed simultaneously.
Lack of Practical Recommendations
While the study identifies degradation patterns, it doesn't provide concrete recommendations or strategies for mitigating these effects. Practical advice for prompt engineering and LLM development would strengthen the impact of the findings.