The study acknowledges testing only a limited set of personas, focusing on sex, ethnicity, and migrant status, but excluding other factors like gender identity, sexual orientation, religion, age, etc. This limited scope may not fully capture the complexities of bias in LLMs and could miss other forms of discrimination.
Limited Benchmark and Language
The evaluation is solely based on the MMLU benchmark and a salary negotiation scenario, both in English. Relying on a single benchmark and language may not be representative of LLMs' performance and bias across diverse tasks and languages.
Although statistical tests were used, the generative evaluation method employed in Experiments 1 and 2 is known to be noisy and sensitive to minor prompt changes. This could affect the reliability of the results, especially given that experiments were run only once for each combination.
Limited Generalizability of Salary Negotiation
The salary negotiation scenario in Experiment 3 is limited to a single US city (Denver) and a specific job title ('Specialist'). The results might not generalize to other locations, job titles, or cultural contexts, limiting their external validity.
Limited Socio-Economic Factors
The study focuses on a single socio-economic factor (pay gap) and doesn't explore other relevant factors like wealth, education, or social status, which could also influence LLM bias.
The study is limited to only 5 commercially available large language models which might affect the generalizability of the results.