The study was funded by Google LLC, and all authors are employees of Alphabet (Google's parent company) who may own stock. The paper evaluates PH-LLM, a model built upon Google's proprietary Gemini LLM, presenting a direct conflict of interest as the developers are evaluating their own product, which introduces a high risk of bias.
Subjectivity and potential for rater bias in evaluations
Expert evaluation of long-form text is inherently subjective. A single expert often rated all candidate responses, potentially allowing for identification of model/expert style and introducing conscious or unconscious bias.
Small human expert sample for comparison
Only five sleep medicine experts and five athletic trainers were recruited to take examinations, which is a very small sample size for comparison against an LLM evaluated on hundreds of questions and case studies, weakening claims of 'exceeding human experts'.
Confabulations and incorrect referencing
Despite finetuning, the model occasionally still produced confabulations (fabricated or distorted information) and incorrectly referenced user data, which is a critical safety concern for deployment in health applications.
Unrepresentative study samples
The case studies and patient-reported outcome prediction data were derived from self-selected wearable device users and may not be representative of the general population, showing demographic skews (e.g., towards more women in PROs, more men in Fitness, and older age groups). Race/ethnicity data was also not acquired.
Limited scope of sensor data utilization
The study primarily used textual representations of aggregated daily-resolution sensor data, rather than raw time-series data, potentially limiting the depth and nuance of insights that could be derived from wearable devices.
Suboptimal performance in specific fitness task
The PH-LLM was outperformed by the base Gemini Ultra 1.0 model and human experts in the 'training load' section of fitness case studies, indicating a specific area where the model's performance was not superior.
Lack of user preference studies
The study did not include blinded user preference studies to assess the model's actual effectiveness and performance relative to human coaching from a user's perspective, which is crucial for real-world applicability.
Generalizability concerns for safe deployment
The authors themselves caution that significant work remains to ensure LLMs are reliable, safe, and equitable for personal health applications in real-world settings, highlighting that the current model is not yet ready for broad deployment.