Labor-Intensive Task Creation
The process of creating each task is highly manual and time-consuming, requiring 3-5 hours of expert effort per task. This significantly limits the scalability and expandability of the benchmark.
Benchmark Difficulty for Smaller Models
MCPMark's high difficulty makes it less useful for evaluating and guiding the development of smaller, more efficient AI models, potentially hindering progress for that segment of research.
Simplified Agent Framework
The evaluation uses a minimal agent framework to avoid bias, but this means it lacks optimizations found in production systems. This might not fully reflect how models perform within more advanced, real-world agentic setups.
Limited Environment Coverage
While diverse, the benchmark only covers five specific MCP environments (Notion, GitHub, Filesystem, PostgreSQL, Playwright), suggesting a broader range of digital tools could offer further challenges.
Lack of Fine-Grained Difficulty Gradient
The current tasks are uniformly challenging, making it difficult to assess incremental improvements or tailor evaluations for models at different stages of development.
The tasks are clearly defined, which doesn't test an agent's ability to handle vague instructions, ask clarifying questions, or infer user intent—a critical aspect of real-world interaction.
Implicit Conflict of Interest
The benchmark heavily features and evaluates proprietary models from major AI companies (e.g., OpenAI's GPT-5, Anthropic's Claude), with the top performer being 'gpt-5-medium'. While authors are primarily from academic institutions or tooling companies, the central role and favorable performance of specific commercial products could create an appearance of an indirect conflict of interest.