Dependency on Pre-trained 3D Foundation Model
The method relies on MASt3R, a large pre-trained 3D foundation model. While this is a strength for performance, it means the approach is not entirely self-contained and inherits potential biases or limitations from MASt3R's pre-training.
Domain Generalization Requires Calibration/Fine-tuning
Although robust, the model showed performance regression when directly applied from an indoor-trained dataset (ScanNet++) to an outdoor dataset (MapFree), requiring recalibration or fine-tuning for optimal performance in significantly different visual domains.
No Explicit Discussion of Computational Cost for Deployment
While training time and a single forward pass time are mentioned, a detailed analysis of the real-time computational demands for robotic deployment (beyond just inference speed) is not explicitly discussed, which could be a factor in real-world applications on constrained hardware.
Limited Exploration of Failure Modes
The paper highlights successes, especially in challenging scenarios like perceptual instance aliasing. However, a more detailed discussion or qualitative analysis of specific failure modes beyond what is shown in comparison to baselines would provide a more complete picture of the method's limitations.