← Back

Computer Vision and Pattern Recognition

The automatic extraction and interpretation of information from images and video, including object detection, facial recognition, medical imaging analysis, autonomous vehicles, and visual understanding systems

33 papers

Papers

WORLDMIRROR: UNIVERSAL 3D WORLD RECONSTRUCTION WITH ANY-PRIOR PROMPTING

This paper introduces WorldMirror, a novel AI model that can reconstruct 3D scenes from images and various "hints" like camera data or depth maps, generating multiple 3D representations simultaneously. It achieves state-of-the-art performance across diverse 3D reconstruction tasks by flexibly integrating these priors, although it shows suboptimal performance on dynamic scenes due to training data limitations. The model demonstrates strong generalization and efficiency, showcasing a promising direction for universal 3D scene understanding.

Computer Vision and Pattern Recognition Oct 23, 10:33 AM

Do generative video models understand physical principles?

This paper introduces Physics-IQ, a comprehensive real-world benchmark to evaluate if generative video models truly understand physical principles like gravity or fluid dynamics. The study found that across a range of current models (e.g., Sora, VideoPoet), physical understanding is severely limited and largely unrelated to visual realism, despite some models generating highly realistic-looking videos. The research concludes that visual realism does not imply physical understanding, highlighting a significant gap in current AI capabilities.

Computer Vision and Pattern Recognition Oct 12, 08:12 PM

SegMASt3R: Geometry Grounded Segment Matching

This paper presents SegMASt3R, a novel method for matching coherent image segments across extreme viewpoint changes using 3D foundation models. The approach significantly outperforms state-of-the-art methods on wide-baseline segment matching benchmarks and demonstrates practical utility in robotic navigation and 3D instance mapping. While highly effective, its generalization to vastly different visual domains (e.g., indoor to outdoor) still benefits from recalibration or fine-tuning.

Computer Vision and Pattern Recognition Oct 10, 01:12 PM

LARGE SCALE DIFFUSION DISTILLATION VIA SCORE-REGULARIZED CONTINUOUS-TIME CONSISTENCY

This paper introduces rCM, a new method that fixes quality issues of previous consistency models, enabling faster and better large-scale image and video generation. The authors, affiliated with NVIDIA and Tsinghua University, demonstrate that rCM can accelerate diffusion sampling by up to 50x while achieving competitive quality and superior diversity using proprietary NVIDIA models and datasets. The technique combines forward-divergence consistency distillation with reverse-divergence score distillation, showing robustness for text-to-image and text-to-video tasks in few steps.

Computer Vision and Pattern Recognition Oct 10, 12:06 PM

GLVD: Guided Learned Vertex Descent

This paper introduces GLVD, a new hybrid method for creating high-fidelity 3D face reconstructions from just a few images. It cleverly combines local neural fields with global 3D keypoint guidance to achieve accurate and adaptable geometry without relying on rigid prior models. GLVD delivers state-of-the-art performance and significantly reduces the time it takes to create these digital faces.

Computer Vision and Pattern Recognition Oct 08, 04:32 PM

VISUAL ODOMETRY WITH TRANSFORMERS

This paper introduces VoT, an end-to-end transformer model for monocular visual odometry, which directly predicts camera motion from video sequences without relying on complex, hand-crafted components or post-processing. VoT demonstrates competitive accuracy across indoor and outdoor datasets, significant speed improvements (3x faster), and robust scaling, though its performance may be limited in dynamic environments due to training on static data.

Computer Vision and Pattern Recognition Oct 07, 06:54 PM

Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

This paper investigates the performance of visual-only versus visual-geometry semantic features in 3D scene representations (radiance fields) for robotic tasks like object localization and camera pose estimation. While visual-geometry features show finer spatial details, they surprisingly perform similarly for object localization and actually *underperform* visual-only features in camera pose estimation. The findings suggest that current visual-only features are more versatile for these applications.

Computer Vision and Pattern Recognition Oct 06, 04:19 PM

Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

This paper introduces VisualMimic, a framework enabling humanoid robots to perform various physical tasks like pushing and kicking objects by using their whole bodies and visual perception. It successfully transfers skills learned in virtual simulations to real-world robots, allowing them to adapt to different environments without extra human help. The approach advances humanoid robot control by integrating egocentric vision with hierarchical whole-body control.

Computer Vision and Pattern Recognition Oct 04, 10:52 AM

A SCENE IS WORTH A THOUSAND FEATURES: FEED-FORWARD CAMERA LOCALIZATION FROM A COLLECTION OF IMAGE FEATURES

This paper introduces FastForward, a novel computer vision method for quickly and accurately determining a camera's exact location and orientation in a 3D scene. By representing scenes as a sparse collection of image features and using a single feed-forward neural network pass, FastForward significantly reduces the time and resources required for mapping a scene while achieving state-of-the-art or comparable accuracy to existing methods across diverse indoor and outdoor environments. The approach also demonstrates robust generalization to unseen domains and varying scale ranges thanks to a scene and scale normalization technique.

Computer Vision and Pattern Recognition Oct 02, 02:36 PM

CMT: MID-TRAINING FOR EFFICIENT LEARNING OFCONSISTENCY, MEAN FLOW, AND FLOW MAP MODELS

The paper introduces Consistency Mid-Training (CMT), a novel intermediate training stage designed to significantly improve the efficiency, stability, and performance of flow map models for vision generation. CMT acts as a bridge between pre-training (diffusion models) and post-training (flow map models), providing a trajectory-consistent initialization that reduces total training cost (data and GPU time) by up to 98% compared to baselines, while achieving state-of-the-art FID scores on various image generation benchmarks. The theoretical analysis confirms that CMT provides a strong starting point for flow map post-training, minimizing gradient bias and accelerating convergence.

Computer Vision and Pattern Recognition Oct 02, 10:46 AM

Understanding the Effective Receptive Field in Deep Convolutional Neural Networks

This paper introduces the concept of an "effective receptive field" (ERF) in deep convolutional neural networks (CNNs), showing it's smaller than the theoretical receptive field and follows a Gaussian distribution. The authors analyze how ERF size is affected by factors like network depth, non-linear activations, and skip connections, and they suggest ways to increase ERF size, such as modified initialization and architectural changes.

Computer Vision and Pattern Recognition Sep 11, 08:15 PM

Tutorial on Diffusion Models for Imaging and Vision

This tutorial provides a comprehensive overview of diffusion models, tracing their development from variational autoencoders (VAEs) to denoising diffusion probabilistic models (DDPMs) and score-matching Langevin dynamics (SMLD). It also explores the connection between diffusion models and stochastic differential equations (SDEs), providing insights into their underlying principles and behavior.

Computer Vision and Pattern Recognition Sep 11, 05:49 PM

Sensory Robustness through Top-Down Feedback and Neural Stochasticity in Recurrent Vision Models

This study explored how top-down feedback and simulated neural noise (dropout) affect the performance of convolutional recurrent neural networks (ConvRNNs) on image classification. They found that only when both top-down feedback and dropout were combined, the ConvRNNs became more robust to noisy or manipulated images, outperforming models with either feature alone.

Computer Vision and Pattern Recognition Sep 09, 08:51 PM

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

B-VLLM improves video understanding in large language models by cleverly selecting key frames and visual details, balancing spatial and temporal information. It shows good performance on various video benchmarks but has limitations in handling multi-round conversations about the same video, requiring repeated processing and adding computational cost.

Computer Vision and Pattern Recognition Sep 06, 08:27 PM

MobileCLIP2: Improving Multi-Modal Reinforced Training

This paper introduces MobileCLIP2, a family of smaller and faster image-text models based on CLIP, optimized for mobile devices. By improving the training data and process, MobileCLIP2 achieves state-of-the-art zero-shot image classification accuracy on ImageNet-1k while being significantly smaller and faster than comparable models. Notably, some variants trade off a small amount of retrieval performance for improved classification accuracy.

Computer Vision and Pattern Recognition Aug 29, 07:33 PM

Self-Rewarding Vision-Language Model via Reasoning Decomposition

This paper introduces Vision-SR1, a method to improve how AI understands images and text by having it check its own work. Specifically, the AI generates a description of an image, then tries to answer a related question using *only* that description, without looking back at the image. This helps it learn to pay more attention to visual details and avoid taking language shortcuts. The method showed improved accuracy on several tasks, though further investigation is needed to isolate the source of these improvements.

Computer Vision and Pattern Recognition Aug 28, 03:09 AM

Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

This paper proposes a new method for registering lines and planes in 3D by minimizing the geodesic distance on the Grassmann manifold, which offers a more theoretically sound and robust approach compared to existing methods that rely on Euclidean distances or point approximations. Experimental results on object registration, RGB-D odometry, and camera pose estimation demonstrate improved accuracy and convergence, especially in the presence of outliers.

Computer Vision and Pattern Recognition Aug 12, 12:40 PM

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

This paper introduces Cambrian-1, a family of open-source multimodal large language models (MLLMs) focused on improving visual understanding. Cambrian-1 achieves state-of-the-art performance on several benchmarks, matching or exceeding some proprietary models. The authors also develop a new vision-centric benchmark and propose a more efficient connector design for vision and language integration.

Computer Vision and Pattern Recognition Aug 11, 04:51 PM

Meta CLIP 2: A Worldwide Scaling Recipe

This paper introduces Meta CLIP 2, a new model trained on a massive dataset of image-text pairs from various languages, resulting in improved performance on both English and multilingual tasks. The key innovation is a scaling recipe involving metadata, curation, and training capacity adjustments. The model achieves state-of-the-art results on several multilingual benchmarks, including XM3600, Babel-ImageNet, and CVQA.

Computer Vision and Pattern Recognition Aug 09, 12:40 PM

Streaming 4D Visual Geometry Transformer

This paper introduces StreamVGGT, a causal transformer model that reconstructs 4D spatial-temporal geometry from video in real-time. By caching historical tokens and using causal attention, it processes video frames incrementally, offering faster inference than traditional methods while maintaining competitive accuracy thanks to knowledge distillation from a more computationally expensive teacher model.

Computer Vision and Pattern Recognition Jul 17, 06:58 AM

Segment Anything

This paper introduces the Segment Anything Model (SAM), a promptable segmentation model capable of generating masks from various input prompts like points, boxes, and text. SAM is trained on SA-1B, a massive dataset containing over 1 billion masks, enabling impressive zero-shot transfer capabilities to a diverse range of segmentation tasks. The authors demonstrate SAM's effectiveness through experiments on edge detection, object proposal generation, instance segmentation, and text-to-mask prediction.

Computer Vision and Pattern Recognition Jul 15, 08:05 AM

Fine-Grained Image Analysis with Deep Learning: A Survey

The survey examines deep learning advancements in fine-grained image analysis (FGIA), arguing for a broader definition encompassing recognition and retrieval. It presents a taxonomy of techniques, evaluates performance on benchmarks, and outlines future research directions, including precise definition, new datasets, 3D application, robust representations, and interpretability.

Computer Vision and Pattern Recognition Jul 14, 05:20 PM

Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges

This paper introduces DOTA, a massive dataset for object detection in aerial images, featuring 1.8 million object instances across 18 categories with oriented bounding box annotations. Using this dataset, they benchmark 10 state-of-the-art object detection algorithms across 70+ configurations, providing a valuable resource for researchers in the field and demonstrating the unique challenges of aerial object detection.

Computer Vision and Pattern Recognition Jul 14, 05:20 PM

High-Resolution Image Synthesis with Latent Diffusion Models

This paper introduces Latent Diffusion Models (LDMs), a new approach to image synthesis that reduces the computational demands of traditional diffusion models while maintaining high-quality results. By operating in the latent space of a pre-trained autoencoder, LDMs achieve faster training and sampling while also enabling flexible conditioning on various inputs like text or bounding boxes.

Computer Vision and Pattern Recognition Jul 14, 05:20 PM

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

This paper proposes a novel vision Transformer architecture called Swin Transformer which utilizes a shifted window approach for computing self-attention, resulting in linear computational complexity. Experiments on ImageNet, COCO, and ADE20K datasets demonstrate state-of-the-art performance across image classification, object detection, and semantic segmentation tasks.

Computer Vision and Pattern Recognition Jul 14, 05:20 PM

Semantic interpretation of architectural and archaeological geometries: Point cloud segmentation for HBIM parameterisation

This paper applies Brodu and Lague's morphological segmentation algorithm (CANUPO) to classify architectural components from terrestrial laser scanning data of a historical palace façade. The algorithm performed well on large complex shapes but struggled with smaller details, suggesting its applicability for general façade analysis but limitations for fine-grained modeling.

Computer Vision and Pattern Recognition Jul 14, 05:09 PM