Paper Summary
Paperzilla title
Just Use Average Pooling, Bro: PoolFormer Challenges Attention in Vision Transformers
This paper proposes PoolFormer, a Vision Transformer architecture that replaces the attention mechanism with simple spatial pooling. Surprisingly, PoolFormer achieves competitive performance on ImageNet, object detection, and semantic segmentation tasks, suggesting the general architecture of Vision Transformers matters more than the specific token mixer used.
Possible Conflicts of Interest
The authors report working at Sea AI Lab and the National University of Singapore, however they state their work was completed during an internship. The authors may still have close ties to these organizations.
Identified Weaknesses
Lack of strong baseline comparisons
The authors do not compare their model to any modern convolutional neural networks. It's not clear how this very simple model compares to similarly sized CNNs.
Inconsistent model size comparison
The authors compare their model size by parameters, but this is not a very consistent measure of model size. They should also report FLOPs, or inference time.
The code is provided but hasn't been released to any standard library like PyTorch Hub. There are also very limited training details.
Rating Explanation
This paper proposes an extremely simple architecture, PoolFormer, which achieves competitive performance on image classification tasks. The paper’s strength lies in its simplicity and the strong message: the core architecture of Vision Transformers may be more important than the token mixers commonly used. While it doesn't surpass state of the art in many cases, its efficiency makes it quite promising for real-world applications.
Good to know
This is our free standard analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.
File Information
Original Title:
MetaFormer Is Actually What You Need for Vision
Uploaded:
July 14, 2025 at 10:35 AM
© 2025 Paperzilla. All rights reserved.