← Back to papers

MetaFormer Is Actually What You Need for Vision

★ ★ ★ ★ ☆

Paper Summary

Paperzilla title
Just Use Average Pooling, Bro: PoolFormer Challenges Attention in Vision Transformers

This paper proposes PoolFormer, a Vision Transformer architecture that replaces the attention mechanism with simple spatial pooling. Surprisingly, PoolFormer achieves competitive performance on ImageNet, object detection, and semantic segmentation tasks, suggesting the general architecture of Vision Transformers matters more than the specific token mixer used.

Explain Like I'm Five

Scientists found a new way to help computers understand pictures better. They learned that a simple trick, like blurring parts of an image, works just as well as complex ones, meaning the basic design of the computer program is most important.

Possible Conflicts of Interest

The authors report working at Sea AI Lab and the National University of Singapore, however they state their work was completed during an internship. The authors may still have close ties to these organizations.

Identified Limitations

Lack of strong baseline comparisons
The authors do not compare their model to any modern convolutional neural networks. It's not clear how this very simple model compares to similarly sized CNNs.
Inconsistent model size comparison
The authors compare their model size by parameters, but this is not a very consistent measure of model size. They should also report FLOPs, or inference time.
Reproducibility
The code is provided but hasn't been released to any standard library like PyTorch Hub. There are also very limited training details.

Rating Explanation

This paper proposes an extremely simple architecture, PoolFormer, which achieves competitive performance on image classification tasks. The paper’s strength lies in its simplicity and the strong message: the core architecture of Vision Transformers may be more important than the token mixers commonly used. While it doesn't surpass state of the art in many cases, its efficiency makes it quite promising for real-world applications.

Good to know

This is the Starter analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.

Explore Pro →

Topic Hierarchy

Domain: Life Sciences
Field: Neuroscience

File Information

Original Title: MetaFormer Is Actually What You Need for Vision
Uploaded: July 14, 2025 at 10:35 AM
Privacy: Public