MetaFormer Is Actually What You Need for Vision

★

☆

SHARE

Overview

Paper Summary

Conflicts of Interest

Identified Weaknesses

Rating Explanation

Good to know

Topic Hierarchy

File Information

Paper Summary

Paperzilla title

Just Use Average Pooling, Bro: PoolFormer Challenges Attention in Vision Transformers

This paper proposes PoolFormer, a Vision Transformer architecture that replaces the attention mechanism with simple spatial pooling. Surprisingly, PoolFormer achieves competitive performance on ImageNet, object detection, and semantic segmentation tasks, suggesting the general architecture of Vision Transformers matters more than the specific token mixer used.

Possible Conflicts of Interest

The authors report working at Sea AI Lab and the National University of Singapore, however they state their work was completed during an internship. The authors may still have close ties to these organizations.

Identified Weaknesses

Lack of strong baseline comparisons

The authors do not compare their model to any modern convolutional neural networks. It's not clear how this very simple model compares to similarly sized CNNs.

Inconsistent model size comparison

The authors compare their model size by parameters, but this is not a very consistent measure of model size. They should also report FLOPs, or inference time.

Reproducibility

The code is provided but hasn't been released to any standard library like PyTorch Hub. There are also very limited training details.

Rating Explanation

This paper proposes an extremely simple architecture, PoolFormer, which achieves competitive performance on image classification tasks. The paper’s strength lies in its simplicity and the strong message: the core architecture of Vision Transformers may be more important than the token mixers commonly used. While it doesn't surpass state of the art in many cases, its efficiency makes it quite promising for real-world applications.

Good to know

This is our free standard analysis. Paperzilla Pro fact-checks every citation, researches author backgrounds and funding sources, and uses advanced AI reasoning for more thorough insights.

Explore Pro →