Per-Pixel Classification is NOT All You Need for Semantic Segmentation
arXiv 2021

* Work partly done during an internship at Facebook AI Research.



Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Is per-pixel classification all you need?

Semantic segmentation has been dominated by a per-pixel classification formulation since the seminal work of Fully Convolutional Networks (FCNs). However, it has limitations. For example, it struggles with large number of classes and it cannot handle instance-level segmentation which requires dynamic number of outputs. In contrast, mask classification is more general and it once dominated the field with methods like O2P and SDS. Given this perspective, a natural question emerges: can a single mask classification model simplify the landscape of effective approaches to semantic- and instance-level segmentation tasks? And can such a mask classification model outperform existing per-pixel classification methods for semantic segmentation?



To address both questions, we propose a simple MaskFormer module that seamlessly converts any existing per-pixel classification model into a mask classification method. Using the set prediction mechanism proposed in DETR, MaskFormer employs a Transformer decoder to compute a set of pairs, each consisting of a class prediction and a mask embedding vector. The mask embedding vector is used to get the binary mask prediction via a dot product with the per-pixel embedding obtained from an underlying fully-convolutional network. The new model solves both semantic- and instance-level segmentation tasks in a unified manner: no changes to the model, losses and training procedure are required. Specifically, for semantic and panoptic segmentation tasks alike, MaskFormer is supervised with the same per-pixel binary mask loss and a single classification loss per mask. Finally, we design a simple inference strategy to blend MaskFormer outputs into a task-dependent prediction format.

Please check the paper for detailed description of the MaskFormer model.

MaskFormer for Semantic Segmentation

We evaluate MaskFormer on five semantic segmentation datasets with various numbers of categories: Cityscapes (19 classes), Mapillary Vistas (65 classes), ADE20K (150 classes), COCO-Stuff-10K (171 classes), ADE20K-Full (847 classes). While MaskFormer performs on par with per-pixel classification models for Cityscapes, which has a few diverse classes, the new model demonstrates superior performance for datasets with larger vocabulary. We hypothesize that a single class prediction per mask models fine-grained recognition better than per-pixel class predictions. MaskFormer achieves the new state-of-the-art on ADE20K (55.6 mIoU) with Swin-Transformer backbone, outperforming a per-pixel classification model with the same backbone by 2.1 mIoU, while being more efficient (10% reduction in parameters and 40% reduction in FLOPs).

Please check the paper for detailed experimental results and ablation studies.

MaskFormer for Panoptic Segmentation

We study MaskFormer's ability to solve instance-level tasks using two panoptic segmentation datasets: COCO and ADE20K. MaskFormer outperforms a more complex DETR model with the same backbone and the same post-processing. Moreover, MaskFormer achieves the new state-of-the-art on COCO (52.7 PQ), outperforming prior state-of-the-art by 1.6 PQ. Our experiments highlight MaskFormer's ability to unify instance- and semantic-level segmentation.

Please check the paper for detailed experimental results and ablation studies.



We thank Ross Girshick for insightful comments and suggestions.

Work of UIUC authors BC and AS was supported in part by NSF under Grant #1718221, 2008387, 2045586, MRI #1725729, NIFA award 2020-67021-32799 and Cisco Systems Inc. (Gift Award CG 1377144 - thanks for access to Arcetri).

The website template was borrowed from Michaƫl Gharbi.