Masked-attention Mask Transformer for Universal Image Segmentation
ArXiv 2021

* Work done during an internship at Facebook AI Research.

Abstract

overview

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and se- mantic segmentation (57.7 mIoU on ADE20K).

Mask2Former

overview

Mask2Former adopts the same meta architecture as MaskFormer, with our proposed Transformer decoder replacing the standard one. The key components of our Transformer decoder include a masked attention operator, which extracts localized features by constraining cross-attention to within the foreground region of the predicted mask for each query, instead of attending to the full feature map. To handle small objects, we propose an efficient multi-scale strategy to utilize high-resolution features. It feeds successive feature maps from the pixel decoder’s feature pyramid into successive Transformer decoder layers in a round robin fashion. Finally, we incorporate optimization improvements that boost model performance without introducing additional computation.

Please check the paper for detailed description of the Mask2Former model.

Mask2Former for Universal Image Segmentation

overview

We study Mask2Former's ability to solve any image segmentation tasks using four popular datasets: COCO, ADE20K, Cityscapes and Mapillary Vistas and three tasks: panoptic, instance and semantic segmentation. Mask2Former, with a single architecture, outperforms the best of the specialized models on each task and dataset.

Please check the paper for detailed experimental results and ablation studies.

BibTeX

Acknowledgments

The website template was borrowed from Michaël Gharbi.