Pointly-Supervised Instance Segmentation
CVPR 2022

* Work done during an internship at Facebook AI Research.



We propose point-based instance-level annotation, a new form of weak supervision for instance segmentation. It combines the standard bounding box annotation with labeled points that are uniformly sampled inside each bounding box. We show that the existing instance segmentation models developed for full mask supervision, like Mask R-CNN, can be seamlessly trained with the point-based annotation without any major modifications. In our experiments, Mask R-CNN models trained on COCO, PASCAL VOC, Cityscapes, and LVIS with only 10 annotated points per object achieve 94%-98% of their fully-supervised performance. The new point-based annotation is approximately 5 times faster to collect than object masks, making high-quality instance segmentation more accessible for new data.

Inspired by the new annotation form, we propose a modification to PointRend instance segmentation module. For each object, the new architecture, called Implicit PointRend, generates parameters for a function that makes the final point-level mask prediction. Implicit PointRend is more straightforward and uses a single point-level mask loss. Our experiments show that the new module is more suitable for the proposed point-based supervision.

Mask R-CNN trained with point-based supervision


Mask R-CNN, as well as other existing instance segmentation models developed for full mask supervision, can be seamlessly trained with 10 annotated points per object and achieve 94%-98% of their fully-supervised performance on various dataset. Although the Mask R-CNN ResNet-50-FPN model is trained with only 10 annotated points per object on the COCO dataset, it learns to segment instance boundary well.

Please check the paper for results of different models with point-based supervision.

Annotation time and model performance trade-off


We compare the new point-based supervision with other forms of supervision for instance segmentation under the same annotation budget which we measure as the time required to label training data. To match annotation times between different supervision forms, we train a Mask R-CNN model using from 10% to 100% of COCO train2017. Observe that Mask R-CNN trained with the new point-based supervision significantly outperforms models trained with both full mask supervision and weak bounding box supervision under the same computation budget.

Please check the paper for more analysis of annotation time.

Implicit PointRend


Inspired by the new point-based annotation, we propose a simplified version of the PointRend module which we name Implicit PointRend. For each detected object, instead of a coarse mask prediction, the new architecture predicts parameters for a point head function that can make a point-wise object mask prediction for any point given its position and corresponding image features. The new module has a single point-level mask loss which simplifies its implementation in comparison with PointRend. Our experiments show Implicit PointRend performs much better than PointRend with point-based annotation and Implicit PointRend is also very competitive when trained with mask supervision.

Please check the paper for more results of Implicit PointRend.



We would like to thank Ross Girshick, Piotr Dollár, Alex Berg, Yuxin Wu, Tamara Berg, and Elisa Berger for useful discussions and advices.

The website template was borrowed from Michaƫl Gharbi.