OMG-Seg: Is One Model Good Enough For All Segmentation?

OMG-Seg is One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation.

Sophia

· Feb 8, 2024

Paper Code

Xiangtai Li¹, Haobo Yuan¹, Wei Li¹, Henghui Ding¹, Size Wu¹, Wenwei Zhang^1,2,
Yining Li², Kai Chen², Chen Change Loy^1
1S-Lab, Nanyang Technological University, ²Shanghai Artificial Intelligence Laboratory

Abstract

In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to fill all these tasks in one model and achieve good enough performance.

We show that OMG-Seg, a transformer-based encoder-decoder architecture with task-specific queries and outputs, can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Both the code and models will be publicly available.

Model Scope Comparison

We include several representative methods here. Our proposed OMG-Seg can perform various segmentation tasks in one model. SS: Semantic Segmentation. IS: Instance Segmentation. PS: Panoptic Segmentation. VSS: Video Semantic Segmentation. VIS: Video Instance Segmentation. VPS: Video Panoptic Segmentation. VOS: Video Object Segmentation. Open-Set: Open-Vocabulary Segmentation.

Video

Method: OMG-Seg

(a), OMG-Seg follows the architecture of Mask2Former, containing a backbone (CLIP Visual Encoder), a pixel decoder, and a mask decoder. The different parts are a shared mask decoder for both image and video segmentation and a visual prompt encoder. We use two types of mask queries, i.e., semantic queries, for instance/semantic masks or mask tubes, and location queries that encode box or point prompts. (b), One decoder layer in the Mask Decoder. The location queries skip the self-attention operation as they are only conditioned on the image content and the location prompts. (c), The forward pass of OMG-Seg in training and inference. We use CLIP's text encoder to represent category names and classify masks by calculating cosine similarity between mask features and text embeddings

Subscribe

OMG-Seg: Is One Model Good Enough For All Segmentation?

Abstract

Model Scope Comparison

Video

Method: OMG-Seg

Comments

Read Next

Prometheus raises $12B to build an AI to automate physical manufacturing processes

Niteshift raises $7M to build the cloud infrastructure layer for AI coding agents

PhysicsX raises $300M Series C at $2.4B valuation to scale AI for engineering and manufacturing

Suno raised a $400M Series D at a $5.4B valuation despite ongoing lawsuits

Codex now boasts plugins for white-collar work and other new features for Enterprise users