Recognizing scenes and objects in 3D from a single image is a long-standing goal of computer vision, with applications in robotics and AR/VR. After the success of 2D recognition, Meta AI returns to the task of detecting 3D objects by introducing a large benchmark called Omni3D that uses and merges existing datasets, resulting in 234,000 images annotated with more than 3 million instances and 97 categories.

The new Cube R-CNN, trained on Omni3D, is designed to summarize all camera types and scenes using a unified approach. It outperforms previous work on large Omni3D and existing benchmarks, is a powerful dataset for 3D object recognition, improves performance on a single dataset, and can accelerate learning on new smaller datasets using pre-training.

The Omni3D dataset has features such as 234k RGB images, 3M oriented 3D box annotations, indoor and outdoor scenes, various focal lengths and resolutions. To support the new 3D AP metric, developers implemented a fast and accurate 3D IoU algorithm.

The new 3D object detection method is based on Faster R-CNN (detectron2) to parameterize the 3D head in order to estimate a virtual 3D cuboid, which is then compared to 3D GT vertices. An important feature of the Cube R-CNN is the use of a virtual camera space for prediction, in which the effective image resolution and focal length for different camera sensors are maintained. Addressing the ambiguity of different camera sensors is critical for scaling to large/different 3D object datasets.

The Cube R-CNN model trained on Omni3D also has the ability to predict 3D objects on invisible COCO images.