Abstract

Humans universally dislike the task of cleaning up a messy room. If machines were to help us with this task, they must understand human criteria for regular arrangements, such as several types of symmetry, co-linearity or co-circularity, spacing uniformity in linear or circular patterns, and further inter-object relationships that relate to style and functionality. Previous approaches for this task relied on human input to explicitly specify goal state, or synthesized scenes from scratch -- but such methods do not address the rearrangement of existing messy scenes without providing a goal state. In this paper, we present LEGO-Net, a data-driven transformer-based iterative method for learning regular rearrangement of objects in messy rooms. LEGO-Net is partly inspired by diffusion models -- it starts with an initial messy state and iteratively "de-noises'' the position and orientation of objects to a regular state while reducing the distance traveled. Given randomly perturbed object positions and orientations in an existing dataset of professionally-arranged scenes, our method is trained to recover a regular re-arrangement. Results demonstrate that our method is able to reliably rearrange room scenes and outperform other methods. We additionally propose a metric for evaluating regularity in room arrangements using number-theoretic machinery.

Method overview

LEGO-Net takes an input messy scene and attempts to clean the scene via iterative denoising. Given the current scene state, it computes the denoising gradient towards the clean manifold, and makes changes to the scene accordingly. This denoising step is taken repeatedly until the scene is "regular". On the right, we show our backbone transformer block f_theta that computes the denoising gradient at each step. This transformer takes the scene attributes of the current state and outputs 2D transformations of each object that would make the scene "cleaner".