🎨 ReStyle3D

Scene-level Appearance Transfer with Semantic Correspondences

ReStyle3D Facilitates Fast Redesign for Indoor Scenes

ReStyle3D method teaser figure

Overview of ReStyle3D. Given an interior design image (style image) and a 3D scene captured by video or multi-view images, ReStyle3D first transfers the appearance based on semantic correspondences to a single view, then lifts the stylization to multiple viewpoints using 3D-aware style lifting, achieving multi-view consistent appearance transfer with fine-grained details.

Abstract

We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instancelevel correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. ReStyle3D first transfers the style to a single view using a training-free semanticattention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization.

How it works

Method Stage 1

Stage 1: Semantic Appearance Transfer

Method Stage 2

Stage 2: Multi-view Stylization

Two-stage approach of ReStyle3D Pipeline. The style and source images are first noised back to step 𝑇 using DDPM inversion [2024]. During the generation of the stylized output, the extended self-attention layer transfers style information from the style to the output latent. This process is further guided by a semantic matching mask, which allows for precise control. Stereo correspondences are extracted from the original image pair and used to warp the stylized image to the second image. To address missing pixels from warping, we train a warp-and-refine model to complete the stylized image. This model is applied across multiple views within our auto-regressive framework.



Semantic Appearance Transfer

Comparison with other methods
Image appearance transfer results. Our method enables precise appearance transfer between semantically corresponding elements, evidenced by the green rug and glass table (first row), textured cabinet (second row), and bedsheets (third row).

End-to-end Results

Applying different styles to the same scene

Transforming different rooms to the same style

Reconstruction on Stylized Images

Stylized Reconstruction

Please make sure "use hardware acceleration when available" is enabled for chrome and WebGL is enabled in Safari.

Citation