LightMover: Generative Light Movement with Color and Intensity Controls

We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. For training our framework, we construct a scalable rendering pipeline that can generate large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.

LightMover enables comprehensive control over light sources, including position, color, and intensity. You can drag light sources to new positions, adjust their color temperature, and control their brightness by moving the sliders.

Drag the slider to adjust intensity.

Drag the slider to adjust color.

Drag the slider to adjust position.

Unified Framework

LightMover repurposes a pre-trained image-to-video diffusion transformer within a sequence-to-sequence generation framework. We formulate light editing as a sequence of visual tokens. The model takes multi-condition frames including the reference image, object crop, movement map, and optional color/intensity controls. These are processed by a video VAE and a diffusion transformer to predict the edited frame with photometrically consistent lighting changes.

Adaptive Token Pruning

A key innovation is our adaptive token-pruning strategy. Spatial controls (movement maps) retain fine-grained tokens in small, localized regions and are downsampled elsewhere. Non-spatial controls (color and intensity) are represented with learnable compression ratios. This reduces the control sequence length by 41% while maintaining accurate spatial and illumination control.

Multi-Signal Positional Encoding (MSPE)

To allow the transformer to interpret different signals correctly, we introduce MSPE, which integrates four orthogonal positional subspaces: Spatial, Temporal, Condition-Type, and Frame-Role encoding. This enables the model to reason jointly over spatial alignment and condition interdependence.

Scalable Data Generation

We constructed a scalable rendering pipeline using Blender to generate 32,000 combinations of synthetic data pairs. These pairs vary in light location, spectrum, and intensity, allowing the model to learn causal illumination effects like shadow shifting and reflection brightening. We also utilize a "LightMove-A" dataset of real-world image triplets for evaluation.

BibTeX

@inproceedings{zhou2026lightmover,
  title={LightMover: Generative Light Movement with Color and Intensity Controls},
  author={Zhou, Gengze and Wang, Tianyu and Kim, Soo Ye and Shu, Zhixin and Yu, Xin and Hold-Geoffroy, Yannick and Chaturvedi, Sumit and Wu, Qi and Lin, Zhe and Cohen, Scott},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

LightMover: Generative Light Movement with Color and Intensity Controls

Abstract

Interactive Demo

Method

Unified Framework

Adaptive Token Pruning

Multi-Signal Positional Encoding (MSPE)

Scalable Data Generation

Physical Consistency

BibTeX