We present LightMover, a framework for controllable light manipulation in single images that leverages video diffusion priors to produce physically plausible illumination changes without re-rendering the scene. We formulate light editing as a sequence-to-sequence prediction problem in visual token space: given an image and light-control tokens, the model adjusts light position, color, and intensity together with resulting reflections, shadows, and falloff from a single view. This unified treatment of spatial (movement) and appearance (color, intensity) controls improves both manipulation and illumination understanding. We further introduce an adaptive token-pruning mechanism that preserves spatially informative tokens while compactly encoding non-spatial attributes, reducing control sequence length by 41% while maintaining editing fidelity. For training our framework, we construct a scalable rendering pipeline that can generate large numbers of image pairs across varied light positions, colors, and intensities while keeping the scene content consistent with the original image. LightMover enables precise, independent control over light position, color, and intensity, and achieves high PSNR and strong semantic consistency (DINO, CLIP) across different tasks.
LightMover enables comprehensive control over light sources, including position, color, and intensity. You can drag light sources to new positions, adjust their color temperature, and control their brightness by moving the sliders.
LightMover repurposes a pre-trained image-to-video diffusion transformer within a sequence-to-sequence generation framework. We formulate light editing as a sequence of visual tokens. The model takes multi-condition frames including the reference image, object crop, movement map, and optional color/intensity controls. These are processed by a video VAE and a diffusion transformer to predict the edited frame with photometrically consistent lighting changes.
A key innovation is our adaptive token-pruning strategy. Spatial controls (movement maps) retain fine-grained tokens in small, localized regions and are downsampled elsewhere. Non-spatial controls (color and intensity) are represented with learnable compression ratios. This reduces the control sequence length by 41% while maintaining accurate spatial and illumination control.
To allow the transformer to interpret different signals correctly, we introduce MSPE, which integrates four orthogonal positional subspaces: Spatial, Temporal, Condition-Type, and Frame-Role encoding. This enables the model to reason jointly over spatial alignment and condition interdependence.
We constructed a scalable rendering pipeline using Blender to generate 32,000 combinations of synthetic data pairs. These pairs vary in light location, spectrum, and intensity, allowing the model to learn causal illumination effects like shadow shifting and reflection brightening. We also utilize a "LightMove-A" dataset of real-world image triplets for evaluation.
LightMover accurately handles complex light-matter interactions:
Surface
Reflection
Shadow
@inproceedings{zhou2026lightmover,
title={LightMover: Generative Light Movement with Color and Intensity Controls},
author={Zhou, Gengze and Wang, Tianyu and Kim, Soo Ye and Shu, Zhixin and Yu, Xin and Hold-Geoffroy, Yannick and Chaturvedi, Sumit and Wu, Qi and Lin, Zhe and Cohen, Scott},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2026}
}