ST2360D : Spatial-to-Temporal Consistency for Training-free 360 Monocular Depth Estimation

NeurIPS 2025

Zidong Cao1     Jinjing Zhu1    Hao Ai2    Lutao Jiang1    Yuanhuiyi Lyu1    Hui Xiong1,3†   
1AI Thrust, HKUST(GZ)      2University of Birmingham      3HKUST     
† Corresponding author

ST2360D is a training-free pipeline that leverages temporal consistency in video depth estimation models for depth consistency across different perspective patches from a single 360 image.

Easy to understand our pipeline.

Abstract

360° monocular depth estimation plays a crucial role in scene understanding owing to its 180° × 360° field-of-view (FoV). To mitigate the distortions brought by equirectangular projection, existing methods typically divide 360° images into distortion-less perspective patches. However, since these patches are processed independently, depth inconsistencies are often introduced due to scale drift among patches. Recently, video depth estimation (VDE) models have leveraged temporal consistency for stable depth predictions across frames. Inspired by this, we propose to represent a 360° image as a sequence of perspective frames, mimicking the viewpoint adjustments users make when exploring a 360° scenario in virtual reality. Thus, the spatial consistency among perspective depth patches can be enhanced by exploiting the temporal consistency inherent in VDE models. To this end, we introduce a training-free pipeline for 360° monocular depth estimation, called ST2360D. Specifically, ST2360D transforms a 360° image into perspective video frames, predicts video depth maps using VDE models, and seamlessly merges these predictions into a complete 360° depth map. To generate sequenced perspective frames that align with VDE models, we propose two tailored strategies. First, a spherical-uniform sampling (SUS) strategy is proposed to facilitate uniform sampling of perspective views across the sphere, avoiding oversampling in polar regions typically with limited structural details. Second, a latitude-guided scanning (LGS) strategy is introduced to organize the frames into a coherent sequence, starting from the equator, prioritizing low-latitude slices, and progressively moving toward higher latitudes. Extensive experiments demonstrate that ST2360D achieves strong zero-shot capability on several datasets, supporting resolutions up to 4K.

pipeline

Comparison with SOTA monocular 360 depth methods

The quantitative comparison with SOTA monocular 360 depth estimation methods is shown below.

pipeline
pipeline