ST2360D

Easy to understand our pipeline.

Abstract

360° monocular depth estimation plays a crucial role in scene understanding owing to its 180° × 360° field-of-view (FoV). To mitigate the distortions brought by equirectangular projection, existing methods typically divide 360° images into distortion-less perspective patches. However, since these patches are processed independently, depth inconsistencies are often introduced due to scale drift among patches. Recently, video depth estimation (VDE) models have leveraged temporal consistency for stable depth predictions across frames. Inspired by this, we propose to represent a 360° image as a sequence of perspective frames, mimicking the viewpoint adjustments users make when exploring a 360° scenario in virtual reality. Thus, the spatial consistency among perspective depth patches can be enhanced by exploiting the temporal consistency inherent in VDE models. To this end, we introduce a training-free pipeline for 360° monocular depth estimation, called ST²360D. Specifically, ST²360D transforms a 360° image into perspective video frames, predicts video depth maps using VDE models, and seamlessly merges these predictions into a complete 360° depth map. To generate sequenced perspective frames that align with VDE models, we propose two tailored strategies. First, a spherical-uniform sampling (SUS) strategy is proposed to facilitate uniform sampling of perspective views across the sphere, avoiding oversampling in polar regions typically with limited structural details. Second, a latitude-guided scanning (LGS) strategy is introduced to organize the frames into a coherent sequence, starting from the equator, prioritizing low-latitude slices, and progressively moving toward higher latitudes. Extensive experiments demonstrate that ST²360D achieves strong zero-shot capability on several datasets, supporting resolutions up to 4K.

Comparison with SOTA monocular 360 depth methods

The quantitative comparison with SOTA monocular 360 depth estimation methods is shown below.

ST²360D : Spatial-to-Temporal Consistency for Training-free 360 Monocular Depth Estimation

NeurIPS 2025

Zidong Cao¹ Jinjing Zhu¹ Hao Ai² Lutao Jiang¹ Yuanhuiyi Lyu¹ Hui Xiong^1,3†

¹AI Thrust, HKUST(GZ) ²University of Birmingham ³HKUST

† Corresponding author

ST²360D is a training-free pipeline that leverages temporal consistency in video depth estimation models for depth consistency across different perspective patches from a single 360 image.

Easy to understand our pipeline.

Abstract

Comparison with SOTA monocular 360 depth methods

ST2360D : Spatial-to-Temporal Consistency for Training-free 360 Monocular Depth Estimation

NeurIPS 2025 Zidong Cao1 Jinjing Zhu1 Hao Ai2 Lutao Jiang1 Yuanhuiyi Lyu1 Hui Xiong1,3† 1AI Thrust, HKUST(GZ) 2University of Birmingham 3HKUST † Corresponding author

ST2360D is a training-free pipeline that leverages temporal consistency in video depth estimation models for depth consistency across different perspective patches from a single 360 image.

Easy to understand our pipeline.

Abstract

Comparison with SOTA monocular 360 depth methods

ST²360D : Spatial-to-Temporal Consistency for Training-free 360 Monocular Depth Estimation

NeurIPS 2025

Zidong Cao¹ Jinjing Zhu¹ Hao Ai² Lutao Jiang¹ Yuanhuiyi Lyu¹ Hui Xiong^1,3†

¹AI Thrust, HKUST(GZ) ²University of Birmingham ³HKUST

† Corresponding author

ST²360D is a training-free pipeline that leverages temporal consistency in video depth estimation models for depth consistency across different perspective patches from a single 360 image.