ST2360D is a training-free pipeline that leverages temporal consistency in video depth estimation models for depth consistency across different perspective patches from a single 360 image.
360° monocular depth estimation plays a crucial role in scene understanding owing to its 180° × 360° field-of-view (FoV). To mitigate the distortions brought by equirectangular projection, existing methods typically divide 360° images into distortion-less perspective patches. However, since these patches are processed independently, depth inconsistencies are often introduced due to scale drift among patches. Recently, video depth estimation (VDE) models have leveraged temporal consistency for stable depth predictions across frames. Inspired by this, we propose to represent a 360° image as a sequence of perspective frames, mimicking the viewpoint adjustments users make when exploring a 360° scenario in virtual reality. Thus, the spatial consistency among perspective depth patches can be enhanced by exploiting the temporal consistency inherent in VDE models. To this end, we introduce a training-free pipeline for 360° monocular depth estimation, called ST2360D. Specifically, ST2360D transforms a 360° image into perspective video frames, predicts video depth maps using VDE models, and seamlessly merges these predictions into a complete 360° depth map. To generate sequenced perspective frames that align with VDE models, we propose two tailored strategies. First, a spherical-uniform sampling (SUS) strategy is proposed to facilitate uniform sampling of perspective views across the sphere, avoiding oversampling in polar regions typically with limited structural details. Second, a latitude-guided scanning (LGS) strategy is introduced to organize the frames into a coherent sequence, starting from the equator, prioritizing low-latitude slices, and progressively moving toward higher latitudes. Extensive experiments demonstrate that ST2360D achieves strong zero-shot capability on several datasets, supporting resolutions up to 4K.
The quantitative comparison with SOTA monocular 360 depth estimation methods is shown below.