HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement

Marco Schouten^1,3, Ioannis Siglidis^2,3, Serge Belongie^2,3, Dim P. Papadopoulos^1,3

¹Technical University of Denmark ³University of Copenhagen ²Pioneer Center for AI

Overview of the HiddenObjects pipeline and examples

Spatial priors for object placement. Given a background scene, our diffusion-based pipeline densely evaluates candidate bounding boxes across a sliding-window grid: (i) semantically plausible locations are inpainted successfully, (ii) wrongly inpainted locations are filtered by our verifier, and (iii) impossible regions produce no object. The resulting heatmap encodes a ranked, class-conditioned spatial prior over all plausible positions in the scene.

Abstract

We propose a method to learn explicit, class-conditioned spatial priors for object placement in natural scenes by distilling the implicit placement knowledge encoded in text-conditioned diffusion models. Prior work relies either on manually annotated data, which is inherently limited in scale, or on inpainting-based object-removal pipelines, whose artifacts promote shortcut learning. To address these limitations, we introduce a fully automated and scalable framework that evaluates dense object placements on high-quality real backgrounds using a diffusion-based inpainting pipeline. With this pipeline, we construct HiddenObjects, a large-scale dataset comprising 27M placement annotations, evaluated across 27k distinct scenes, with ranked bounding box insertions for different images and object categories. Experimental results show that our spatial priors outperform sparse human annotations on a downstream image editing task (3.90 vs. 2.68 VLM-Judge), and significantly surpass existing placement baselines and zero-shot Vision-Language Models for object placement. Furthermore, we distill these priors into a lightweight model for fast and practical inference (230,000× faster). Our dataset and code will be made publicly available upon release.

Highlights

27M

Placement annotations

27k

Real-world background scenes from Places365

Foreground Object categories from COCO

3.77ms

Inference after distillation

3.90

vs. 2.68 sparse human annotations via ImgEdit-Judge

Dataset

We construct HiddenObjects by pairing 27k background images from 126 scene categories of Places365 with 50 COCO foreground categories across 10 macro-classes (food, furniture, animal, vehicle, …). Each scene–category pair is annotated with 1,004 ranked bounding box proposals, yielding 27M placement annotations in total.

Comparison with Existing Datasets

Dataset	# Boxes	+ / inst	− / inst	Real Img	Dense	Scalable	Clean BG
DreamEditBench	220	1	0	✓	✗	✗	✓
MureCom	640	1	0	✓	✗	✗	✓
OPA	0.07M	1.9	3.8	✓	✗	✗	✓
OPA-ext	0.15M	2.2	4.3	✓	✗	✗	✓
PIPE	0.89M	1	0	✗	✗	✓	✗
SAM-FB	3M	1	0	✗	✗	✓	✗
HiddenObjects	27M	77.7	926.3	✓	✓	✓	✓

Interactive Annotation Explorer

Hover over any image to reveal its placement annotations. Green = accepted · Red = rejected.

Loading visualization...

Relative Object Size and Distribution of Object Centers

Method

① Inpainter

Qwen-Image + ControlNet synthesises class-consistent object insertions at each bounding box in a sliding-window grid. ControlNet conditioning enforces scene geometry and refuses physically impossible placements.

② Verifier

Grounded-SAM-2 detects whether the target object was successfully inpainted in the proposed region. Low-confidence detections, wrong categories, and object replacements are discarded.

③ Ranker

ImageReward assigns a human-preference score to each verified insertion. The ranked scores define the final spatial prior distribution over all valid locations.

Example spatial prior creation. Our pipeline performs an exhaustive sliding-window search on a kitchen scene for pizza insertions.

Composite Overlay of Inpaintings. Multiple valid inpaintings are aggregated into a single composite scene. Accepted objects are segmented and overlaid on the source background. Insertions are ordered by preference ranking so the highest ranked object remains unoccluded at the top of the stack.

Results

Annotation Quality via Downstream Image Editing

We evaluate the annotations via inpainting conditioned on input bounding boxes using ImgEdit-Judge (Qwen2.5-VL, scale 1–5) across three axes: Prompt Compliance (PC), Visual Naturalness (VN), and Physical & Detail Coherence (PDC).

Method	Test Set	PC ↑	VN ↑	PDC ↑	Avg ↑
Raw Background	HiddenObjects	1.04	1.03	1.03	1.04
Full Mask	HiddenObjects	1.62	1.60	1.59	1.60
Random BBox	HiddenObjects	2.73	2.62	2.62	2.65
Ours (Pipeline)	HiddenObjects	3.83	3.63	3.63	3.69

Raw Background	OPA	1.00	1.00	1.00	1.00
Full Mask	OPA	1.66	1.66	1.66	1.66
Human Annotation	OPA	2.72	2.66	2.66	2.68
Random BBox	OPA	2.97	2.84	2.84	2.89
Ours (Pipeline)	OPA	4.05	3.83	3.83	3.90

Our pipeline outperforms all baselines including human annotations (3.90 vs. 2.68). Manual placement labels are inherently ambiguous — comparable in quality to random bounding box proposals.

Qualitative Results

Qualitative comparison of annotation quality. Our pipeline produces spatially coherent placements that respect scene geometry, while baselines often place objects in implausible locations or fail to produce realistic composites.

Distillation

We distill our spatial priors into a lightweight transformer that predicts ranked bounding boxes directly from a background image and a text category. The distilled model runs at 3.77 ms/image, a 230,000× speedup over the full pipeline, while preserving the quality of the learned priors.

Distilled model architecture. A transformer encoder–decoder processes a background image and a CLIP text embedding for the target category, and directly predicts a set of ranked bounding boxes as the spatial prior.

Object Placement Benchmark

We benchmark the distilled model against zero-shot VLMs and dedicated placement models on the HiddenObjects test set.

Method	Train Set	mAP (%)	IoU50@1 (%)	IoU@1 (%)	IoU50@5 (%)	IoU@5 (%)
Zero-shot Vision-Language Models
Qwen2.5-VL-72B	—	1.3	7.9	14.0	12.7	18.8
Qwen2.5-VL-7B	—	1.4	8.6	13.9	10.1	16.5
InternVL3-8B	—	0.3	3.3	11.2	9.2	18.7
LLaVA-OV-7B	—	1.0	11.7	20.9	16.3	24.7
Dedicated Object Placement Models
BootPlace	Cityscapes	0.1	0.9	6.0	0.9	6.0
TerseNet	OPA	1.4	20.9	29.5	20.9	29.5
GracoNet	OPA	2.1	23.7	35.9	23.7	35.9
PlaceNet	OPA	2.9	43.5	42.4	43.5	42.4
Ours	OPA	28.0	36.4	33.0	51.9	57.6
Ours	HiddenObjects	56.6	62.9	55.2	79.1	67.7

Our distilled model outperforms all baselines by a large margin (56.6% vs. 2.9% mAP), demonstrating that dense reward-weighted spatial priors are essential for learning robust placement distributions.

Slides

BibTeX

@inproceedings{schouten2026hiddenobjects,
  title     = {HiddenObjects: Scalable Diffusion-Distilled Spatial Priors for Object Placement},
  author    = {Schouten, Marco and Siglidis, Ioannis and Belongie, Serge and Papadopoulos, Dim P.},
  booktitle = {arXiv},
  year      = {2026}
}