Envisage: Diffusion-Based Rhinoplasty Goal Visualization with Mask-Decomposed Evaluation

Abstract

Localized generative editing needs localized evaluation: full-image identity metrics are structurally confounded under hard-composited edits. We present Envisage, a FLUX.1-Fill inpainting reference pipeline for rhinoplasty goal visualization from a single frontal photograph. The pipeline combines 8 rhinoplasty clinical presets (the released framework also includes 8 blepharoplasty and 8 rhytidectomy presets), MediaPipe masks, and hard-mask compositing. The composite preserves outside-mask pixels by construction, so full-face identity scores are dominated by copied pixels rather than by the diffusion backbone. Because full-face identity metrics cannot grade localized edits, we introduce SurgicalScore, a mask-decomposed 0-1 protocol scoring edit direction, edit magnitude, masked LPIPS, realism, and outside-mask preservation; SSraw assigns 0.919 [0.918, 0.920] to a perfect-predictor control , anchoring the ceiling. On N=211, the paired ArcFace gain (output-to-GT minus input-to-GT) is negative for all methods (Envisage -0.048 smallest, vs. ICEdit -0.139, Kontext -0.242, InstructPix2Pix -0.294; p < 1e-4), with external validation on a 457-pair ASPS/PCA corpus showing a larger negative gap. With SurgicalScore, Envisage achieves the highest score (0.599 [0.579, 0.619]) and leads on both metrics, but the all-negative ArcFace gap shows that full-face identity is poorly aligned with localized surgical accuracy under hard compositing. A 5-seed GT-oracle (an upper bound, not a deployable result) reduces the residual ArcFace gap by 73% (-0.054 to -0.015), with positive output-to-GT gain on 33.9% of cases, indicating candidate-space headroom for a learned ranker. For localized edits, progress should be measured with edit-region fidelity rather than full-face identity metrics. We release Envisage, SurgicalScore, preset definitions, and matched split manifests.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…