Visual-prompt detection — Moondream 2 fine-tune

This is image-as-prompt detection: instead of asking the model in words ("find all the dogs"), you give it a picture of one example, and it finds every other instance of that thing in your target image.

The fine-tune wires a single mean-pooled vision embedding of the query image into the same prompt slot the tokenizer would normally fill with a class name.

Upload an image, paint over one example object with the brush, then hit Detect. The painted region is auto-cropped and used as the visual prompt — the model will draw boxes around every other matching instance in the same image.

Target image — paint one example object

Max objects to detect

1 50

Detections

Auto-extracted query crop

Tips: the model was trained on LVIS objects — it generalizes best when the query crop is a clean, well-cropped example. Heavy occlusion or unusual viewpoints in the query lower recall.