Visual-prompt detection — Moondream 2 fine-tune
This is image-as-prompt detection: instead of asking the model in words ("find all the dogs"), you give it a picture of one example, and it finds every other instance of that thing in your target image.
The fine-tune wires a single mean-pooled vision embedding of the query image into the same prompt slot the tokenizer would normally fill with a class name.
Upload an image, paint over one example object with the brush, then hit Detect. The painted region is auto-cropped and used as the visual prompt — the model will draw boxes around every other matching instance in the same image.
These are (query crop, target image) pairs sampled from the LVIS validation split. Click an example to load it, then hit Detect to run the model.
Tips: the model was trained on LVIS objects — it generalizes best when the query crop is a clean, well-cropped example. Heavy occlusion or unusual viewpoints in the query lower recall.