Multimodal models like CLIP are everywhere. Their adversarial robustness is poorly understood.
The cross-modal threat
CLIP maps images and text to the same embedding space. This enables zero-shot classification, image search, and more. But it also means adversarial perturbations can transfer across modalities.
Our attack
We extend PGD to the multimodal setting. Perturb an image to push its embedding toward a target text embedding. The perturbation is imperceptible, but the model's behavior changes dramatically.
Transfer analysis
The surprising finding: attacks crafted for CLIP transfer to other vision-language models at 67% success rate. The shared embedding space creates shared vulnerabilities.
Implications
As multimodal models deploy in production (content moderation, search, accessibility), understanding their failure modes becomes critical. Our toolkit provides reproducible evaluation.