Cross-modal adversarial attacks: when fooling CLIP fools everything

Multimodal models like CLIP are everywhere. Their adversarial robustness is poorly understood.

The cross-modal threat

CLIP maps images and text to the same embedding space. This enables zero-shot classification, image search, and more. But it also means adversarial perturbations can transfer across modalities.

Our attack

We extend PGD to the multimodal setting. Perturb an image to push its embedding toward a target text embedding. The perturbation is imperceptible, but the model's behavior changes dramatically.

Transfer analysis

The surprising finding: attacks crafted for CLIP transfer to other vision-language models at 67% success rate. The shared embedding space creates shared vulnerabilities.

Implications

As multimodal models deploy in production (content moderation, search, accessibility), understanding their failure modes becomes critical. Our toolkit provides reproducible evaluation.