CLIP: Learning Transferable Visual Models from Natural Language Supervision

Presented by: Aman Singhal

Venue: Large Language and Vision Models Symposium, NYU Center for Data Science

Year: 2024

Abstract

CLIP, a model that learns visual concepts directly from natural language supervision rather than fixed object categories. It’s trained on 400 million image-text pairs from the internet using a simple task: predicting which caption matches which image. This approach enables zero-shot transfer to downstream tasks using natural language descriptions, matching ResNet-50’s ImageNet accuracy without using any of its 1.28 million training examples.

Presentation Slides

View slides in new window

Key Topics

Vision-language pretraining architecture
Contrastive learning from image-text pairs
Training on 400 million web-scraped pairs
Zero-shot transfer to downstream tasks
Natural language as flexible supervision signal
Comparison with supervised learning approaches
ImageNet performance without training examples
Applications in multimodal AI and computer vision
Impact on subsequent vision-language models

Back to Profile View All Talks