CLIP: Learning Transferable Visual Models from Natural Language Supervision
CLIP: Learning Transferable Visual Models from Natural Language Supervision
Presented by: Aman Singhal
Venue: Large Language and Vision Models Symposium, NYU Center for Data Science
Year: 2024
Abstract
CLIP, a model that learns visual concepts directly from natural language supervision rather than fixed object categories. It’s trained on 400 million image-text pairs from the internet using a simple task: predicting which caption matches which image. This approach enables zero-shot transfer to downstream tasks using natural language descriptions, matching ResNet-50’s ImageNet accuracy without using any of its 1.28 million training examples.
Presentation Slides
Key Topics
- Vision-language pretraining architecture
- Contrastive learning from image-text pairs
- Training on 400 million web-scraped pairs
- Zero-shot transfer to downstream tasks
- Natural language as flexible supervision signal
- Comparison with supervised learning approaches
- ImageNet performance without training examples
- Applications in multimodal AI and computer vision
- Impact on subsequent vision-language models