CLIP: Learning Transferable Visual Models from Natural Language Supervision

CLIP: Learning Transferable Visual Models from Natural Language Supervision

Presented by: Aman Singhal

Venue: Large Language and Vision Models Symposium, NYU Center for Data Science

Year: 2024

Abstract

CLIP, a model that learns visual concepts directly from natural language supervision rather than fixed object categories. It’s trained on 400 million image-text pairs from the internet using a simple task: predicting which caption matches which image. This approach enables zero-shot transfer to downstream tasks using natural language descriptions, matching ResNet-50’s ImageNet accuracy without using any of its 1.28 million training examples.

Presentation Slides

View slides in new window

Key Topics

  • Vision-language pretraining architecture
  • Contrastive learning from image-text pairs
  • Training on 400 million web-scraped pairs
  • Zero-shot transfer to downstream tasks
  • Natural language as flexible supervision signal
  • Comparison with supervised learning approaches
  • ImageNet performance without training examples
  • Applications in multimodal AI and computer vision
  • Impact on subsequent vision-language models
Back to Profile View All Talks