rgb_language_cap

Spatial-aware vision-language model trained on COCO for image captioning using ViT and GPT2.

  • Type: AI Model
  • Key Features: Generates captions describing spatial relationships between objects in an image.
  • Technical Categories: Computer Vision, Natural Language Processing, Vision-Language Models
  • Sectors: Content Creation, Accessibility, Image Indexing
  • Research areas: Image Captioning, Visual Scene Understanding
  • Type of License: Apache-2.0

Shopping Basket