rgb_language_cap

Spatial-aware vision-language model trained on COCO for image captioning using ViT and GPT2.

Type: AI Model
Key Features: Generates captions describing spatial relationships between objects in an image.
Technical Categories: Computer Vision, Natural Language Processing, Vision-Language Models
Sectors: Content Creation, Accessibility, Image Indexing
Research areas: Image Captioning, Visual Scene Understanding
Type of License: Apache-2.0