Spatial-aware vision-language model trained on COCO for image captioning using ViT and GPT2.
- Type: AI Model
- Key Features: Generates captions describing spatial relationships between objects in an image.
- Technical Categories: Computer Vision, Natural Language Processing, Vision-Language Models
- Sectors: Content Creation, Accessibility, Image Indexing
- Research areas: Image Captioning, Visual Scene Understanding
- Type of License: Apache-2.0