One piece at a time: Assembling textual video captions from single words and image patches

Recording video using several types of modern capturing devices — such as digital cameras, web cameras, or cell-phone cameras — has been widely proliferated in the world for many years now. The reasons why we capture videos are numerous. People mostly desire to capture important moments in their lives, or less important ones, using their mobile devices. Within a series of years, people may come up with several hundreds or thousands of videos and images. Video capturing has other operational applications, too, like video-based surveillance. In this type of surveillance, a place of interest that is visible by a camera is captured in order to monitor what is happening in the surrounding area. But why would we need to capture video in this case? Shop owners would, for instance, utilise surveillance cameras to monitor people who navigate their shops for security or for business management reasons. However, there can be more than that. Another idea could be to predict when and where people visiting a very large shop or a museum should be serviced by the staff. The utility of video for artificially intelligent digital analyses is very difficult to understand fully. When we need to manage smaller or larger collections of video, one question that is important is this: How can we summarise the essential semantic visual information that is contained in a collection of videos through text?

By VectorMine from Getty Images

As humans, we can instantly perceive some of the elements of our surrounding environment, without even making a significant effort. Perceiving aspects of the visual world is essential for us to function within our communities. The human brain perceives the world around us visually by being passed visual information sensed directly through our eyes and transmitted via the optic nerve. This is a true but too coarse of a statement about how vision works in the human species. The complexity of how human vision essentially works is not revealed. In fact, until today it is not fully understood how the brain processes visual information and how it makes sense of it.

Although there are important scientific questions in the modern understanding of the underpinnings of visual processing in the human brain, for years computer scientists have been trying to find explanations, algorithms and mathematical tools that can recreate visual analysis and understanding from visual data of different sorts (e.g., images and video). In particular, making sense of videos, for instance, requires us to be able to detect and localise objects in videos, track target moving objects, or to take in a streaming video of a road scene from a car roof-mounted camera and find where the vehicles, pedestrians, or road signs are. These are only very few examples of computer vision problems and applications studied by computer scientists and practitioners.

Intermediate processes taking place in the visual cortex of a biological brain, and an artificial neural network-based analog for object detection that uses a CNN, intermediate non-linear feature transformations and a fully connected layer calculating a distribution of the numerical confidences of detecting several object categories from an input image. Image adapted from [7].

The top of the image above depicts a model of the biological analog for understanding how the biological brain recognises objects, and the bottom of it shows an artificial analog for visual processing in the same task. In the biological analog, the human eye senses a green tree through the retina and passes on the signals to the optic nerve. The different cues of this image (such as elements of motion, depth information and color) are processed by the Lateral Geniculate Nucleus (LGN) in the Thalamus area of the brain. This layer-by-layer signal propagation in the LGN makes up an explanation of how the brain encodes raw sensory information from the environment. The outcome in the brain encoding stage is passed on to the layers in visual cortex of the brain, which finally enables the human brain to sense the picture of a green tree. On the contrary, at the bottom of the figure, we see how a relatively modern neural network analog works. At an initial step, raw visual data are captured by a camera that can sense the visible spectrum of the electromagnetic spectrum. Then, the three layers of pixel intensities in terms of three color components (R for red, G for green and B for blue), are passed through a deep convolutional neural net. This forward process encodes the raw input data as an arrangement of numbers that are fixed horizontally and vertically. Each number is encoded by a tuple of coordinates in the arrangement of numbers, each telling us at which index of a dimension the number belongs to. The arrangement of these numbers give us what we call “features”. We can think of it as a numerical “signature” or “fingerprint” of what the camera captured from a fixed position in the environment. Finally, another deep neural network with intermediate neural processing modules decodes the former features by transforming them sequentially and non-linearly through different neural processing layers. Finally, by means of the forward propagation of information through these neural layers, a distribution of the likely categories of objects in the image are computed. How is this made possible? Simple: the model was trained to adapt its parameters to learn this association from raw images to distributions of object categories by minimising the categorisation error over a set of image versus category pairs.

Being able to understand how to write computer programs that can tell us what is depicted in videos through text has been one important and general-purpose computer vision application. Although for decades humans have thought of intelligent machines that can visually perceive the world around them, being able to convincingly take in a video and expecting to robustly generate a text description that tells us what is depicted in the video only recently became possible with the development or robust algorithms. Before this time-point in the horizon of AI advances, attacking the video-to-text task with older scientific techniques and methods was practically impossible. In the recent years after this turning time-point, the most important scientific concepts that were used to create effective video captioning algorithms were borrowed from the AI subfield called deep learning. Moreover, advances in computer hardware that can accelerate numerical computations have became available in the market, so that deep learning models with a large number of model parameters can be trained on raw data. Graphics Processing Units (GPUs) is the go-to hardware technology to allow for the development of deep learning-based models. Unfortunately, in the era before the proliferation of deep learning (probably around the year 2006 in which Deep Belief Networks were introduced), there were still important techniques and concepts that were employed to devise successful algorithms, but their capabilities fell short versus those exhibited by deep learning models. Moreover, the deep learning-based video captioning algorithms that have existed today have a significantly enormous amount of parameters, that is, numbers, which was atypical of older algorithms (or models) designed for exactly the same task. Although the widest adoption of deep learning can be estimated to have started around the year 2006, it is important to notice that the LeNet deep CNN model developed by Yann LeCun was published in 1998. Moreover, basic elements of deep learning were initially developed in the eighties and nineties, such as backprop algorithm for tuning model parameters.

The sets of model parameters in deep learning models can be found through algorithms that perform what is called function optimization. Through the use of function optimization algorithms, a model can hopefully work sufficiently well on its intended task. Research on explainable AI has contributed methods that can explain why a model committed a particular function (e.g., a classification or a regression operation). In the area of video captioning, many successful systems such as SwinBERT [1] follow such an approach to train a video captioning model on a large dataset of videos harvested from the Web. Each of these videos is associated with a text caption, comprised by one or more sentences, that was written by a human. What is significant here is that the designer of a video captioning deep learning algorithm can take in a large dataset of such videos and associated annotations and, after some amount of time that can vary depending on the amount of data and the size of the model, come up with a good model that can be presented with new videos that the model had never seen before during training and generate relatively accurate video captions for them.

An ordered sequence of words that describes a video may normally relate to what a real human would say aiming to describe the video. But is it technically trivial to generate a video caption by means of an algorithm? The answer to this question is a mixed “yes and no” answer. It is partly a “yes” answer, because scientists have already come up with capable algorithms for the task, although they are still not “entirely perfect”. On the contrary, the answer is partly a “no” one, because the problem of generating a video caption is ill-posed: that is, it cannot be defined in a way that can clearly determine what a video caption really is, so that then there can be an exact algorithm that can provide answers like those that are unquestionably correct. To understand this better, imagine that you would normally make different statements after seeing the same video depending on the details that you actually want to highlight. So how can we a-priori decide what to tell about a video via an algorithm when there can be several possible statements that we can actually make? There is no way we can do this, because the algorithm may miss declaring something about the video that is, in fact, important to a human observing it. Therefore, facing ill-posedness, we only go by defining what video captioning is by providing a mathematical model that can convincingly tell what video captioning is, but one that is not really the optimal one. As we already mentioned, we even do not know what the optimal model is in the first place! In the way to train this suboptimal model, the designer has to again rely on a large dataset of raw videos and associated text annotations in order for the model to learn to perform the associated task well.

In an attempt to grasp that mathematical models of reality can be suboptimal in some sense, it is helpful to recall a quote by George E. P. Box that says: “All models are wrong, but some are useful”. Video captioning systems that are built using deep learning modelling concepts can be said to operate at a level that can empirically prove they are useful and reliable for the task. Their results can achieve a good level of utility when we can quantify goodness, despite the fact that we know these models are not globally perfect models of the visual world sensed through a camera.

To glimpse an example of how a real video captioning system works, we will provide a simple and comprehensive summary of a video captioning model called SwinBERT. This model was developed by researchers at Microsoft. It was presented at the CVPR 2022 conference (see reference [1]), a top-tier conference for computer vision. The implementation of this model is publicly available on GitHub [3].

To get an understanding on how SwinBERT works, it is helpful to consider the fact that in the physical world matter is made of small pieces organised hierarchically. Small pieces of matter are combined with other small pieces of matter to create bigger chunks of matter. For example, sand is made of pieces of very tiny rocks. A selection of thousands or millions of such small rocks are arranged geometrically together to make a piece of sand. Equivalently, these small pieces tie together to form larger pieces of matter, and each small piece has a particular position in space. In the case of video captioning, the SwinBERT system provides a model that is trained to relate pieces of visual information (that is, image patches) to sequences of words. To make this happen, SwinBERT reuses two previous important ideas. The first one is the VidSwin transformer model that starts with representing video as 3D voxel patches and finally performs feature extraction to represent the visual content in an image sequence. VidSwin-generated representations can be used for classification tasks such as action recognition tasks, among others. VidSwin is a technical development that was published before the time SwinBERT became accessible. It was created by Microsoft and was presented in a 2022 paper at the CVPR conference [2]. A module that helps generate word sequences (that is, sentences) is BERT (Bidirectional Encoder Representations from Transformers) [5] developed by Devlin and collaborators.

By Alea Mosaic

To begin with, in order to understand the function of VidSwin, imagine a colourful mosaic created by an artist. To create a mosaic that depicts a scene, a mosaic designer takes very small coloured pieces of rock, each with a unique colour and texture. She stitches these nice, small, colorful pieces of rock together, piece by piece, to form objects like a dolphin and the sea, as in the mosaic image on the left. Normally, each selected mosaic patch certainly belongs to a unique object. In the mosaic on the left, for example, some patches belong to the dolphin, while other patches belong to the background, and some patches belong to the sea surface. Mosaic patches belonging to the same object are often adjacent, or it can be said that they are near each other; but patches that do not belong to the same object are usually not adjacent, or they are adjacent because their boundaries touch. If we represented the adjacency of image patches as a graph, we would naturally come up with an outright planar, undirected graph. For example, for the mosaic example depicted in the picture above, we can say that the “the dolphin hovers on seawater”. Now imagine the scenario in which the same designer creates slightly altered mosaics given the original mosaic. Consider the patches of each mosaic image to lie at the same positions but their color content to change from one mosaic image to the next mosaic image. Imagine that some of the patches make the appearance of the dolphin to be undergoing a displacement in time, giving the impression that the dolphin is moving. Naturally, as we iterate the mosaics from the first to the last one, some image patches may be correlated spatially (because they are adjacent within the image), while other patches belonging to different mosaic images may be correlated temporally. VidSwin aspires to model these dynamic patch relationships by adopting a transformer model to perform self-attention on the 3D patches both spatially and temporally, creating refined 3D patches that are embedded well in feature space. These refined patch embeddings are further transformed several times by self-attention layers in order to robustly model the dependencies among them at different scales of attention. Finally, these 3D embeddings are passed through a multi-head self-attention layer, followed by a forward non-linear transformation computed by a feed-forward neural network. VidSwin then outputs spatio-temporal features that numerically describe small consecutive frame segments in the video.

From Canva by shutter_m from Getty Images

In the language modelling module of SwinBERT called BERT [5], the ultimate goal is to generate a sequence of words that better describes the visual content of the video as captured by VidSwin. BERT captures the relationships between the words that appear in a sentence with a heuristic that considers the importance of an anchor word given the words that appear both to the left of the anchor word and to the right of the anchor word. Due to this heuristic, BERT is described to take into account the bidirectional context of words. BERT considers this bidirectional context to train itself on a large corpus of text, so that it can be fine-tuned on other text corpora in order to serve other downstream tasks. Much like every deep learning model, BERT optimises two objective functions. The first objective function that BERT optimises is called the Mask Language Model (MLM) objective, where the model picks some words of a sentence randomly and masks them out, requiring for BERT to infer the masked (that is, missing) parts of it correctly. To correct the missing parts, BERT again considers bidirectional context described previously. The second objective optimised by BERT, is what is called Next Sentence Prediction (NSP). This last objective causes BERT to optimise its model parameters to force the model to understand relationships between pairs of sentences.

The mapping of visual information to word sequences through a multi-modal transformer model with a cross-attention layer in SwinBERT

Now that we described how Swin transformer generates spatiotemporal visual-specific features from frame segments in videos — and how BERT generates textual features — it’s about time to describe how these two elements are combined together to form what is known as the SwinBERT model. The key ingredient here is a model from the literature that can combine both of these worlds: the visual world, and the textual one. One needs to define such a model, so that they can go from a VidSwin-based visual representation to a textual representation, which is the desired output of SwinBERT. The multimodal transformer actually fuses the visual representations and the textual representations to make a better representation that introduces simple and sparse interactions between the visual and textual elements. Such interactions between elements of two different modalities are, in fact, more easily interpretable. Instead, everything-versus-everything interactions are more expensive and are often unnecessarily complicated. SwinBERT does away with this via a multimodal Transfomer that employs a key element for processing multimodal data: the cross-attention layer. A simple transformer model [4] instead employs a self-attention layer, computing dense relationships of tokens from the same single modality. The self-attention layer learns a common representation of text elements and visual elements by computing linear combinations of them and, after being passed through a feed-forward neural network, are then input to the seq2seq sequence generation algorithm that can compute video captions.

References

[1] Lin et al., SwinBERT: end-to-end transformers with sparse attention for video captioning, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 17949-17958.

[2] Liu et al., Video Swin Transformer, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, pp. 3202-3211.

[3] Code accessed at https://github.com/microsoft/SwinBERT

[4] Vaswani et al., Attention is all you need, in Proceedings of the Neural Information Processing Systems 2017.

[5] Devlin et al. BERT: Pre-training of deep bidirectional transformers for language understanding, in AclWeb 2018.

[6] Sutskever, I., Oriol V., and Quoc V. Le. Sequence to sequence learning with neural networks, in Proceedings of Neural Information Processing Systems 2014.

[7] Zhang, Hongxin, and Suan Lee. Robot bionic vision technologies: A review, in Applied Sciences (2022)

Picture of Sotiris Karavarsamis

Sotiris Karavarsamis

Research Assistant at Visual Computing Lab (VCL)@CERTH/ITI

Twitter
LinkedIn
Shopping Basket