A Vision Transformer (ViT) is a
transformer
A transformer is a passive component that transfers electrical energy from one electrical circuit to another circuit, or multiple circuits. A varying current in any coil of the transformer produces a varying magnetic flux in the transformer' ...
that is targeted at vision processing tasks such as
image recognition
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the huma ...
.
Vision Transformers

Transformers found their initial applications in
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...
(NLP) tasks, as demonstrated by
language models
A language model is a probability distribution over sequences of words. Given any sequence of words of length , a language model assigns a probability P(w_1,\ldots,w_m) to the whole sequence. Language models generate probabilities by training on ...
such as
BERT and
GPT-3
Generative Pre-trained Transformer 3 (GPT-3) is an autoregressive language model that uses deep learning to produce human-like text. Given an initial text as prompt, it will produce text that continues the prompt.
The architecture is a standa ...
. By contrast the typical image processing system uses a
convolutional neural network
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
(CNN). Well-known projects include
Xception
Critical Software is a Portuguese international information systems and software company, headquartered in Coimbra. The company was established in 1998, from the University of Coimbra's business incubator and technology transfer centre, Instituto ...
,
ResNet, EfficientNet, DenseNet, and Inception.
Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed
attention
Attention is the behavioral and cognitive process of selectively concentrating on a discrete aspect of information, whether considered subjective or objective, while ignoring other perceivable information. William James (1890) wrote that "Att ...
. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the
pixel
In digital imaging, a pixel (abbreviated px), pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device.
In most digital display devices, pixels are the s ...
. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.
As in the case of
BERT, a fundamental role in classification tasks is played by the class token. A special token that is used as the only input of the final
MLP Head as it has been influenced by all the others.
The architecture for image classification is the most common and uses only the Transformer Encoder in order to transform the various input tokens. However, there are also other applications in which the decoder part of the traditional Transformer Architecture is also used.
History
Transformers initially introduced in 2017 in the well-known paper "Attention is All You Need" have spread widely in the field of Natural Language Processing soon becoming one of the most widely used and promising architectures in the field.
In 2020 Vision Transformers were then adapted for tasks in Computer Vision with the paper "An image is worth 16x16 words". The idea is basically to break down input images as a series of patches which, once transformed into vectors, are seen as words in a normal transformer.
If in the field of Natural Language Processing the mechanism of attention of the Transformers tried to capture the relationships between different words of the text to be analysed, in Computer Vision the Vision Transformers try instead to capture the relationships between different portions of an image.
In 2021 a pure transformer model demonstrated better performance and greater efficiency than CNNs on image classification.
A study in June 2021 added a transformer backend to Resnet, which dramatically reduced costs and increased accuracy.
In the same year, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Among the most relevant is the Swin Transformer, which through some modifications to the attention mechanism and a multi-stage approach achieved state-of-the-art results on some object detection datasets such as
COCO
Coco commonly refers to:
* Coco (folklore), a mythical bogeyman in many Hispano- and Lusophone nations
Coco may also refer to:
People
* Coco (given name), a first name, its shorthand, or unrelated nickname
* Coco (surname), a list of people wi ...
. Another interesting variant is the TimeSformer, designed for video understanding tasks and able to capture spatial and temporal information through the use of divided space-time attention.
Vision Transformers were also able to get out of the lab and into one of the most important fields of Computer Vision,
autonomous driving
A self-driving car, also known as an autonomous car, driver-less car, or robotic car (robo-car), is a car that is capable of traveling without human input.Xie, S.; Hu, J.; Bhowmick, P.; Ding, Z.; Arvin, F.,Distributed Motion Planning for Sa ...
.
Comparison with Convolutional Neural Networks
ViT performance depends on decisions including that of the optimizer, dataset-specific
hyperparameters
In Bayesian statistics, a hyperparameter is a parameter of a prior distribution; the term is used to distinguish them from parameters of the model for the underlying system under analysis.
For example, if one is using a beta distribution to m ...
, and network depth. CNN are much easier to optimize.
A variation on a pure transformer is to marry a transformer to a CNN stem/front end. A typical ViT stem uses a 16x16 convolution with a 16 stride. By contrast a 3x3 convolution with stride 2, increases stability and also improves accuracy.
The CNN translates from the basic pixel level to a feature map. A tokenizer translates the feature map into a series of tokens that are then fed into the transformer, which applies the attention mechanism to produce a series of output tokens. Finally, a projector reconnects the output tokens to the feature map. The latter allows the analysis to exploit potentially significant pixel-level details. This drastically reduces the number of tokens that need to be analyzed, reducing costs accordingly.
The differences between CNNs and Vision Transformers are many and lie mainly in their architectural differences.
In fact, CNNs achieve excellent results even with training based on data volumes that are not as large as those required by Vision Transformers.
This different behaviour seems to derive from the presence in the CNNs of some
inductive biases that can be somehow exploited by these networks to grasp more quickly the particularities of the analysed images even if, on the other hand, they end up limiting them making it more complex to grasp global relations.
On the other hand, the Vision Transformers are free from these biases which leads them to be able to capture also global and wider range relations but at the cost of a more onerous training in terms of data.
Vision Transformers also proved to be much more robust to input image distortions such as adversarial patches or permutations.
However, choosing one architecture over another is not always the wisest choice, and excellent results have been obtained in several Computer Vision tasks through hybrid architectures combining convolutional layers with Vision Transformers.
The Role of Self-Supervised Learning
The considerable need for data during the training phase has made it essential to find alternative methods to train these models, and a central role is now played by
self-supervised methods. Using these approaches, it is possible to train a neural network in an almost autonomous way, allowing it to deduce the peculiarities of a specific problem without having to build a large dataset or provide it with accurately assigned labels. Being able to train a Vision Transformer without having to have a huge vision dataset at its disposal could be the key to the widespread dissemination of this promising new architecture.
Applications
Vision Transformers have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art.
Among the most relevant areas of application are:
*
Image Classification
Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the human ...
*
Object Detection
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...
*
Video Deepfake Detection
*
Image segmentation
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpl ...
*
Anomaly detection
In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority o ...
*
Image Synthesis
Rendering or image synthesis is the process of generating a photorealistic or non-photorealistic image from a 2D or 3D model by means of a computer program. The resulting image is referred to as the render. Multiple models can be defined ...
*
Cluster analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
*
Autonomous Driving
A self-driving car, also known as an autonomous car, driver-less car, or robotic car (robo-car), is a car that is capable of traveling without human input.Xie, S.; Hu, J.; Bhowmick, P.; Ding, Z.; Arvin, F.,Distributed Motion Planning for Sa ...
Implementations
There are many implementations of Vision Transformers and its variants available in open source online. The main versions of this architecture have been implemented in
PyTorch
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...
but implementations have also been made available for
TensorFlow
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
.
See also
*
Transformer (machine learning model)
A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer v ...
*
Attention (machine learning)
In artificial neural networks, attention is a technique that is meant to mimic cognitive attention. The effect enhances some parts of the input data while diminishing other parts — the motivation being that the network should devote more focus ...
*
Perceiver
Perceiver is a transformer adapted to be able to process non-textual data, such as images, sounds and video, and spatial data. Transformers underlie other notable systems such as BERT and GPT-3, which preceded Perceiver. It adopts an asymmetric at ...
*
Deep learning
*
PyTorch
PyTorch is a machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and op ...
*
TensorFlow
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
References
External links
*
*{{cite web, last=Coccomini, first=Davide, date=2021-05-03, url=https://towardsdatascience.com/on-dino-self-distillation-with-no-labels-c29e9365e382, url-access=subscription, title=On DINO, Self-Distillation with no labels, website=Towards Data Science, access-date=2021-10-03
Neural networks
Image processing