In
artificial neural networks, a convolutional layer is a type of
network layer
In the seven-layer OSI model of computer networking, the network layer is layer 3. The network layer is responsible for packet forwarding including routing through intermediate routers.
Functions
The network layer provides the means of trans ...
that applies a
convolution
In mathematics (in particular, functional analysis), convolution is a mathematical operation on two functions ( and ) that produces a third function (f*g) that expresses how the shape of one is modified by the other. The term ''convolution' ...
operation to the input. Convolutional layers are some of the primary building blocks of
convolutional neural networks
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networ ...
(CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform
translational symmetry
In geometry, to translate a geometric figure is to move it from one place to another without rotating it. A translation "slides" a thing by .
In physics and mathematics, continuous translational symmetry is the invariance of a system of equatio ...
.
The convolution operation in a convolutional layer involves sliding a small window (called a
kernel
Kernel may refer to:
Computing
* Kernel (operating system), the central component of most operating systems
* Kernel (image processing), a matrix used for image convolution
* Compute kernel, in GPGPU programming
* Kernel method, in machine lea ...
or filter) across the input data and computing the
dot product
In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...
between the values in the kernel and the input at each position. This process creates a feature map that represents detected
features
Feature may refer to:
Computing
* Feature (CAD), could be a hole, pocket, or notch
* Feature (computer vision), could be an edge, corner or blob
* Feature (software design) is an intentional distinguishing characteristic of a software item ...
in the input.
Concepts
Kernel
Kernels, also known as filters, are small matrices of weights that are learned during the training process. Each kernel is responsible for detecting a specific feature in the input data. The size of the kernel is a hyperparameter that affects the network's behavior.
Convolution
For a 2D input
and a 2D kernel
, the 2D convolution operation can be expressed as:
where
and
are the height and width of the kernel, respectively.
This generalizes immediately to nD convolutions. Commonly used convolutions are 1D (for audio and text), 2D (for images), and 3D (for spatial objects, and videos).
Stride
Stride determines how the kernel moves across the input data. A stride of 1 means the kernel shifts by one pixel at a time, while a larger stride (e.g., 2 or 3) results in less overlap between convolutions and produces smaller output feature maps.
Padding
Padding involves adding extra pixels around the edges of the input data. It serves two main purposes:
* Preserving spatial dimensions: Without padding, each convolution reduces the size of the feature map.
* Handling border pixels: Padding ensures that border pixels are given equal importance in the convolution process.
Common padding strategies include:
* No padding/valid padding. This strategy typically causes the output to shrink.
* Same padding: Any method that ensures the output size same as input size is a same padding strategy.
* Full padding: Any method that ensures each input entry is convolved over for the same number of times is a full padding strategy.
Common padding algorithms include:
* Zero padding: Add zero entries to the borders of input.
* Mirror/reflect/symmetric padding: Reflect the input array on the border.
* Circular padding: Cycle the input array back to the opposite border, like a torus.
The exact numbers used in convolutions is complicated, for which we refer to (Dumoulin and Visin, 2018) for details.
Variants
Standard
The basic form of convolution as described above, where each kernel is applied to the entire input volume.
Depthwise separable
Depthwise separable convolution separates the standard convolution into two steps: depthwise convolution and pointwise convolution. The depthwise separable convolution decomposes a single standard convolution into two convolutions: a depthwise convolution that filters each input channel independently and a pointwise convolution (
convolution) that combines the outputs of the depthwise convolution. This factorization significantly reduces computational cost.
It was first developed by Laurent Sifre during an internship at
Google Brain
Google Brain is a deep learning artificial intelligence research team under the umbrella of Google AI, a research division at Google dedicated to artificial intelligence. Formed in 2011, Google Brain combines open-ended machine learning research ...
in 2013 as an architectural variation on
AlexNet
AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor.
AlexNet competed in the ImageNet Large Scale Vis ...
to improve convergence speed and model size.
Dilated
Dilated convolution, or atrous convolution, introduces gaps between kernel elements, allowing the network to capture a larger receptive field without increasing the kernel size.
Transposed
Transposed convolution, also known as deconvolution, fractionally strided convolution, and upsampling convolution, is a convolution where the output tensor is larger than its input tensor. It's often used in encoder-decoder architectures for upsampling. It's used in image generation,
semantic segmentation
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpli ...
, and
super-resolution
Super-resolution imaging (SR) is a class of techniques that enhance (increase) the resolution of an imaging system. In optical SR the diffraction limit of systems is transcended, while in geometrical SR the resolution of digital imaging sensors ...
tasks.
History
The concept of convolution in neural networks was inspired by the visual cortex in biological brains. Early work by Hubel and Wiesel in the 1960s on the cat's visual system laid the groundwork for artificial convolution networks.
An early convolution neural network was developed by
Kunihiko Fukushima
Kunihiko Fukushima (Japanese: 福島 邦彦, born 16 March 1936) is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a Senior Research Scientist at the ...
in 1969. It had mostly hand-designed kernels inspired by convolutions in mammalian vision. In 1979 he improved it to the
Neocognitron
__NOTOC__
The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese Handwriting recognition, handwritten character recognition and other pattern recognition task ...
, which ''learns'' all convolutional kernels by
unsupervised learning
Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
(in his terminology, "
self-organized
Self-organization, also called spontaneous order in the social sciences, is a process where some form of overall order arises from local interactions between parts of an initially disordered system. The process can be spontaneous when suffi ...
by 'learning without a teacher'").
In 1998,
Yann LeCun
Yann André LeCun ( , ; originally spelled Le Cun; born 8 July 1960) is a French computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Professor ...
et al. introduced
LeNet-5, an early influential CNN architecture for handwritten digit recognition, trained on the
MNIST dataset.
(
Olshausen & Field, 1996)
discovered that
simple cells
A simple cell in the primary visual cortex is a cell that responds primarily to oriented edges and gratings (bars of particular orientations). These cells were discovered by Torsten Wiesel and David Hubel in the late 1950s.
Such cells are tuned ...
in the mammalian
primary visual cortex
The visual cortex of the brain is the area of the cerebral cortex that processes visual information. It is located in the occipital lobe. Sensory input originating from the eyes travels through the lateral geniculate nucleus in the thalamus and ...
implement localized, oriented, bandpass receptive fields, which could be recreated by fitting sparse linear codes for natural scenes. This was later found to also occur in the lowest-level kernels of trained CNNs.
The field saw a resurgence in the 2010s with the development of deeper architectures and the availability of large datasets and powerful GPUs.
AlexNet
AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor.
AlexNet competed in the ImageNet Large Scale Vis ...
, developed by
Alex Krizhevsky
Alex Krizhevsky is a Ukrainian-born Canadian computer scientist most noted for his work on artificial neural networks and deep learning. Shortly after having won the ImageNet challenge in 2012 with AlexNet, he and his colleagues sold their st ...
et al. in 2012, was a catalytic event in modern
deep learning.
See also
*
Convolutional neural network
In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
*
Pooling layer
In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of comp ...
*
Feature learning
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature ...
*
Deep learning
*
Computer vision
Computer vision is an Interdisciplinarity, interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate t ...
References
{{Differentiable computing
Artificial neural networks
Computer vision
Deep learning