neural networks A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

, a pooling layer is a kind of

network layer In the seven-layer OSI model of computer networking, the network layer is layer 3. The network layer is responsible for packet forwarding including routing through intermediate routers. Functions The network layer provides the means of trans ...

that downsamples and aggregates information that is dispersed among many vectors into fewer vectors. It has several uses. It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.

Convolutional neural network pooling

Pooling is most commonly used in

convolutional neural network In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

s (CNN). Below is a description of pooling in 2-dimensional CNNs. The generalization to n-dimensions is immediate. As notation, we consider a tensor

x \in \R^

, where

H

is height,

W

is width, and

C

is the number of channels. A pooling layer outputs a tensor

y \in \R^

. We define two variables

f, s

called "filter size" (aka "kernel size") and "stride". Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions. In such cases, we define 4 variables

f_H, f_W, s_H, s_W

. The receptive field of an entry in the output tensor

y

are all the entries in

x

that can affect that entry.

Max pooling

Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps. Define

\mathrm(x ,  f, s)_ = \max(x_)

where

0:f-1

means the range

0, 1, \dots, f-1

. Note that we need to avoid the

off-by-one error An off-by-one error or off-by-one bug (known by acronyms OBOE, OBO, OB1 and OBOB) is a logic error involving the discrete equivalent of a boundary condition. It often occurs in computer programming when an iterative loop iterates one time too m ...

. The next input is

\mathrm(x ,  f, s)_ = \max(x_)

and so on. The receptive field of

y_

x_

, so in general,

\mathrm(x ,  f, s)_ = \mathrm(x_)

If the horizontal and vertical filter size and strides differ, then in general,

\mathrm(x ,  f, s)_ = \mathrm(x_)

More succinctly, we can write

y_k = \max(\)

. If

H

is not expressible as

ks + f

where

k

is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor. In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right. Global Max Pooling (GMP) is a specific kind of max pooling where the output tensor has shape

\R^

and the receptive field of

y_c

is all of

x_

. That is, it takes the maximum over each entire channel. It is often used just before the final fully connected layers in a CNN classification head.

Average pooling

Average pooling (AvgPool) is similarly defined

\mathrm(x ,  f, s)_ = \mathrm(x_) = 
\frac \sum_\sum_ x_

Global Average Pooling (GAP) is defined similarly to GMP. It was first proposed in Network-in-Network. Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head.

Interpolations

There are some interpolations of max pooling and average pooling. Mixed Pooling is a linear sum of maxpooling and average pooling. That is,

\mathrm(x ,  f, s, w) = w \mathrm(x ,  f, s) + (1-w)\mathrm(x ,  f, s)

where

w \in

, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...

/math> is either a hyperparameter, a learnable parameter, or randomly sampled anew every time. Lp Pooling is like average pooling, but uses

Lp norm In mathematics, the spaces are function spaces defined using a natural generalization of the -norm for finite-dimensional vector spaces. They are sometimes called Lebesgue spaces, named after Henri Lebesgue , although according to the Bourbaki ...

average instead of average:

y_k = \left(\frac 1N \sum_ , x_, ^p\right)^

where

N

is the size of receptive field, and

p \geq 1

is a hyperparameter. If all activations are non-negative, then average pooling is the case of

p = 1

, and maxpooling is the case of

p \to \infty

. Square-root pooling is the case of

p = 2

. Stochastic pooling samples a random activation

x_

from the receptive field with probability

\frac

. It is the same as average pooling in expectation. Softmax pooling is like maxpooling, but uses

softmax The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

, i.e.

\frac

where

\beta > 0

. Average pooling is the case of

\beta \downarrow 0

, and maxpooling is the case of

\beta \uparrow \infty

Local Importance-based Pooling generalizes softmax pooling by

\frac

where

g

is a learnable function. RoI_pooling_animated

Other poolings

Spatial pyramidal pooling applies max pooling (or any other form of pooling) in a pyramid structure. That is, it applies global max pooling, then applies max pooling to the image divided into 4 equal parts, then 16, etc. The results are then concatenated. It is a hierarchical form of global pooling, and similar to global pooling, it is often used just before a classification head.

Region of Interest A region of interest (often abbreviated ROI) is a sample within a data set identified for a particular purpose. The concept of a ROI is commonly used in many application areas. For example, in medical imaging, the boundaries of a tumor may be de ...

Pooling (also known as RoI pooling) is a variant of max pooling used in R-CNNs for

object detection Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Well-researched ...

. It is designed to take an arbitrarily-sized input matrix, and output a fixed-sized output matrix. Covariance pooling computes the covariance matrix of the vectors

\_

which is then flattened to a

C^2

-dimensional vector

y_

. Global covariance pooling is used similarly to global max pooling. As average pooling computes the average, which is a first-degree

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hy ...

, and covariance is a second-degree statistic, covariance pooling is also called "second-order pooling". It can be generalized to higher-order poolings. Blur Pooling means applying a blurring method before downsampling. For example, the Rect-2 blur pooling means taking an average pooling at

f = 2, s = 1

, then taking every second pixel (identity with

s = 2

Vision Transformer pooling

In Vision Transformers (ViT), there are the following common kinds of poolings. BERT-like pooling uses a dummy

 LS/code> token ("classification"). For classification, the output at  LS/code> is the classification token, which is then processed by a LayerNorm 

In machine learning, normalization is a statistical technique with various applications. There are mainly two forms of normalization, data normalization and activation normalization. Data normalization, or feature scaling, is a general technique i ...
-feedforward-softmax module into a probability distribution, which is the network's prediction of class probability distribution. This is the one used by the original ViT and  Masked Autoencoder.

Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good.

Multihead attention pooling (MAP) applies a  multiheaded attention block to pooling. Specifically, it takes as input a list of vectors  $x_1, x_2, \dots, x_n$ , which might be thought of as the output vectors of a layer of a ViT. It then applies a feedforward layer  $\mathrm$  on each vector, resulting in a matrix  $V = mathrm(v_1), \dots, \mathrm(v_n) /math>. This is then sent to a multiheaded attention, resulting in \mathrm(Q, V, V), where Q is a matrix of trainable parameters. This was first proposed in the Set Transformer architecture.

Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling.$ 
  Graph neural network pooling 


In graph neural network 


A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.



In the more general subject of "geometric  deep learning", certain existing neural network architectures can b ...
s (GNN), there are also two forms of pooling: global and local. Global pooling can be reduced to a local pooling where the receptive field is the entire output.

# Local pooling: a local pooling layer coarsens the graph via downsampling In digital signal processing, downsampling, compression, and decimation are terms associated with the process of  ''resampling'' in a  multi-rate digital signal processing system. Both ''downsampling'' and ''decimation'' can be synonymous with ''com ...
. Local pooling is used to increase the receptive field of a GNN, in a similar fashion to pooling layers in convolutional neural network 



In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of  artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...
s. Examples include  k-nearest neighbours pooling, top-k pooling, and self-attention pooling.
# Global pooling: a global pooling layer, also known as ''readout'' layer, provides fixed-size representation of the whole graph. The global pooling layer must be permutation invariant, such that permutations in the ordering of graph nodes and edges do not alter the final output. Examples include element-wise sum, mean or maximum.

Local pooling layers coarsen the graph via downsampling. We present here several learnable local pooling strategies that have been proposed. For each cases, the input is the initial graph is represented by a matrix  $\mathbf$  of node features, and the graph adjacency matrix  $\mathbf$ . The output is the new matrix  $\mathbf'$ of node features, and the new graph adjacency matrix  $\mathbf'$ .

  Top-k pooling 

We first set

 $\mathbf = \frac$ 

where  $\mathbf$  is a learnable  projection vector. The projection vector  $\mathbf$  computes a scalar projection value for each graph node.

The top-k pooling layer  can then be formalised as follows:
: $\mathbf' = (\mathbf \odot \text(\mathbf))_$ 

: $\mathbf' = \mathbf_$ 

where  $\mathbf = \text_k(\mathbf)$  is the subset of nodes with the top-k highest projection scores,  $\odot$  denotes element-wise matrix multiplication 

In  mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the  ...
, and  $\text(\cdot)$  is the sigmoid function 






A sigmoid function is a  mathematical function having a characteristic "S"-shaped curve or sigmoid curve.

A common example of a sigmoid function is the logistic function shown in the first figure and defined by the formula:
:S(x) = \frac = \ ...
. In other words, the nodes with the top-k highest projection scores are retained in the new adjacency matrix  $\mathbf'$ . The  $\text(\cdot)$  operation makes the projection vector  $\mathbf$  trainable by backpropagation  





In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training  feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
, which otherwise would produce discrete outputs.

  Self-attention pooling 

We first set

: $\mathbf = \text(\mathbf, \mathbf)$ 
where  $\text$  is a generic permutation equivariant GNN layer (e.g., GCN, GAT, MPNN).

The Self-attention pooling layer can then be formalised as follows:

: $\mathbf' = (\mathbf \odot \mathbf)_$ 

: $\mathbf' = \mathbf_$ 

where  $\mathbf = \text_k(\mathbf)$  is the subset of nodes with the top-k highest projection scores,  $\odot$  denotes element-wise matrix multiplication 

In  mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the  ...
.

The self-attention pooling layer can be seen as an extension of the top-k pooling layer. Differently from top-k pooling, the self-attention scores computed in self-attention pooling account both for the graph features and the graph topology.

  History 

In early 20th century, neuroanatomists noticed a certain motif where multiple neurons synapse to the same neuron. This was given a functional explanation as "local pooling", which makes vision translation-invariant. (Hartline, 1940) gave supporting evidence for the theory by electrophysiological experiments on the receptive fields of retinal ganglion cells. The Hubel 
Hubel, Hübel or Huebel is a German language topographic surname, denoting a person who lived near a hill (Middle High German ''hübel'' "hill") and may refer to:
* Allison Hubel, American mechanical engineer and cryobiologist
*David H. Hubel (1926 ...
 and  Wiesel experiments showed that the vision system in cats is similar to a convolutional neural network, with some cells summing over inputs from the lower layer. See (Westheimer, 1965) for citations to these early literature.

During the 1970s, to explain the effects of depth perception, some such as ( Julesz and Chang, 1976) proposed that the vision system implements a disparity-selective mechanism by global pooling, where the outputs from matching pairs of retinal regions in the two eyes are pooled in higher order cells. See  for citations to these early literature.

In artificial neural networks, max pooling was used in 1990 for speech processing (1-dimensional convolution).

  See also 

* Convolutional neural network 



In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of  artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Netwo ...

*  Subsampling
* Image scaling 



In computer graphics and digital imaging, image scaling refers to the resizing of a digital image. In video technology, the magnification of digital material is known as upscaling or  resolution enhancement.

When scaling a  vector graphic image ...

* Feature extraction 
In machine learning, pattern recognition, and image processing, feature extraction starts from an initial set of measured data and builds derived values ( features) intended to be informative and non-redundant, facilitating the subsequent learning  ...

* Region of interest 


A region of interest (often abbreviated ROI) is a sample within a data set identified for a particular purpose. The concept of a ROI is commonly used in many application areas. For example, in  medical imaging, the boundaries of a tumor may be de ...

* Graph neural network 


A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.



In the more general subject of "geometric  deep learning", certain existing neural network architectures can b ...


  References 

{{Differentiable computing

 Neural network architectures
 Computer vision
 Computational neuroscience