Model compression is a

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

technique for reducing the size of trained models. Large models can achieve high accuracy, but often at the cost of significant resource requirements. Compression techniques aim to compress models without significant performance reduction. Smaller models require less storage space, and consume less memory and compute during inference. Compressed models enable deployment on resource-constrained devices such as

smartphone A smartphone is a mobile phone with advanced computing capabilities. It typically has a touchscreen interface, allowing users to access a wide range of applications and services, such as web browsing, email, and social media, as well as multi ...

embedded system An embedded system is a specialized computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is e ...

edge computing Edge computing is a distributed computing model that brings computation and data storage closer to the sources of data. More broadly, it refers to any design that pushes computation physically closer to a user, so as to reduce the Latency (engineer ...

devices, and

consumer electronics Consumer electronics, also known as home electronics, are electronic devices intended for everyday household use. Consumer electronics include those used for entertainment, Communication, communications, and recreation. Historically, these prod ...

computers. Efficient inference is also valuable for large corporations that serve large model inference over an API, allowing them to reduce computational costs and improve response times for users. Model compression is not to be confused with

knowledge distillation In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have more knowledge ...

, in which a ''separate'', smaller "student" model is trained to imitate the input-output behavior of a larger "teacher" model.

Techniques

Several techniques are employed for model compression.

Pruning

Pruning sparsifies a large model by setting some parameters to exactly zero. This effectively reduces the number of parameters. This allows the use of sparse matrix operations, which are faster than dense matrix operations. Pruning criteria can be based on magnitudes of parameters, the statistical pattern of neural activations, Hessian values, etc.

Quantization

Quantization reduces the numerical precision of weights and activations. For example, instead of storing weights as 32-bit

floating-point In computing, floating-point arithmetic (FP) is arithmetic on subsets of real numbers formed by a ''significand'' (a Sign (mathematics), signed sequence of a fixed number of digits in some Radix, base) multiplied by an integer power of that ba ...

numbers, they can be represented using 8-bit integers. Low-precision parameters take up less space, and takes less compute to perform arithmetic with. It is also possible to quantize some parameters more aggressively than others, so for example, a less important parameter can have 8-bit precision while another, more important parameter, can have 16-bit precision. Inference with such models requires

mixed-precision arithmetic Mixed-precision arithmetic is a form of floating-point arithmetic that uses numbers with varying widths in a single operation. Overview A common usage of mixed-precision arithmetic is for operating on inaccurate numbers with a small width and expan ...

. Quantized models can also be used during training (rather than after training).

PyTorch PyTorch is a machine learning library based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is one of the mo ...

implements automatic mixed-precision (AMP), which performs autocasting, gradient scaling, and loss scaling.

Low-rank factorization

Weight matrices can be approximated by low-

rank A rank is a position in a hierarchy. It can be formally recognized—for example, cardinal, chief executive officer, general, professor—or unofficial. People Formal ranks * Academic rank * Corporate title * Diplomatic rank * Hierarchy ...

matrices. Let

W

be a weight matrix of shape

m \times n

. A low-rank approximation is

W \approx UV^T

, where

U

and

V

are matrices of shapes

m \times k, n \times k

. When

k

is small, this both reduces the number of parameters needed to represent

W

approximately, and accelerates matrix multiplication by

W

. Low-rank approximations can be found by

singular value decomposition In linear algebra, the singular value decomposition (SVD) is a Matrix decomposition, factorization of a real number, real or complex number, complex matrix (mathematics), matrix into a rotation, followed by a rescaling followed by another rota ...

(SVD). The choice of rank for each weight matrix is a hyperparameter, and jointly optimized as a mixed discrete-continuous optimization problem. The rank of weight matrices may also be pruned after training, taking into account the effect of activation functions like ReLU on the implicit rank of the weight matrices.

Training

Model compression may be decoupled from training, that is, a model is first trained without regard for how it might be compressed, then it is compressed. However, it may also be combined with training. The "train big, then compress" method trains a large model for a small number of training steps (less than it would be if it were trained to convergence), then heavily compress the model. It is found that at the same compute budget, this method results in a better model than lightly compressed, small models. In Deep Compression, the compression has three steps. * First loop (pruning): prune all weights lower than a threshold, then finetune the network, then prune again, etc. * Second loop (quantization): cluster weights, then enforce weight sharing among all weights in each cluster, then finetune the network, then cluster again, etc. * Third step: Use

Huffman coding In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by ...

to losslessly compress the model. The

SqueezeNet SqueezeNet is a deep neural network for image classification released in 2016. SqueezeNet was developed by researchers at DeepScale, University of California, Berkeley, and Stanford University. In designing SqueezeNet, the authors' goal was to cr ...

paper reported that Deep Compression achieved a compression ratio of 35 on AlexNet, and a ratio of ~10 on SqueezeNets.

References

* Review papers ** ** ** ** {{DEFAULTSORT:Model Compression Machine learning Deep learning