The swish function is a
mathematical function defined as follows:
:
where β is either constant or a
trainable parameter
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...
depending on the model. For β = 1, the function becomes equivalent to the Sigmoid Linear Unit
or SiLU, first proposed alongside the
GELU in 2016. The SiLU was later rediscovered in 2017 as the Sigmoid-weighted Linear Unit (SiL) function used in
reinforcement learning.
The SiLU/SiL was then rediscovered as the swish over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equalled 1. The swish paper was then updated to propose the activation with the learnable parameter β, though researchers usually let β = 1 and do not use the learnable parameter β. For β = 0, the function turns into the scaled linear function f(''x'') = ''x''/2.
With β → ∞, the
sigmoid component approaches a 0-1 function, so swish approaches the
ReLU function. Thus, it can be viewed as a smoothing function which nonlinearly
interpolate
In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points.
In engineering and science, one often has a n ...
s between a linear function and the ReLU function.
This function uses non-monotonicity, and may have influenced the proposal of other activation functions with this property such as
Mish.
When considering positive values, Swish is a particular case of sigmoid shrinkage function defined in
(see the doubly parameterized sigmoid shrinkage form given by Equation (3) of this reference).
Applications
In 2017, after performing analysis on
ImageNet data, researchers from
Google
Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
indicated that using the function as an
activation function in
artificial neural networks improves the performance, compared to ReLU and sigmoid functions.
It is believed that one reason for the improvement is that the swish function helps alleviate the
vanishing gradient problem during
backpropagation.
References
{{reflist, refs=
[{{cite arXiv , eprint = 1606.08415 , title = Gaussian Error Linear Units (GELUs) , last1 = Hendrycks , first1 = Dan , last2 = Gimpel , first2 = Kevin , year = 2016 , class = cs.LG]
[{{cite arXiv , first1=Stefan , last1=Elfwing , first2=Eiji , last2=Uchibe , first3=Kenji , last3=Doya , title=Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , date=2017-11-02 , class=cs.LG , eprint=1702.03118v3]
[{{cite arXiv , title=Searching for Activation Functions , date=2017-10-27 , eprint=1710.05941v2 , last1=Ramachandran , first1=Prajit , last2=Zoph , first2=Barret , last3=Le , first3=Quoc V., class=cs.NE ]
[{{cite web , title=Swish as Neural Networks Activation Function , first=Sefik Ilkin , last=Serengil , series=Machine Learning, Math , date=2018-08-21 , url=https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ , access-date=2020-06-18 , url-status=live , archive-url=https://web.archive.org/web/20200618093348/https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ , archive-date=2020-06-18]
Functions and mappings
Artificial neural networks