The swish function is a mathematical function defined as follows: :

\operatorname(x) = x \operatorname(\beta x) = \frac.

where β is either constant or a

trainable parameter In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

depending on the model. For β = 1, the function becomes equivalent to the Sigmoid Linear Unit or SiLU, first proposed alongside the GELU in 2016. The SiLU was later rediscovered in 2017 as the Sigmoid-weighted Linear Unit (SiL) function used in reinforcement learning. The SiLU/SiL was then rediscovered as the swish over a year after its initial discovery, originally proposed without the learnable parameter β, so that β implicitly equalled 1. The swish paper was then updated to propose the activation with the learnable parameter β, though researchers usually let β = 1 and do not use the learnable parameter β. For β = 0, the function turns into the scaled linear function f(''x'') = ''x''/2. With β → ∞, the sigmoid component approaches a 0-1 function, so swish approaches the ReLU function. Thus, it can be viewed as a smoothing function which nonlinearly

interpolate In the mathematical field of numerical analysis, interpolation is a type of estimation, a method of constructing (finding) new data points based on the range of a discrete set of known data points. In engineering and science, one often has a n ...

s between a linear function and the ReLU function. This function uses non-monotonicity, and may have influenced the proposal of other activation functions with this property such as Mish. When considering positive values, Swish is a particular case of sigmoid shrinkage function defined in (see the doubly parameterized sigmoid shrinkage form given by Equation (3) of this reference).

Applications

In 2017, after performing analysis on ImageNet data, researchers from

Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...

indicated that using the function as an activation function in artificial neural networks improves the performance, compared to ReLU and sigmoid functions. It is believed that one reason for the improvement is that the swish function helps alleviate the vanishing gradient problem during backpropagation.

References

{{reflist, refs= {{cite arXiv , eprint = 1606.08415 , title = Gaussian Error Linear Units (GELUs) , last1 = Hendrycks , first1 = Dan , last2 = Gimpel , first2 = Kevin , year = 2016 , class = cs.LG {{cite arXiv , first1=Stefan , last1=Elfwing , first2=Eiji , last2=Uchibe , first3=Kenji , last3=Doya , title=Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , date=2017-11-02 , class=cs.LG , eprint=1702.03118v3 {{cite arXiv , title=Searching for Activation Functions , date=2017-10-27 , eprint=1710.05941v2 , last1=Ramachandran , first1=Prajit , last2=Zoph , first2=Barret , last3=Le , first3=Quoc V., class=cs.NE {{cite web , title=Swish as Neural Networks Activation Function , first=Sefik Ilkin , last=Serengil , series=Machine Learning, Math , date=2018-08-21 , url=https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ , access-date=2020-06-18 , url-status=live , archive-url=https://web.archive.org/web/20200618093348/https://sefiks.com/2018/08/21/swish-as-neural-networks-activation-function/ , archive-date=2020-06-18 Functions and mappings Artificial neural networks