AlexNet is a

convolutional neural network A convolutional neural network (CNN) is a type of feedforward neural network that learns features via filter (or kernel) optimization. This type of deep learning network has been applied to process and make predictions from many different ty ...

architecture developed for image classification tasks, notably achieving prominence through its performance in the

ImageNet The ImageNet project is a large visual database designed for use in Outline of object recognition, visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictur ...

Large Scale Visual Recognition Challenge (ILSVRC). It classifies images into 1,000 distinct object categories and is regarded as the first widely recognized application of deep convolutional networks in large-scale visual recognition. Developed in 2012 by Alex Krizhevsky in collaboration with

Ilya Sutskever Ilya Sutskever (; born 8 December 1986) is an Israeli-Canadian computer scientist who specializes in machine learning. He has made several major contributions to the field of deep learning. With Alex Krizhevsky and Geoffrey Hinton, he co-inv ...

and his Ph.D. advisor

Geoffrey Hinton Geoffrey Everest Hinton (born 1947) is a British-Canadian computer scientist, cognitive scientist, and cognitive psychologist known for his work on artificial neural networks, which earned him the title "the Godfather of AI". Hinton is Univer ...

at the

University of Toronto The University of Toronto (UToronto or U of T) is a public university, public research university whose main campus is located on the grounds that surround Queen's Park (Toronto), Queen's Park in Toronto, Ontario, Canada. It was founded by ...

, the model contains 60 million parameters and 650,000

neurons A neuron (American English), neurone (British English), or nerve cell, is an membrane potential#Cell excitability, excitable cell (biology), cell that fires electric signals called action potentials across a neural network (biology), neural net ...

. The original paper's primary result was that the depth of the model was essential for its high performance, which was computationally expensive, but made feasible due to the utilization of

graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

s (GPUs) during training. The three formed team SuperVision and submitted AlexNet in the

ImageNet Large Scale Visual Recognition Challenge The ImageNet project is a large visual database designed for use in Outline of object recognition, visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictur ...

on September 30, 2012. The network achieved a top-5 error of 15.3%, more than 10.8 percentage points better than that of the runner-up. The architecture influenced a large number of subsequent work in

deep learning Deep learning is a subset of machine learning that focuses on utilizing multilayered neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience a ...

, especially in applying

neural networks A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either Cell (biology), biological cells or signal pathways. While individual neurons are simple, many of them together in a netwo ...

computer vision Computer vision tasks include methods for image sensor, acquiring, Image processing, processing, Image analysis, analyzing, and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical ...

Architecture

AlexNet contains eight

layers Layer or layered may refer to: Arts, entertainment, and media * ''Layers'' (Kungs album) * ''Layers'' (Les McCann album) * ''Layers'' (Royce da 5′9″ album) *“Layers”, the title track of Royce da 5′9″’s sixth studio album * Layer, a ...

: the first five are

convolution In mathematics (in particular, functional analysis), convolution is a operation (mathematics), mathematical operation on two function (mathematics), functions f and g that produces a third function f*g, as the integral of the product of the two ...

al layers, some of them followed by max-pooling layers, and the last three are fully connected layers. The network, except the last layer, is split into two copies, each run on one GPU, because the network did not fit the

VRAM Video random-access memory (VRAM) is dedicated computer memory used to store the pixels and other graphics data as a framebuffer to be rendered on a computer monitor. It often uses a different technology than other computer memory, in order to ...

of a single

Nvidia Nvidia Corporation ( ) is an American multinational corporation and technology company headquartered in Santa Clara, California, and incorporated in Delaware. Founded in 1993 by Jensen Huang (president and CEO), Chris Malachowsky, and Curti ...

GTX 580 3GB GPU. The entire structure can be written as

(CONV → RN → MP)² → (CONV³ → MP) → (FC → DO)² → Linear → softmax

where * CONV = convolutional layer (with ReLU activation) * RN = local response normalization * MP = max-pooling * FC = fully connected layer (with ReLU activation) * Linear = fully connected layer (without activation) * DO = dropout Notably, the convolutional layers 3, 4, 5 were connected to one another without any pooling or normalization. It used the non-saturating

ReLU In the context of Neural network (machine learning), artificial neural networks, the rectifier or ReLU (rectified linear unit) activation function is an activation function defined as the non-negative part of its argument, i.e., the ramp function ...

activation function The activation function of a node in an artificial neural network is a function that calculates the output of the node based on its individual inputs and their weights. Nontrivial problems can be solved using only a few nodes if the activation f ...

, which trained better than

tanh In mathematics, hyperbolic functions are analogues of the ordinary trigonometric functions, but defined using the hyperbola rather than the circle. Just as the points form a circle with a unit radius, the points form the right half of the u ...

and

sigmoid Sigmoid means resembling the lower-case Greek letter sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς) or the Latin letter S. Specific uses include: * Sigmoid function, a mathematical function * Sigmoid colon, part of the l ...

Training

The

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

contained 1.2 million images. The model was trained for 90 epochs over a period of five to six days using two Nvidia GTX 580 GPUs (3GB each). These GPUs have a theoretical performance of 1.581

TFLOPS Floating point operations per second (FLOPS, flops or flop/s) is a measure of computer performance in computing, useful in fields of scientific computations that require floating-point calculations. For such cases, it is a more accurate measur ...

in float32 and were priced at US$500 upon release. Each forward pass of AlexNet required approximately 1.43 GFLOPs. Based on these values, the two GPUs together were theoretically capable of performing over 2,200 forward passes per second under ideal conditions. The dataset images were stored in JPEG format. They took up 27GB of disk. The neural network took up 2GB of RAM on each GPU, and around 5GB of system RAM during training. The GPUs were responsible for training, while the CPUs were responsible for loading images from disk, and data-augmenting the images. AlexNet was trained with momentum gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. Learning rate started at 10⁻² and was manually decreased 10-fold whenever validation error appeared to stop decreasing. It was reduced three times during training, ending at 10⁻⁵. It used two forms of data augmentation, both computed on the fly on the CPU, thus "computationally free": * Each image from ImageNet was first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out and normalized (dividing the pixel values so that they fall between 0 and 1, then subtracting by .485, 0.456, 0.406 then dividing by .229, 0.224, 0.225 These are the mean and standard deviations for ImageNet, so this whitens the input data). * Extracting random 224×224 patches (and their horizontal reflections) from the 256×256 crop. This increases the size of the training set 2048-fold. * Randomly shifting the RGB value of each image along the three principal directions of the RGB values of its pixels. The resolution 224×224 was picked, because 256 - 16 - 16 = 224, meaning that given a 256×256 image, framing out a width of 16 on its 4 sides results in a 224×224 image. It used local response normalization, and dropout regularization with drop probability 0.5. All weights were initialized as gaussians with 0 mean and 0.01 standard deviation. Biases in convolutional layers 2, 4, 5, and all fully-connected layers, were initialized to constant 1 to avoid the dying ReLU problem. At test time, to use a trained AlexNet for predicting the class of an image, that image is first scaled, so that its shorter side was of length 256. Then the central 256×256 patch was cropped out. Then, the five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections are computed, 10 patches in all. The network's predicted probabilities on all 10 patches are averaged, and that is the final predicted probability.

ImageNet competition

The version they used to enter the 2012 ImageNet competition was an ensemble of 7 AlexNets. Specifically, they trained 5 AlexNets of the previously described architecture (with 5 CONV layers) on the ILSVRC-2012 training set (1.2 million images). They also trained 2 variant AlexNets, obtained by adding one extra CONV layer over the last pooling layer. These were trained by first training on the entire ImageNet Fall 2011 release (15 million images in 22K categories), and then finetuning it on the ILSVRC-2012 training set. The final system of 7 AlexNets was used by averaging their predicted probabilities.

History

Previous work

In 1980,

Kunihiko Fukushima Kunihiko Fukushima ( Japanese: 福島邦彦, born 16 March 1936) is a Japanese computer scientist, most noted for his work on artificial neural networks and deep learning. He is currently working part-time as a senior research scientist at the F ...

proposed an early CNN named

neocognitron __NOTOC__ The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the i ...

. It was trained by an

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

algorithm. The LeNet-5 (

Yann LeCun Yann André Le Cun ( , ; usually spelled LeCun; born 8 July 1960) is a French-American computer scientist working primarily in the fields of machine learning, computer vision, mobile robotics and computational neuroscience. He is the Silver Pr ...

et al., 1989) was trained by supervised learning with

backpropagation In machine learning, backpropagation is a gradient computation method commonly used for training a neural network to compute its parameter updates. It is an efficient application of the chain rule to neural networks. Backpropagation computes th ...

algorithm, with an architecture that is essentially the same as AlexNet on a small scale. Max pooling was used in 1990 for speech processing (essentially a 1-dimensional CNN), and for image processing, was first used in the Cresceptron of 1992. During the 2000s, as GPU hardware improved, some researchers adapted these for general-purpose computing, including neural network training. (K. Chellapilla et al., 2006) trained a CNN on GPU that was 4 times faster than an equivalent CPU implementation. (Raina et al 2009) trained a deep belief network with 100 million parameters on an Nvidia GeForce GTX 280 at up to 70 times speedup over CPUs. A deep CNN of (Dan Cireșan ''et al.'', 2011) at

IDSIA The Dalle Molle Institute for Artificial Intelligence (, IDSIA) is a research institute in the Lugano district of Canton Ticino, in southern Switzerland. It was founded in 1988 by Angelo Dalle Molle through the private Fondation Dalle Molle, a ...

was 60 times faster than an equivalent CPU implementation. Between May 15, 2011, and September 10, 2012, their CNN won four image competitions and achieved

state of the art The state of the art (SOTA or SotA, sometimes cutting edge, leading edge, or bleeding edge) refers to the highest level of general development, as of a device, technique, or scientific field achieved at a particular time. However, in some contex ...

for multiple image

database In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and a ...

s. According to the AlexNet paper, Cireșan's earlier net is "somewhat similar". Both were written with

CUDA In computing, CUDA (Compute Unified Device Architecture) is a proprietary parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing units (GPUs) for accelerated gene ...

to run on

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed for digital image processing and to accelerate computer graphics, being present either as a discrete video card or embedded on motherboards, mobile phones, personal ...

Computer vision

During the 1990–2010 period, neural networks were not better than other machine learning methods like

kernel regression In statistics, kernel regression is a non-parametric technique to estimate the conditional expectation of a random variable. The objective is to find a non-linear relation between a pair of random variables ''X'' and ''Y''. In any nonparametric r ...

support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

AdaBoost AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learnin ...

, structured estimation, among others. For computer vision in particular, much progress came from manual

feature engineering Feature engineering is a preprocessing step in supervised machine learning and statistical modeling which transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with ...

, such as SIFT features, SURF features, HoG features, bags of visual words, etc. It was a minority position in computer vision that features can be learned directly from data, a position which became dominant after AlexNet. In 2011,

started reaching out to colleagues about "What do I have to do to convince you that neural networks are the future?", and

Jitendra Malik Jitendra Malik (born 11 October 1960) is an Indian-American academic who is the Arthur J. Chick Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He is known for his research in computer vision ...

, a sceptic of neural networks, recommended the PASCAL Visual Object Classes challenge. Hinton said its dataset was too small, so Malik recommended to him the ImageNet challenge. The

dataset, which became central to AlexNet's success, was created by

Fei-Fei Li Fei-Fei Li (; born in Beijing, China, July 3, 1976) is a Chinese-American computer scientist known for her pioneering work in artificial intelligence (AI), particularly in computer vision. She is best known for establishing ImageNet, the dat ...

and her collaborators beginning in 2007. Aiming to advance visual recognition through large-scale data, Li built a dataset far larger than earlier efforts, ultimately containing over 14 million labeled images across 22,000 categories. The images were labeled using

Amazon Mechanical Turk Amazon Mechanical Turk (MTurk) is a crowdsourcing website with which businesses can hire remotely located "crowdworkers" to perform discrete on-demand tasks that computers are currently unable to do as economically. It is operated under Amazon Web ...

and organized via the

WordNet WordNet is a lexical database of semantic relations between words that links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definitions and usage examples. It can thu ...

hierarchy. Initially met with skepticism, ImageNet later became the foundation of the

(ILSVRC) and a key resource in the rise of deep learning. Sutskever and Krizhevsky were both graduate students. Before 2011, Krizhevsky had already written cuda-convnet to train small CNNs on

CIFAR-10 The CIFAR-10 dataset ( Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. Th ...

with a single GPU. Sutskever convinced Krizhevsky, who could do

GPGPU General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditiona ...

well, to train a CNN on ImageNet, with Hinton serving as principal investigator. So Krizhevsky extended cuda-convnet for multi-GPU training. AlexNet was trained on 2 Nvidia GTX 580 in Krizhevsky's bedroom at his parents' house. During 2012, Krizhevsky performed

hyperparameter optimization In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process, which must be con ...

on the network until it won the ImageNet competition later the same year. Hinton commented that, "Ilya thought we should do it, Alex made it work, and I got the Nobel Prize". At the 2012

European Conference on Computer Vision The European Conference on Computer Vision (ECCV) is a biennial research conference with the proceedings published by Springer Science+Business Media. Similar to ICCV in scope and quality, it is held those years which ICCV is not. It is consider ...

, following AlexNet's win, researcher

described the model as "an unequivocal turning point in the history of computer vision". AlexNet's success in 2012 was enabled by the convergence of three developments that had matured over the previous decade: large-scale labeled datasets, general-purpose

GPU computing General-purpose computing on graphics processing units (GPGPU, or less often GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditional ...

, and improved training methods for deep neural networks. The availability of ImageNet provided the data necessary for training deep models on a broad range of object categories. Advances in GPU programming through

platform enabled practical training of large models. Together with algorithmic improvements, these factors enabled AlexNet to achieve high performance on large-scale visual recognition benchmarks. Reflecting on its significance over a decade later, Fei-Fei Li stated in a 2024 interview: "That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time". While AlexNet and LeNet share essentially the same design and algorithm, AlexNet is much larger than LeNet and was trained on a much larger dataset on much faster hardware. Over the period of 20 years, both data and compute became cheaply available.

Subsequent work

AlexNet is highly influential, resulting in much subsequent work in using CNNs for computer vision and using GPUs to accelerate deep learning. As of early 2025, the AlexNet paper has been cited over 172,000 times according to Google Scholar. At the time of publication, there was no framework available for GPU-based neural network training and inference. The codebase for AlexNet was released under a BSD license, and had been commonly used in neural network research for several subsequent years. In one direction, subsequent works aimed to train increasingly deep CNNs that achieve increasingly higher performance on ImageNet. In this line of research are GoogLeNet (2014),

VGGNet The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford. The VGG family includes various configurations with different depths, denoted by the letter "VGG" followe ...

(2014),

Highway network In machine learning, the Highway Network was the first working very deep feedforward neural network with hundreds of layers, much deeper than previous neural networks. It uses ''skip connections'' modulated by learned gating mechanisms to regulat ...

(2015), and ResNet (2015). Another direction aimed to reproduce the performance of AlexNet at a lower cost. In this line of research are

SqueezeNet SqueezeNet is a deep neural network for image classification released in 2016. SqueezeNet was developed by researchers at DeepScale, University of California, Berkeley, and Stanford University. In designing SqueezeNet, the authors' goal was to cr ...

(2016),

MobileNet MobileNet is a family of convolutional neural network (CNN) architectures designed for image classification, object detection, and other computer vision tasks. They are designed for small size, low latency, and low power consumption, making them ...

(2017),

EfficientNet EfficientNet is a family of convolutional neural networks (CNNs) for computer vision published by researchers at Google AI in 2019. Its key innovation is compound scaling, which uniformly scales all dimensions of depth, width, and resolution usi ...

(2019). Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky formed DNNResearch soon afterwards and sold the company, and the AlexNet source code along with it, to Google. There had been improvements and reimplementations for the AlexNet, but the original version as of 2012, at the time of its winning of ImageNet, had been released under BSD-2 license via

Computer History Museum The Computer History Museum (CHM) is a computer museum in Mountain View, California. The museum presents stories and artifacts of Silicon Valley and the Information Age, and explores the Digital Revolution, computing revolution and its impact ...

References

{{Artificial intelligence navbox Deep learning software Object recognition and categorization Neural network architectures