In mathematics, matrix calculus is a specialized notation for doing

multivariable calculus Multivariable calculus (also known as multivariate calculus) is the extension of calculus in one variable to calculus with functions of several variables: the differentiation and integration of functions involving several variables, rather ...

, especially over spaces of

matrices Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...

. It collects the various partial derivatives of a single

function Function or functionality may refer to: Computing * Function key, a type of key on computer keyboards * Function model, a structured representation of processes in a system * Function object or functor or functionoid, a concept of object-oriente ...

with respect to many variables, and/or of a

multivariate function In mathematical analysis and its applications, a function of several real variables or real multivariate function is a function with more than one argument, with all arguments being real variables. This concept extends the idea of a function ...

with respect to a single variable, into vectors and matrices that can be treated as single entities. This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of

differential equation In mathematics, a differential equation is an equation that relates one or more unknown functions and their derivatives. In applications, the functions generally represent physical quantities, the derivatives represent their rates of change, an ...

s. The notation used here is commonly used in statistics and

engineering Engineering is the use of scientific principles to design and build machines, structures, and other items, including bridges, tunnels, roads, vehicles, and buildings. The discipline of engineering encompasses a broad range of more speciali ...

, while the

tensor index notation In mathematics, Ricci calculus constitutes the rules of index notation and manipulation for tensors and tensor fields on a differentiable manifold, with or without a metric tensor or connection. It is also the modern name for what used to be c ...

is preferred in

physics Physics is the natural science that studies matter, its fundamental constituents, its motion and behavior through space and time, and the related entities of energy and force. "Physical science is that department of knowledge which r ...

. Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a row vector. Both of these conventions are possible even when the common assumption is made that vectors should be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be somewhat standard throughout a single field that commonly uses matrix calculus (e.g.

econometrics Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...

, statistics, estimation theory and

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

). However, even within a given field different authors can be found using competing conventions. Authors of both groups often write as though their specific convention were standard. Serious mistakes can result when combining results from different authors without carefully verifying that compatible notations have been used. Definitions of these two conventions and comparisons between them are collected in the layout conventions section.

Scope

Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable. In general, the independent variable can be a scalar, a vector, or a matrix while the dependent variable can be any of these as well. Each different situation will lead to a different set of rules, or a separate

calculus Calculus, originally called infinitesimal calculus or "the calculus of infinitesimals", is the mathematical study of continuous change, in the same way that geometry is the study of shape, and algebra is the study of generalizations of arithm ...

, using the broader sense of the term. Matrix notation serves as a convenient way to collect the many derivatives in an organized way. As a first example, consider the

gradient In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p is the "direction and rate of fastest increase". If the gr ...

from

vector calculus Vector calculus, or vector analysis, is concerned with differentiation and integration of vector fields, primarily in 3-dimensional Euclidean space \mathbb^3. The term "vector calculus" is sometimes used as a synonym for the broader subjec ...

. For a scalar function of three independent variables,

f(x_1, x_2, x_3)

, the gradient is given by the vector equation :

\nabla f = \frac \hat_1 + \frac  \hat_2 + \frac \hat_3

, where

\hat_i

represents a unit vector in the

x_i

direction for

1\le i \le 3

. This type of generalized derivative can be seen as the derivative of a scalar, ''f'', with respect to a vector,

\mathbf

, and its result can be easily collected in vector form. :

\nabla f = \left( \frac \right)^ = 
  \begin
    \frac &
    \frac &
    \frac \\
  \end^\textsf.

More complicated examples include the derivative of a scalar function with respect to a matrix, known as the gradient matrix, which collects the derivative with respect to each matrix element in the corresponding position in the resulting matrix. In that case the scalar must be a function of each of the independent variables in the matrix. As another example, if we have an ''n''-vector of dependent variables, or functions, of ''m'' independent variables we might consider the derivative of the dependent vector with respect to the independent vector. The result could be collected in an ''m×n'' matrix consisting of all of the possible derivative combinations. There are a total of nine possibilities using scalars, vectors, and matrices. Notice that as we consider higher numbers of components in each of the independent and dependent variables we can be left with a very large number of possibilities. The six kinds of derivatives that can be most neatly organized in matrix form are collected in the following table. Here, we have used the term "matrix" in its most general sense, recognizing that vectors and scalars are simply matrices with one column and one row respectively. Moreover, we have used bold letters to indicate vectors and bold capital letters for matrices. This notation is used throughout. Notice that we could also talk about the derivative of a vector with respect to a matrix, or any of the other unfilled cells in our table. However, these derivatives are most naturally organized in a

tensor In mathematics, a tensor is an algebraic object that describes a multilinear relationship between sets of algebraic objects related to a vector space. Tensors may map between different objects such as vectors, scalars, and even other tensor ...

of rank higher than 2, so that they do not fit neatly into a matrix. In the following three sections we will define each one of these derivatives and relate them to other branches of mathematics. See the layout conventions section for a more detailed table.

Relation to other derivatives

The matrix derivative is a convenient notation for keeping track of partial derivatives for doing calculations. The

Fréchet derivative In mathematics, the Fréchet derivative is a derivative defined on normed spaces. Named after Maurice Fréchet, it is commonly used to generalize the derivative of a real-valued function of a single real variable to the case of a vector-valued ...

is the standard way in the setting of

functional analysis Functional analysis is a branch of mathematical analysis, the core of which is formed by the study of vector spaces endowed with some kind of limit-related structure (e.g. inner product, norm, topology, etc.) and the linear functions defined o ...

to take derivatives with respect to vectors. In the case that a matrix function of a matrix is Fréchet differentiable, the two derivatives will agree up to translation of notations. As is the case in general for partial derivatives, some formulae may extend under weaker analytic conditions than the existence of the derivative as approximating linear mapping.

Usages

Matrix calculus is used for deriving optimal stochastic estimators, often involving the use of

Lagrange multipliers In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied e ...

. This includes the derivation of: *

Kalman filter For statistics and control theory, Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, including statistical noise and other inaccuracies, and produces estima ...

Wiener filter In signal processing, the Wiener filter is a filter used to produce an estimate of a desired or target random process by linear time-invariant ( LTI) filtering of an observed noisy process, assuming known stationary signal and noise spectra, and ...

* Expectation-maximization algorithm for Gaussian mixture *

Gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...

Notation

The vector and matrix derivatives presented in the sections to follow take full advantage of

matrix notation In mathematics, a matrix (plural matrices) is a rectangular array or table of numbers, symbols, or expressions, arranged in rows and columns, which is used to represent a mathematical object or a property of such an object. For example, \begin ...

, using a single variable to represent a large number of variables. In what follows we will distinguish scalars, vectors and matrices by their typeface. We will let ''M''(''n'',''m'') denote the space of

real Real may refer to: Currencies * Brazilian real (R$) * Central American Republic real * Mexican real * Portuguese real * Spanish real * Spanish colonial real Music Albums * ''Real'' (L'Arc-en-Ciel album) (2000) * ''Real'' (Bright album) (2010) ...

''n×m'' matrices with ''n'' rows and ''m'' columns. Such matrices will be denoted using bold capital letters: A, X, Y, etc. An element of ''M''(''n'',1), that is, a

column vector In linear algebra, a column vector with m elements is an m \times 1 matrix consisting of a single column of m entries, for example, \boldsymbol = \begin x_1 \\ x_2 \\ \vdots \\ x_m \end. Similarly, a row vector is a 1 \times n matrix for some n, c ...

, is denoted with a boldface lowercase letter: a, x, y, etc. An element of ''M''(1,1) is a scalar, denoted with lowercase italic typeface: ''a'', ''t'', ''x'', etc. X^T denotes matrix

transpose In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations). The tr ...

, tr(X) is the

trace Trace may refer to: Arts and entertainment Music * ''Trace'' (Son Volt album), 1995 * ''Trace'' (Died Pretty album), 1993 * Trace (band), a Dutch progressive rock band * ''The Trace'' (album) Other uses in arts and entertainment * ''Trace'' ...

, and det(X) or , X, is the

determinant In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. It characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if a ...

. All functions are assumed to be of

differentiability class In mathematical analysis, the smoothness of a function is a property measured by the number of continuous derivatives it has over some domain, called ''differentiability class''. At the very minimum, a function could be considered smooth if ...

''C''¹ unless otherwise noted. Generally letters from the first half of the alphabet (a, b, c, ...) will be used to denote constants, and from the second half (t, x, y, ...) to denote variables. NOTE: As mentioned above, there are competing notations for laying out systems of partial derivatives in vectors and matrices, and no standard appears to be emerging yet. The next two introductory sections use the numerator layout convention simply for the purposes of convenience, to avoid overly complicating the discussion. The section after them discusses layout conventions in more detail. It is important to realize the following: #Despite the use of the terms "numerator layout" and "denominator layout", there are actually more than two possible notational choices involved. The reason is that the choice of numerator vs. denominator (or in some situations, numerator vs. mixed) can be made independently for scalar-by-vector, vector-by-scalar, vector-by-vector, and scalar-by-matrix derivatives, and a number of authors mix and match their layout choices in various ways. #The choice of numerator layout in the introductory sections below does not imply that this is the "correct" or "superior" choice. There are advantages and disadvantages to the various layout types. Serious mistakes can result from carelessly combining formulas written in different layouts, and converting from one layout to another requires care to avoid errors. As a result, when working with existing formulas the best policy is probably to identify whichever layout is used and maintain consistency with it, rather than attempting to use the same layout in all situations.

Alternatives

The

with its

Einstein summation In mathematics, especially the usage of linear algebra in Mathematical physics, Einstein notation (also known as the Einstein summation convention or Einstein summation notation) is a notational convention that implies summation over a set of i ...

convention is very similar to the matrix calculus, except one writes only a single component at a time. It has the advantage that one can easily manipulate arbitrarily high rank tensors, whereas tensors of rank higher than two are quite unwieldy with matrix notation. All of the work here can be done in this notation without use of the single-variable matrix notation. However, many problems in estimation theory and other areas of applied mathematics would result in too many indices to properly keep track of, pointing in favor of matrix calculus in those areas. Also, Einstein notation can be very useful in proving the identities presented here (see section on differentiation) as an alternative to typical element notation, which can become cumbersome when the explicit sums are carried around. Note that a matrix can be considered a tensor of rank two.

Derivatives with vectors

Because vectors are matrices with only one column, the simplest matrix derivatives are vector derivatives. The notations developed here can accommodate the usual operations of

by identifying the space ''M''(''n'',1) of ''n''-vectors with the

Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, that is, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean ...

R^''n'', and the scalar ''M''(1,1) is identified with R. The corresponding concept from vector calculus is indicated at the end of each subsection. NOTE: The discussion in this section assumes the numerator layout convention for pedagogical purposes. Some authors use different conventions. The section on layout conventions discusses this issue in greater detail. The identities given further down are presented in forms that can be used in conjunction with all common layout conventions.

Vector-by-scalar

The

derivative In mathematics, the derivative of a function of a real variable measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). Derivatives are a fundamental tool of calculus. ...

of a

vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...

\mathbf = \begin
    y_1 &
    y_2 &
    \cdots &
    y_m
  \end^\mathsf

, by a scalar ''x'' is written (in numerator layout notation) as :

\frac =
  \begin
    \frac\\
    \frac\\
    \vdots\\
    \frac\\
  \end.

the derivative of a vector y with respect to a scalar ''x'' is known as the

tangent vector In mathematics, a tangent vector is a vector that is tangent to a curve or surface at a given point. Tangent vectors are described in the differential geometry of curves in the context of curves in R''n''. More generally, tangent vectors are e ...

of the vector y,

\frac

. Notice here that y: R¹ → R^''m''. Example Simple examples of this include the

velocity Velocity is the directional speed of an object in motion as an indication of its rate of change in position as observed from a particular frame of reference and as measured by a particular standard of time (e.g. northbound). Velocity i ...

vector in

, which is the

of the position vector (considered as a function of time). Also, the

acceleration In mechanics, acceleration is the rate of change of the velocity of an object with respect to time. Accelerations are vector quantities (in that they have magnitude and direction). The orientation of an object's acceleration is given by t ...

is the tangent vector of the velocity.

Scalar-by-vector

The

of a scalar ''y'' by a vector

\mathbf = \begin
    x_1 &
    x_2 &
    \cdots &
    x_n
  \end^\mathsf

, is written (in numerator layout notation) as :

\frac =
  \begin
    \frac &
    \frac &
    \cdots &
    \frac
  \end.

, the

of a scalar field ''f'' in the space R^''n'' (whose independent coordinates are the components of x) is the transpose of the derivative of a scalar by a vector. :

\nabla f = \begin\frac \\ \vdots \\ \frac \end = \left( \frac \right)^

By example, in physics, the electric field is the negative vector

of the

electric potential The electric potential (also called the ''electric field potential'', potential drop, the electrostatic potential) is defined as the amount of work energy needed to move a unit of electric charge from a reference point to the specific point in ...

. The

directional derivative In mathematics, the directional derivative of a multivariable differentiable (scalar) function along a given vector v at a given point x intuitively represents the instantaneous rate of change of the function, moving through x with a velocity ...

of a scalar function ''f''(x) of the space vector x in the direction of the unit vector u (represented in this case as a column vector) is defined using the gradient as follows. :

\nabla_(\mathbf) = \nabla f(\mathbf) \cdot \mathbf

Using the notation just defined for the derivative of a scalar with respect to a vector we can re-write the directional derivative as

\nabla_\mathbf f = \frac \mathbf.

This type of notation will be nice when proving product rules and chain rules that come out looking similar to what we are familiar with for the scalar

Vector-by-vector

Each of the previous two cases can be considered as an application of the derivative of a vector with respect to a vector, using a vector of size one appropriately. Similarly we will find that the derivatives involving matrices will reduce to derivatives involving vectors in a corresponding way. The derivative of a

vector function A vector-valued function, also referred to as a vector function, is a mathematical function of one or more variables whose range is a set of multidimensional vectors or infinite-dimensional vectors. The input of a vector-valued function could b ...

(a vector whose components are functions)

\mathbf = \begin
    y_1 &
    y_2 &
    \cdots &
    y_m
  \end^\mathsf

, with respect to an input vector,

\mathbf = \begin
    x_1 &
    x_2 &
    \cdots &
    x_n
  \end^\mathsf

, is written (in numerator layout notation) as :

\frac =
  \begin
    \frac & \frac & \cdots & \frac\\
    \frac & \frac & \cdots & \frac\\
    \vdots & \vdots & \ddots & \vdots\\
    \frac & \frac & \cdots & \frac\\
  \end.

, the derivative of a vector function y with respect to a vector x whose components represent a space is known as the pushforward (or differential), or the Jacobian matrix. The pushforward along a vector function f with respect to vector v in R^''n'' is given by

d\,\mathbf(\mathbf) = \frac d\,\mathbf.

Derivatives with matrices

There are two types of derivatives with matrices that can be organized into a matrix of the same size. These are the derivative of a matrix by a scalar and the derivative of a scalar by a matrix. These can be useful in minimization problems found in many areas of applied mathematics and have adopted the names tangent matrix and gradient matrix respectively after their analogs for vectors. Note: The discussion in this section assumes the numerator layout convention for pedagogical purposes. Some authors use different conventions. The section on layout conventions discusses this issue in greater detail. The identities given further down are presented in forms that can be used in conjunction with all common layout conventions.

Matrix-by-scalar

The derivative of a matrix function Y by a scalar ''x'' is known as the tangent matrix and is given (in numerator layout notation) by :

\frac =
\begin
\frac & \frac & \cdots & \frac\\
\frac & \frac & \cdots & \frac\\
\vdots & \vdots & \ddots & \vdots\\
\frac & \frac & \cdots & \frac\\
\end.

Scalar-by-matrix

The derivative of a scalar ''y'' function of a ''p''×''q'' matrix X of independent variables, with respect to the matrix X, is given (in numerator layout notation) by :

\frac =
\begin
\frac & \frac & \cdots & \frac\\
\frac & \frac & \cdots & \frac\\
\vdots & \vdots & \ddots & \vdots\\
\frac & \frac & \cdots & \frac\\
\end.

Important examples of scalar functions of matrices include the

of a matrix and the

. In analog with

this derivative is often written as the following. :

\nabla_\mathbf y(\mathbf) = \frac

Also in analog with

, the directional derivative of a scalar ''f''(X) of a matrix X in the direction of matrix Y is given by :

\nabla_\mathbf f = \operatorname \left(\frac \mathbf\right).

It is the gradient matrix, in particular, that finds many uses in minimization problems in estimation theory, particularly in the

derivation Derivation may refer to: Language * Morphological derivation, a word-formation process * Parse tree or concrete syntax tree, representing a string's syntax in formal grammars Law * Derivative work, in copyright law * Derivation proceeding, a proc ...

of the

algorithm, which is of great importance in the field.

Other matrix derivatives

The three types of derivatives that have not been considered are those involving vectors-by-matrices, matrices-by-vectors, and matrices-by-matrices. These are not as widely considered and a notation is not widely agreed upon.

Layout conventions

This section discusses the similarities and differences between notational conventions that are used in the various fields that take advantage of matrix calculus. Although there are largely two consistent conventions, some authors find it convenient to mix the two conventions in forms that are discussed below. After this section, equations will be listed in both competing forms separately. The fundamental issue is that the derivative of a vector with respect to a vector, i.e.

\frac

, is often written in two competing ways. If the numerator y is of size ''m'' and the denominator x of size ''n'', then the result can be laid out as either an ''m×n'' matrix or ''n×m'' matrix, i.e. the elements of y laid out in columns and the elements of x laid out in rows, or vice versa. This leads to the following possibilities: #''Numerator layout'', i.e. lay out according to y and x^T (i.e. contrarily to x). This is sometimes known as the ''Jacobian formulation''. This corresponds to the ''m×n'' layout in the previous example. #''Denominator layout'', i.e. lay out according to y^T and x (i.e. contrarily to y). This is sometimes known as the ''Hessian formulation''. Some authors term this layout the ''gradient'', in distinction to the ''Jacobian'' (numerator layout), which is its transpose. (However, ''

'' more commonly means the derivative

\frac,

regardless of layout.). This corresponds to the ''n×m'' layout in the previous example. #A third possibility sometimes seen is to insist on writing the derivative as

\frac,

(i.e. the derivative is taken with respect to the transpose of x) and follow the numerator layout. This makes it possible to claim that the matrix is laid out according to both numerator and denominator. In practice this produces results the same as the numerator layout. When handling the

\frac

and the opposite case

\frac,

we have the same issues. To be consistent, we should do one of the following: #If we choose numerator layout for

\frac,

we should lay out the

\frac

as a row vector, and

\frac

as a column vector. #If we choose denominator layout for

\frac,

we should lay out the

\frac

as a column vector, and

\frac

as a row vector. #In the third possibility above, we write

\frac

and

\frac,

and use numerator layout. Not all math textbooks and papers are consistent in this respect throughout. That is, sometimes different conventions are used in different contexts within the same book or paper. For example, some choose denominator layout for gradients (laying them out as column vectors), but numerator layout for the vector-by-vector derivative

\frac.

Similarly, when it comes to scalar-by-matrix derivatives

\frac

and matrix-by-scalar derivatives

\frac,

then consistent numerator layout lays out according to Y and X^T, while consistent denominator layout lays out according to Y^T and X. In practice, however, following a denominator layout for

\frac,

and laying the result out according to Y^T, is rarely seen because it makes for ugly formulas that do not correspond to the scalar formulas. As a result, the following layouts can often be found: #''Consistent numerator layout'', which lays out

\frac

according to Y and

\frac

according to X^T. #''Mixed layout'', which lays out

\frac

according to Y and

\frac

according to X. #Use the notation

\frac,

with results the same as consistent numerator layout. In the following formulas, we handle the five possible combinations

\frac, \frac, \frac, \frac

and

\frac

separately. We also handle cases of scalar-by-scalar derivatives that involve an intermediate vector or matrix. (This can arise, for example, if a multi-dimensional

parametric curve In mathematics, a parametric equation defines a group of quantities as functions of one or more independent variables called parameters. Parametric equations are commonly used to express the coordinates of the points that make up a geometric obj ...

is defined in terms of a scalar variable, and then a derivative of a scalar function of the curve is taken with respect to the scalar that parameterizes the curve.) For each of the various combinations, we give numerator-layout and denominator-layout results, except in the cases above where denominator layout rarely occurs. In cases involving matrices where it makes sense, we give numerator-layout and mixed-layout results. As noted above, cases where vector and matrix denominators are written in transpose notation are equivalent to numerator layout with the denominators written without the transpose. Keep in mind that various authors use different combinations of numerator and denominator layouts for different types of derivatives, and there is no guarantee that an author will consistently use either numerator or denominator layout for all types. Match up the formulas below with those quoted in the source to determine the layout used for that particular type of derivative, but be careful not to assume that derivatives of other types necessarily follow the same kind of layout. When taking derivatives with an aggregate (vector or matrix) denominator in order to find a maximum or minimum of the aggregate, it should be kept in mind that using numerator layout will produce results that are transposed with respect to the aggregate. For example, in attempting to find the

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...

estimate of a

multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...

using matrix calculus, if the domain is a ''k''×1 column vector, then the result using the numerator layout will be in the form of a 1×''k'' row vector. Thus, either the results should be transposed at the end or the denominator layout (or mixed layout) should be used. : The results of operations will be transposed when switching between numerator-layout and denominator-layout notation.

Numerator-layout notation

Using numerator-layout notation, we have: :

\begin
  \frac &= \begin
    \frac &
    \frac &
    \cdots                          &
    \frac
  \end. \\
  \frac &= \begin
    \frac \\
    \frac \\
    \vdots                          \\
    \frac \\
  \end. \\
  \frac &= \begin
    \frac & \frac & \cdots & \frac \\
    \frac & \frac & \cdots & \frac \\
    \vdots                            & \vdots                            & \ddots & \vdots \\
    \frac & \frac & \cdots & \frac \\
  \end. \\
  \frac &= \begin
    \frac & \frac & \cdots & \frac \\
    \frac & \frac & \cdots & \frac \\
    \vdots                             & \vdots                             & \ddots & \vdots \\
    \frac & \frac & \cdots & \frac \\
  \end.
\end

The following definitions are only provided in numerator-layout notation: :

\begin
  \frac &= \begin
    \frac & \frac & \cdots & \frac \\
    \frac & \frac & \cdots & \frac \\
    \vdots                             & \vdots                             & \ddots & \vdots \\
    \frac & \frac & \cdots & \frac \\
  \end. \\
  d\mathbf &= \begin
    dx_ & dx_ & \cdots & dx_ \\
    dx_ & dx_ & \cdots & dx_ \\
    \vdots  & \vdots  & \ddots & \vdots \\
    dx_ & dx_ & \cdots & dx_ \\
  \end.
\end

Denominator-layout notation

Using denominator-layout notation, we have: :

\begin
  \frac &= \begin
    \frac\\
    \frac\\
    \vdots\\
    \frac\\
  \end. \\
  \frac &= \begin
    \frac &
    \frac &
    \cdots &
    \frac
  \end. \\
  \frac &= \begin
    \frac & \frac & \cdots & \frac \\
    \frac & \frac & \cdots & \frac \\
    \vdots                            & \vdots                            & \ddots & \vdots \\
    \frac & \frac & \cdots & \frac\\
  \end. \\
  \frac &= \begin
    \frac & \frac & \cdots & \frac\\
    \frac & \frac & \cdots & \frac\\
    \vdots                             & \vdots                             & \ddots & \vdots\\
    \frac & \frac & \cdots & \frac\\
  \end.
\end

Identities

As noted above, in general, the results of operations will be transposed when switching between numerator-layout and denominator-layout notation. To help make sense of all the identities below, keep in mind the most important rules: the

chain rule In calculus, the chain rule is a formula that expresses the derivative of the composition of two differentiable functions and in terms of the derivatives of and . More precisely, if h=f\circ g is the function such that h(x)=f(g(x)) for every , ...

product rule In calculus, the product rule (or Leibniz rule or Leibniz product rule) is a formula used to find the derivatives of products of two or more functions. For two functions, it may be stated in Lagrange's notation as (u \cdot v)' = u ' \cdot v ...

and sum rule. The sum rule applies universally, and the product rule applies in most of the cases below, provided that the order of matrix products is maintained, since matrix products are not commutative. The chain rule applies in some of the cases, but unfortunately does ''not'' apply in matrix-by-scalar derivatives or scalar-by-matrix derivatives (in the latter case, mostly involving the

operator applied to matrices). In the latter case, the product rule can't quite be applied directly, either, but the equivalent can be done with a bit more work using the differential identities. The following identities adopt the following conventions: * the scalars, a, b, c, d, and e are constant in respect of, and the scalars, u, and v are functions of one of x, x, or X; * the vectors, a, b, c, d, and e are constant in respect of, and the vectors, u, and v are functions of one of x, x, or X; * the matrices, A, B, C, D, and E are constant in respect of, and the matrices, U and V are functions of one of x, x, or X.

Vector-by-vector identities

This is presented first because all of the operations that apply to vector-by-vector differentiation apply directly to vector-by-scalar or scalar-by-vector differentiation simply by reducing the appropriate vector in the numerator or denominator to a scalar. :

Scalar-by-vector identities

The fundamental identities are placed above the thick black line. :{, class="wikitable" style="text-align: center;" , + Identities: scalar-by-vector

\frac{\partial y}{\partial \mathbf{x = \nabla_\mathbf{x} y

! scope="col" width="150" , Condition ! scope="col" width="200" , Expression ! scope="col" width="200" , Numerator layout,
i.e. by x^T; result is row vector ! scope="col" width="200" , Denominator layout,
i.e. by x; result is column vector , - , ''a'' is not a function of x , ,

\frac{\partial a}{\partial \mathbf{x  =

\mathbf{0}^\top

Here,

\mathbf{0}

refers to a

of all 0's, of size ''n'', where ''n'' is the length of x., ,

\mathbf{0}

, - , ''a'' is not a function of x,
''u'' = ''u''(x) , ,

\frac{\partial au}{\partial \mathbf{x  =

, colspan=2,

a\frac{\partial u}{\partial \mathbf{x

, - , ''u'' = ''u''(x), ''v'' = ''v''(x) , ,

\frac{\partial (u+v)}{\partial \mathbf{x  =

, colspan=2,

\frac{\partial u}{\partial \mathbf{x + \frac{\partial v}{\partial \mathbf{x

, - , ''u'' = ''u''(x), ''v'' = ''v''(x) , ,

\frac{\partial uv}{\partial \mathbf{x  =

, colspan=2,

u\frac{\partial v}{\partial \mathbf{x + v\frac{\partial u}{\partial \mathbf{x

, - , ''u'' = ''u''(x) , ,

\frac{\partial g(u)}{\partial \mathbf{x  =

, colspan=2,

\frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \mathbf{x

, - , ''u'' = ''u''(x) , ,

\frac{\partial f(g(u))}{\partial \mathbf{x  =

, colspan=2,

\frac{\partial f(g)}{\partial g} \frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \mathbf{x

, - , u = u(x), v = v(x) ,

\frac{\partial (\mathbf{u} \cdot \mathbf{v})}{\partial \mathbf{x = \frac{\partial \mathbf{u}^\top \mathbf{v{\partial \mathbf{x =

\mathbf{u}^\top\frac{\partial \mathbf{v{\partial \mathbf{x + \mathbf{v}^\top\frac{\partial \mathbf{u{\partial \mathbf{x

\frac{\partial \mathbf{u{\partial \mathbf{x, \frac{\partial \mathbf{v{\partial \mathbf{x

in numerator layout ,

\frac{\partial \mathbf{u{\partial \mathbf{x\mathbf{v} + \frac{\partial \mathbf{v{\partial \mathbf{x\mathbf{u}

\frac{\partial \mathbf{u{\partial \mathbf{x, \frac{\partial \mathbf{v{\partial \mathbf{x

in denominator layout , - , u = u(x), v = v(x),
A is not a function of x ,

\frac{\partial (\mathbf{u} \cdot \mathbf{A}\mathbf{v})}{\partial \mathbf{x = \frac{\partial \mathbf{u}^\top\mathbf{A}\mathbf{v{\partial \mathbf{x =

\mathbf{u}^\top\mathbf{A}\frac{\partial \mathbf{v{\partial \mathbf{x + \mathbf{v}^\top \mathbf{A}^\top\frac{\partial \mathbf{u{\partial \mathbf{x

\frac{\partial \mathbf{u{\partial \mathbf{x, \frac{\partial \mathbf{v{\partial \mathbf{x

in numerator layout ,

\frac{\partial \mathbf{u{\partial \mathbf{x\mathbf{A}\mathbf{v} + \frac{\partial \mathbf{v{\partial \mathbf{x\mathbf{A}^\top\mathbf{u}

\frac{\partial \mathbf{u{\partial \mathbf{x, \frac{\partial \mathbf{v{\partial \mathbf{x

in denominator layout , - , ,

\frac{\partial^2 f}{\partial\mathbf{x} \partial\mathbf{x}^\top} =

\mathbf{H}^\top

\mathbf{H}

, the

Hessian matrix In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed ...

, - style="border-top: 3px solid;" , a is not a function of x , ,

\frac{\partial (\mathbf{a}\cdot\mathbf{x})}{\partial \mathbf{x = \frac{\partial (\mathbf{x}\cdot\mathbf{a})}{\partial \mathbf{x =

\frac{\partial \mathbf{a}^\top\mathbf{x{\partial \mathbf{x = \frac{\partial \mathbf{x}^\top\mathbf{a{\partial \mathbf{x =

, ,

\mathbf{a}^\top

, ,

\mathbf{a}

, - , A is not a function of x
b is not a function of x , ,

\frac{\partial \mathbf{b}^\top\mathbf{A}\mathbf{x{\partial \mathbf{x =

, ,

\mathbf{b}^\top\mathbf{A}

, ,

\mathbf{A}^\top\mathbf{b}

, - , A is not a function of x , ,

\frac{\partial \mathbf{x}^\top\mathbf{A}\mathbf{x{\partial \mathbf{x =

, ,

\mathbf{x}^\top\left(\mathbf{A} + \mathbf{A}^\top\right)

, ,

\left(\mathbf{A} + \mathbf{A}^\top\right)\mathbf{x}

, - , A is not a function of x
A is

symmetric Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definiti ...

, ,

\frac{\partial \mathbf{x}^\top\mathbf{A}\mathbf{x{\partial \mathbf{x =

, ,

2\mathbf{x}^\top\mathbf{A}

, ,

2\mathbf{A}\mathbf{x}

, - , A is not a function of x , ,

\frac{\partial^2 \mathbf{x}^\top\mathbf{A}\mathbf{x{\partial\mathbf{x} \partial\mathbf{x}^\top} =

, , colspan=2,

\mathbf{A} + \mathbf{A}^\top

, - , A is not a function of x
A is

, ,

\frac{\partial^2 \mathbf{x}^\top\mathbf{A}\mathbf{x{\partial\mathbf{x} \partial\mathbf{x}^\top} =

, , colspan=2,

2\mathbf{A}

, - , , ,

\frac{\partial (\mathbf{x} \cdot \mathbf{x})}{\partial \mathbf{x = \frac{\partial \mathbf{x}^\top\mathbf{x{\partial \mathbf{x = \frac{\partial \left\Vert \mathbf{x} \right\Vert^2}{\partial \mathbf{x =

, ,

2\mathbf{x}^\top

, ,

2\mathbf{x}

, - , a is not a function of x,
u = u(x) ,

\frac{\partial (\mathbf{a} \cdot \mathbf{u})}{\partial \mathbf{x = \frac{\partial \mathbf{a}^\top\mathbf{u{\partial \mathbf{x =

\mathbf{a}^\top\frac{\partial \mathbf{u{\partial \mathbf{x

\frac{\partial \mathbf{u{\partial \mathbf{x

in numerator layout ,

\frac{\partial \mathbf{u{\partial \mathbf{x\mathbf{a}

\frac{\partial \mathbf{u{\partial \mathbf{x

in denominator layout , - , a, b are not functions of x , ,

\frac{\partial \;  \textbf{a}^\top\textbf{x}\textbf{x}^\top\textbf{b{\partial \; \textbf{x =

, ,

\textbf{x}^\top\left(\textbf{a}\textbf{b}^\top + \textbf{b}\textbf{a}^\top\right)

, ,

\left(\textbf{a}\textbf{b}^\top + \textbf{b}\textbf{a}^\top\right)\textbf{x}

, - , A, b, C, D, e are not functions of x , ,

\frac{\partial \; (\textbf{A}\textbf{x} + \textbf{b})^\top \textbf{C} (\textbf{D}\textbf{x} + \textbf{e})}{\partial \; \textbf{x =

, ,

(\textbf{D}\textbf{x} + \textbf{e})^\top \textbf{C}^\top \textbf{A} +  (\textbf{A}\textbf{x} + \textbf{b})^\top \textbf{C} \textbf{D}

, ,

\textbf{D}^\top \textbf{C}^\top (\textbf{A}\textbf{x} + \textbf{b}) + \textbf{A}^\top\textbf{C}(\textbf{D}\textbf{x} + \textbf{e})

, - , a is not a function of x , ,

\frac{\partial \; \, \mathbf{x} - \mathbf{a}\{\partial \; \mathbf{x =

, ,

\frac{(\mathbf{x} - \mathbf{a})^\top}{\, \mathbf{x} - \mathbf{a}\

, ,

\frac{\mathbf{x} - \mathbf{a{\, \mathbf{x} - \mathbf{a}\

Vector-by-scalar identities

:{, class="wikitable" style="text-align: center;" , + Identities: vector-by-scalar

\frac{\partial \mathbf{y{\partial x}

! scope="col" width="150" , Condition ! scope="col" width="100" , Expression ! scope="col" width="100" , Numerator layout, i.e. by y,
result is column vector ! scope="col" width="100" , Denominator layout, i.e. by y^T,
result is row vector , - , a is not a function of ''x'' , ,

\frac{\partial \mathbf{a{\partial x} =

, , colspan=2,

\mathbf{0}

, - , ''a'' is not a function of ''x'',
u = u(''x'') , ,

\frac{\partial a\mathbf{u{\partial x}  =

, colspan=2,

a\frac{\partial \mathbf{u{\partial x}

, - , A is not a function of ''x'',
u = u(''x'') , ,

\frac{\partial \mathbf{A}\mathbf{u{\partial x} =

, ,

\mathbf{A}\frac{\partial \mathbf{u{\partial x}

, ,

\frac{\partial \mathbf{u{\partial x}\mathbf{A}^\top

, - , u = u(''x'') , ,

\frac{\partial \mathbf{u}^\top}{\partial x} =

, colspan=2,

\left(\frac{\partial \mathbf{u{\partial x}\right)^\top

, - , u = u(''x''), v = v(''x'') , ,

\frac{\partial (\mathbf{u} + \mathbf{v})}{\partial x}  =

, colspan=2,

\frac{\partial \mathbf{u{\partial x} + \frac{\partial \mathbf{v{\partial x}

, - , u = u(''x''), v = v(''x'') , ,

\frac{\partial (\mathbf{u}^\top \times \mathbf{v})}{\partial x}  =

, ,

\left(\frac{\partial \mathbf{u{\partial x}\right)^\top \times \mathbf{v} + \mathbf{u}^\top \times \frac{\partial \mathbf{v{\partial x}

, ,

\frac{\partial \mathbf{u{\partial x} \times \mathbf{v} + \mathbf{u}^\top \times \left(\frac{\partial \mathbf{v{\partial x}\right)^\top

, - , rowspan=2, u = u(''x'') , , rowspan=2,

\frac{\partial \mathbf{g(u){\partial x} =

, ,

\frac{\partial \mathbf{g(u){\partial \mathbf{u \frac{\partial \mathbf{u{\partial x}

, ,

\frac{\partial \mathbf{u{\partial x} \frac{\partial \mathbf{g(u){\partial \mathbf{u

, - , colspan=2, Assumes consistent matrix layout; see below. , - , rowspan=2, u = u(''x'') , , rowspan=2,

\frac{\partial \mathbf{f(g(u)){\partial x} =

, ,

\frac{\partial \mathbf{f(g){\partial \mathbf{g \frac{\partial \mathbf{g(u){\partial \mathbf{u \frac{\partial \mathbf{u{\partial x}

, ,

\frac{\partial \mathbf{u{\partial x} \frac{\partial \mathbf{g(u){\partial \mathbf{u \frac{\partial \mathbf{f(g){\partial \mathbf{g

, - , colspan=2, Assumes consistent matrix layout; see below. , - , U = U(''x''), v = v(''x'') , ,

\frac{\partial (\mathbf{U} \times \mathbf{v})}{\partial x}  =

, ,

\frac{\partial \mathbf{U{\partial x} \times \mathbf{v} + \mathbf{U} \times \frac{\partial \mathbf{v{\partial x}

, ,

\mathbf{v}^\top \times \left(\frac{\partial \mathbf{U{\partial x}\right) + \frac{\partial \mathbf{v{\partial x} \times \mathbf{U}^\top

NOTE: The formulas involving the vector-by-vector derivatives

\frac{\partial \mathbf{g(u){\partial \mathbf{u

and

\frac{\partial \mathbf{f(g){\partial \mathbf{g

(whose outputs are matrices) assume the matrices are laid out consistent with the vector layout, i.e. numerator-layout matrix when numerator-layout vector and vice versa; otherwise, transpose the vector-by-vector derivatives.

Scalar-by-matrix identities

Note that exact equivalents of the scalar

and

do not exist when applied to matrix-valued functions of matrices. However, the product rule of this sort does apply to the differential form (see below), and this is the way to derive many of the identities below involving the

function, combined with the fact that the trace function allows transposing and cyclic permutation, i.e.: :

\begin{align}
     \operatorname{tr}(\mathbf{A}) &= \operatorname{tr}\left(\mathbf{A^\top}\right) \\
  \operatorname{tr}(\mathbf{ABCD}) &= \operatorname{tr}(\mathbf{BCDA}) = \operatorname{tr}(\mathbf{CDAB}) = \operatorname{tr}(\mathbf{DABC})
\end{align}

For example, to compute

\frac{\partial \operatorname{tr}(\mathbf{AXBX^\top C})}{\partial \mathbf{X:

\begin{align}
  d\operatorname{tr}(\mathbf{AXBX^\top C})
    &= d\operatorname{tr}\left(\mathbf{CAXBX^\top}\right) = \operatorname{tr}\left(d\left(\mathbf{CAXBX^\top}\right)\right) \\
    &= \operatorname{tr}\left(\mathbf{CAX} d(\mathbf{BX^\top}\right) + d\left(\mathbf{CAX})\mathbf{BX^\top}\right) \\
    &= \operatorname{tr}\left(\mathbf{CAX} d\left(\mathbf{BX^\top}\right)\right) + \operatorname{tr}\left(d(\mathbf{CAX})\mathbf{BX^\top}\right) \\
    &= \operatorname{tr}\left(\mathbf{CAXB} d\left(\mathbf{X^\top}\right)\right) + \operatorname{tr}\left(\mathbf{CA}(d\mathbf{X})\mathbf{BX^\top}\right) \\
    &= \operatorname{tr}\left(\mathbf{CAXB} (d\mathbf{X})^\top\right) + \operatorname{tr}(\mathbf{CA}\left(d\mathbf{X})\mathbf{BX^\top}\right) \\
    &= \operatorname{tr}\left(\left(\mathbf{CAXB} (d\mathbf{X})^\top\right)^\top\right) + \operatorname{tr}\left(\mathbf{CA}(d\mathbf{X})\mathbf{BX^\top}\right) \\
    &= \operatorname{tr}\left((d\mathbf{X})\mathbf{B^\top X^\top A^\top C^\top}\right) + \operatorname{tr}\left(\mathbf{CA}(d\mathbf{X})\mathbf{BX^\top}\right) \\
    &= \operatorname{tr}\left(\mathbf{B^\top X^\top A^\top C^\top}(d\mathbf{X})\right) + \operatorname{tr}\left(\mathbf{BX^\top}\mathbf{CA}(d\mathbf{X})\right) \\
    &= \operatorname{tr}\left(\left(\mathbf{B^\top X^\top A^\top C^\top} + \mathbf{BX^\top}\mathbf{CA}\right)d\mathbf{X}\right) 
\end{align}

Therefore, :

\frac{\partial \operatorname{tr}\left(\mathbf{AXBX^\top C}\right)}{\partial \mathbf{X = \mathbf{B^\top X^\top A^\top C^\top} + \mathbf{BX^\top}\mathbf{CA} .

(numerator layout) (For the last step, see the Conversion from differential to derivative form section.) :{, class="wikitable" style="text-align: center;" , + Identities: scalar-by-matrix

\frac{\partial y}{\partial \mathbf{X

! scope="col" width="175" , Condition ! scope="col" width="10" , Expression ! scope="col" width="100" , Numerator layout, i.e. by X^T ! scope="col" width="100" , Denominator layout, i.e. by X , - , ''a'' is not a function of X , ,

\frac{\partial a}{\partial \mathbf{X  =

\mathbf{0}^\top

Here,

\mathbf{0}

refers to a matrix of all 0's, of the same shape as X., ,

\mathbf{0}

, - , ''a'' is not a function of X, ''u'' = ''u''(X) , ,

\frac{\partial au}{\partial \mathbf{X  =

, colspan=2,

a\frac{\partial u}{\partial \mathbf{X

, - , ''u'' = ''u''(X), ''v'' = ''v''(X) , ,

\frac{\partial (u+v)}{\partial \mathbf{X  =

, colspan=2,

\frac{\partial u}{\partial \mathbf{X + \frac{\partial v}{\partial \mathbf{X

, - , ''u'' = ''u''(X), ''v'' = ''v''(X) , ,

\frac{\partial uv}{\partial \mathbf{X  =

, colspan=2,

u\frac{\partial v}{\partial \mathbf{X + v\frac{\partial u}{\partial \mathbf{X

, - , ''u'' = ''u''(X) , ,

\frac{\partial g(u)}{\partial \mathbf{X  =

, colspan=2,

\frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \mathbf{X

, - , ''u'' = ''u''(X) , ,

\frac{\partial f(g(u))}{\partial \mathbf{X  =

, colspan=2,

\frac{\partial f(g)}{\partial g} \frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \mathbf{X

, - , rowspan=2, U = U(X) , , rowspan=2,

\frac{\partial g(\mathbf{U})}{\partial X_{ij =

, ,

\operatorname{tr}\left( \frac{\partial g(\mathbf{U})}{\partial \mathbf{U \frac{\partial \mathbf{U{\partial X_{ij\right)

, ,

\operatorname{tr}\left( \left(\frac{\partial g(\mathbf{U})}{\partial \mathbf{U\right)^\top \frac{\partial \mathbf{U{\partial X_{ij\right)

, - , colspan=2, Both forms assume ''numerator'' layout for

\frac{\partial \mathbf{U{\partial X_{ij,

i.e. mixed layout if denominator layout for X is being used. , - style="border-top: 3px solid;" , ''a'' and ''b'' are not functions of X , ,

\frac{\partial \mathbf{a}^\top\mathbf{X}\mathbf{b{\partial \mathbf{X  =

\mathbf{b}\mathbf{a}^\top

, ,

\mathbf{a}\mathbf{b}^\top

, - , ''a'' and ''b'' are not functions of X , ,

\frac{\partial \mathbf{a}^\top \mathbf{X}^\top \mathbf{b{\partial \mathbf{X  =

\mathbf{a}\mathbf{b}^\top

, ,

\mathbf{b}\mathbf{a}^\top

, - , ''a'', ''b'' and C are not functions of X , ,

\frac{\partial (\mathbf{X}\mathbf{a} + \mathbf{b})^\top \mathbf{C}(\mathbf{X} \mathbf{a} + \mathbf{b})}{\partial \mathbf{X  =

\left(\left(\mathbf{C} + \mathbf{C}^\top\right)(\mathbf{X} \mathbf{a} + \mathbf{b})\mathbf{a}^\top\right)^\top

, ,

\left(\mathbf{C} + \mathbf{C}^\top\right)(\mathbf{X} \mathbf{a} + \mathbf{b})\mathbf{a}^\top

, - , ''a'', ''b'' and C are not functions of X , ,

\frac{\partial (\mathbf{X}\mathbf{a})^\top \mathbf{C}(\mathbf{X}\mathbf{b})}{\partial \mathbf{X  =

\left(\mathbf{C}\mathbf{X}\mathbf{b}\mathbf{a}^\top + \mathbf{C}^\top\mathbf{X}\mathbf{a}\mathbf{b}^\top\right)^\top

, ,

\mathbf{C}\mathbf{X}\mathbf{b}\mathbf{a}^\top + \mathbf{C}^\top\mathbf{X}\mathbf{a}\mathbf{b}^\top

, - style="border-top: 3px solid;" , , ,

\frac{\partial \operatorname{tr}(\mathbf{X})}{\partial \mathbf{X =

, , colspan=2,

\mathbf{I}

, - , U = U(X), V = V(X) , ,

\frac{\partial \operatorname{tr}(\mathbf{U} + \mathbf{V})}{\partial \mathbf{X =

, , colspan=2,

\frac{\partial \operatorname{tr}(\mathbf{U})}{\partial \mathbf{X + \frac{\partial \operatorname{tr}(\mathbf{V})}{\partial \mathbf{X

, - , ''a'' is not a function of X,
U = U(X) , ,

\frac{\partial \operatorname{tr}(a\mathbf{U})}{\partial \mathbf{X =

, , colspan=2,

a\frac{\partial \operatorname{tr}(\mathbf{U})}{\partial \mathbf{X

, - , g(X) is any

polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An example ...

with scalar coefficients, or any matrix function defined by an infinite polynomial series (e.g. e^X, sin(X), cos(X), ln(X), etc. using a

Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

); ''g''(''x'') is the equivalent scalar function, ''g''′(''x'') is its derivative, and g′(X) is the corresponding matrix function , ,

\frac{\partial \operatorname{tr}(\mathbf{g(X)})}{\partial \mathbf{X =

, ,

\mathbf{g}'(\mathbf{X})

, ,

\left(\mathbf{g}'(\mathbf{X})\right)^\top

, - , A is not a function of X , ,

\frac{\partial \operatorname{tr}(\mathbf{AX})}{\partial \mathbf{X = \frac{\partial \operatorname{tr}(\mathbf{XA})}{\partial \mathbf{X =

, ,

\mathbf{A}

, ,

\mathbf{A}^\top

, - , A is not a function of X , ,

\frac{\partial \operatorname{tr}\left(\mathbf{AX^\top}\right)}{\partial \mathbf{X = \frac{\partial \operatorname{tr}\left(\mathbf{X^\top A}\right)}{\partial \mathbf{X =

, ,

\mathbf{A}^\top

, ,

\mathbf{A}

, - , A is not a function of X , ,

\frac{\partial \operatorname{tr}\left(\mathbf{X^\top AX}\right)}{\partial \mathbf{X =

, ,

\mathbf{X}^\top\left(\mathbf{A} + \mathbf{A}^\top\right)

, ,

\left(\mathbf{A} + \mathbf{A}^\top\right)\mathbf{X}

, - , A is not a function of X , ,

\frac{\partial \operatorname{tr}(\mathbf{X^{-1}A})}{\partial \mathbf{X =

, ,

-\mathbf{X}^{-1}\mathbf{A}\mathbf{X}^{-1}

, ,

-\left(\mathbf{X}^{-1}\right)^\top\mathbf{A}^\top\left(\mathbf{X}^{-1}\right)^\top

, - , A, B are not functions of X , ,

\frac{\partial \operatorname{tr}(\mathbf{AXB})}{\partial \mathbf{X = \frac{\partial \operatorname{tr}(\mathbf{BAX})}{\partial \mathbf{X =

, ,

\mathbf{BA}

, ,

\mathbf{A^\top B^\top}

, - , A, B, C are not functions of X , ,

\frac{\partial \operatorname{tr}\left(\mathbf{AXBX^\top C}\right)}{\partial \mathbf{X =

, ,

\mathbf{BX^\top CA} + \mathbf{B^\top X^\top A^\top C^\top}

, ,

\mathbf{A^\top C^\top XB^\top} + \mathbf{CAXB}

, - , ''n'' is a positive integer , ,

\frac{\partial \operatorname{tr}\left(\mathbf{X}^n\right)}{\partial \mathbf{X =

, ,

n\mathbf{X}^{n-1}

, ,

n\left(\mathbf{X}^{n-1}\right)^\top

, - , A is not a function of X,
''n'' is a positive integer , ,

\frac{\partial \operatorname{tr}\left(\mathbf{A}\mathbf{X}^n\right)}{\partial \mathbf{X =

, ,

\sum_{i=0}^{n-1} \mathbf{X}^i\mathbf{A}\mathbf{X}^{n-i-1}

, ,

\sum_{i=0}^{n-1} \left(\mathbf{X}^i\mathbf{A}\mathbf{X}^{n-i-1}\right)^\top

, - , , ,

\frac{\partial \operatorname{tr}\left(e^\mathbf{X}\right)}{\partial \mathbf{X =

, ,

e^\mathbf{X}

, ,

\left(e^\mathbf{X}\right)^\top

, - , , ,

\frac{\partial \operatorname{tr}(\sin(\mathbf{X}))}{\partial \mathbf{X =

, ,

\cos(\mathbf{X})

, ,

(\cos(\mathbf{X}))^\top

, - style="border-top: 3px solid;" , , ,

\frac{\partial , \mathbf{X}{\partial \mathbf{X =

, ,

\operatorname{cofactor}(X)^\top = , \mathbf{X}, \mathbf{X}^{-1}

, ,

\operatorname{cofactor}(X) = , \mathbf{X}, \left(\mathbf{X}^{-1}\right)^\top

, - , ''a'' is not a function of X , ,

\frac{\partial \ln , a\mathbf{X}{\partial \mathbf{X =

, ,

\mathbf{X}^{-1}

, ,

\left(\mathbf{X}^{-1}\right)^\top

, - , A, B are not functions of X , ,

\frac{\partial , \mathbf{AXB}{\partial \mathbf{X =

, ,

, \mathbf{AXB}, \mathbf{X}^{-1}

, ,

, \mathbf{AXB}, \left(\mathbf{X}^{-1}\right)^\top

, - , ''n'' is a positive integer , ,

\frac{\partial \left, \mathbf{X}^n\right{\partial \mathbf{X =

, ,

n\left, \mathbf{X}^n\\mathbf{X}^{-1}

, ,

n\left, \mathbf{X}^n\\left(\mathbf{X}^{-1}\right)^\top

, - , (see

pseudo-inverse In mathematics, and in particular, algebra, a generalized inverse (or, g-inverse) of an element ''x'' is an element ''y'' that has some properties of an inverse element but not necessarily all of them. The purpose of constructing a generalized in ...

) , ,

\frac{\partial \ln \left, \mathbf{X}^\top\mathbf{X}\right{\partial \mathbf{X =

, ,

2\mathbf{X}^{+}

, ,

2\left(\mathbf{X}^{+}\right)^\top

, - , (see

) , ,

\frac{\partial \ln \left, \mathbf{X}^\top\mathbf{X}\right{\partial \mathbf{X}^{+ =

, ,

-2\mathbf{X}

, ,

-2\mathbf{X}^\top

, - , A is not a function of X,
X is square and invertible , ,

\frac{\partial \left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\right{\partial \mathbf{X =

, ,

2\left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\\mathbf{X}^{-1} = 2\left, \mathbf{X^\top}\, \mathbf{A}, , \mathbf{X}, \mathbf{X}^{-1}

, ,

2\left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\\left(\mathbf{X}^{-1}\right)^\top

, - , A is not a function of X,
X is non-square,
A is symmetric , ,

\frac{\partial \left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\right{\partial \mathbf{X =

, ,

2\left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\\left(\mathbf{X^\top A^\top X}\right)^{-1}\mathbf{X^\top A^\top}

, ,

2\left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\\mathbf{AX}\left(\mathbf{X^\top AX}\right)^{-1}

, - , A is not a function of X,
X is non-square,
A is non-symmetric , ,

\frac{\partial , \mathbf{X^\top}\mathbf{A}\mathbf{X}{\partial \mathbf{X =

\begin{align}
  \left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\
    \Big(&\left(\mathbf{X^\top AX}\right)^{-1}\mathbf{X^\top A} + {} \\
         &\left(\mathbf{X^\top A^\top X}\right)^{-1}\mathbf{X^\top A^\top}\Big)
\end{align}

\begin{align}
  \left, \mathbf{X^\top}\mathbf{A}\mathbf{X}\
    \Big(&\mathbf{AX}\left(\mathbf{X^\top AX}\right)^{-1} + {} \\
         &\mathbf{A^\top X}\left(\mathbf{X^\top A^\top X}\right)^{-1}\Big)
\end{align}

Matrix-by-scalar identities

:{, class="wikitable" style="text-align: center;" , + Identities: matrix-by-scalar

\frac{\partial \mathbf{Y{\partial x}

! scope="col" width="175" , Condition ! scope="col" width="100" , Expression ! scope="col" width="100" , Numerator layout, i.e. by Y , - , U = U(''x'') , ,

\frac{\partial a\mathbf{U{\partial x} =

, ,

a\frac{\partial \mathbf{U{\partial x}

, - , A, B are not functions of ''x'',
U = U(''x'') , ,

\frac{\partial \mathbf{AUB{\partial x} =

, ,

\mathbf{A}\frac{\partial \mathbf{U{\partial x}\mathbf{B}

, - , U = U(''x''), V = V(''x'') , ,

\frac{\partial (\mathbf{U}+\mathbf{V})}{\partial x} =

, ,

\frac{\partial \mathbf{U{\partial x} + \frac{\partial \mathbf{V{\partial x}

, - , U = U(''x''), V = V(''x'') , ,

\frac{\partial (\mathbf{U}\mathbf{V})}{\partial x} =

, ,

\mathbf{U}\frac{\partial \mathbf{V{\partial x} + \frac{\partial \mathbf{U{\partial x}\mathbf{V}

, - , U = U(''x''), V = V(''x'') , ,

\frac{\partial (\mathbf{U} \otimes \mathbf{V})}{\partial x} =

, ,

\mathbf{U} \otimes \frac{\partial \mathbf{V{\partial x} + \frac{\partial \mathbf{U{\partial x} \otimes \mathbf{V}

, - , U = U(''x''), V = V(''x'') , ,

\frac{\partial (\mathbf{U} \circ \mathbf{V})}{\partial x} =

, ,

\mathbf{U} \circ \frac{\partial \mathbf{V{\partial x} + \frac{\partial \mathbf{U{\partial x} \circ \mathbf{V}

, - , U = U(''x'') , ,

\frac{\partial \mathbf{U}^{-1{\partial x} =

, ,

-\mathbf{U}^{-1} \frac{\partial \mathbf{U{\partial x}\mathbf{U}^{-1}

, - , U = U(''x,y'') , ,

\frac{\partial^2 \mathbf{U}^{-1{\partial x \partial y} =

, ,

\mathbf{U}^{-1}\left(\frac{\partial \mathbf{U{\partial x}\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial y} - \frac{\partial^2 \mathbf{U{\partial x \partial y} + \frac{\partial \mathbf{U{\partial y}\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)\mathbf{U}^{-1}

, - , A is not a function of ''x'', g(X) is any polynomial with scalar coefficients, or any matrix function defined by an infinite polynomial series (e.g. e^X, sin(X), cos(X), ln(X), etc.); ''g''(''x'') is the equivalent scalar function, ''g''′(''x'') is its derivative, and g′(X) is the corresponding matrix function , ,

\frac{\partial \, \mathbf{g}(x\mathbf{A})}{\partial x} =

, , colspan=2,

\mathbf{A}\mathbf{g}'(x\mathbf{A}) = \mathbf{g}'(x\mathbf{A})\mathbf{A}

, - , A is not a function of ''x'' , ,

\frac{\partial e^{x\mathbf{A}{\partial x} =

, ,

\mathbf{A}e^{x\mathbf{A = e^{x\mathbf{A\mathbf{A}

Further see

Derivative of the exponential map In the theory of Lie groups, the exponential map is a map from the Lie algebra of a Lie group into . In case is a matrix Lie group, the exponential map reduces to the matrix exponential. The exponential map, denoted , is analytic and has as su ...

Scalar-by-scalar identities

With vectors involved

:{, class="wikitable" style="text-align: center;" , + Identities: scalar-by-scalar, with vectors involved ! scope="col" width="150" , Condition ! scope="col" width="10" , Expression ! scope="col" width="150" , Any layout (assumes dot product ignores row vs. column layout) , - , u = u(''x'') , ,

\frac{\partial g(\mathbf{u})}{\partial x} =

, ,

\frac{\partial g(\mathbf{u})}{\partial \mathbf{u \cdot \frac{\partial \mathbf{u{\partial x}

, - , u = u(''x''), v = v(''x'') , ,

\frac{\partial (\mathbf{u} \cdot \mathbf{v})}{\partial x}  =

, colspan=2,

\mathbf{u} \cdot \frac{\partial \mathbf{v{\partial x} + \frac{\partial \mathbf{u{\partial x} \cdot \mathbf{v}

With matrices involved

:{, class="wikitable" style="text-align: center;" , +Identities: scalar-by-scalar, with matrices involved This book uses a mixed layout, i.e. by Y in

\frac{\partial \mathbf{Y{\partial x},

by X in

\frac{\partial y}{\partial \mathbf{X.

! scope="col" width="175" , Condition ! scope="col" width="100" , Expression ! scope="col" width="100" , Consistent numerator layout,
i.e. by Y and X^T ! scope="col" width="100" , Mixed layout,
i.e. by Y and X , - , U = U(''x'') , ,

\frac{\partial , \mathbf{U}{\partial x} =

, , colspan=2,

, \mathbf{U}, \operatorname{tr}\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)

, - , U = U(''x'') , ,

\frac{\partial \ln, \mathbf{U}{\partial x} =

, , colspan=2,

\operatorname{tr}\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)

, - , U = U(''x'') , ,

\frac{\partial^2 , \mathbf{U}{\partial x^2} =

, colspan=2 ,

, \mathbf{U}, \left[
  \operatorname{tr}\left(\mathbf{U}^{-1}\frac{\partial^2 \mathbf{U{\partial x^2}\right) +
  \operatorname{tr}^2\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right) -
  \operatorname{tr}\left(\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)^2\right)
\right]

, - , U = U(''x'') , ,

\frac{\partial g(\mathbf{U})}{\partial x} =

, ,

\operatorname{tr}\left( \frac{\partial g(\mathbf{U})}{\partial \mathbf{U \frac{\partial \mathbf{U{\partial x}\right)

, ,

\operatorname{tr}\left( \left(\frac{\partial g(\mathbf{U})}{\partial \mathbf{U\right)^\top \frac{\partial \mathbf{U{\partial x}\right)

\frac{\partial \operatorname{tr}(\mathbf{g}(x\mathbf{A}))}{\partial x} =

, , colspan=2,

\operatorname{tr}\left(\mathbf{A}\mathbf{g}'(x\mathbf{A})\right)

, - , A is not a function of ''x'' , ,

\frac{\partial \operatorname{tr}\left(e^{x\mathbf{A\right)}{\partial x} =

, , colspan=2,

\operatorname{tr}\left(\mathbf{A}e^{x\mathbf{A\right)

Identities in differential form

It is often easier to work in differential form and then convert back to normal derivatives. This only works well using the numerator layout. In these rules, "a" is a scalar. :{, class="wikitable" style="text-align: center;" , + Differential identities: scalar involving matrix ! Condition !! Expression !! Result (numerator layout) , - , , ,

d(\operatorname{tr}(\mathbf{X})) =

, ,

\operatorname{tr}(d\mathbf{X})

, - , , ,

d(, \mathbf{X}, ) =

, ,

, \mathbf{X}, \operatorname{tr}\left(\mathbf{X}^{-1}d\mathbf{X}\right) = \operatorname{tr}(\operatorname{adj}(\mathbf{X})d\mathbf{X})

, - , , ,

d(\ln, \mathbf{X}, ) =

, ,

\operatorname{tr}\left(\mathbf{X}^{-1}d\mathbf{X}\right)

:{, class="wikitable" style="text-align: center;" , + Differential identities: matrix ! Condition !! Expression !! Result (numerator layout) , - , A is not a function of X , ,

d(\mathbf{A}) =

, ,

0

, - , ''a'' is not a function of X , ,

d(a\mathbf{X}) =

, ,

a\,d\mathbf{X}

, - , , ,

d(\mathbf{X} + \mathbf{Y}) =

, ,

d\mathbf{X} + d\mathbf{Y}

, - , , ,

d(\mathbf{X}\mathbf{Y}) =

, ,

(d\mathbf{X})\mathbf{Y} + \mathbf{X}(d\mathbf{Y})

, - , (

Kronecker product In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a generalization of the outer product (which is denoted by the same symbol) from vectors ...

) , ,

d(\mathbf{X} \otimes \mathbf{Y}) =

, ,

(d\mathbf{X})\otimes\mathbf{Y} + \mathbf{X}\otimes(d\mathbf{Y})

, - , ( Hadamard product) , ,

d(\mathbf{X} \circ \mathbf{Y}) =

, ,

(d\mathbf{X})\circ\mathbf{Y} + \mathbf{X}\circ(d\mathbf{Y})

, - , , ,

d\left(\mathbf{X}^\top\right) =

, ,

(d\mathbf{X})^\top

, - , ,

d\left(\mathbf{X}^{-1}\right) =

-\mathbf{X}^{-1}\left(d\mathbf{X}\right)\mathbf{X}^{-1}

, - , (

conjugate transpose In mathematics, the conjugate transpose, also known as the Hermitian transpose, of an m \times n complex matrix \boldsymbol is an n \times m matrix obtained by transposing \boldsymbol and applying complex conjugate on each entry (the complex c ...

) , ,

d\left(\mathbf{X}^{\rm H}\right) =

, ,

(d\mathbf{X})^{\rm H}

, - , ''n'' is a positive integer , ,

d\left(\mathbf{X}^n\right) =

, ,

\sum_{i=0}^{n-1} \mathbf{X}^i (d\mathbf{X})\mathbf{X}^{n-i-1}

, - , ,

d \left(e^\mathbf{X}\right) =

\int_0^1 e^{a\mathbf{X (d\mathbf{X}) e^{(1-a)\mathbf{X \, da

, - , ,

d \left(\log{X}\right) =

\int_0^\infty (\mathbf{X}+z \, \mathbf{I})^{-1} (d\mathbf{X}) (\mathbf{X}+z \, \mathbf{I})^{-1} \, dz

, - ,

\mathbf{X} = \sum_i \lambda_i \mathbf{P}_i

diagonalizable In linear algebra, a square matrix A is called diagonalizable or non-defective if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix P and a diagonal matrix D such that or equivalently (Such D are not unique.) F ...

\mathbf{P}_i \mathbf{P}_j = \delta_{ij} \mathbf{P}_i

''f'' is

differentiable In mathematics, a differentiable function of one real variable is a function whose derivative exists at each point in its domain. In other words, the graph of a differentiable function has a non-vertical tangent line at each interior point in its ...

at every eigenvalue

\lambda_i

d \left(f(\mathbf{X})\right) =

\sum_{ij} \mathbf{P}_i (d\mathbf{X}) \mathbf{P}_j \begin{cases}
  f'(\lambda_i) & \lambda_i = \lambda_j \\
  \frac{f(\lambda_i) - f(\lambda_j)}{\lambda_i - \lambda_j} & \lambda_i \neq \lambda_j 
\end{cases}

In the last row,

\delta_{ij}

is the

Kronecker delta In mathematics, the Kronecker delta (named after Leopold Kronecker) is a function of two variables, usually just non-negative integers. The function is 1 if the variables are equal, and 0 otherwise: \delta_ = \begin 0 &\text i \neq j, \\ 1 & ...

and

(\mathbf{P}_k)_{ij} = (\mathbf{Q})_{ik} (\mathbf{Q}^{-1})_{kj}

is the set of orthogonal projection operators that project onto the ''k''-th eigenvector of X. Q is the matrix of

eigenvectors In linear algebra, an eigenvector () or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. The corresponding eigenvalue, often denoted ...

\mathbf{X} = \mathbf{Q} \mathbf{\Lambda} \mathbf{Q}^{-1}

, and

(\mathbf{\Lambda})_{ii} = \lambda_i

are the eigenvalues. The matrix function

f(\mathbf{X})

is defined in terms of the scalar function

f(x)

for diagonalizable matrices by

f(\mathbf{X}) = \sum_i f(\lambda_i) \mathbf{P}_i

where

\mathbf{X} = \sum_i \lambda_i \mathbf{P}_i

with

\mathbf{P}_i \mathbf{P}_j = \delta_{ij} \mathbf{P}_i

. To convert to normal derivative form, first convert it to one of the following canonical forms, and then use these identities: :{, class="wikitable" style="text-align: center;" , + Conversion from differential to derivative form ! Canonical differential form !! Equivalent derivative form (numerator layout) , - ,

dy = a\,dx

, ,

\frac{dy}{dx} = a

, - ,

dy = \mathbf{a}^\top d\mathbf{x}

, ,

\frac{dy}{d\mathbf{x = \mathbf{a}^\top

, - ,

dy = \operatorname{tr}(\mathbf{A}\,d\mathbf{X})

, ,

\frac{dy}{d\mathbf{X = \mathbf{A}

, - ,

d\mathbf{y} = \mathbf{a}\,dx

, ,

\frac{d\mathbf{y{dx} = \mathbf{a}

, - ,

d\mathbf{y} = \mathbf{A}\,d\mathbf{x}

, ,

\frac{d\mathbf{y{d\mathbf{x = \mathbf{A}

, - ,

d\mathbf{Y} = \mathbf{A}\,dx

, ,

\frac{d\mathbf{Y{dx} = \mathbf{A}

Applications

Matrix differential calculus is used in statistics, particularly for the statistical analysis of

multivariate distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

s, especially the

and other

elliptical distribution In probability and statistics, an elliptical distribution is any member of a broad family of probability distributions that generalize the multivariate normal distribution. Intuitively, in the simplified two and three dimensional case, the joint ...

s. It is used in

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

to compute, for example, the ordinary least squares regression formula for the case of multiple

explanatory variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

Notes

References

* * *

External links

Software

MatrixCalculus.org
a website for evaluating matrix calculus expressions symbolically
NCAlgebra
an open-source Mathematica package that has some matrix calculus functionality * SymPy supports symbolic matrix derivatives in it
matrix expression module
as well as symbolic tensor derivatives in it

Information

Mike Brookes,

Imperial College London Imperial College London (legally Imperial College of Science, Technology and Medicine) is a public research university in London, United Kingdom. Its history began with Prince Albert, consort of Queen Victoria, who developed his vision for a cu ...

.
Matrix Differentiation (and some other stuff)
Randal J. Barnes, Department of Civil Engineering, University of Minnesota.
Notes on Matrix Calculus
Paul L. Fackler, North Carolina State University.
Matrix Differential Calculus
(slide presentation), Zhang Le,

University of Edinburgh The University of Edinburgh ( sco, University o Edinburgh, gd, Oilthigh Dhùn Èideann; abbreviated as ''Edin.'' in post-nominals) is a public research university based in Edinburgh, Scotland. Granted a royal charter by King James VI in 15 ...

.
Introduction to Vector and Matrix Differentiation
(notes on matrix differentiation, in the context of

Econometrics Econometrics is the application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics," '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...

), Heino Bohn Nielsen.
A note on differentiating matrices
(notes on matrix differentiation), Pawel Koval, from Munich Personal RePEc Archive.
Vector/Matrix Calculus
More notes on matrix differentiation.
Matrix Identities
(notes on matrix differentiation), Sam Roweis. {{Calculus topics Matrix theory Linear algebra Multivariable calculus

Scope

Relation to other derivatives

Usages

Notation

Alternatives

Derivatives with vectors

Vector-by-scalar

Scalar-by-vector

Vector-by-vector

Derivatives with matrices

Matrix-by-scalar

Scalar-by-matrix

Other matrix derivatives

Layout conventions

Numerator-layout notation

Denominator-layout notation

Identities

Vector-by-vector identities

Scalar-by-vector identities

Vector-by-scalar identities

Scalar-by-matrix identities

Matrix-by-scalar identities

Scalar-by-scalar identities

With vectors involved

With matrices involved

Identities in differential form

Applications

See also

Notes

References

Further reading

External links

Software

Information