HOME

TheInfoList



OR:

In statistics, econometrics, epidemiology, genetics and related disciplines, causal graphs (also known as path diagrams, causal
Bayesian networks A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Ba ...
or DAGs) are probabilistic graphical models used to encode assumptions about the data-generating process. Causal graphs can be used for communication and for inference. As communication devices, the graphs provide formal and transparent representation of the causal assumptions that researchers may wish to convey and defend. As inference tools, the graphs enable researchers to estimate effect sizes from non-experimental data, derive
testable Testability is a primary aspect of Science and the Scientific Method and is a property applying to an empirical hypothesis, involves two components: #Falsifiability or defeasibility, which means that counterexamples to the hypothesis are logicall ...
implications of the assumptions encoded, test for external validity, and manage missing data and selection bias. Causal graphs were first used by the geneticist
Sewall Wright Sewall Green Wright FRS(For) Honorary FRSE (December 21, 1889March 3, 1988) was an American geneticist known for his influential work on evolutionary theory and also for his work on path analysis. He was a founder of population genetics alongsi ...
under the rubric "path diagrams". They were later adopted by social scientists and, to a lesser extent, by economists. These models were initially confined to linear equations with fixed parameters. Modern developments have extended graphical models to non-parametric analysis, and thus achieved a generality and flexibility that has transformed causal analysis in computer science, epidemiology, and social science.


Construction and terminology

The causal graph can be drawn in the following way. Each variable in the model has a corresponding vertex or node and an arrow is drawn from a variable ''X'' to a variable ''Y'' whenever ''Y'' is judged to respond to changes in ''X'' when all other variables are being held constant. Variables connected to ''Y'' through direct arrows are called ''parents'' of ''Y'', or "direct causes of ''Y''," and are denoted by ''Pa(Y)''. Causal models often include "error terms" or "omitted factors" which represent all unmeasured factors that influence a variable ''Y'' when ''Pa(Y)'' are held constant. In most cases, error terms are excluded from the graph. However, if the graph author suspects that the error terms of any two variables are dependent (e.g. the two variables have an unobserved or latent common cause) then a bidirected arc is drawn between them. Thus, the presence of latent variables is taken into account through the correlations they induce between the error terms, as represented by bidirected arcs.


Fundamental tools

A fundamental tool in graphical analysis is
d-separation A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Ba ...
, which allows researchers to determine, by inspection, whether the causal structure implies that two sets of variables are independent given a third set. In recursive models without correlated error terms (sometimes called ''Markovian''), these conditional independences represent all of the model's testable implications.


Example

Suppose we wish to estimate the effect of attending an elite college on future earnings. Simply regressing earnings on college rating will not give an unbiased estimate of the target effect because elite colleges are highly selective, and students attending them are likely to have qualifications for high-earning jobs prior to attending the school. Assuming that the causal relationships are linear, this background knowledge can be expressed in the following structural equation model (SEM) specification. Model 1 : \begin Q_1 &= U_1\\ C &= a \cdot Q_1 + U_2\\ Q_2 &= c \cdot C + d \cdot Q_1 + U_3\\ S &= b \cdot C + e \cdot Q_2 + U_4, \end where Q_1 represents the individual's qualifications prior to college, Q_2 represents qualifications after college, C contains attributes representing the quality of the college attended, and S the individual's salary. Figure 1 is a causal graph that represents this model specification. Each variable in the model has a corresponding node or vertex in the graph. Additionally, for each equation, arrows are drawn from the independent variables to the dependent variables. These arrows reflect the direction of causation. In some cases, we may label the arrow with its corresponding structural coefficient as in Figure 1. If Q_1 and Q_2 are unobserved or latent variables their influence on C and S can be attributed to their error terms. By removing them, we obtain the following model specification: Model 2 : \begin C &= U_C \\ S &= \beta C + U_S \end The background information specified by Model 1 imply that the error term of S, U_S, is correlated with ''C'' error term, U_C. As a result, we add a bidirected arc between ''S'' and ''C'', as in Figure 2. Since U_S is correlated with U_C and, therefore, C, C is
endogenous Endogenous substances and processes are those that originate from within a living system such as an organism, tissue, or cell. In contrast, exogenous substances and processes are those that originate from outside of an organism. For example, ...
and \beta is not identified in Model 2. However, if we include the strength of an individual's college application, A, as shown in Figure 3, we obtain the following model: Model 3 : \begin Q_1 &= U_1\\ A &= a \cdot Q_1 + U_2 \\ C &= b \cdot A + U_3\\ Q_2 &= e \cdot Q_1 + d \cdot C + U_4\\ S &= c \cdot C + f \cdot Q_2 + U_5, \end By removing the latent variables from the model specification we obtain: Model 4 : \begin A &= a \cdot Q_1 + U_A \\ C &= b \cdot A + U_C\\ S &= \beta \cdot C + U_S, \end with U_A correlated with U_S. Now, \beta is identified and can be estimated using the regression of S on C and A. This can be verified using the ''single-door criterion'', a necessary and sufficient graphical condition for the identification of a structural coefficients, like \beta, using regression.


References

{{reflist Graphical models