Wednesday, December 5, 2012

Path Analysis

Stata do file

* Path analysis is an interesting statistical method that can be used to indentify complex relationships beween variables and an outcome variable.

* As with all statistical methods the modelling framework is essential to derive reasonable results.

* Conviently, I am only interested in simulating data so as usual my data will perfectly conform to the model's specifications.

* Imagine the following model.

* All of the boxes are observable variables.  The arrows indicate the causal direction of the effects.

* There are two exogenous variables: A and D.   These variables are not influenced by any other variables in the model.

* All other variables are endogenous.

* Each of the variables represents a direct effect of a one unit change in one variable on that of the other variable.

* This framework is convient because it allows us to indentify a "total effect" which is a combined result of both the direct and indirect effects of variables on the outcome variable.

* The variable of primary interest in explaining is H.

* The variable G has only a direct effect on H (pHG).

* While the variable C only has an indirect effect on H (pFC*pHF).

* The reason the indirect effect is a product is because C has a pFC effect on F, and F has a pHF effect on H, thus a change in H as a result of a change in C is how much F changes as a result of C and how much that change effects H.

* Variables can have both and indirect and direct effect.

* B for instance has the direct effect: pHB
* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

* The key feature about this particular example is that all of the arrows are one directional.

* Making a great deal of inference possible that otherwise would not be possible.

* Usually we cannot say that when trying to explain H with explanatory variables A through G that A causes B and B causes H.

* However, if we do the work to indentify reasonable pathways then this type of analysis could be quite interesting.

* Let's generate out data.


* Let's imagine 6000 youth in our sample.

set obs 6000

* Let's first specify our effects

* pEA = .3
* pEB = .13

* pHA = .2
* pHB = .2
* pHE = .3
* pHG = 1.1
* pHF = .2

* pBA = .5

* pCB = .2
* pCD = .1

* pGD = .2

* pFC = .76
* pFB = .4

* For B we can calculate our true effects:

* B Direct: pHB = .2

* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Indirect effects:.2*.76*.2 + .13*.3 = .0694

* Total effect: .0694+.2 = .2694

gen A =                rnormal()
gen B = A*.5 +         rnormal()
gen D =                rnormal()
gen C = B*.2 + D*.1 +  rnormal()
gen E = A*.3 + B*.13 + rnormal()
gen F = B*.4 + C*.76 + rnormal()
gen G = D*.2 +         rnormal()
gen H = E*.3 + A*.2 + B*.2 + F*.2 + G*1.1 + rnormal()

* Simualtion Done

* In order to generate our different effects we simply run OLS for each endogenous variable.

reg A B
  local pBA = _b[B]

reg C B D
  local pCB = _b[B]
  local pCD = _b[D]

reg C B D
  local pCB = _b[B]
  local pCD = _b[D]

reg G D
  local pGD = _b[D]

reg F C B
  local pFB = _b[B]
  local pFC = _b[C]

reg E A B
  local pEA = _b[A]
  local pEB = _b[B]

reg H A B E F G
  local pHA = _b[A]
  local pHB = _b[B]
  local pHE = _b[E]
  local pHF = _b[F]
  local pHG = _b[G]

* In order to estimate the indirect effect say of B on H.
* We just plug our estimates into the equation.

* B direct effect: pHB
* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

di "B's estimated indirect effect = `pCB'*`pFC'*`pHF' + `pEB'*`pHE'"
di "B's estimated indirect effect = " `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* Which turns out to be close to our true value.

di "B's total estimated effect on H is " `pHB' + `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* It is possible to use the user written command pathreg to make things easier.

* Install it by typing the following command. findit pathreg
pathreg (H E B F G) (G D) (C B D) (B A) (E A B) (F B C)

* This command does not currently calculate out all of the indirect and direct effects.

* I am not sure the best way to calculate the standard errors of the different effect estimates.

* My guess is that since this is just a series of fast OLS regressions the easiest thing to do would be to boostrap the entire process.

* This would require slightly more code but definitely easy to do from this point.


  1. Stata has (relatively new, I think) SEM features:

    1. Yeah I know, I am a version behind the times :D