* Path analysis is an interesting statistical method that can be used to indentify complex relationships beween variables and an outcome variable.

* As with all statistical methods the modelling framework is essential to derive reasonable results.

* Conviently, I am only interested in simulating data so as usual my data will perfectly conform to the model's specifications.

* Imagine the following model.

* All of the boxes are observable variables. The arrows indicate the causal direction of the effects.

* There are two exogenous variables: A and D. These variables are not influenced by any other variables in the model.

* All other variables are endogenous.

* Each of the variables represents a direct effect of a one unit change in one variable on that of the other variable.

* This framework is convient because it allows us to indentify a "total effect" which is a combined result of both the direct and indirect effects of variables on the outcome variable.

* The variable of primary interest in explaining is H.

* The variable G has only a direct effect on H (pHG).

* While the variable C only has an indirect effect on H (pFC*pHF).

* The reason the indirect effect is a product is because C has a pFC effect on F, and F has a pHF effect on H, thus a change in H as a result of a change in C is how much F changes as a result of C and how much that change effects H.

* Variables can have both and indirect and direct effect.

* B for instance has the direct effect: pHB

* Indirect effects: pCB*pFC*pHF + pEB*pHE

* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

* The key feature about this particular example is that all of the arrows are one directional.

* Making a great deal of inference possible that otherwise would not be possible.

* Usually we cannot say that when trying to explain H with explanatory variables A through G that A causes B and B causes H.

* However, if we do the work to indentify reasonable pathways then this type of analysis could be quite interesting.

* Let's generate out data.

clear

* Let's imagine 6000 youth in our sample.

set obs 6000

* Let's first specify our effects

* pEA = .3

* pEB = .13

* pHA = .2

* pHB = .2

* pHE = .3

* pHG = 1.1

* pHF = .2

* pBA = .5

* pCB = .2

* pCD = .1

* pGD = .2

* pFC = .76

* pFB = .4

* For B we can calculate our true effects:

* B Direct: pHB = .2

* Indirect effects: pCB*pFC*pHF + pEB*pHE

* Indirect effects:.2*.76*.2 + .13*.3 = .0694

* Total effect: .0694+.2 = .2694

gen A = rnormal()

gen B = A*.5 + rnormal()

gen D = rnormal()

gen C = B*.2 + D*.1 + rnormal()

gen E = A*.3 + B*.13 + rnormal()

gen F = B*.4 + C*.76 + rnormal()

gen G = D*.2 + rnormal()

gen H = E*.3 + A*.2 + B*.2 + F*.2 + G*1.1 + rnormal()

* Simualtion Done

* In order to generate our different effects we simply run OLS for each endogenous variable.

reg A B

local pBA = _b[B]

reg C B D

local pCB = _b[B]

local pCD = _b[D]

reg C B D

local pCB = _b[B]

local pCD = _b[D]

reg G D

local pGD = _b[D]

reg F C B

local pFB = _b[B]

local pFC = _b[C]

reg E A B

local pEA = _b[A]

local pEB = _b[B]

reg H A B E F G

local pHA = _b[A]

local pHB = _b[B]

local pHE = _b[E]

local pHF = _b[F]

local pHG = _b[G]

* In order to estimate the indirect effect say of B on H.

* We just plug our estimates into the equation.

* B direct effect: pHB

* Indirect effects: pCB*pFC*pHF + pEB*pHE

* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

di "B's estimated indirect effect = `pCB'*`pFC'*`pHF' + `pEB'*`pHE'"

di "B's estimated indirect effect = " `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* Which turns out to be close to our true value.

di "B's total estimated effect on H is " `pHB' + `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* It is possible to use the user written command pathreg to make things easier.

* Install it by typing the following command. findit pathreg

pathreg (H E B F G) (G D) (C B D) (B A) (E A B) (F B C)

* This command does not currently calculate out all of the indirect and direct effects.

* I am not sure the best way to calculate the standard errors of the different effect estimates.

* My guess is that since this is just a series of fast OLS regressions the easiest thing to do would be to boostrap the entire process.

* This would require slightly more code but definitely easy to do from this point.

Stata has (relatively new, I think) SEM features:

ReplyDeletehttp://www.stata.com/stata12/structural-equation-modeling/

Yeah I know, I am a version behind the times :D

Delete