* Principal component analysis is a very interesting method that allows for
* one to attempt to identify the underlying driving factor or compenent in
* the observable values in data.
* Imagine that you have data on demographic information about people. This
* data tells you stuff like class rank, number of
* sports played in, salary, married, number of children, etc.
* Now let's imagine that individual data observations are a function of
* underlying latent personal traits. These traits include: intelligence,
* athleticism, and family_orientation
set seed 101
clear
set obs 1000
* Latent traits
gen inte = rnormal()
gen athl = rnormal()
gen famo = rnormal()
* Observable traits
gen class_rank = 2*inte - .1*athl + 1*famo + rnormal()
gen nsports = -.5*inte + 2*athl + .5*famo + rnormal()
gen salary = 1*inte + .5*athl - 1*famo + rnormal()
gen married = .1*inte + .5*athl + 1.5*famo + rnormal()
gen children = -.5*inte + 0*athl + 2*famo + rnormal()
* Now let us attempt to identify our latent traits
pca class_rank nsports salary married children
screeplot
predict lt1 lt2 lt3
* This will generate a variable that respresents the latent trait
* estimates from the principal component analysis.
corr lt1 lt2 lt3 inte athl famo
* By correlating the latent traits with the the pca generated
* variables we are able to test how well the pca analysis is working.
* We can see that the first latent component identified is famo
* (family orientation) followed by intelligence and then athletics.
* It is important to note that while in practice family orientation,
* intelligence, and athletics can be correlated principal component
* analysis would have difficulty identify them if it did since it
* importantly relies upon identifying orthogonal components.
Formatted By Econometrics by Simulation
What you are describing is not PCA, it is in fact Factor analysis. A similar but different method that is appropriate in different settings compared to PCA.
ReplyDelete