* Cluster analysis is class of tools in which you use to group complex data into
distinct clusters based on observable variation.
* Cluster analysis is closely related to the idea of latent class analysis in
which data is grouped into classes based on observable characteristics.
* Generally speaking, cluster analysis falls within the realm of data mining
and is usually used for some kind of data exploration.
* As with all of my posts, this is exploratory.
Please feel free to correct any glaring errors.
* Let's imagine we have clusters of high school
students defined by:
* A. The Nerds/Geeks
* B. The Iconoclasts
* C. The Jocks
* D. The Super-Stars
* E. The Divas
* F. Overlooked (base) (0)
* And we have a number of observable scales
* A. Grades
* B. Friends
* C. Athletics
* D. Performance
* E. Popularity
* Now I will just basically assign a modifier to each of the scales which is
for each class relative to the overlooked student or the base student.
* Positive means positive base value while negative means negative base value
* p1 through p5 is the proportion of students in this class.
* This variable scales the class effects
local Cscalar = 2
* A. The Nerds/Geeks
local p1 = .15
local grade1 = 1
local friends1 = -1
local athletics1 = -1
local performance1 = 0
local popularity1 = -1
* B. The Iconoclasts
local p2 = .1
local grade2 = 0
local friends2 = 1
local athletics2 = -1
local performance2 = 1
local popularity2 = 1
* C. The Jocks
local p3 = .2
local grade3 = -1
local friends3 = 1
local athletics3 = 1.5
local performance3 = -.5
local popularity3 = 1
* D. The Super-Stars
local p4 = .1
local grade4 = 1.5
local friends4 = .5
local athletics4 = 1
local performance4 = 1
local popularity4 = 1
* E. The Divas
local p5 = .1
local grade5 = -.5
local friends5 = -1
local athletics5 = -1
local performance5 = 1
local popularity5 = 1.5
* Let's first generate some data
clear
set obs 1000
set seed 1
gen assign = runiform()
gen Lclass = 0 // Assume student is part of base first
replace Lclass = 5 if assign<`p1'+`p2'+`p3'+`p4'+`p5'
replace Lclass = 4 if assign<`p1'+`p2'+`p3'+`p4'
replace Lclass = 3 if assign<`p1'+`p2'+`p3'
replace Lclass = 2 if assign<`p1'+`p2'
replace Lclass = 1 if assign<`p1'
* Create a labelbook for Lclass
label define Lclass 0 "base" 1 "nerd" 2 "iconoclast" 3 "jock" 4 "super-star" 5 "diva"
label val Lclass Lclass
tab Lclass
* Now let's generate our observable data assuming everyone is base
gen grade=rnormal()
gen friends=rnormal()
gen athletics=rnormal()
gen performance=rnormal()
gen popularity=rnormal()
* Now modify each based on the class:
forv i=1/5 {
foreach v in grade friends athletics performance popularity {
* This is going to look fishy so I will use the display command
* to display what is going on in this nested loop.
di "replace `v' = `v' + ``v'`i''*`Cscalar'"
qui replace `v' = `v' + ``v'`i''*`Cscalar' if Lclass==`i'
}
}
* So this is what our data might look like except that our Lclass is unobserved
and we would like to impute it.
scatter athletics grade
* This is what we see when we look at our data
* But we would like to see this:
twoway (scatter athletics grade if Lclass==0) (scatter athletics grade if Lclass==1) ///
(scatter athletics grade if Lclass==2) (scatter athletics grade if Lclass==3) ///
(scatter athletics grade if Lclass==4) (scatter athletics grade if Lclass==5) , ///
legend(label(1 "Base") label(2 "Nerd") label(3 "Iconoclast") label(4 "Jock") ///
label(5 "Super-Star") label(6 "Diva") rows(2))
* Cluster kmeans is define k clusters with each cluster being defined
* by the mean values in each variable.
cluster kmeans grade friends athletics performance popularity, k(6)
* This generates the variable _clus_1
* We can do a cross tab to check how well our clustering worked.
tab Lclass _clus_1
* It is looking pretty darn good really.
* With a Cscalar of 2 it is fairly successful at grouping observations
* into distinct clusters.
* If we reduce the Cscalar then it becomes more difficult.
* Another interesting modification could occur if we reduced the groups to a
* lower or greater number.
cluster kmeans grade friends athletics performance popularity, k(4)
tab Lclass _clus_2
* By doing so we can see that different student classes are grouped together.
* Iconoclast and divas are grouped together and super-stars and base
* are grouped together.
cluster kmeans grade friends athletics performance popularity, k(8)
tab Lclass _clus_3
* Having too many clusters we now have clusters which split our classes
* probably based on random variation.
* In this simulation I got jocks in cluster 4 and 7.
* Overall this raises one of the inherent difficulty of cluster analysis,
* we are incapable of identifying what is the appropriate number of clusters.
* The best we can do is look at our different clusters and try to characterize
* them from the values of the observed variables.
bysort _clus_3: sum grade friends athletics performance popularity
* I can see that cluster 1 does not diverge much from the mean which
* suggests it might be the base or average student cluster 2 has poor
* athletics, less popularity, and less friends but good grades suggesting nerd.
* cluster 3 has low grades, athletics, friends but is popular suggesting diva
* etc. the interesting thing is looking at the difference between 4 and 7
* they both exhibit generally the same pattern except 7s are substantially
* higher athletics suggesting that the clustering identified athletic jocks
* vs average.
* please don't be offended by this silly category system. In high school
* you can probably guess which category I fell into :P
Formatted By Econometrics by Simulation