## Thursday, September 19, 2013

### Cluster Analysis

* Cluster analysis is class of tools in which you use to group complex data into
distinct clusters based on observable variation.

* Cluster analysis is closely related to the idea of latent class analysis in 
which data is grouped into classes based on observable characteristics.

* Generally speaking, cluster analysis falls within the realm of data mining 
and is usually used for some kind of data exploration.

* As with all of my posts, this is exploratory. 
Please feel free to correct any glaring errors.

* Let's imagine we have clusters of high school 
students defined by:

* A. The Nerds/Geeks
* B. The Iconoclasts
* C. The Jocks
* D. The Super-Stars
* E. The Divas
* F. Overlooked (base) (0)

* And we have a number of observable scales
* B. Friends
* C. Athletics
* D. Performance
* E. Popularity

* Now I will just basically assign a modifier to each of the scales which is 
for each class relative to the overlooked student or the base student.

* Positive means positive base value while negative means negative base value

* p1 through p5 is the proportion of students in this class.

* This variable scales the class effects
local Cscalar = 2

* A. The Nerds/Geeks
local p1 = .15
local friends1 = -1
local athletics1 = -1
local performance1 = 0
local popularity1 = -1

* B. The Iconoclasts
local p2 = .1
local friends2 = 1
local athletics2 = -1
local performance2 = 1
local popularity2 = 1

* C. The Jocks
local p3 = .2
local friends3 = 1
local athletics3 = 1.5
local performance3 = -.5
local popularity3 = 1

* D. The Super-Stars
local p4 = .1
local friends4 = .5
local athletics4 = 1
local performance4 = 1
local popularity4 = 1

* E. The Divas
local p5 = .1
local friends5 = -1
local athletics5 = -1
local performance5 = 1
local popularity5 = 1.5

* Let's first generate some data
clear
set obs 1000

set seed 1

gen assign = runiform()

gen Lclass = 0  // Assume student is part of base first

replace Lclass = 5 if assign<p1'+p2'+p3'+p4'+p5'
replace Lclass = 4 if assign<p1'+p2'+p3'+p4'
replace Lclass = 3 if assign<p1'+p2'+p3'
replace Lclass = 2 if assign<p1'+p2'
replace Lclass = 1 if assign<p1'

* Create a labelbook for Lclass
label define Lclass 0 "base" 1 "nerd" 2 "iconoclast" 3 "jock" 4 "super-star" 5 "diva"
label val Lclass Lclass

tab Lclass

* Now let's generate our observable data assuming everyone is base
gen friends=rnormal()
gen athletics=rnormal()
gen performance=rnormal()
gen popularity=rnormal()

* Now modify each based on the class:
forv i=1/5 {
foreach v in grade friends athletics performance popularity {
* This is going to look fishy so I will use the display command
* to display what is going on in this nested loop.
di "replace v' = v' + v'i''*Cscalar'"
qui replace v' = v' + v'i''*Cscalar' if Lclass==i'
}
}

* So this is what our data might look like except that our Lclass is unobserved 
and we would like to impute it.
* This is what we see when we look at our data




* But we would like to see this:
twoway (scatter athletics grade if Lclass==0) (scatter athletics grade if Lclass==1)  ///
(scatter athletics grade if Lclass==4) (scatter athletics grade if Lclass==5) ,     ///
legend(label(1 "Base") label(2 "Nerd") label(3 "Iconoclast") label(4 "Jock")  ///
label(5 "Super-Star") label(6 "Diva") rows(2))

* Cluster kmeans is define k clusters with each cluster being defined 
* by the mean values in each variable.
cluster kmeans grade friends athletics performance popularity, k(6)

* This generates the variable _clus_1

* We can do a cross tab to check how well our clustering worked.
tab Lclass _clus_1

* It is looking pretty darn good really.

* With a Cscalar of 2 it is fairly successful at grouping observations 
* into distinct clusters.

* If we reduce the Cscalar then it becomes more difficult.

* Another interesting modification could occur if we reduced the groups to a 
* lower or greater number.

cluster kmeans grade friends athletics performance popularity, k(4)

tab Lclass _clus_2

* By doing so we can see that different student classes are grouped together.

* Iconoclast and divas are grouped together and super-stars and base 
* are grouped together.

cluster kmeans grade friends athletics performance popularity, k(8)

tab Lclass _clus_3

* Having too many clusters we now have clusters which split our classes 
* probably based on random variation.

* In this simulation I got jocks in cluster 4 and 7.

* Overall this raises one of the inherent difficulty of cluster analysis, 
* we are incapable of identifying what is the appropriate number of clusters.

* The best we can do is look at our different clusters and try to characterize 
* them from the values of the observed variables.

bysort _clus_3: sum grade friends athletics performance popularity

* I can see that cluster 1 does not diverge much from the mean which 
* suggests it might be the base or average student cluster 2 has poor
* athletics, less popularity, and less friends but good grades suggesting nerd.
* cluster 3 has low grades, athletics, friends but is popular suggesting diva
* etc. the interesting thing is looking at the difference between 4 and 7
* they both exhibit generally the same pattern except 7s are substantially 
* higher athletics suggesting that the clustering identified athletic jocks
* vs average.

* please don't be offended by this silly category system. In high school 
* you can probably guess which category I fell into :P

Formatted By Econometrics by Simulation