Thursday, September 19, 2013

Cluster Analysis

* Cluster analysis is class of tools in which you use to group complex data into 
distinct clusters based on observable variation.

* Cluster analysis is closely related to the idea of latent class analysis in 
which data is grouped into classes based on observable characteristics.

* Generally speaking, cluster analysis falls within the realm of data mining 
and is usually used for some kind of data exploration.
* As with all of my posts, this is exploratory. 
Please feel free to correct any glaring errors.

* Let's imagine we have clusters of high school 
students defined by:

* A. The Nerds/Geeks
* B. The Iconoclasts
* C. The Jocks
* D. The Super-Stars
* E. The Divas
* F. Overlooked (base) (0)

* And we have a number of observable scales
* A. Grades
* B. Friends
* C. Athletics
* D. Performance
* E. Popularity

* Now I will just basically assign a modifier to each of the scales which is 
for each class relative to the overlooked student or the base student.

* Positive means positive base value while negative means negative base value

* p1 through p5 is the proportion of students in this class.

* This variable scales the class effects
local Cscalar = 2

* A. The Nerds/Geeks
local p1 = .15
local grade1 = 1
local friends1 = -1
local athletics1 = -1
local performance1 = 0
local popularity1 = -1

* B. The Iconoclasts
local p2 = .1
local grade2 = 0
local friends2 = 1
local athletics2 = -1
local performance2 = 1
local popularity2 = 1

* C. The Jocks
local p3 = .2
local grade3 = -1
local friends3 = 1
local athletics3 = 1.5
local performance3 = -.5
local popularity3 = 1

* D. The Super-Stars
local p4 = .1
local grade4 = 1.5
local friends4 = .5
local athletics4 = 1
local performance4 = 1
local popularity4 = 1

* E. The Divas
local p5 = .1
local grade5 = -.5
local friends5 = -1
local athletics5 = -1
local performance5 = 1
local popularity5 = 1.5

* Let's first generate some data
set obs 1000

set seed 1

gen assign = runiform()

gen Lclass = 0  // Assume student is part of base first 

replace Lclass = 5 if assign<`p1'+`p2'+`p3'+`p4'+`p5'
replace Lclass = 4 if assign<`p1'+`p2'+`p3'+`p4'
replace Lclass = 3 if assign<`p1'+`p2'+`p3'
replace Lclass = 2 if assign<`p1'+`p2'
replace Lclass = 1 if assign<`p1'

* Create a labelbook for Lclass
label define Lclass 0 "base" 1 "nerd" 2 "iconoclast" 3 "jock" 4 "super-star" 5 "diva"
label val Lclass Lclass

tab Lclass 

* Now let's generate our observable data assuming everyone is base
gen grade=rnormal()
gen friends=rnormal()
gen athletics=rnormal()
gen performance=rnormal()
gen popularity=rnormal()

* Now modify each based on the class:
forv i=1/5 {
  foreach v in grade friends athletics performance popularity {
  * This is going to look fishy so I will use the display command
  * to display what is going on in this nested loop.
  di "replace `v' = `v' + ``v'`i''*`Cscalar'"
  qui replace `v' = `v' + ``v'`i''*`Cscalar' if Lclass==`i'

* So this is what our data might look like except that our Lclass is unobserved 
and we would like to impute it.
scatter athletics grade
* This is what we see when we look at our data

* But we would like to see this:
twoway (scatter athletics grade if Lclass==0) (scatter athletics grade if Lclass==1)  /// 
  (scatter athletics grade if Lclass==2) (scatter athletics grade if Lclass==3)       /// 
  (scatter athletics grade if Lclass==4) (scatter athletics grade if Lclass==5) ,     /// 
  legend(label(1 "Base") label(2 "Nerd") label(3 "Iconoclast") label(4 "Jock")  /// 
  label(5 "Super-Star") label(6 "Diva") rows(2))
* Cluster kmeans is define k clusters with each cluster being defined 
* by the mean values in each variable. 
cluster kmeans grade friends athletics performance popularity, k(6)

* This generates the variable _clus_1

* We can do a cross tab to check how well our clustering worked.
tab Lclass _clus_1

* It is looking pretty darn good really.

* With a Cscalar of 2 it is fairly successful at grouping observations 
* into distinct clusters.

* If we reduce the Cscalar then it becomes more difficult.

* Another interesting modification could occur if we reduced the groups to a 
* lower or greater number.

cluster kmeans grade friends athletics performance popularity, k(4)

tab Lclass _clus_2

* By doing so we can see that different student classes are grouped together.

* Iconoclast and divas are grouped together and super-stars and base 
* are grouped together.

cluster kmeans grade friends athletics performance popularity, k(8)

tab Lclass _clus_3

* Having too many clusters we now have clusters which split our classes 
* probably based on random variation.

* In this simulation I got jocks in cluster 4 and 7.

* Overall this raises one of the inherent difficulty of cluster analysis, 
* we are incapable of identifying what is the appropriate number of clusters.  

* The best we can do is look at our different clusters and try to characterize 
* them from the values of the observed variables.

bysort _clus_3: sum grade friends athletics performance popularity

* I can see that cluster 1 does not diverge much from the mean which 
* suggests it might be the base or average student cluster 2 has poor 
* athletics, less popularity, and less friends but good grades suggesting nerd.
* cluster 3 has low grades, athletics, friends but is popular suggesting diva
* etc. the interesting thing is looking at the difference between 4 and 7 
* they both exhibit generally the same pattern except 7s are substantially 
* higher athletics suggesting that the clustering identified athletic jocks 
* vs average.

* please don't be offended by this silly category system. In high school 
* you can probably guess which category I fell into :P

Formatted By Econometrics by Simulation


  1. Hi! thanks, the article is very interesting. :)
    I would like to comment you guys that the page is completely broken in firefox 24. :(

  2. Your help on this topic was very much appreciated -- including the useful tool of running mlogit with the clusters as the dependent variable to follow in terms of identifying the factors within the construct of the variable which are most defining of each cluster group.

    The syntax for doing this in this example for anyone else interested and reading this post would be:
    mlogit _clus_3 grade friends athletics performance popularity

    Thanks, again, Francis, and keep up the good work! I love the application you simulated for demonstration purposes, by the way :)