## Thursday, September 19, 2013

### Cluster Analysis

* Cluster analysis is class of tools in which you use to group complex data into
distinct clusters based on observable variation.

* Cluster analysis is closely related to the idea of latent class analysis in 
which data is grouped into classes based on observable characteristics.

* Generally speaking, cluster analysis falls within the realm of data mining 
and is usually used for some kind of data exploration.

* As with all of my posts, this is exploratory. 
Please feel free to correct any glaring errors.

* Let's imagine we have clusters of high school 
students defined by:

* A. The Nerds/Geeks
* B. The Iconoclasts
* C. The Jocks
* D. The Super-Stars
* E. The Divas
* F. Overlooked (base) (0)

* And we have a number of observable scales
* B. Friends
* C. Athletics
* D. Performance
* E. Popularity

* Now I will just basically assign a modifier to each of the scales which is 
for each class relative to the overlooked student or the base student.

* Positive means positive base value while negative means negative base value

* p1 through p5 is the proportion of students in this class.

* This variable scales the class effects
local Cscalar = 2

* A. The Nerds/Geeks
local p1 = .15
local friends1 = -1
local athletics1 = -1
local performance1 = 0
local popularity1 = -1

* B. The Iconoclasts
local p2 = .1
local friends2 = 1
local athletics2 = -1
local performance2 = 1
local popularity2 = 1

* C. The Jocks
local p3 = .2
local friends3 = 1
local athletics3 = 1.5
local performance3 = -.5
local popularity3 = 1

* D. The Super-Stars
local p4 = .1
local friends4 = .5
local athletics4 = 1
local performance4 = 1
local popularity4 = 1

* E. The Divas
local p5 = .1
local friends5 = -1
local athletics5 = -1
local performance5 = 1
local popularity5 = 1.5

* Let's first generate some data
clear
set obs 1000

set seed 1

gen assign = runiform()

gen Lclass = 0  // Assume student is part of base first

replace Lclass = 5 if assign<p1'+p2'+p3'+p4'+p5'
replace Lclass = 4 if assign<p1'+p2'+p3'+p4'
replace Lclass = 3 if assign<p1'+p2'+p3'
replace Lclass = 2 if assign<p1'+p2'
replace Lclass = 1 if assign<p1'

* Create a labelbook for Lclass
label define Lclass 0 "base" 1 "nerd" 2 "iconoclast" 3 "jock" 4 "super-star" 5 "diva"
label val Lclass Lclass

tab Lclass

* Now let's generate our observable data assuming everyone is base
gen friends=rnormal()
gen athletics=rnormal()
gen performance=rnormal()
gen popularity=rnormal()

* Now modify each based on the class:
forv i=1/5 {
foreach v in grade friends athletics performance popularity {
* This is going to look fishy so I will use the display command
* to display what is going on in this nested loop.
di "replace v' = v' + v'i''*Cscalar'"
qui replace v' = v' + v'i''*Cscalar' if Lclass==i'
}
}

* So this is what our data might look like except that our Lclass is unobserved 
and we would like to impute it.
* This is what we see when we look at our data




* But we would like to see this:
twoway (scatter athletics grade if Lclass==0) (scatter athletics grade if Lclass==1)  ///
(scatter athletics grade if Lclass==4) (scatter athletics grade if Lclass==5) ,     ///
legend(label(1 "Base") label(2 "Nerd") label(3 "Iconoclast") label(4 "Jock")  ///
label(5 "Super-Star") label(6 "Diva") rows(2))

* Cluster kmeans is define k clusters with each cluster being defined 
* by the mean values in each variable.
cluster kmeans grade friends athletics performance popularity, k(6)

* This generates the variable _clus_1

* We can do a cross tab to check how well our clustering worked.
tab Lclass _clus_1

* It is looking pretty darn good really.

* With a Cscalar of 2 it is fairly successful at grouping observations 
* into distinct clusters.

* If we reduce the Cscalar then it becomes more difficult.

* Another interesting modification could occur if we reduced the groups to a 
* lower or greater number.

cluster kmeans grade friends athletics performance popularity, k(4)

tab Lclass _clus_2

* By doing so we can see that different student classes are grouped together.

* Iconoclast and divas are grouped together and super-stars and base 
* are grouped together.

cluster kmeans grade friends athletics performance popularity, k(8)

tab Lclass _clus_3

* Having too many clusters we now have clusters which split our classes 
* probably based on random variation.

* In this simulation I got jocks in cluster 4 and 7.

* Overall this raises one of the inherent difficulty of cluster analysis, 
* we are incapable of identifying what is the appropriate number of clusters.

* The best we can do is look at our different clusters and try to characterize 
* them from the values of the observed variables.

bysort _clus_3: sum grade friends athletics performance popularity

* I can see that cluster 1 does not diverge much from the mean which 
* suggests it might be the base or average student cluster 2 has poor
* athletics, less popularity, and less friends but good grades suggesting nerd.
* cluster 3 has low grades, athletics, friends but is popular suggesting diva
* etc. the interesting thing is looking at the difference between 4 and 7
* they both exhibit generally the same pattern except 7s are substantially 
* higher athletics suggesting that the clustering identified athletic jocks
* vs average.

* please don't be offended by this silly category system. In high school 
* you can probably guess which category I fell into :P

Formatted By Econometrics by Simulation


## Thursday, September 12, 2013

### Attenuation Bias and Power

* There is some debate as to if attenuation bias which biases coefficient estimates toward zero as a result of measurement error reduces the likelihood of rejecting the null.

* It seems to me that it must, but I may be wrong.

* Using the code from the previous post on classical measurement error we can easy simulate this.

cap program drop simME4
program define simME4
* First argument is number of observations
* Second argument is measurement error in the dependent variable

clear
set obs 1' // The first argument defines how many draws

gen weight = rnormal()^2*2

gen v = rnormal()*2'
gen oweight = weight + v
gen u = rnormal()*10
gen price = 3*weight + u

reg price oweight

test oweight=0

end

simulate p=r(p) b=_b[oweight], rep(2000): simME4 100 0
gen rej = 0
replace rej = 1 if p < .1
sum
* Rejection rate is 100%

simulate p=r(p) b=_b[oweight], rep(2000): simME4 100 20
gen rej = 0
replace rej = 1 if p < .1
sum
* Now are rejection rate is only 24%

* Thus measurement error could place a critical role in failing to reject the null.

Formatted By Econometrics by Simulation

## Wednesday, September 11, 2013

### Classical Measurement Error and Attenuation Bias

* Classical measurement error is when a variable of interest either explanatory or dependent variable has some measurement error independent of its value.

* We can think of this as the noisy scale phenomenon.

* Imagine that you have a remarkably unreliable scale.

* Every time you stand on it it varies by and average of 10 pounds.

* You know that your weight does not vary by 10 pounds and that every time you get on the scale there is a different value returned.

* If you are clever, you may map out the differences in weights and establish a distribution.

* Then you may establish a decide how accurate you need your weight measurement to be (for instance within 1 pound of the true) at a 95% level.

* This line of reasoning will lead you to specific number of times you must weigh yourself on the scale and take the average of those weights in order to be confident of your weight.

* The formula is fairly straightforward.  We know standard deviation of the measurement is 10 pounds.

* We know the standard error of a mean estimate is sd/root(n)

* Thus we need SE(95% CI) = 1/2 = 10/root(n)

* solve for n: root(n) = 20 or n = 400

* So we need to measure ourselves around 400 times to ensure that the average is within 1 pound of the true 95% of the time.

* Let's see this in action.  First I will design a simulation that generates the values.

cap program drop simME
program define simME

clear
set obs 1' // The first argument defines how many draws

* Let's say the true weight is always 210
gen weight = 210 + rnormal()*10

sum weight

end

simME 400
* Chances are the weight is within 1 of 210

* Let's repeate this 2000 times to see how frequently we miss by more than 1
simulate, rep(2000): simME 400

gen miss = 0
replace miss = 1 if mean<209 span="" style="color: #990033;">mean
>211
sum // I get a miss rate of a little less than 5% which is basically perfect!

* Changing our sample size to 200 will affect our confidence in the outcome
simulate, rep(2000): simME 200
gen miss = 0
replace miss = 1 if mean<209 span="" style="color: #990033;">mean
>211sum // Now the miss rate is about 15%

* Before putting moving on to think about how measurement error can affect our estimates, reflect for a moment on the similarities between statistical hypothesis testing and sampling error and controlling for measurement error by repeatedly taking measures.

* In statistical modeling, we have an underlying model for which we would like to understand how well our parameters fit.

* We would like to just look at one or two observations in order to draw conclusions but we assume there is sampling error and in order to minimize that sampling error we draw more observations.

* We assume an underlying distribution of sampling error and use estimates of the parameters of that distribution to estimate how much confidence we can have in our estimated outcome.

* This variance in estimated outcome is really very similar to the idea of measurement error.

* After some thought I think the two proceedures are exactly identical with different names!

* So enough about that! Let's see how measurement error affects our estimates.

* First let's assume we are trying to model weight gain among cattle and we are using our noisy scale to measure the outcome variable.

* Let's say that there is some linear relationship between calories consumed and weight.

* Weight = calories*B + u

* Let's think that there is some measurement error so that our observed weight (Weight') is equal to true weight (Weight) plus the measurement error (v).

* Weight' = Weight + v

* So we need substitute our observed Weight' into our model:

* Weight' = Weight + v = calories*B + u

* Weight = calories*B + u - v

* If measurement error (v) is uncorrelated with Weight and calories then measurement error of the dependent variable which just cause it to have less precision in the estimator since the combined error e=u+v (assuming that v is not negatively correlated with u).

* Let's see this:

cap program drop simME2
program define simME2
* First argument is number of observations
* Second argument is measurement error in the dependent variable

clear
set obs 1' // The first argument defines how many draws

gen calories = rnormal()^2*10

gen u = rnormal()*10

gen v = rnormal()*2'

* Let's say the true weight is always 210
gen weight = 200 + calories + u + v

reg weight calories

end

* First with no measurement error
simulate, rep(2000): simME2 100 0
sum

simulate, rep(2000): simME2 100 10
* We can see that there is no bias introduced by measurement error. Only less precision in estimates (larger standard deviation).
sum

* Thus it does not change the fundamental model that our outcome variable is hard to measure, it only diminishes our ability to detect real effects from the changes.

* Now let's look at a slightly more interesting case, when the explanatory variable has some measurement error.

* Let's think once again we are measuring weight at 6 months and that is a predictor of sale price of cattle at 12 months.

* the model we want to estimate is: price = Weight*3 + u

* Weight is once again measured noisily (Weight').

* price = Weight'*3 + u
* price = (Weight+v)*3 + u
* price = Weight'3 + v*3 + u

* Though this might look like it does not present a problem, it does.

* Now the explanatory variable Weight' is correlated with the error term e=v*3+u since cov(Weight',v) = var(v) > 0

* If we right OLS in terms of covariance terms we get:

* Bhat=cov(Weight',price)/var(Weight')
* Bhat=[cov(Weight, price) + cov(v,price)]/var(Weight')
* Substituting in the true price model:
* Bhat=[cov(Weight, Weight + u) + cov(v,Weight + u)]/var(Weight')
* Bhat=cov(Weight, Weight + u)/var(Weight') + cov(v,Weight + u)/var(Weight')

* We know that cov(v,Weight + u) = 0 so the second term is zero so what is the problem?
* The problem lies in: Bhat=cov(Weight, Weight + u)/var(Weight')=cov(Weight, Weight + u)/(var(Weight)+Var(v))

* Given var(v)>0 : |Bhat| < |B|
* Thus attenuation bias!

* Let's see it in action!

cap program drop simME3
program define simME3
* First argument is number of observations
* Second argument is measurement error in the dependent variable

clear
set obs 1' // The first argument defines how many draws

gen weight = rnormal()^2*10

gen v = rnormal()*2'

* Generate the observed weight
gen oweight = weight + v

gen u = rnormal()*10

* The real pridictor of price is weight not observed weight.
gen price = 3*weight + u

reg price oweight

end

* First with no measurement error and no problems
simulate, rep(2000): simME3 100 0
sum

simulate, rep(2000): simME3 100 10
* We can see that there is no bias introduced by measurement error. Only less precision in estimates (larger standard deviation).
sum

* We can see there is now a strong bias towards zero in our estimates.

Formatted By Econometrics by Simulation

## Sunday, September 8, 2013

### Maximum Likelihood Estimation and the Origin of Life



# Maximum likelihood Estimation (MLE) is a powerful tool in econometrics and statistics which allows for the consistent and asymptotically efficient estimation of parameters given a correct identification (in terms of distribution) of the random variable.

# It is also a proceedure which seems superficially quite complex and intractible, but in reality is very intuitive and easy to use.

# To introduct MLE, lets think about observing a repeated coin flip.

# Let's say we know it is not a fair coin and we would like to figure out what the likehood of heads parameter popping up is.

# Let's define out likelihood function this way:

lk <- function(responses, theta) {
# This likelihood function returns the likelihood of (a priori) observing a
# particular response pattern given the likelihood of seeing a head on the
# coin is theta.

returner <- 0*theta
# Define a vector to hold likelihoods in case theta is a vector

for (i in 1:length(theta)) returner[i] <-
prod(theta[i]^responses) * # The likelihood of seeing the head's pattern
prod((1-theta[i])^(1-responses)) # The likelihood of seeing the tails's pattern

returner
}

# Let's see how our likelihood function is working.
# Let's see we flip the coin once and we observe a heads
# and our initial guess is that the coin is fair.
lk(c(1), .5)
# [1] 0.5
# Our likelihood of this being true is .5 (we know this is right)

# What is we get a tail?
lk(c(0), .5)
# [1] 0.5
# .5 as well.  This seems right.

# Now let's try something more interesting.
# Let's say we get two heads:
lk(c(1,1), .5)
# [1] 0.25
# .5^2=.25

# This is standard probability. But, let's now ask:
# What if the coin was not fair (as we originally guessed)?

# Is there a better parameter set that would lead to our observed outcome?
# Let's say, there is a 75% chance of getting heads.
# What is the likelihood of seeing our 2 heads.
lk(c(1,1), .75)
# [1] 0.5625
# So if the coin was unfair at 75% then our likelihood of observing the
# outcome we did would increase from 25% to 56%.

# How about is the coin were 95% in favor of heads.
lk(c(1,1), .95)
# [1] 0.9025
# Now we are up to 90%.

# You probably see where this is going.
# If we assume we know nothing about the underlying distribution of the coin
# parameter, then the parameter which fits best is 1 (100%).
# Thus if we see only positive outcomes then the most likely guess at the
# underlying distribution of outcomes is that every time you flip the coin it

# We can see this mapped out.
theta.vect <- seq(0,1,.05)

plot(theta.vect, lk(c(1,1), theta.vect), main="HH",
ylab="Probability", xlab="Theta")

# It is worth noting that though there are different probabilities
# for each unfairness parameter. We can only definitively rule out
# one option after flipping the coin twice and getting heads.
# That is that it is impossible for the likelihood of getting a head=0.

# However, all other options are available. How do we choose from these options?

# Well... naturally, we choose the option which maximizes the probability
# of seeing the observed outcome (as the name implies).

# Now let's make things a little more interesting. Let's say the third
# time we flip the coin it ends up tails.

# If we assume it is a fair coin then:
lk(c(1,1,0), .5)
# [1] 0.125

lk(c(1,1,0), .75)
# [1] 0.140625
# More likely than that of a fair coin but not as dramatic an improvement from
# when we saw only two heads.

# Let's see how the graph looks now.
plot(theta.vect, lk(c(1,1,0), theta.vect), main="HHT",
ylab="Probability", xlab="Theta")

# We can see that our graph is finally beginning to look a bit more interesting.
# We can see that the most likely outcome is around 65%.  For those of us
# a little ahead of the game the most likely probability is the success rate
# or 2/3 (66.6%).

# But the importance of the exercise to think about why 66.6% is parameter
# we select as the most likely.
lk(c(1,1,0), 2/3)
# [1] 0.1481481
# Not because it is overwhelmingly the best choice.

# It is only 2.3% (0.148-0.125) more likely to occur than if it were a fair coin.
# So we really are not very confident with our parameter choice at this point.

# However, imagine instead for one moment, if we observed the same ratio but with
# 300 coins.
cpattern <- c(rep(1,200), rep(0,100))

lk(cpattern, 2/3)
# [1] 1.173877e-83

lk(cpattern, 1/2)
# [1] 4.909093e-91

# Now, in terms of percentages the differences are extremely small.
# So small that it is hard to compare.  The plot can be useful:

plot(theta.vect, lk(cpattern, theta.vect),
main="(HHT)^100", ylab="Probability", xlab="Theta")

# Let's see what happens if we increase our number of coins to
# 3000
plot(theta.vect, lk(rep(cpattern,10), theta.vect),
main="(HHT)^1000", ylab="Probability", xlab="Theta")

# In this graph we can see the first major computational problem
# when dealing with likelihoods. They get so small, they are hard
# to manage in raw probabilities.  In this case the digits get
# rounded into 0 so that all R sees is 0.

# Fortunately, the maximum of a function is the same maximum
# (in terms of parameter choices) as a monotonic transformation
# of a function.  Thus we can rescale our probabilities
# before multiplication using logs creating the log likelihood
# function which produces parameters which vary in scale much less
# dramatically.

# MLE functions always maximizes and reports the "log likelihood"
# value rather than the "likelihood".

# However, in this discussion it is worth noting that there is
# a somewhat useful statistic that we can produce to compare
# the likelihoods of the fair coin hypothesis with that
# of the 2/3 biased hypothesis.

# That is the odds ratio of the two outcomes.  How much more
# likely (multiplicatively) is our outcome to be observed
# if the coin is unfair towards 2/3 heads rather than fair?

lk(cpattern, 2/3)/lk(cpattern, 1/2)
# [1] 23912304
# That is to say, the outcome in which 2/3 rds of the time
# we get heads for a coin flip of 300 coins is 23 million
# times more likely to occur if our coin
# is unfair (66.6%) over that of being fair (50%).

# This is a pretty big number and thus very unlikely to occur
# relative to that of a fair coin. Comparing accross all possible
# outcomes, we would find that while this ratio is not always as
# large, for example if we are comparing 2/3s to .6
lk(cpattern, 2/3)/lk(cpattern, .75)
# [1] 183.3873
# But it can still be quite large.  In this case, we are 183 times
# more likely to see the outcome we saw if we chose 2/3s as our parameter
# choice compared with 3/4ths.

# Looking at the raw probability we see
lk(cpattern, 2/3)
# [1] 1.173877e-83
# or 1 out of 8*10^82 outcomes.

# Thus the likelihood of a particular event ever occurring is very small, even
# given the most likely hypothesis (theta=2/3).
# However, compared nearly all other hypothesis such as (theta=1/2 or 3/4)
# the event is much more likely to have occurred.

# And THAT is why creationists are right to say it very unlikely
# in absolute terms that evolution brought about the origin of
# life on earth yet are also completely wrong because compared
# with all other available hypotheses that is the only one
# which is remotely likely (at least from the series of
# outcomes that I have observed) to have occurred makes its odds
# ratio very high in my mind.

Created by Pretty R at inside-R.org

## Thursday, September 5, 2013

### ThinkNum - A new interactive public database and graphing tool!

Thinknum.com is a new free data service similar to Quandl.com in that they organize and list publicly available data sets.  Currently they are using data from 2,000 data sources which sounds good compared with Quandl's 400+ source.  Overall, in terms of data points, I don't know which has more.  But that is not really the point.

What it does have however is an advanced user interface in which users can specify functions which they would like to perform analysis on right off of the ThinkNum website.  ThinkNum maintains a high level of credibility in how these functions are implemented in that they list all of the code that enters each of the functions so that there is no behind the curtain ambiguity.

ThinkNum can easily generate a wide range of charts using data names to identify which data sets to use and functions to define how to modify the data sets.  For example:

Say you were interested in how google stock price is moving with respect to the S&P 500.  You want to compare both the daily rate and the 30 day simple moving average and have them plotted on the same graph.  The following command will do this. sma(^spx,30);^spx;sma(goog*2,30);(goog*2); sma is the "simple moving average" while ^spx specifies the  S&P 500 with goog Google and the 30 is the 30 day moving average.  Note, I have multiplied the Google by 2 so that we can see them more clearly on the same graph since the stock price of Google is about half that of the S&P and price scales are arbitrary generally.

function: sma(^spx,30);^spx;sma(goog*2,30);(goog*2);

They have dozens of other functions. You can browse the function library here: http://www.thinknum.com/browseFunctions

In addition, ThinkNum has released an R package called "Thinknum" with the single function "Thinknum" which allows for time series data to be imported directly into R.

install.packages("ThinkNum")

library("Thinknum")
goog <- Thinknum("goog")
plot(goog, type="l")

In order to pass arguments to R in the same manner as the ThinkNum website I have written a small function which can read multiple calls to Thinknum and returns them as either a wide or tall data.frame

mThinknum <- function(command, tall=F) {

require("Thinknum")

# Break the command into seperate calls
think.list <- strsplit(command, ";")[[1]]

# Look through each element of the think.list
for (i in 1:length(think.list)) {

if (!tall) {
if (i==1) returner <- Thinknum(think.list[1])
if (i>1)  returner <- merge(returner,Thinknum(think.list[i])
, by.x="date_time", by.y="date_time")
}
if (tall) {
tempdat <- Thinknum(think.list[i])
names(tempdat) <- c("date_time", "value")

if (i==1) returner <- data.frame(tempdat, call=think.list[i])
if (i>1) returner <- rbind(returner, data.frame(tempdat, call=think.list[i]))
names(returner) <- c("date_time", "value", "call")
}

cat(paste(rep(" ", max(1,20-nchar(think.list[i]))), collapse=""))
# Insert spaces
cat(paste("Dimensions:", paste(dim(returner), collapse="x"), "\n"))
# Show dimensions
}

# Ensure the return file has appropriate column names
if (!tall) names(returner) <- c("date_time", think.list)

data.frame(returner)
}

test <- mThinknum("^spx;sma(^spx,30);sma(goog*2,30);(goog*2)")

# Let's try plotting with the base package

plot(x=c(min(test$date_time), max(test$date_time)),
y=c(min(test[,2:5]),max(test[,2:5])), type="n",
main="Plot in R")

lines(test$date_time, test[,2], type="l") lines(test$date_time, test[,3], type="l", col="red")
lines(test$date_time, test[,4], type="l", col="blue") lines(test$date_time, test[,5], type="l", col="darkgreen")

# Let's do the same but now let's use ggplot2

# In order to do this let's use the tall option for mThinknum

test2 <- mThinknum("^spx;sma(^spx,180);sma(goog*2,180);(goog*2)", tall=T)
# I have extended the running average to be 180 days rather than 30 because the scale is
# so large.

library("ggplot2")

ggplot(data=test2[test2\$date_time>as.Date("2004-01-01"),],
aes(x=date_time, y=value, group=call)) +
geom_line(aes(colour = call))

Created by Pretty R at inside-R.org

Overall, ThinkNum seems to be a pretty cool tool for supplying data and building analytical graphs.  And in addition to all of this, ThinkNum also provides a platform for doing counter-factual data analysis!  This I will not go into since I have not played around with it much.  However, it appears very interesting!  See the below example on how ThinkNum may be used to calculate expected price of Google.  I don't know anything about finance so this is all way over my head.

http://www.thinknum.com/cashflowmodel/?ticker=goog

## Tuesday, September 3, 2013

### When LAD is more efficient than OLS

* There are many potential cases when LAD (least absolute deviations) is a more e

fficient estimators of the linear model than OLS (ordinary least squares).

* These cases are characterized by data in which tails of the errors tend to be wide.

* With errors distributed normally we know that OLS is the most efficient estimator.

* Let's define a small command to show us the relative estimator efficiencies

clear
set obs 1'  // set the observations equal to the first argument

if "2'"=="norm" gen u = rnormal()*5

* Since Stata does not have a built in LaPlace distribution
* I will simulate one by solving for the probability
* and generating the error term from an inversion of a uniform draw.
if "2'"=="laplace" {
gen u0 = runiform()
gen u1 = rbinomial(1,.5)
gen u = (-1)^u1 * ln(2*u0) * 6
}

gen x = rnormal()

gen y = x*1 + u

reg y x
return scalar OLSb = _b[x]

qreg y x

end

hist u, title("Normal Distribution")

hist u, title("LaPlace Distribution")

* Seems to be working well, let's simulate it 1,000 times.

sum
* We can see that both models produce unbiased estimators though OLS
* has a standard deviation which is about 3/4 the size of LAD