Tuesday, July 31, 2012

3 ways to write a probit command


* Probit is a commonly used command to model binary outcome variables.

* The maximum likelihood sytax in Stata is specific
* I will first define three different equivalent ml programs.  Then we will solve them to see how they perform.

cap program drop myprobit1
program define myprobit1
  * Tell Stata that this code is written in version 11
  version 11
  * Define the arguments input in the maximum likelihood function
  * Stata automatically generates the first argument as a temporary variable name
  * that is used to store the natural log of the likelihood value of the likelihood function.
  args ln_likehood xb
  * The probit tries to directly maximize the probability of seeing the outcome (either y==1 or y==0)
  * If y==1 then we maximize the probability of seeing a positive outcome
  qui replace `ln_likehood' = ln(  normal(`xb')) if $ML_y==1
  * If y==0 then we want to maximize the probability of seeing a negative outcome
  qui replace `ln_likehood' = ln(1-normal(`xb')) if $ML_y==0
end

cap program drop myprobit2
program define myprobit2
  version 11
  args ln_likehood xb
  * Because of the symetry of the normal CDF the following set of expressions is equivalent.
  qui replace `ln_likehood' = ln(normal( `xb' )) if $ML_y==1
  qui replace `ln_likehood' = ln(normal(-`xb' )) if $ML_y==0
end

cap program drop myprobit3
program define myprobit3
  version 11
  args ln_likehood xb
  * This is a somewhat opeque way of writing the same thing as above.
  * I do not prefer it to the first two forms except that it is likely to run slightly faster
  * since it is a single command.
  qui replace `ln_likehood' = ln(normal((-1)^(1-$ML_y)*`xb'))
end


clear
set obs 1000

gen x1 = rnormal()*.5
gen x2 = rnormal()*.5
gen u = rnormal()*(.5^.5)

gen p=normal(x1+x2+u)

gen y=rbinomial(1,p)

probit y x1 x2
* This is the benchmark

ml model lf myprobit1 (y=x1 x2)
ml maximize
* Basically works the same except there is no pseudo r2 reported.

timer on 1
ml model lf myprobit2 (y=x1 x2)
ml maximize
timer off 1

timer on 2
ml model lf myprobit3 (y=x1 x2)
ml maximize
timer off 2

timer list

di "myprobit2 is " r(t1)/r(t2) " times slower than myprobit3"

* The following is a brief exploration attempting to run two probits at once.
* It shows how ml can optimize two equations jointly
cap program drop mytwoprobit1
program define mytwoprobit1
  version 11
  args ln_likehood xb1 xb2
  qui replace `ln_likehood' = ln(normal((-1)^(1-$ML_y1)*`xb1')) ///
                            + ln(normal((-1)^(1-$ML_y2)*`xb2'))
end

* This is equivalent to the previous except that it uses the binormal distribution rather than two seperate normals.  I tried specifying the correlation as a third equation but that did not work.  I will need to play around a bit more to figure out how to program a biprobit.
cap program drop mytwoprobit2
program define mytwoprobit2
  version 11
  args ln_likehood xb1 xb2
  qui replace `ln_likehood' = ln(binormal((-1)^(1-$ML_y1)*`xb1', ///
                                          (-1)^(1-$ML_y2)*`xb2', ///
 0))

end

clear
set obs 1000

gen x1 = rnormal()*.5
gen x2 = rnormal()*.5

gen p1=normal(x1-x2)
gen p2=normal(-x1+x2)

gen y1=rbinomial(1,p1)
gen y2=rbinomial(1,p2)

ml model lf mytwoprobit1 (y1:y1=x1 x2) (y2:y2=x1 x2)
ml maximize

ml model lf mytwoprobit2 (y1:y1=x1 x2) (y2:y2=x1 x2)
ml maximize

Monday, July 30, 2012

The Joy of Logic - Stata logical operators


* Both Stata and R handle logical operators in a similar fassion

* There are two types of operators & (and) and | (or)

if (1==2 & 2==2) di "both 1 = 2 and 2 = 2"
* This message is not displayed

if (1==2 | 2==2) di "either 1 = 2 or 2 = 2 (obviously 2==2)"

* Operators can easy switch values
if !(1==2 | 2==2) di "neither 1 = 2 nor 2 = 2"
  * Is not displayed

* You might be wondering what the value of these operators are

di "True: " 1==1 ", False: " 3==2

* Logical operators can also take order of operations.

* Thus

if (1==1|2==3)|(3==3&3==2) di "Will display because either 1==1 or 2==3 (obviously 1==1) or 3==3 and 3==2 (and 3==3)"
if (1==1|2==3|3==3)&(3==2) di "Will not display because either 1==1, 2==3, or 3==3 (we know 1==1, and 3==3) but 3 does not equal 2."

* Order of operations matters.

* Any logical operator combination can be specified two ways.
if (1==1|2==3) di "Straightforward method"
if !(1!=1&2!=3) di "Alternative display (more clunky)"

Sunday, July 29, 2012

If statements and relational operators


* In Stata there are two forms of if statemetns.

* Let's first generate some data
clear
set obs 1000

forv i=1/4 {
  gen x`i'=rnormal()
}


* 1. There are the if Statements that are evaluated as a vector each value based on some rule.

* For instance:

replace x1 = 10000 if x1>0

sum

* We can see that some values of x1 are now equal to 1000 while all others are equal to or less than zero.

* 2. This is in contrast to the following commands

* This will sort the elements of x2 t0 have large values first
gsort -x2

if x2>0 replace x2 = 10000

* and

gsort x3

if x3>0 replace x3 = 10000

sum

* We can see obviously what each of these commands produced dramatically different results.

* This is because if the "if" statement is listed before a command then it is only evaluated once.

* And that value when it is evaluated with regards to a variable tells the if statement to only evaluate the first element of that variable.

* Thus the first element of x2 was greater then 0 so the replace command "replace x2 = 10000" was executed (for all elements of x2 while the first element of x3 was less than zero so the replace command was not executed.

* Both types of if Statements are useful though the second one tends to be more frequently used by programmers.

* Say you want to generate 10 variables with half drawn from a uniform and half from a normal distribution.

forv i=1/10 {
  if mod(`i',2) == 0 gen z`i' = rnormal()
  if mod(`i',2) == 1 gen z`i' = runiform()
}
sum
* We can see that the even numbered zs are normally distributed while the odd are uniformaly distributed.
* Note: a summary command is usually not sufficient to diagnose distributions.

* It is very useful to become knowledgable about relational operators.

* To see the help file on operators look at
help operators

Thursday, July 26, 2012

7 ways to Speed up your do files

* 1. Drop unused data.  When you have the choice of using if statements to exclude unused data or dropping that data.  Dropping the data is a faster option usually.


  * I am first going to clear the timer memory from Stata since we are going to use it.
  forv i = 1/20 {
    cap timer clear `i'
  }


  clear
  set obs 100000

  gen z = rbinomial(1,.5)

  forv i = 1/10 {
    gen x`i' = rnormal()
    forv ii = 1/40 {
      gen z`i'`ii' = rnormal()*(1/20)^.5
    di "loop `i' unneccessary variable `ii'"
    }
  }

  gen u = rnormal()*(10)^.5*(1/20)^.5

  gen y = rbinomial(1,normal(x1-x2+x3-x4+x5-x6+x7-x8+x9-x10+u))

  timer on 1
  probit y x* if z==1
  timer off 1

  drop if z!=1

  timer on 2
  probit y x*
  timer off 2

  timer list 1
  timer list 2

  * On my computer the first probit took 0.5350 seconds and the second probit took 0.4430.

* 2. Drop unnecessary variables

  keep x* y

  timer on 3
  probit y x*
  timer off 3

  timer list 3

  * This is the same with temporary variables.

  * Unnecessary temporary variables clog as much active memory as any other variables until they are dropped.

  * Dropping the unused variables decreased the computational time to 0.3180 seconds.

* 3. Removed visual feedback

  timer on 4
  qui probit y x*
  timer off 4

  timer list 4

  * Quieting the display decreases run time to 0.2990.  This is primarily because this command does a lot in the background.

  * With other commands the gain to turning the feedback is much greater.
  timer on 5
  forv v=1/100 {
    sum y
  }
  timer off 5

  timer on 6
  forv v=1/100 {
    qui sum y
  }
  timer off 6

  timer list 5
  timer list 6
  * Displaying the results sum command is causing Stata to run 74% slower.

* 4. Remove redundant data writing and reading commands.
  tempfile tempdata

  timer on 7
  forv v=1/100 {
    qui save `tempdata', replace
    qui sum y
    qui use `tempdata', clear
  }
  timer off 7

  timer list 7
  * Reading and writing causing the command to take five times as long to complete.

* 5. Choose commands that approximate time consuming commands.

  * For instance if you are really interested in the average partial effect of a probit command and your computer is getting bogged down then try using a linear probability model instead since the APE tends to be very similar to that of a probit.
  timer on 8
    probit y x*
    margins, dydx(x1 x2 x3 x4 x5 x6 x7 x8 x9 x10)
  timer off 8

  timer list 8

  timer on 9
    reg y x*
  timer off 9

  timer list 9

  * While the linear probability model is not as good a fit as the probit (because we know how the data was generated and because the pseudo-r2 is larger than the r2)
  * it gets similar results and runs about 560 times faster.

* 6. Minimize unused calculations.

  * Imagine that you have decided to bootstrap the standard errors on your estimators.
  * If that is the case then it is completely unnecessary in OLS to use robust standard errors because it just means more calculations.
  timer on 10
    bs: reg y x*, robust
  timer off 10

  timer on 11
    bs: reg y x*
  timer off 11

  timer list 10
  * On my computer this command takes 9.6 seconds
  timer list 11
  * While second one only takes 7.6

* 7. Minimize your data foot print

  * Compress will reduce your large data elements to smaller data elements when possible.
  compress

  * Sometimes it is helpful to convert your data from string variables to factor variables.
  gen longstring = "This is a really big string that is going to take up a lot of memory.  Because the longer the string you have the more space Stata needs to reserve for that sting within the variable even if all of the other values are small" if _n == 1

  timer on 12
    probit y x*
  timer off 12

  encode longstring, gen(factor_longstring)

  desc

  * Variables x1-x10 take up only 4 bytes a piece or 40 bytes collectively.
  * We can see that the storage type on the string is 244 which means it takes up 244 bytes.
  * While the data type on the factor version of it is a long meaning it only takes 4 bytes.
  * Note: This can be misleading. If you have many unique factor values then they are going to also take up a lot of space.

  drop longstring

  timer on 13
    probit y x*
  timer off 13

  timer list 12
  timer list 13

  * By reducing the size of the longstring to a factor variable we free up memory for Stata to use and speed up the probit regression by about 8%.


Wednesday, July 25, 2012

The Return Command


* In Stata the return command is an essential tool that any aspiring programmer should be aware of.

* Note, I have two lines of comments going on in this code.  Hopefully, it is coherent.
* The comments that are commented out as such /* HELLO I am YOU COMMENT */ are referring
* to a method to identify coefficients when the error is multiplicative.

* Within Stata there are two methods by which commands yeild feedback to the user.

* One method is the visual display.

* Let's start with a little generated data

 clear
 set obs 10000
 gen x = rnormal()*10
 gen u = exp(cos(rnormal()*10))
 gen y = 30*x*u  /* Imagine we have a multiplicative error and we want to
                       identify the coefficient 30.  */

* All of the above commands do not give any kind of return except visual ones (this is frustrating to me since sometimes)
* However, there is a lot of commands that do.
 gen add_error = y-30*x
 corr add_error x  /* We can see that that the "addative" error is correlated with the
                      explanatory variable */

* corr is an rclass command the majority of commands return as rclass.
* to see the returns from corr
 return list
* We can target the returns in the form that Stata lists:
 di "We will have difficulty estimating the 30 coefficient on x"
 di "because the addative error is correlated with x at a level = " r(rho)

reg y x   /* This badly fails at identifying the 3 because the error is
             correlated with explanatory variable x */
* Most regression/estimator commands are "eclass" commands and store their returns as ereturns.

* To see those returns:
ereturn list

* These returns are available to be targetted if you would like to store a value as a macro.
global rss_1 = e(rss)
global r2_a  = e(r2_a)

* Matrices can be targetted in a similar fassion
matrix b1 = e(b)
matrix list b1

* Most coefficients are also saved as system variables
global bx = _b[x]

* When you do your next eclass command it will override the stored values from the previous
gen lny = ln(y) /* We may attempt to isolate the coefficient by using the natural log command
                   . This makes ln(y) = 30*x*u = ln(30) + ln(x) + ln(u)                      */
gen lnx = ln(x)

reg lny lnx
 di "The " exp(_b[_cons]) " is closer to 30 than the coefficient in the direct estimation ${bx}"
 /* Note that the only reason this works as well as it does is because u is defined so that
    E(ln(u))=E(ln(exp(rnormal())))=E(rnormal())=0 and because there is not constant in the first equation.
If there were a constant in the first equation then we would have
y=c+bxu -> lny=ln(c+bxu) which does not help us. */

Tuesday, July 24, 2012

Congressional culture of Corruption

Such an important issue.  This must change. 'The Congressional Culture of Corruption'

Write your own estimator and bootstrap the standard errors


* Let's imagine that you have some new command that you would like to use and you do not know how to solve the math to calculate the standard errors.

* For instance, two stage quantile regression would be extremely challenging.
clear
set obs 1000

* There is an exogenous component to w which is z
gen z = rnormal()

* There is an endogenous component to w which is v
* It is drawn from a Cauchy distribution.
* See (previous postfor more information on drawing from non-standard distributions.
gen v = tan(_pi*(runiform()-.5))

gen w = z + v

* There is also a standard addative error in the model
* u1 is a portion of the error that is independent of w
* It is drawn from a Cauchy distribution.
gen u1 = tan(_pi*(runiform()-.5))

gen u = (v+u1*3)*5

* Thus y is generated this way
gen y = -14 + 5*w + u

* In a Cauchy distribution the median is defined but the mean and variance are not.
* Thus it is inappropriate and often infeasible to use OLS or many common estimators.

* Quantile regression however is theoretically justified.

* That said:
reg y w
* Looks like it is working fine.

* But just for the sake of doing the consistent thing:
qreg y w

* We can see that both estimators are heavily biased because of the endogeneity of w.

* We want to estimate a two stage quantile regression
qreg w z

predict w_hat

qreg y w_hat
* It looks pretty good but we need to estimate the standard errors because that generated from just qreg on the predicted values is ignoring the variance in the original prediction.

* We will write a short program:

cap program drop twostageqreg
program define twostageqreg

  qreg w z
    * First let's drop w_hat if it is already defined
    cap drop w_hat

    predict w_hat
  qreg y w_hat

end


* Now let's bootstrap
bs: twostageqreg

* Our two stage bootstrap looks like it is working pretty well.
* The standard errors are smaller than in the second stage estimate.
* This is somewhat surprising.

Monday, July 23, 2012

Draw n values from a defined set


# This is a simple function (program in Stata) that takes an input vector of values (d) and randomly draws from them a random subset set of values of length n.

# We want a to program our own function to draw x draws from a set of possible draws
pull <- function(n,d) {
  # Create an empty vector to store the output
  v <- c()
  # Loop from 1 to n and draw 1 observation from 1 randomly for each of those
  for(i in 1:n) v=append(v, d[ceiling(runif(1)*length(d))])
    # Okay so I know this looks pretty intractable.  The for loop allows a singe line of code to follow on the same line.  So it is just saying loop through i from 1 to n.  Then the vector v has one value added to the end of it.  That value is a random draw from the vector d.
  # Return the compiled vector of random draws v
  v
}

# Now let's see how well our little code works.
pull(20,c("Agent Bob","Agent Sue","008","Agent Henrietta"))

# This will draw five numbers randomly between 1 and 10
pull(5,1:10)

# One can do this same kind of thing in Stata.  However, it would be much less elegant.

###########
# After writing this function I realized my original solution was not nearly as elegant as it could have been.
###########


# We want a to program our own function to draw x draws from a set of possible draws
pull <- function(n,d) d[ceiling(runif(n)*length(d))]

# Now let's see how well our little code works.
pull(20,c("Agent Bob","Agent Sue","008","Agent Henrietta"))

# This will draw five numbers randomly between 1 and 10
pull(5,1:10)


# This alternative function specification uses the preferred vector forms in R's language.
# To get a hint of what it is doing play around with this.
vect1 <- c("a","b","c","d","e")
# To see vect1 just type
vect1

# To get a single element of vect1
vect1[2]

# To get a subset of vect1
vect1[c(1,2,3)]

# To get a subset with repetition
vect1[c(1,2,1,3,1)]

#Now
ceiling(runif(20)*length(vect1))

# This is a nice way of figuring out how code is working in R.  Simply take it apart and see what it does by itself.

Sunday, July 22, 2012

Why I am Switching to R

Dear Reader,

          As I am sure you are aware, the majority of my posts are in Stata.  My primary motivation for choosing to post in Stata has been that this is the program which the majority of the professors I work for have been using (Jeffrey Wooldridge, Andrew Dillon, etc.).  However, I have been and continue to be  a reluctant user of Stata.  Frankly, I do not see a reason to buy a program that costs upwards of $800 (often more) in order to do work that could be done just as well on a program which is free.
              
         So far I have been content to use Stata because I am more fluent with it than R.  However, now Stata has released a new version (version 12) and I am stuck back in version 11 unwilling to pay 595 to upgrade my version.  So unless someone decides to upgrade my version for me, very likely, then most of my future posts are going to be in R.  Hope you find them useful!
               
Francis

The Importance of Independence


# Simulation in R
# It is a common assumption in most econometric models that data is drawn randomly from the population.

# However, this is often not the case due to data often being collected for purposes outside of econometric analysis.

# In addition, collecting random data might be infeasible or impossible in some instances.

# How would you survey randomly among potential terrorists for instance?

# There are a number of problems non-random sampling can cause.

# First let's draw 1000 standard normal xs with mean 0 and standard deviation 1.
x <- rnorm(1000,0,1)

# Let's imagine that we by chance sellected the 250 x's which are in the "middle"

# First let's sort x

xsort <- sort(x)

# Now let's sample non-randomly from the sorted "middle" of the population

xmiddle_sample <- xsort[400:599]

# Now let's sample alternatively from the random "middle" of the population

x_sample <- x[400:599]

sd(xmiddle_sample)
# We can see the standard deviation is about 1/6 of what it should be (close to 1)

sd(x_sample)
# In the random sub-sample the standard deviation is much closer to 1

mean(xmiddle_sample)
# The mean of the middle sample is close to the true mean.  But this is by chance.

mean(x_sample)
# The mean of the random sample also works well.

# However, having a non-random sample can bias mean estimates as well as estimates of variation.
xlower_sample <- xsort[1:200]

sd(xlower_sample)
mean(xlower_sample)
# The standard deviation is still too small but now the mean is way off.

# So how does all of this translate into OLS?

# It depends to some extent on what elements are non-random.

# In Y = XB + u, if the sample of X is non-randomly sorted and selected from as above then the only problem will be that the variance of X is small making it hard to identify it's effect on Y.

# If however, the u term is the one that is sorted on and selected from then it is going to make our estimated standard errors too small causing us to overrect the null.  In addition, it might also cause us to bias the constant.

# Finally, if sorting and selecting is based on a combination of X and u (perhaps sorting on Y for instance) then non-random selection will additionally make our estimates of the coefficients biased.

# Let's see how this works.  Let's start with 10,0000 observations.

X <- rnorm(10000)
u <- rnorm(10000)*3

Y <- 2+X+u

# The coefficient on X is 1

# Let's bind the data together in a data frame
data.rand <- as.data.frame(cbind(Y,X,u))

summary(lm(Y~X, data=data.rand))

# First let's randomly sample form data as a benchmark

sample.rand <- data.rand[1:1000,]

summary(lm(Y~X, data=sample.rand))

# We can see a random subsample from the population does not pose us too many problems.

# We loose a little accuracy and the stength of our t values diminish.

# Now, let's see what happens when we sort on X then select from x

data.Xsort <- data.rand[order(data.rand$X),]

# We can see that in data.Xsort the X variable is sorted (we will only look at the first 40)

data.Xsort[1:40,]

# Now we select a middle lower 1000 observations

sample.Xsort <- data.Xsort[3001:4000,]

summary(lm(Y~X, data=sample.Xsort))

# We can see that as predicted, though the sample size is the sample as previously, and though the estimate of the coefficient on X is still good, there is not enough variation in X to reject the null.

# Let's see how it works when the data is selected and sorted by u.

data.usort <- data.rand[order(data.rand$u),]

sample.usort <- data.usort[3001:4000,]

summary(lm(Y~X, data=sample.usort))

# As predicted the constant is strongly biased.

# In addition the standard errors are too small.  This would cause us to overreject the null.

# It is not possible to see this with this data as is because all of the variables have an effect on Y.

# But let us add a handful of variables to see if we reject.

# Loops through i=1 through i=20
for(i in 1:20) {
  # Generate the new variable name z1 z2 etc
  newvar <- paste("z",i,sep="")
  # Assign the new variable to the sample.usort
  sample.usort[newvar]<-rnorm(1000)
}

# We can see we now have 20 random new variables
sample.usort[1:5,]

summary(lm(Y~X+
             z1+z2+z3+z4+z5+z6+z7+z8+z9+z10+
             z11+z12+z13+z14+z15+z16+z17+z18+z19+z20
                , data=sample.usort))

# It does not seem to be overrejecting too horribly in general.  But I am sure if we did a Monte Carlo repeated simulation we would find the rejection rate to be greater than .1 at the .1 level .05 at the .05 level etc.

# Finally let us look at what happens if the data is sorted by Y (which is a combination of X and u)

data.Ysort <- data.rand[order(data.rand$Y),]

sample.Ysort <- data.Ysort[3001:4000,]
cor(sample.Ysort$X,sample.Ysort$u)
# Now Y and X are negatively correlated with each other.  Why?
plot(1:100, sample.Ysort$u[1:100], col = "red")
  points(1:100, sample.Ysort$Y[1:100], col = "blue")
  points(1:100, sample.Ysort$x[1:100], col = "black")
dev.copy(png,'independence.png')

# In order to be in the range 3001-4000 in generaly either X has to be low but u high, or u low and X high, or u and X both a little low.
sample.Ysort[1:40,]

# If you were to take a different sample of Y then you would find a different correlation between X and u.

# The important point is that by selecting the sample on Y then (assuming X has real effects) then you are by neccessity causing a correlation between the error and x.
summary(lm(Y~X, data=sample.Ysort))

# Thus choosing your sampling mechanism is critical.

Saturday, July 21, 2012

A note on Temporary Variables in Stata


* It is easy to create temporary variables in Stata that are automatically cleaned from memory as soon as the current do file is completed or program is done.

clear
set obs 10000

* For example
tempvar temp1 temp2 temp3

gen `temp1' = rnormal()
gen `temp2' = rnormal()^2
gen `temp3' = runiform()

sum
* Conviently these variables are cleaned from memory once no longer useful

* However, if we were to try to override a temporary variable with one of the same name then it will not work.

forv i = 1/10 {
  tempvar temp
  gen `temp' = rnormal()
}

sum

* However, it is impossible to target previous temporary variables that were created with the typical `temp' identifier since now it is targetting the most recently created variable.

sum __000003

* However works.

* Which is really not very useful but could potentially be.

* This may not appear to be a problem but when you start looping through commands could potentially result in a large number of temporary variables clogging the active memory things start to slow down.

clear
set obs 10000

* For example:
forv i = 0/10000 {
  tempvar temp
  gen `temp' = rnormal()
  if mod(`i',1000)==0 di "$S_TIME"
}
* For me there is about 2 seconds for every 1,000 variables created

* Easy fix:
clear
set obs 10000

forv i = 1/10000 {
  tempvar temp
  gen `temp' = rnormal()
  * Add drop
  drop `temp'
  if mod(`i',1000)==0 di "$S_TIME"
}
* When cleaning the memory this reduces the run time of all 10,000 (for me) to only 2 seconds.

* This is all fine, but they only problem is that if we are going through all of the work of drawing temporary variables just to later drop them then why not just draw normal variables?

* There is only one reason I cas see.  We do not want to risk our drawn variable names to overlap with our existing variable names.

* However, if this is not a problem then the following commands are clearly that much cleaner:

clear
set obs 10000

forv i = 1/10000 {
  gen temp = rnormal()
  drop temp
  if mod(`i',1000)==0 di "$S_TIME"
}

* In general I do not use temporary variables.

Friday, July 20, 2012

7 commands in Stata and R


* In Stata we should clear the memory and set the number of observations
clear
set obs 1000

* 1. Random draws first let's generate some data (random normal draws)

gen x = rnormal()
gen u = rnormal()

gen y = x + u

* 2. OLS

reg y x

* 3. Huber White

reg y x, robust

* 4. Apply cumulative normal function

gen yprob = normal(y)

* 5. Generate bernoulli draws (a binomial is a bernoulli distribution when n = 1)

gen ybernoulli = rbinomial(1,yprob)

* 6. Use a probit estimation to estimate y

probit ybernoulli x

* 7. Logit

logit ybernoulli x

* We can see that Stata is easy to use and efficient in its commands

####################################
#### Now the same commands in R ####

# It is unnessecary to clear memory in R.  However every time we generate new data we need to tell it how many observations.

# 1. Random draws first let's generate some data (random normal draws)

x = rnorm(1000)
u = rnorm(1000)
  # The 1000 tells R to generate vectors of random variables 1000 long.

y = x + u

# 2. OLS

result=lm(y~x)

# To get details
summary(result)

# 3. Huber White Heteroskedastic Corrected Standard Errors

  # As far as I can tell this is not a default command in R so we need an additional package.
  # This is easy enough to load:

  require(car)

# Now for the adjusted standard errors
hccm(result)

# 4. Apply cumulative normal function

yprob = pnorm(y)

# 5. Generate bernoulli draws (a binomial is a bernoulli distribution when n = 1)

ybernoulli = rbinom(1000,1,yprob)
  # Remember we need to specify 1000 so that R knows how many random draws to make

# 6. Use a probit estimation to estimate y

myprobit = glm(ybernoulli ~ x, family=binomial(link="probit"))

summary(myprobit)

# 7. Logit

mylogit = glm(ybernoulli ~ x, family=binomial(link="logit"))

summary(mylogit)

# We can see that while most commands are very similar to those of Stata the overall setup is more complicated.
# Which is not neccessarily a bad thing.  It is generally much easier to program (anything more advanced than one liners) in R.

Thursday, July 19, 2012

Settlers of Catan - probability of bandit attack


* This post seeks to answer the question, "Given that you have more than 7 cards in your hand, what is the likelihood that nobody (including yourself) will roll a 7 before the next turn, forcing you to discard?"

* First we must calculate the probability of anybody rolling a 7 (a seven is the most common roll possible).

* It is the probability that seven will be rolled in any combination of dice.

* Given that any particular outcome (X=a,Y=b)=P(X=a)*P(Y=b)=1/6*1/6=1/36

* That is P(X+Y=7)=P(X=1,Y=6)+P(X=2,Y=5)+P(X=3,Y=4)+P(X=4,Y=3)+P(X=5,Y=2)+P(X=6,Y=1)
*                 =  1/36    +    1/36  +   1/36   +    1/36  +    1/36  +   1/36    = 6/36 = 1/6

* let's check this

clear
set obs 10000
gen X = ceil(runiform()*6)
gen Y = ceil(runiform()*6)

gen XplusY = X + Y

gen YXis7 = 0
replace YXis7 = 1 if XplusY==7

sum
di 1/6

* Looks pretty close to 1/6 to me.

* So given that 1 out of 6 of the times we roll we will see a 7, what is the probability given z players at the table that you will get all the way around to your turn without rolling a 7?

* If it is just you, then it is easy. P(X1+Y1 != 7)= 1-1/6 = 5/6

* Gets a bit more tricky with more people but in general it is easy to calculate.

* Since the probability of not getting a 7 for any player is independent of not getting a 7 for any other player then the probabily is easy to calculate:

* In the two player case: P(X1+Y1 != 7, X2+Y2 != 7) = P(X1+Y1 != 7)*P(X2+Y2 != 7)=5/6*5/6=(5/6)^2=25/36
di 25/36 " = 69%"

* If you have three players
di "P(X1+Y1 != 7, X2+Y2 != 7, X3+Y3 != 7) = P(X1+Y1 != 7)*P(X2+Y2 != 7)*P(X3+Y3 != 7)=5/6*5/6*5/6=(5/6)^3=125/216 =" 125/216 " about 58%"

* If you have four players
di "P(!=7)=(5/6)^4= " (5/6)^4 " about 48%"

* So with 4 players there is better than 50% chance of rolling a 7 before your turn comes back to you.

* Let's test this with through simulation

set seed 1010
clear
set obs 10000

* This will generate 4 sets of variables
forv i=1/4 {
  * X for the first die
  gen X`i' = ceil(runiform()*6)
  * Y for the second die
  gen Y`i' = ceil(runiform()*6)
  * The sum of the dice
  gen X`i'plusY`i' = X`i' + Y`i'
  * An indicator if the sum equals 7
  gen Y`i'X`i'is7 = 0
  replace Y`i'X`i'is7 = 1 if X`i'plusY`i'==7
}

* If it is just you
gen no7_1 = 0
replace no7_1 = 1 if Y1X1is7 == 0

* If it is you and one other
gen no7_2 = 0
replace no7_2 = 1 if Y1X1is7 == 0 & Y2X2is7 == 0

* If it is you and two others
gen no7_3 = 0
replace no7_3 = 1 if Y1X1is7 == 0 & Y2X2is7 == 0 & Y3X3is7 == 0


* If it is you and three others
gen no7_4 = 0
replace no7_4 = 1 if Y1X1is7 == 0 & Y2X2is7 == 0 & Y3X3is7 == 0 & Y4X4is7 == 0

sum
*  We can see that the simulation confirms about 48% is the probability that a 7 will be rolled.

Wednesday, July 18, 2012

Problem with probilities - Monte Hall's Problem


* There once was a game show hosted by Monte Hall.  The game show at some point had the contestant choosing between one of three doors.  

* Behind two of the doors there was a goat while behind the third one there was a car.

* The contestant would first choose one door.  Then Monte would reveal a goat from one of the two remaining doors (being how Monte Hall knew where the car was) and the contestant had the chance to choose a new door (the one that was not opened) at that time.

* After that the doors were openned and the goat or car was revealed behind the door that was finally chosen.  If the contestant had chosen the door with the car then the contestant won the car.

* So the question was asked, "What is the ideal strategy for this game?"

* A clever columnist responded by stating that the optimal stategy was to always change your door after one goat is revealed.

* Why is this?

* Simply, if you stick with you original door as a strategy then your probability of being right is 1/3.

* However, if you change your vote after one of the goats in revealed then you have increased your probability of getting the car to 2/3.

* One way to think about this is, if the probability of getting a car from the initial initial_guess is 1/3 then the strategy of keeping with your original initial_guess must yield the same results.

* However, if getting a car by sticking with your original initial_guess is 1/3 then by switching your initial_guess you must be therefore getting the remaining probabily of getting a car (ie. 1-1/3=2/3)

* Don't believe it? I did not at first either!

* Well let's write a simulation to check the expected results of the two strategies

* First let's imagine 100,000 repeated games (a decent sampling for an 'unknown' probability distribution)

clear
set obs 100000

* The car is behind one of the doors: (1,2,3)

gen car = ceil(runiform()*3)
tab car
* Looks good

* The contestant initially initial_guesses one of the doors:

gen initial_guess = ceil(runiform()*3)
tab initial_guess

* This also looks good.

* Now a bit more tricky we must find which door is revealed.

* The revealed door cannot be the door the contestant chose and it cannot be the door that has the car.

* Let's just do a while loop and add one to bad initial_guesses till we draw a door not drawn by the contestant.

gen revealed = ceil(runiform()*3)

* Loop through twice adding one to any revealed value that is not legitamate
forv i = 1/2 {
  replace revealed = mod(revealed+1,3)+1 if revealed == initial_guess | revealed == car
}

* This count command will count how many random revealed doors are not legitamate.
count if revealed == initial_guess | revealed == car
* The loop successfully worked

tab revealed
* We can see revealed still looks like it is performing well (in terms of equal sampling from each of the doors)

* Now let's formulate three strategies
* 1. Choose a door and stick with it
* 2. Choose a door then always switch
* 3. Choose a door then fair flip a coin, if heads switch.

* For strategy one it is easy
  gen door1_choice = initial_guess

  gen strat1_car = 0
  replace strat1_car = 1 if door1_choice == car

* For stragey two it is a bit more complicated.  First we need to figure out how to identify which alternative is chosen.
  gen door2_choice = mod(initial_guess+1,3)+1
  * That is if the empty value is 1 plus the initial choice
  replace door2_choice = mod(door2_choice+1,3)+1 if revealed==door2_choice
  * But there is the possibility that the new choice will be the door revealed which is not good.
  * However, in the modular system if we add +1 and +1 to a initial draw then we have gone through all of the potential problems.
  count if revealed==door2_choice | initial_guess == door2_choice
  * No violations of the strategy

  gen strat2_car = 0
  replace strat2_car = 1 if door2_choice == car

*   Finally strategy three: choose a door randomly from the two doors remaining
* In order to execute these random draws of the remaining choices we will use a repeated loop to keep drawing random draws until all of the observations have draws not equal to the revealed draw.

  gen door3_choice = ceil(runiform()*3)

  count if door3_choice == revealed
  while r(N) > 0 {
   replace door3_choice = ceil(runiform()*3) if door3_choice == revealed
    count if door3_choice == revealed
  }

    tab door3_choice

    gen strat3_car = 0
    replace strat3_car = 1 if door3_choice == car

* Before executing this next bit of code, check your intuition.

* What do you expect to be the probabilities of each strategy?  Strategy 1 and 2 have beed discussed, what about three.

* Ok, drum roll....

sum strat?_car

* Surpising?  Perhaps not.

* Interesting how different strategies are clearly so much better than other strategies ;)

* Note, btw that strategy 1 and 2 have the same standard deviation.  That is because for a Bournoulli random variable X, var(X)=P(1-P).  Where P is the probability of X=1.

Tuesday, July 17, 2012

The Problem with Probabilities - how many rolls does it take to get a 1!


* So, you roll a fair 6 sided die once.  What is the probability of getting a 1 on the first roll?

* Simple:

di 1/6

* How about the possibility of not getting a one:

di 1-1/6

* or

di 5/6

* Ok, easy right.  Now is when things get ugly.

* What happens if you want to know the possibility of rolling a 1 (at least 1 or more) with either of two fair dice?

* 1/6 + 1/6? right?  I mean, you roll one die, 1 in 6 times you will get a 1 and you roll the other die and one in six times you will get a 1.  2/6=1/3 Right?

di 1/6 + 1/6

* Wrong!

* Weirdly.  It is not obvious with 2 dice why this statement is not true.  However, start adding dice to the equation.  3/6 (three dice) 2/3 (four dice) etc. 6/6 = 1 (six dice) wait a second!  I know that if I roll six dice, there is a possibility that one of them will not be a 1.  I mean, it might be small but it exists!  Then what happens when we go to 7 dice?  7/6 > 1 and thus not a probability.

* The reason why the intuitive guess does not work, is because the intuition applies to the question of "how many 1s will be rolled on x number of dice?"  To see this we find the expected value in terms of successes of one die roll:
* E(x=1)=1/6*1[x1=1] + 5/6*0[x1!=0]=1/6
* likewise: E(x=2)=1/6*1[x1=1] + 5/6*0[x1!=0] + 1/6*1[x2=1] + 5/6*0[x2!=0]=2/6=1/3
* The difference between this and the previous question is that in the previous question if you get a one, both rolls of the dice it counts as 1 while in the expected number of 1s count if you roll two ones then they count as 2.

* Thus we realize the utter failure of our intuition.

* The things about independent outcomes is that they are in a sense conditional upon each other.  You roll two dice and you only care about the possibility that at least one of them is a 1.  Thus you say, what is the possibility that the first die is a 1 = 1/6 + what is the possibility that the second die is a 1 (1/6) given that the first die is not a 1 (5/6) -> 1/6*5/6

* So the total probability that either die are ones is:
di 1/6 + 1/6*5/6

* Not quite 33%

* It becomes more exaggerated the more dice you add to the mix

*  P(a1=1) + P(a2=1 | a1 != 1) + P(a3=1 | a1!=1 & a2!=1)
di 1/6 + 1/6*5/6 + 1/6*5/6*(1-1/6*5/6)

* Okay, one more:
*  P(a1=1) + P(a2=1 | a1 != 1) + P(a3=1 | a1!=1 & a2!=1) + P(a4=1 | a1!=1 & a2!=1 & a3!=1)
di 1/6 + 1/6*5/6 + 1/6*5/6*(1-(1/6 + 1/6*5/6)) + 1/6*(1-(1/6 + 1/6*5/6 + 1/6*5/6*(1-(1/6 + 1/6*5/6))))

* An easier way to do these things is: taking 1 less the probability that the event never occurs
di 1-(5/6)^4
* I thought these equations are equivalent.  They probably are, just a more of a rounding error I would suspect.

* So it is easy to say this stuff but do we really believe it?

* Fortunately we can test our intuition with simulation!

* Lets generate some random six sided draws

clear
set obs 100000

* First the draws
gen draw1 = ceil(runiform()*6)

tab draw1
* That looks about right!

gen draw2 = ceil(runiform()*6)
gen draw3 = ceil(runiform()*6)
gen draw4 = ceil(runiform()*6)

* Now let's generate some indicator variables of different outcomes.

gen draw1eq1 = 0
replace draw1eq1 = 1 if draw1==1

gen draw1or2eq1 = 0
replace draw1or2eq1 = 1  if draw1==1 | draw2==1

gen draw1or3eq1 = 0
replace draw1or3eq1 = 1  if draw1==1 | draw2==1 | draw3==1

gen draw1or4eq1 = 0
replace draw1or4eq1 = 1  if draw1==1 | draw2==1 | draw3==1 | draw4==1

sum draw*eq1
* Thus we can see, contrary to our intuition, the probability of a 1 being rolled on at least 1 of four dice is closer to 50% rather than our intuitive guess of 75% (4/6).

* This could make a big difference in an intense game of D&D!    :)

Monday, July 16, 2012

Recursive programming in Stata


* This is one of the basic programming techniques in most programming languages but as far as I can tell it is largely absent from end user Stata programming.

* Imagine some routine which is recursively identical.

* For instance, imagine you would like to calculate a factorial.

* Such as !10 = 10 * 9 * 8 * 7 * 6 *5 * ... 2 * 1

* Interesting Stata (outside of Mata) does not have a basic factorial command but rather only a log factorial

* Easy enough to invert:

di exp(lnfactorial(10))

* Now, let's see if we can't write our own command to calculate a factorial.

cap program drop myfact
program define myfact, rclass
  * Let's display some feedback to help the user see what is going on
  if "`2'" == "noisy" di "Evaluate `1' - Input `1'"

  * If the number to be evaluated is greater than 1 then this program will call itself.
  if `1' > 1 {
    * myfact calling myfact at a value 1 less than the input value
    myfact `=`1'-1' `2'

* After the called myfact command it returns a scalar local called 'current_sum'
* This sum we multiply by the current `1' being evaluated
local current_sum = `1'*r(current_sum)
  }

  * This is in a sense the starting point of the recursive operation.
  * This is only because computers are required 'in general' to evaluate computational problems.
  else local current_sum = 1

  * Display some more output
  if "`2'" == "noisy" di "Evaluate `1' - Output `current_sum'"

  * This returns to the top most operation the current sum of the recursive operation.
  return scalar current_sum = `current_sum'
end

myfact 10 noisy
return list

* If we want our program to run without feedback
myfact 10
return list

* Of course this is far from the easiest way to program our own factorial calculator.
* For instance:
cap program drop myfact2
program define myfact2, rclass

  local current_sum = 1
  forv i = 1(1)`1' {
    local current_sum=`current_sum'*`i'
  }

  return scalar current_sum = `current_sum'

end

myfact2 10
return list

* The point is to demonstrate the problem solving technique known as recursive programming.
* There are numerous other applications for this technique.
* Search and sorting algorithms frequently feature recursive algorithms.

* In economics I have used recursive algorithm to solve market equilibrium problems.

* That is when the decision of one actor depends on the choices of other actors simultaneously.

Sunday, July 15, 2012

Use bootstrapped draws for simulating draws - expand method


* Use bootstrapped draws for simulating results - expand method

* This presents an alternative method to resampling results from the last post

webuse mheart0, clear

* First let's generate an observation index
gen obs_id = _n

* And you want to test how well an estimator will work on sampled data from that data set.

* There are obviously many ways to do this.

* One way would be to resample from that data 1,000 draws and then generate a dependent variable and test how well your estimator works.

sum

* First we want to mark the draws but we can see that bmi is missing some information.

* For our purposes we could either drop the observations for which bmi is missing or inearly impute bmi.

* Let's just impute bmi:

reg  bmi age smokes attack female hsgrad marstatus alcohol hightar

predict bmi_fill

replace bmi = bmi_fill if bmi==.

sum bmi
drop bmi_fill


di "Now what we want is approximately 1,000 results (we do not need to be exact)"

di "We have " _N " observations"

local obs_add =1000/_N

di "So we need to add approximately " round(`obs_add') " observations per observation"

* One way to do this would be to add (or subtract) randomly more duplicate observations.

* The uniform distribution is a natural choice.  However, its expected value is 1/2 so we need to multiply by 2 to ensure that we get the right number of observations.
gen add = round(`obs_add'*runiform()*2)

* Note: alternative distributions might be any non-negative distribution for which you can specify the expected value.
* For example: poission.  This distribution will be less likely to drop observations and have more proportional representation of initial observations.

tab add
* First let's drop any obervations that are slated to be dropped

drop if add == 0

* Now the command expand is very useful because it allows us to easily duplicate observations

expand add

sum

tab obs_id

Saturday, July 14, 2012

Stata access to World Bank data

World bank implements easy access to over 5,300 time series indicators in Stata.  For more information see:

http://thedatamonkey.blogspot.com/2012/03/stata-access-to-world-bank-data.html

bootstrapped draws for simulating observations - merge method

* Suppose you would like your simulation to be as realistic as possible in terms of input data.

* You have some data set: for example:

webuse mheart0, clear

* And you want to test how well an estimator will work on sampled data from that data set.

* There are obviously many ways to do this.

* One way would be to resample from that data 10,000 draws and then generate a dependent variable and test how well your estimator works.

sum

* First we want to mark the draws, but we can see that bmi is missing some information.

* For our purposes we could either drop the observations for which bmi is missing or linearly impute bmi.

* Let's just impute bmi:

reg  bmi age smokes attack female hsgrad marstatus alcohol hightar

predict bmi_fill

replace bmi = bmi_fill if bmi==.

sum bmi
drop bmi_fill

* This might not be the best technique for imputing values for econometric estimation but since all we want is to generate data with reasonable ranges for the explanatory variables, this should work fine.

* First let's mark our observations
gen draw_num = _n

di "There are " _N " observations available to draw from"

* Let us save the name number of draws
gl max_draw = _N

tempfile draw
save `draw'

* Now we are ready to start drawing our data.
* Imagine we would like to draw 10,000 observations:

clear
set obs 10000

* Now is the only trick.  We need to assign each observation a draw_num that ranges form 1 to ${max_draw}

gen draw_num = int(runiform()*${max_draw})+1
* This is a little tricky but it should do the job.  First it generates a variable which ranges in value from 0 to max_draw-1 then it reduces it to an integer and adds 1.

sort draw_num
* Neccessary for the merge

* Now we just need to merge in our temporary data set:
merge m:1 draw_num using `draw'

* Sum

* Now we can generate some values and estimate how well they work
* for example

gen u = 50*rnormal()

gen y = -5 + attack -3* smokes + .15*age -2*bmi + u

reg y attack smokes age bmi

Friday, July 13, 2012

Simulating Multinomial logit in Stata - Updated


* This post has been corrected based on suggestions form:
* Ali Hashemi Economics & Finance Ashland University and Mateusz (comment below)
* Thanks!



* This code simulates the multinomial logit model pioneered by McFadden (1973, 1983, 1984)
* (http://ideas.repec.org/h/eee/ecochp/2-24.html   - 1984 paper)

* The model allows for the estimation of a model that selects between multiple binary alternatives.

* Imagine that you are at ice cream shop and you are getting a rootbeer float.  The rootbeer is given.

* Now you need to selected between multiple mutually exclusive binary and complete outcomes.

* This model is used widely in such things as choosing travel destinations, insurance policies, etc.

* The structure is a little more abstract than we are used to:

* By definition sum{j=1 to J}[(P(Yj)]=1

* P(Yj=1)=exp(X'Bj)/{1+sum{i=1 to J}[exp(X'Bi)]}

* and

* P(Yj=0)=1/{1+sum{i=1 to J}[exp(X'Bi)]}

* Thus the more easily interpretable "Log-Odds Ratio is" P(1)/P(0) = exp(X'Bj)

* Note: obviously this is a highly non-linear setup

* So let's start the simulation:

clear
set obs 1000

* First a few explanatory variables

gen fruit_loving = rnormal()

gen sweet_tooth = rnormal()

gen bean_loving = rnormal()

gen cocoa_loving = rnormal()

* Now let us calculate the XB for each of the dependent variables:
* Vanilla is the base outcome so for simplicity we will not have any coefficient values.
gen XBvanilla = 0

gen XBchocolate = 1 -2 * fruit_loving + -.1 * sweet_tooth + 1.75 * bean_loving + 2.2 * cocoa_loving

gen XBstrawberry = -1 + 2 * fruit_loving + .1 * sweet_tooth - 1.5 * bean_loving - 2.5 * cocoa_loving

* Now let's generate the Exp(XB)

gen expvan = exp(XBvanilla)

gen expchoc = exp(XBchocolate)

gen expstraw = exp(XBstrawberry)

* Now let's generate the denominator:

gen denom = expvan + expchoc + expstraw

* Finally let us generate the probabilities of seeing any particular outcome:
gen pvan = expvan/denom
gen pchoc = expchoc/denom
gen pstraw = expstra/denom

sum p*
* We can see chocolate is the most likely ice-cream flavor followed by strawberry, followed by vanilla.

* Now let's make the draws that we observe.  This is a little tricky.

gen runif = runiform()

gen flavor = 3
replace flavor = 2 if runif < pvan+pchoc
replace flavor = 1 if runif < pvan

label define flavor 1 "Vanilla" 2 "Chocolate" 3 "Strawberry"

label var flavor flavor
tab flavor
* Looking pretty good.

mlogit flavor fruit_loving sweet_tooth bean_loving cocoa_loving, baseoutcome(1)

predict pvan_hat ,equation(#1)
predict pchoc_hat ,equation(#2)
predict pstraw_hat ,equation(#3)

sum p*
* We can see the average predicted outcome probabilites are similar

pwcorr pchoc_hat pchoc, sig
* pchoc is statiscally significant

reg pvan  fruit_loving sweet_tooth bean_loving cocoa_loving
reg pvan_hat  fruit_loving sweet_tooth bean_loving cocoa_loving
* Looks very good
* We can see that how the independent variables affect the predicted variables is extremely similar.
* This is probably the best evidence yet that the mlogit command is doing what we want it to be doing.

* And
reg pchoc  fruit_loving sweet_tooth bean_loving cocoa_loving
reg pchoc_hat  fruit_loving sweet_tooth bean_loving cocoa_loving
* Looks also good

* And
reg pstraw  fruit_loving sweet_tooth bean_loving cocoa_loving
reg pstraw_hat  fruit_loving sweet_tooth bean_loving cocoa_loving

Thursday, July 12, 2012

Remove a subset from a global - Stata


* This program will remove from a global a set of specified values
* I believe this is the same in sigma algebra of subtracting a set from another
* I figure there is probably a prewritten command for this but I have not been able to find it.

cap program drop takefromglobal
program define takefromglobal
 
  * First define a local called temp to hold the initial values of the global
  local temp ${`1'}
 
  * Empty the global
  global `1'
 
  * Loop through each of the old values
  foreach v of local temp {
   
* Start counting at 0
    local i = 0

* Create a local that indicates that this value should be skipped (initially assume no skipping)
local skip = 0

* Loop through all of the user specified inputs `0'
    foreach vv in `0' {

* We will skip the first one since we know it is just the global specified
      local i = `i' + 1

 * If we are past the first input in the local and one our values to exclude matches the value in the global skip that value
 if `i' > 1 & "`vv'"=="`v'" local skip = 1
    }

* If the current value is not to be skipped then add it to the new global list
if `skip' == 0 global `1' ${`1'} `v'
  }
end

global XYZ a b c d e f g a b c d e

di "$XYZ"

takefromglobal XYZ a b c

di "$XYZ"

Wednesday, July 11, 2012

Survival Rates and Negative Binomial

* This estimates average survival rates (days)
* Given different daily mortality rates (predation/accident etc.)

* Given a constant failure rate and a defined number of eliminations then the expected survival rate follows the negative binomial distribution.

* Negative binomial results are as follows:
* mean = pr/(1-p)
* r is number of failures:
* r = 1

* mortality rate = .01
* if p=.99 ->
di "E= " .99/.01

* mortality rate = .015
* if p=.985 ->
di "E= " .985/.015


* mortality rate = .02
* if p=.98 ->
di "E= " .98/.02

* mortality rate = .025
* if p=.975 ->
di "E= " .975/.025

* mortality rate = .03
* if p=.97 ->
di "E= " .97/.03

* mortality rate = .035
* if p=.965 ->
di "E= " .965/.035


* Let's see if the simulation gives us the expected results
clear
set obs 12000
gen mort_rate = mod(_n,6)/200+.01
* This will split the data into 6 groups of 200 with each group having mortality rates from .01 to .035

tab mort_rate

* An indicator that the mosquito is alive
gen living=1

* An indicator that the mosquito survived
gen survived=0

* Sample 500 days, see how many mosquitos survive to live for each day
forv i = 1/500 {
  * Draw a single temporary binomial draw to see if this day the mosquitos die
  qui gen died = rbinomial(1,mort_rate) if living==1
  * If the mosquitos died then change their "living" status
  qui replace living=0 if living==1 & died == 1
  * Replace survived to indicate the current day if they are still alive
  qui replace survived=`i' if living==1

  * Displays user feedback telling what round the days currently are on and what percentage of mosquitoes live.
  qui sum living
  di "Round `i' :" r(mean)

  * Drop the temporary variable "died"
  drop died
}

hist survived, by(mort_rate)



bysort mort_rate: sum survived


Tuesday, July 10, 2012

The Delta Method

* Traditionally the delta method was and continues to be used in economics to estimate the standard errors of many econometric estimators. 

* The delta method typically takes an estimator for which a reasonable standard error estimate is known and transforms that estimator by some function to another estimator.

* See more here: http://en.wikipedia.org/wiki/Delta_method & http://en.wikipedia.org/wiki/Taylor_series
clear
set obs 1000

gen x = rnormal()*5 + 10
* Let's set up a variable x with a mean of 10 and variance of 25

* c=f(x)=x^3
gen c = x^3

* cprime = f'(x) = 3*x^2
gen cprime = 3*x^2

reg x
gen var_hat_c = cprime^2*_se[_cons]^2

qui sum var_hat_c
di "Thus the delta estimate of the standard error of mean(c) is: " r(mean)^.5
reg c
di "While the 'unobservable' draw's se is: " _se[_cons]

* It seems the delta method is working pretty well!

* An alternative would be to bootstrap the errors
bs, rep(100): reg c

* Both methods work well in this example.

***********************************************************
* Let's try something a little different

clear
set obs 1000

gen x = runiform()*10

* c=f(x)=sin(x)+x
gen c = sin(x)+x

* cprime = f'(x) = cos(x)+1
gen cprime = cos(x)+1

reg x
gen var_hat_c = cprime^2*_se[_cons]^2

qui sum var_hat_c
di "Thus the delta estimate of the standard error of mean(c) is: " r(mean)^.5
reg c
di "While the 'unobservable' draw's se is: " _se[_cons]

bs, rep(100): reg c

* The Delta method still seems to be working well.

* As I understand it the Delta method:
*   Less machine time intensive while more algebra intensive.  Taking derivitives above was easy, but when you have multiple explanatory variables and more complex equations things can get pretty ugly.
*   More suceptable to approximation errors (because it is a taylor series expansion)

Monday, July 9, 2012

Write your own bootstrap command.

* A similar function is also programmed in R.  See: http://www.econometricsbysimulation.com/2012/09/program-your-own-bootstrap-command-r.html

* The bootstrap resampling technique is one of the coolest techniques that have developed in econometrics/statistics in the last 50 years (that and maximum likelihood).  Bootstrap releases the user from strong assumptions about the underlying distribution of errors.

* As far as I understand the bootstrapping assumptions follow, so long as the data you draw are reasonable draws from the underlying distribution then resampling from that sample will give you a reasonable estimate of the populations underlying parameters. Note, this method is obviously related to Baysian methods.

* B Efron formalized the technique and termonology of Bootstrap in a paper in 1979.

* First let us define a program that will do our bootstrap
cap program drop bootme
program bootme

********
* First we will set up the data that we will use as a resource to draw from

  * Preserve the current data
  $nq preserve
  syntax anything [, Reps(integer 100) Command(string) Keep(string) Noisily]

  * Tell stata to display what it is doing or the default is not to
  if "`noisily'" == "" local nq quietly
  if "`noisily'" != "" local nq

  * Generate a variable that counts from 0 to _N-1
  `nq' gen bootmatcher = _n-1

  * This is the number of observations to draw from.
  `nq' gl boot_N=_N

  * We will save the data file to draw from first
   tempfile drawfile
  `nq' save `drawfile' , replace

* Draw file set up
********

* Clear the memory
`nq' clear

* Set the number of repetitions to that specified by the user
`nq' set obs `reps'

* Keep tells the bootstrap rountine what values from the estiamtion to keep
foreach v in `keep' {
 * Creates emptly variables to hold the coefficients of interest
 local vtemp=strtoname("v`v'")
`nq' gen `vtemp'=.
}

* Create a data file that will list the results of the bootstrap
tempfile returnlist
`nq' save `returnlist' , replace

* Display the bootstraps progress
di as text _newline "Bootstrap replications ({result:`reps'})"
di "----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5"

* Loop through the number of repetitions in the bootstrap
forv i=1/`reps' {
* First clear the data
clear
* Set the number of observations equal to the number to draw
`nq' set obs $boot_N
* Generate a variable that represents random draws.
`nq' gen bootmatcher = int(runiform()*$boot_N)
* Sort the draws from 1 to _N to assist in the merge command
`nq' sort bootmatcher
* Merge in the drawfile
`nq' merge m:1 bootmatcher using `drawfile', keep(3)

* Now run the user specified command
`nq' `command'

* Now load up the return list file
`nq' use `returnlist', clear

* For each value of the list of keep values loop through
foreach v in `keep' {
  * We need to make sure the variables are readable as names
  local vtemp=strtoname("v`v'")
  * Replace each observation with the estimated value
  `nq' replace `vtemp'=`v' if _n==`i'
}
`nq' save `returnlist', replace

* If the observation is divisible by 50 insert a space in the progress bar.
if `i'/50 == int(`i'/50) di ".  " `i'
* Otherwise keep adding .s
else di _continue "."
}
* This summarizes the results
sum
* Restores the data to the previous state
`nq' restore
* Runs the command one last time for display
`command'
end
* End of the Bootme command

clear
set obs 1000

gen x = rnormal()
gen u = rnormal()*10
gen y = x + u

* This is Stata's default bootstrap command
bs _b[x] _se[x] e(F) e(ll), reps(200): reg y x

* Our bootstrap seems to work very similarly though it seems to me that our standard errors seem to be a little wider in general.  I am not sure why.
bootme 1, r(200) k(_b[x] _se[x] e(F) e(ll)) c(reg y x)

Sunday, July 8, 2012

Predict


* Stata's predict command is an extremely useful command for many purposes.

* In this post we will go through how it works.  And manually program in long hand some of the things it does.

* First let's start with OLS

* Imagine the underlying population model Y = g(x1, x2)

* Now imagine an estimator Y = f(x1, x2)

* What most estimations do is they take the Y and the xs and estimate some variant of f.

* In the linear case Y = b0 + b1x1 + b2x2 + u

* Most estimation commands attempt to estimate b0, b1, and b2.  Which is great!

* But after estiamting b0, b1, and b2 what we may ask,

* "How does u look? Does it look normal, thus justifying the use of OLS?"

* We may also ask, "How does the estimated y look?  This is often not particularly interesting since it is purely linear but often 'yhat' the predicted y is used in post estimation techniques."

* Let's see how this works.

* First, let's simulate some data:

set seed 10
clear
set obs 1000

gen x1 = rnormal()
gen x2 = rnormal()
gen u  = rnormal()
gen y  = 6*x1 + -4*x2 + 10*u

* Now let's estimate the OLS equation

reg y x1 x2

* If we want to get the fitted values we need only write the following
predict yhat1

* This is equivalent in the OLS case to:
predict yhat2, xb

* We can also manually generate these values by using the estimated coefficients:
gen yhat3 = .1430927 + 5.767773*x1 + -3.798869*x2

sum yhat?

* Likewise we may be interested in the error uhat

predict uhat1, residual

* We can do it manually:
gen uhat2=y-yhat1

sum uhat?
* There is the slightest difference between uhat1 and uhat2 but this is only the result of rounding error.

* Now that we have uhat, we can map it out to see if it looks like it is behaving well:

hist uhat1, kden
* Unsprisingly (given that we drew u from normal) uhat1 (which is an estimate of u) looks normal as well.

Saturday, July 7, 2012

Probit vs Logit


* This is a comparison of how well the logit does relative to the probit when the data is generated from the assumptions underlying the probit.

* First let's generate data that is consistent with the probit assumptions

clear
set seed 101
set obs 1000

* x is the explanatory variable
gen x = rnormal()*(1/2)^(1/2)

* u is the error
gen u = rnormal()*(1/2)^(1/2)

* y is the unobservable structural y
gen y = x + u

* In order to do a probit correctly the underlying distribution has to be standard normal (which is not a restriction so long as you remember when generating values.)

* This is why rnormal*(1/2)^(1/2) -> var(y)=var(x+u)=(1/2)*1+(1/2)*1=1
sum y
* Pretty close to 1 in the sample

* y_prob is the probability of observing a 1 given
gen y_prob = normal(y)

* y is the actual binary draws
gen y_observed = rbinomial(1,y_prob)

* now let's try estimating probit first
probit y_observed x
  * Save the estimated coefficient to a local macro
  local coef_probit = _b[x]

* let's predict the probabilities
predict y_probit
  label var y_probit "Probit fitted values"

* Now let's estimate the logit
logit y_observed x
* Save the logit to a local macro
local coef_logit = _b[x]

* predict the probabilities from the logit
predict y_logit
  label var y_logit "Logit fitted values"

* We can see that both the probit and the logit are almost identical
two (line y_logit x, sort) (line y_probit x, sort)




di "It is a somewhat well known property that probits and logits are in practice almost linearly equivalent."
di "The ratio of probit to logit is: `=`coef_probit'/`coef_logit''"

reg y_probit y_logit
* Check out that R-squared!

* So what does all of this practically mean?

* Feel free to switch between probit and logit whenever you want.  The choice should not generally significantly affect your estimates.

* Note: for mathematical reasons sometimes it is easier using one over the other.

* Finally, if you want to recover the original coefficient on x the best thing to do is to take the average partial effect (APE)

probit y_observed x

di (1/2)^(1/2)

test x==.70710678
* The probit results get fairly close but we reject the null

Friday, July 6, 2012

Logit: Logistic regression on a factor variable


* Logistic regression on a factor variable

* A reader recently contacted me with a request.  I want to run logistic regressions to examine if a person ever visited a dentist in the last year.

* The example of the data she sent looked something like the following:

/*                  A12 |      Freq.     Percent        Cum.
------------------------+-----------------------------------
    In the last 4 weeks |      2,028       20.28       20.28
Between 1 and 12 months |      2,036       20.36       40.64
          1-2 years ago |      1,997       19.97       60.61
  More than 2 years ago |      1,963       19.63       80.24
                  Never |      1,976       19.76      100.00
------------------------+-----------------------------------
                  Total |     10,000      100.00          */

* Let us first generate data that looks something like her data.

clear
set obs 10000
gen A12 = int(runiform()*5)+1
  label define dental 1 "In the last 4 weeks" ///
                      2 "Between 1 and 12 months" ///
 3 "1-2 years ago" ///
 4 "More than 2 years ago" ///
 5 "Never"

label values A12  dental

tab A12
* The problem is that the dependent variable is coded as a factor variable but the logistic regression takes a binary varailble.

* First we want to figure out what the label book on the A12 varaible is.
desc A12
* But this might not be the case that A12 is a factor variable.  We might find that A12 is actually a string variable.

* Let us generate string duplicate
decode A12, gen(A12b)

tab A12b
* We can see that the tab commands are identical except in the order that the items are listed.
* Thus we can infer that the original data is in factor form.

* Though just looking at the desc command tells us as well.  If the storage type is not string then it must be a factor variable.

* This tells us that A12 has the label dental applied to it.

label list

* Here is a detailed post on how to convert factor variables to dummies:
* http://www.econometricsbysimulation.com/2012/06/convert-factor-variables-dummy-lists.html

* However, it might be a bit of overkill for this problem.  Instead we can manually convert the factor variables as we need to.

* Generate first an empty variable
gen dental_yr1 = 0
  label var dental_yr1 "Went to the dentist in the last year"
replace dental_yr1 = 1 if A12 == 1
  * We know from the label list that A12 == 1 is "In the last 4 weeks"
replace dental_yr1 = 1 if A12 == 2
  * We know from the label list that A12 == 1 is "Between 1 and 12 months"

sum dental_yr1
* Everything is looking good.

* Now in order to do a logistic regression we need to have some explanatory variables so let's generate some independent ones for now.
gen indepvar1 = rnormal()
gen indepvar2 = rnormal()

* Finally:
logit dental_yr1 indepvar1 indepvar2
* Unsprisingly the independent vars are not statistically significant.

Thursday, July 5, 2012

The randomly walking currency market

* The randomly walking currency market

* One of the first simulations I ever wrote was using random walks and seeing how OLS fails when the data generating process does not experience decay.

* A good example of this might be the stock market.  We do not expect that because prices go down today that they need go up tomorrow or visa versa.  Likewise the currency market (comparing two developed economies is unlikely experience any kind of a decay rate on the previous value.  Either the dollar is up relative to the Euro or it is down relative to the Euro, the best we can do to predict the dollar is look at what it was yesterday.

* Let's imagine that we have two random walks and 730 (two years of daily information):

clear
set seed 111
set obs 730

gen t = _n
  label var t "Time (day #)"

gen y = 100 if t==1
  label var y "% value of dollars to t=0"

gen x = 100 if t==1
  label var x "% value of euros to t=0"

* Initially both values are set at the same
replace y = y[_n-1] + rnormal() if _n>1
replace x = x[_n-1] + rnormal() if _n>1
* However now there is no relationship between y and x

two (line y t) (line x t)
* Looking at the two values plotted next to each other we can almost imagine a relationship. Play around with different seeds.  If your mind is anything like mine it will start seeing patterns.


reg y x
* We strongly reject the null that there is no relationship.  This is no accident of this draw.  Almost any seed you set you will get similar results.

* So it the problem with the random draws?  Are they not truly random?

* Well, yes they are not trully random but they are as close to random as we need concern ourselves.

* It is simply the nature of the random walk.  Fortunately there are tests for that.

tsset t

dfuller y
dfuller x

* The null in this test is that the process follows a random walk (contains a unit root).  The test fails to reject our null.

*********************************
* We can do the exact same thing in a panel data setting.

* Imagine you have the quiz scores for 1000 students over 100 periods.

clear
set seed 111
set obs 50

gen id = _n
  label var id "Student id"

expand 52

bysort id: gen t = _n
  label var t "Year"

* Student first year is the reference year
gen y = 100 if t==1
  label var y "student quiz score"

* The explanatory variable also follows a random walk
gen x = 100 if _n==1
  label var x "Hours spent watching TV"

* Initially both values are set at the same
replace y = y[_n-1] + rnormal() if t>1
replace x = x[_n-1] + rnormal() if t>1
* However now there is no relationship between y and x

reg y x
* In the panel data case we are much less likely to reject the null.  Given that the random walks for each individual is independent of those of the others.

tsset id t

* For the test of the unit root we will use the aptly named madfuller test (just kidding) written by Christopher F Baum and can be found at http://ideas.repec.org/c/boc/bocode/s418701.html.
* The test can be installed via ssc install madfuller
madfuller x, lags(1)
madfuller y, lags(1)
* This test uncovers the truth that there is a random walk process at work.

* Note the biggest limitation of the test is that number of time periods must exceed the number of ids.  Thus there is 50 students and 52 quizes.

Wednesday, July 4, 2012

The problem of near multicollinearity

* This simulation looks at the problem that happens when the variable of interest is correlated with other explanatory variables.  By not including the other variables you may bias the results but by including them you may absorb so much of the variation in the explanatory variable that you may be unable to identify the true coefficient of interest.

* Clear the old data from memory
clear

* Set the number of observations to generate to 1000
set obs 1000

set seed 10

* Generate a positive explanatory variable.
gen x = abs(rnormal())*3

* Imagine we are interested in the coefficient on x.

* Now create correlated explanatory variables
gen z1 = x^2 + rnormal()*10
gen z2 = x^1.75 + rnormal()*10
gen z3 = x^.5 + rnormal()*10 + z2/4
gen y  = 4*x + .5*z1 + .8*z2 + z3 + rnormal()*100

reg y x
* The problem with near-mutlitcolinearity is that when you do not include other correlated explanatory variables it can heavily bias the one that is included.

reg y x z1 z2 z3
* But then you do include them they can absorb so much of the variation that you have no help of identifying the true effect of the variables of interest (x).

corr x z1 z2 z3

* Let's use the Farrar-Glauber Multicollinearity Tests user written Stata command by Emad Abd Elmessih Shehata.

* The ado file can be found at (http://ideas.repec.org/c/boc/bocode/s457417.html).

* However, it can also be installed via the command ssc install fgtest

fgtest y x z1 z2 z3
* All of the variables appear to be multicollinear (unsuprisingly).

* Thus we can see the Farrar-Glauber test is working well.

Tuesday, July 3, 2012

Post estimation counterfactual biprobit draws

* This Stata program offers the ability to generate counterfactual draws post estimation of a biprobit.   These draws are unique because they reflect not only the actual data used in the original estimation technique but also modifications to that data that the researcher may want to simulation.  


* For this purposes I use probabilities to generate counterfactual data to be used in the simulation.  In the later step when the xb1 and xb2 are generated it is possible for the user to change these values by say adding 1 to one of the independent variables and seeing how the bivariate outcomes change as a result.  This result can be expanded beyond just finding the marginal effect since the user may be also interested in knowing how the joint likelihood changes which cannot be found easily by looking just at the marginal coefficients on Y1 and Y2.

* Some of the features of this code are available in the mvprobit command.  Which inspired this code.
* http://ideas.repec.org/c/boc/bocode/s432601.html written by Lorenzo Cappellari & Stephen P. Jenkins


* This code is written by Francis Smart

* Just to set up let's load a default data set.
webuse school, clear

local binary1 = "private"
local binary2 = "vote"

local indep_var = "logptax loginc years"

* We want to be able to draw probabilities that will allow us to get close to the true probabilities.
gen ptarget00 = 0
gen ptarget10 = 0
gen ptarget01 = 0
gen ptarget11 = 0

replace ptarget00 = 1 if `binary1' == 0 & `binary2' == 0
replace ptarget01 = 1 if `binary1' == 0 & `binary2' == 1
replace ptarget10 = 1 if `binary1' == 1 & `binary2' == 0
replace ptarget11 = 1 if `binary1' == 1 & `binary2' == 1

graph pie, over(`binary1' `binary2') title(`binary1' and `binary2' in Data) name(pie1, replace)

sum ptarget*

* First the bivariate probit.  That is we are trying to predict two binary outcomes simultaneously.
biprobit `binary1' `binary2' `indep_var'

* This actual data is completely arbitrary and the relationships are not reasonable.  It is just a default webuse data set.

* Now let's inagine that we would like to estimate a bivariate probit then simulate random draws that that correspond with the bivariate probit.

* First let us predict the xb values
predict xb1, xb equation(#1)
predict xb2, xb equation(#2)

* Change the xb1 and xb2 values in order to see how the joint distribution of draws will change.

* This is the prefix that is going to be in front of all of the generated probabilities
local pref = "phat_"

* This is the estimated correlation between the positive draws.
local rho = e(rho)

local xb1 = "xb1"
local xb2 = "xb2"

* Create a correlation matrix between dependent var1 and dependent var2
tempname A C
mat `A' = I(2)
mat `A'[1,2] = (`rho')
mat `A'[2,1] = (`rho')

* Decompose A into two matrices
mat `C' = cholesky(`A')

* Create a scalar (local) extract from the cholesky decomposition
forval j = 1/2 {
tempname c2`j'
scalar `c2`j'' = `C'[2,`j']
}
* These values are used to adjust the second generated probability

* Define the beginning of the equation loop
***************************************
tempvar d11 d01 sp1x sp0x arg111 arg001 z

qui gen double `arg111' = `xb1'
qui gen double `arg001' = `xb1'

qui gen `z' = uniform()

qui gen double `d11' =  invnorm(`z'*normprob( `arg111'))
qui gen double `d01' = -invnorm(`z'*normprob(-`arg001'))

* Generate the probability of seeing a draw of 1 on the first equation
qui gen double `sp1x' = normprob( `arg111')

* Generate the probability of seeing a draw of 0 on the first equation
qui gen double `sp0x' = normprob(-`arg001')

***************************************
tempvar spx1 spx0 arg112 arg002

qui gen double `arg112' = `xb2'-`d11'*`c21'
qui gen double `arg002' = `xb2'-`d01'*`c21'

* Generate the probability of seeing a draw of 1 on the first equation
qui gen double `spx1' = normprob((`arg112')/`c22')

* Generate the probability of seeing a draw of 0 on the first equation
qui gen double `spx0' = normprob((-`arg002')/`c22')

***************************************
* Start cross equation calculation

tempvar sp01 sp10 arg102 arg012

qui gen double `arg102' = `xb2'+`d11'*`c21'
qui gen double `arg012' = `xb2'+`d01'*`c21'

* Generate the probability of seeing a draw of 1 on the first equation
qui gen double `sp01' = normprob((`arg102')/`c22')

* Generate the probability of seeing a draw of 0 on the first equation
qui gen double `sp10' = normprob((-`arg012')/`c22')

***************************************
* End equation loop

gen `pref'00 =  `sp0x'*`spx0'
gen `pref'01 =  `sp1x'*`sp10'
gen `pref'10 =  `sp0x'*`sp01'
gen `pref'11 =  `sp1x'*`spx1'

gen `pref'_sum = `pref'11 + `pref'00 + `pref'10 + `pref'01

foreach i in "00" "10" "01" "11" {
* Let's make sure all of the generated probabilities add up to one.
  replace `pref'`i' = `pref'`i'/`pref'_sum

  * This is just a temporary place holder for the draws
  gen `pref'draw`i'=0
}

drop `pref'_sum

* Generate ranges for which each outcome will occure
gen outcome00 = `pref'00
gen outcome01 = `pref'01+`pref'00
gen outcome10 = `pref'10+`pref'01+`pref'00
gen outcome11 = `pref'11+`pref'10+`pref'01+`pref'00

* Make the draw
gen runit = runiform()

replace `pref'draw00=1 if runit < outcome00
replace `pref'draw10=1 if outcome00 < runit & runit < outcome01
replace `pref'draw01=1 if outcome01 < runit & runit < outcome10
replace `pref'draw11=1 if outcome10 < runit & runit < outcome11

sum *draw??, detail

* Make the draws of the dependent vars
gen `binary1'_draw=0
replace `binary1'_draw=1 if `pref'draw10==1 | `pref'draw11==1

gen `binary2'_draw=0
replace `binary2'_draw=1 if `pref'draw01==1 | `pref'draw11==1

graph pie, over(`binary1'_draw `binary2'_draw) ///
  title(`binary1' and `binary2' in Bivariate Draws) ///
  name(pie2, replace)

graph combine pie1 pie2
* The random bivariate draws are not a perfect replica of the original data



* Compare this to two seperate probit estimations
probit `binary1' `indep_var'
predict `binary1'_p
gen  `binary1'_prob_draw = rbinomial(1, `binary1'_p)

probit `binary2' `indep_var'
predict `binary2'_p
gen  `binary2'_prob_draw = rbinomial(1, `binary2'_p)

graph pie, over(`binary1'_prob_draw `binary2'_prob_draw) ///
  title(`binary1' and `binary2' in Two Probits) ///
  name(pie3, replace)

graph combine pie1 pie2 pie3
* However, they are much better than running two seperate probits and drawing from them.