Tuesday, July 30, 2013

Diff-n-Diff of estimates from Structural Equation Models (SEM)

Structural equation modeling (SEM) is generally the practice of estimating latent factor(s) from observable measures.  It is similar to factor analysis in that it typically reduces a problem with potentially many dependent variables to a lower dimension which we hope is interpretable.

Let’s imagine a relatively simple model where this might be useful.  We have five responses which are measures of 2 latent traits we are interested in.  For half of our sample we administer an intervention with the hopes that it will increase our latent traits.  We administer a pretest measuring the 5 observables before the treatment and a posttest after measuring the 5 observables after.

Stata Code
set obs 4000

gen id = _n

gen eta1 = rnormal()
gen eta2 = rnormal()

 * Generate 5 irrelevant factors that might affect each of the 
 * different responses on the pretest 
gen f1 = rnormal()
gen f2 = rnormal()
gen f3 = rnormal()
gen f4 = rnormal()
gen f5 = rnormal()

 * Now let's apply the treatment 
expand 2, gen(t)   // double our data 

gen treat=0
replace treat=1 if ((id<=_N/4)&(t==1))

 * Now let's generate our changes in etas 
replace eta1 = eta1 + treat*1 + t*.5
replace eta2 = eta2 + treat*.5 + t*1

 * Finally we generate out pre and post test responses 
gen v1 = f1*.8  + eta1*1  + eta2*.4   // eta1 has more loading on 
gen v2 = f2*1.5 + eta1*1  + eta2*.3   // the first few questions 
gen v3 = f3*2   + eta1*1  + eta2*1  
gen v4 = f4*1   + eta1*.2 + eta2*1   // eta2 has more loading on 
gen v5 = f5*1   +           eta2*1   // the last few questions 

 * END Simulation 
 * Begin Estimation 

sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
predict L1 L2, latent

sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
predict L12 L22, latent

replace  L1 = L12 if t==1
replace  L2 = L22 if t==1

 * Now let's see if our latent predicted factors are correlated with our true factors. 
corr eta1 eta2 L1 L2

 * We can see already that we are having problems.   
 * I am no expert on SEM so I don't really know what is going wrong except 
 * that eta1 is reasonably highly correlated with L1 and L2 and 
 * eta2 is less highly correlated with L1 and L2 equally each 
 * individually, which is not what we want. 

 * Well too late to stop now.  Let's do our diff in diff estimation. 
 * In this case we can easily accomplish it by generating one more variable. 

 * Let's do a seemingly unrelated regression form to make a single joint estimator. 

sureg (L1 t id treat) (L2 t id treat)

 * Now we have estimated the effect of the treatment given a control for the 
 * time effect and individual differences.  Can we be sure of our results? 
 * Not quite.  We are treating L1 and L2 like observed varaibles rather than 
 * random variables we estimated.  We need to adjust out standard errors to 
 * take this into account.  The easiest way though computationally intensive is 
 * to use a bootstrap routine. 

 * This is how it is done.  Same as above but we will use temporary variables. 
cap program drop SEMdnd
program define SEMdnd

  tempvar L1 L2 L12 L22
  sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0
  predict `L1' `L2', latent
  sem (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1
  predict `L12' `L22', latent

  replace  `L1' = `L12' if t==1
  replace  `L2' = `L22' if t==1

  sureg (`L1' t id treat) (`L2' t id treat)

  drop `L1' `L2' `L12' `L22'


SEMdnd   // Looking good 

 * This should do it though I don't hae the machine time available to wait 
 * for it to finish. 
bs , rep(200) cluster(id): SEMdnd 

Formatted By EconometricsbySimulation.com

Wednesday, July 24, 2013

Power Analysis by Simulation: R, RCT, Malaria Example

I have received a number of requests for demonstration code on how to perform a power analysis using simulation in R.  I have already demonstrated howto do this in Stata but lacked the easy to use Stata command “simulate” that I preferred.  However, in a recent post I have written up a command very similar to simulate in R called SimpleSim.

This command takes is capable of taking a vector of parameters and feeding them into a function which returns a vector of results.  The results are turned into a table which can be used to show the rates of type 1 and type 2 error given a particular sample size and underlying effect.

Let's imagine that we are interested in testing the effect that bednet usage has on the rate of infection of malaria.  We are concerned that the decision to purchase and use bednets might endogenous as a result of those more likely to come down with malaria being more prone to seeing bednets as a valuable investment.

So, in response we are going to design an randomized control trial (RCT) in which we either supply fully paid bednet coupons or half discounted coupons each to different thirds of the population leaving the remaining third as a control.  We are concerned that if supply the bednets at no cost then the recipients will not value them so that is why we give coupons that only reduce the cost of the bednets by 1/2.

After distributing the coupons, we would like to use the random distribution of the coupons as an instrument for use of a bednet.  Ultimately we would like to do two levels of analysis.

1. Using instrumental variables (IV) to see how effective bednets are at preventing Malaria in an active population. This is actually an interesting question because often there is imperfect compliance because few individuals spend all of the hours at dusk and dawn under a bednet.  In this case I should use an estimator that takes the form of the first stage being a binary regressor followed by a second stage binary regressor.  However, since linear models seem to be pretty good at approximating average partial effects I am just going to use a IV estimator.
2. Using ordinary least squares (OLS) we would like to see how effective each of the coupon programs are at increasing use of bednets within the homes.

require("AER") # We will use the ivreg function later
# We define a function that both simulates a population and estimates 
# results and returns and returns the results.
MalariaIV <- function(
  nsim = 100,
  npop = 1000,    # Define the sampling population
  treat0 = 1/3,   # Define the proportion which is control
  treat1 = 1/3,   # Define the proportion which is treated with free bednets
  treat2 = 1/3,   # Define the proportion which is treated with 50% cost
  t0comp = .05,   # Bednet useage among control group
  t1comp = .85,   # Bednet useage if recieving free net
  t2comp = .5 ,   # Bednet useage if recieving 50% cost
  malariaRT = .85,# Rate of getting malaria without bednet
  netRT = .45,    # Rate of getting malaria with bednet
  Tdetect = 1,    # Likihood of detecting malaria if present
  Fdetect = 0,    # Likihood of detecting malaria if not present
  alpha = .05     # The alpha level that a p-value must be below
) {
  # Define how the function works
  # First, define a vector to store results
  pvalues <- NULL
  for (i in 1:nsim) { # Repeat the simulation a number of times
    # Generate the population by technology useage
    simdata <- data.frame(
      control=c(rep(1,npop*treat0),rep(0,npop*treat1), rep(0,npop*treat2)),
      treat1 =c(rep(0,npop*treat0),rep(1,npop*treat1), rep(0,npop*treat2)),
      treat2 =c(rep(0,npop*treat0),rep(0,npop*treat1), rep(1,npop*treat2)))
    # Calculate the actual population generated (should be 999 in this case)
    npeople <- nrow(simdata)
    # Generate the rate of bednet usage.
    simdata$bednet <-
      simdata$control*rbinom(npeople,1,t0comp) + # Bednet usage control
      simdata$treat1 *rbinom(npeople,1,t1comp) + # Free
      simdata$treat2 *rbinom(npeople,1,t2comp)   # 50% cost
    # Now let's generate the rate of malaria
    simdata$malaria <-
    # Finally generate the rate of malaria detection as a function
    # of true parasite levels.
    simdata$mdetect <- 
      (simdata$malaria==0)*rbinom(npeople,1,Fdetect)+ # False detection
      (simdata$malaria==1)*rbinom(npeople,1,Tdetect)  # True detection
    # Time to do our simple estimation of the effect of treatment on bednet use
    lmcoef <- summary(lm(bednet~treat1+treat2,data=simdata))$coefficients
    # Now let's try the 2SLS to estimate the effect of bednets on contraction
    # of malaria.
    ivregest <- ivreg(mdetect~bednet | treat1+treat2, data=simdata)
    ivregcoef <- summary(ivregest)$coefficients
    # Save the rejection of the null rates
    pvalues <- rbind(pvalues,c(treat1=lmcoef[2,4]<alpha,
  # Calculate the mean rejection rate for each coefficient.
# Running it once we can see that we get a single set of results
# where we easily reject the null.  However, we want to know
# what happens when bednets are not so effective or malaria is harder
# to detect.  We can modify the parameters fed into the model to test
# these questions manually or we could use the SimpleSim function from
# a previous post.
# This could take a little while to run since there are 256 combinations
# to try and each of them will be run 10 times.
# The above command gives back lots of data but it is not always very easy
# to understand in a matrix form.  It is often easier to just vary one
# paramter at a time.
sample.size <- SimpleSim(fun=MalariaIV, 
# Looking at just sample size
# Save the results to single long data format to be useable by ggplot2
results <- with(sample.size, data.frame(reject = as.numeric(
  id=rep(c("iv","treat2","treat1"), each=length(npop)),
p <- ggplot(results, aes(npop, reject))
# Normally a 80% detection rate is the minimum rejection rate needed 
# to justify a study. 
p + geom_point(aes(colour =id)) + 
  geom_line(aes(group=id)) + 
  geom_hline(yintercept = .8) 

# Thus we want to ensure our study has at least 5000 participants.
# I think there might be some additional considerations for the
# number of participants required to ensure that there is no
# false rejection of the null.  However, I am no expert on Power 
# Analysis so this is what I know.
Formatted by Pretty R at inside-R.org

Monday, July 22, 2013

Loading up your custom toolkits at startup - R

If you are anything like me then you have dozens if not hundreds of personal functions that you have written to accomplish a number of tasks.  Many of the tasks that you would like to do are similar to previous tasks that you have already done.  So how are you going to bridge the functions across your disjoint projects?

There are a four good options I see which might fit different projects:

1. You copy and paste your old code at the beginning of your new code so that when you run the code the previously designed functions are available.  Pros: simple, Cons: makes your code a lot longer than you may prefer.

2. You create a package or the skeleton of the package and load your functions into it as you write them. When you write new code that uses those functions you just load your package. Pros: this creates very clean code, Cons: this can be a little complicated and your code might not be at a level of refinement that would warrant packaging up your code and then potentially having to make revisions on it and updating the package frequently.

3. You could source your code with the "source" function.  Pros: this allows your code to be nearly as clean as loading your code as its own package and it allows you to easily update your source file as frequently you may want.  Cons: Nothing obvious.

4. Set up start-up profiles with R so that you can customize your programming environment easily.  This solution is kind of the point of this post however it is not clear to me that this is the best option.  However, this is what I have done.  I used Quick-R: Customizing Startup article as a reference.

In the .Rprofile file which I found in "C:/users/user_name/Documents":

.First &lt;- function(){
  # Load up functions that I always want to be available.
  # Optionally load up toolkit profiles for each R programming paradigm. 
  print("Profiles Available: Shiny, Sim, Concerto")
  Shiny <<- function() source("C:/Dropbox/R-Startup-Profiles/Caller.R")
  Sim <<- function() source("C:/Dropbox/R-Startup-Profiles/Simulation.R")
  Concerto <<- function() source("C:/Dropbox/R-Startup-Profiles/Concerto.R")

Formatted by Pretty R at inside-R.org

So now every time R startups it will load my always needed functions which are stored in the Base.R file.  Then I will choose which environment I want to program in.  If for instance I want to program a Shiny app then I would just send the command:
and that will load my Shiny profile which is pretty simple:

cat("Loading Shiny!")

Formatted by Pretty R at inside-R.org

Friday, July 19, 2013

Simulation of Blackjack: the odds are not with you

I often want to simulate outcomes varying across a set of 
parameters. In order to accomplish this in an efficient 
manner I have coded up a little function that takes parameter
vectors and produces results. First I will show how to set it
up with some dummy examples and next I will show how it can 
be used to select the optimal blackjack strategy.
SimpleSim <- function(..., fun, pairwise=F) {
  # SimpleSim allows for the calling of a function varying 
  # multiple parameters entered as vectors. In pairwise form 
  # it acts much like apply. In non-paiwise form it makes a 
  # combination of each possible parameter mix
  # in a manner identical to block of nested loops.
  returner <- NULL
  L        <- list(...)
  # Construct a vector that holds the lengths of each object
  vlength <- unlist(lapply(L, length)) 
  npar    <- length(vlength)
  CL      <- lapply(L, "[", 1) # Current list is equal to the first element
 # Pairwise looping
 if (pairwise) {
  # If pairwise is selected than all elements greater than 1 must be equal.
  # Checks if all of the elements of a vector are equal
  if (!(function(x) all(x[1]==x))(vlength[vlength>1])) {
   print(unlist(lapply(L, length)))
   stop("Pairwise: all input vectors must be of equal length", call. =F)
  for (i in 1:max(vlength)) { # Loop through calling the function
   CL[vlength>1]  <- lapply(L, "[", i)[vlength>1] # Current list
   returner <- rbind(returner,c(do.call(fun, CL),pars="", CL))
 } # End Pairwise
 # Non-pairwise looping
 if (!pairwise) {
  ncomb <- prod(vlength) # Calculate the number of combinations
  print(paste(ncomb, "combinations to loop through"))
  comb <- matrix(NA, nrow=prod(vlength), ncol=npar+1)
  comb[,1] <- 1:prod(vlength) # Create an index value
  comb <- as.data.frame(comb) # Converto to data.frame
  colnames(comb) <- c("ID", names(CL))
  for (i in (npar:1)) { # Construct a matrix of parameter combinations
   comb[,i+1] <- L[[i]] # Replace one column with values
   comb<-comb[order(comb[,(i+1)]),] # Reorder rows
  for (i in 1:ncomb) {
   for (ii in 1:npar) CL[ii] <- comb[i,ii+1]
   returner <- rbind(returner,c(do.call(fun, CL),pars="", CL))
 } # End Non-Pairwise
# Let's first define a simple function for demonstration
minmax <- function(...) c(min=min(...),max=max(...))
# Pairwise acts similar to that of a multidimensional apply across columns 
SimpleSim(a=1:20,b=-1:-20,c=21:40, pairwise=T, fun="minmax")
# The first set of columns are those of returns from the function "fun" called.
# The second set divided by "par" are the parameters fed into the function.
SimpleSim(a=1:20,b=-1:-20,c=10, pairwise=T, fun="minmax")
# Non-pairwise creates combinations of parameter sets.
# This form is much more resource demanding.
SimpleSim(a=1:5,b=-1:-5,c=1:2, pairwise=F, fun="minmax")
# Let's try something a little more interesting.
# Let's simulate a game of black jack strategies assuming no card counting is possible.
blackjack <- function(points=18, points.h=NULL, points.ace=NULL, 
                      cards=10, cards.h=NULL, cards.ace=NULL,
                      sims=100, cutoff=10) {
  # This function simulates a blackjack table in which the player
  # has a strategy of standing (not asking for any more cards)
  # once he has either recieved a specific number of points or 
  # a specific number of cards.  This function repeates itself sims # of times.
  # This function allows for up to three different strategies to be played.
  # 1. If the dealer's hole card is less than the cuttoff
  # 2. If the dealer's hole card is greater than or equal to the cuttoff
  # 3. If the dealer's hole card is an ace
  # In order to use 3 level strategies input parameters as .h and .ace
  # It returns # of wins, # of losses, # of pushes (both player and dealer gets 21)
  # and the number of blackjacks.
  # This simulation assumes the number of decks used is large thus
  # the game is like drawing with replacement.
  if (is.null(points.h))   points.h   <- points
  if (is.null(points.ace)) points.ace <- points.h
  if (is.null(cards.h))    cards.h    <- cards
  if (is.null(cards.ace))  cards.ace  <- cards.h
  bdeck <- c(11,2:9,10,10,10,10) # 11 is the ace
  bdresult <- c(ppoints=NULL, pcards=NULL, dpoints=NULL, dcards=NULL)
  for (s in 1:sims) {
   dhand <- sample(bdeck,1) # First draw the deal's revealed card
   phand <- sample(bdeck,2, replace=T)
   # Specify target's based on dealer's card
   if (dhand<cutoff) {
     pcuttoff <- points
     ccuttoff <- cards
   if (dhand>=cutoff) {
     pcuttoff <- points.h
     ccuttoff <- cards.h
   if (dhand==11) {
     pcuttoff <- points.ace
     ccuttoff <- cards.ace
   # player draws until getting above points or card count
   while ((sum(phand)<pcuttoff)&(length(phand)<ccuttoff)){
     phand <- c(phand, sample(bdeck,1))
       # If player goes over then player may change aces to 1s 
       if (sum(phand)>21) phand[phand==11] <- 1
   # Dealer must always hit 17 so hand is predetermined
   while (sum(dhand)<17) {
     dhand <- c(dhand, sample(bdeck,1))
     # If dealer goes over then dearler may change aces to 1s
     if (sum(dhand)>21) dhand[dhand==11] <- 1
   bdresult <- rbind(bdresult, 
        c(ppoints=sum(phand), pcards=length(phand), 
          dpoints=sum(dhand), dcards=length(dhand)))
  # Calculate the times that the player wins, pushes (ties), and loses
  pbj <- (bdresult[,1]==21) & (bdresult[,2]==2)
  dbj <- (bdresult[,3]==21) & (bdresult[,4]==2)
  pwins <- ((bdresult[,1] >  bdresult[,3]) & (bdresult[,1] <  22)) | (pbj & !dbj)
  push  <- (bdresult[,1] == bdresult[,3]) | (pbj & dbj)
  dwins <- !(pwins | push)
  # Specify the return.
blackjack(points=18, sims=4000)
# We can see unsurprisingly, that the player is not doing well.
blackjack(points=18, points.h=19, sims=4000)
# We can see that by adopting a more aggressive strategy for when
# the dealer has a 10 point card or higher, we can do slightly better.
# But overall, the dealer is still winning about 3x more than us.
# We could search through different parameter combinations manually to
# find the best option.  Or we could use our new command SimpleSim!
MCresults <- SimpleSim(fun=blackjack, points=15:21, points.h=18:21, 
                       points.ace=18:21, cutoff=9:10, cards=10, sims=100)
# Let's now order our results from the most promising.
# By the simulation it looks like we have as high as a 50% ratio of loses to wins.
# Which means for every win there are 2 loses.
# However, I don't trust it since we only drew 100 simulations.
# In addition, this is the best random draw from all 224 combinations which each
# have different probabilities.
# Let's do the same simulation but with 2000 draws per.
# This might take a little while.
MCresults <- SimpleSim(fun=blackjack, points=15:21, points.h=18:21, 
                       points.ace=18:21, cutoff=9:10, cards=10, sims=5000)
# Let's now order our results from the most promising.
hist(unlist(MCresults[,1]), main="Across all combinations\nN(Win)/N(Loss)", 
     xlab = "Ratio", ylab = "Frequency")

# The best case scenario 38% win to loss ratio appears around were we started, 
# playing to hit 18 always and doing almost as well when the dealer is high
# (having a 10 or ace) then playing for 19.
# Overall, the odds are not in our favor.  For every win we expect 1/.38 (2.63) loses.
Highlighted by Pretty R at inside-R.org

Thursday, July 18, 2013

R-squared and Adjusted R-squared

* Stata code

* When I first started taking stats there was some discussion between the merits of R2 measures and that of adjusted R2.

* People were concerned that including any additional estimators by definition increased the R2 measure so the need to come up with a measure that did not depend on number of regressors.

* In this small command I generates a dependent variable then generate independent explanatory variables and see what happens to the r2 and adjusted r2 when we increase the number of explanatory variables.
cap program drop R2r
program define R2r
  set obs `2' // The second argument is the number of observations
  tempvar y
  gen `y' = rnormal()  // Generate the dependent variable
  forv i=1(1)`1' {  // Loop from 1 to the number of variables defined as the first argument.
    tempvar v`i'
    gen `v`i'' = rnormal()
  reg `y' `v1'-`v`1'' // Do the estimation.

set seed 1

R2r 2 10000 /// The r-squared is quite small with only two dependent variables
R2r 20 10000 /// The r-squared is much larger

* But we should not take the results of just two simulations lets try this using the simulate command
simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 2 10000
* The R2 is a little less than .2%
* The adjusted R2 is a little less than 0

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 20 10000
* Now the R2 using 20 observations is close to .2%
* The adjusted R2 is very close to zero

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 200 10000
* Almost identical results with the 2 squared on average being around 2%

* Using 1000 observations the r2 is more sensitive
simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 2 1000
* Now the R2 is a little greater than .2%
* The adjusted R2 is little greater than .02%

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 20 1000
* Now the R2 using 20 observations is 2%

simulate r2=e(r2) r2a=e(r2_a), rep(200): R2r 200 1000
* Now the average R2 squared is greater tha 20%

* Overall, the take away seems to be, only worry about the R2 when the number of observations are low and the number of regressors are large.

Tuesday, July 9, 2013

A Sudoku Puzzle Solver - attempt 1

I have programmed up a R based Sudoku problem solver for Sudoku puzzles of that only require simple inference.  In these puzzles a solution can be found using only first order inference.  This solver can be found at the end of the code located in the git:


A puzzle of this form can look like this (generated from the R code on the git)
A solution to this problem can be found using the following steps generated from the solver:
[1] "Round 1 -sub out 1 4 with 4"  That is sub out row 1 column 4 with a 4.
[1] "Round 1 -sub out 1 6 with 5"
[1] "Round 1 -sub out 2 7 with 5"
[1] "Round 1 -sub out 2 8 with 3"
[1] "Round 1 -sub out 2 9 with 8"
[1] "Round 1 -sub out 3 5 with 9"
[1] "Round 1 -sub out 3 9 with 2"
[1] "Round 1 -sub out 4 7 with 2"
[1] "Round 1 -sub out 5 5 with 4"
[1] "Round 1 -sub out 5 8 with 6"
[1] "Round 1 -sub out 6 1 with 2"
[1] "Round 1 -sub out 6 6 with 9"
[1] "Round 1 -sub out 7 5 with 5"
[1] "Round 1 -sub out 8 9 with 9"
[1] "Round 1 -sub out 9 1 with 3"
[1] "Round 1 -sub out 9 2 with 2"
[1] "Round 1 -sub out 9 3 with 4"
[1] "Round 1 -sub out 9 6 with 1"
[1] "Round 1 -sub out 9 9 with 7"
[1] "Round 2 -sub out 1 1 with 1"
[1] "Round 2 -sub out 1 2 with 3"
[1] "Round 2 -sub out 1 3 with 2"
[1] "Round 2 -sub out 2 1 with 7"
[1] "Round 2 -sub out 3 1 with 6"
[1] "Round 2 -sub out 3 2 with 5"
[1] "Round 2 -sub out 3 8 with 1"
[1] "Round 2 -sub out 4 2 with 8"
[1] "Round 2 -sub out 4 4 with 6"
[1] "Round 2 -sub out 4 8 with 9"
[1] "Round 2 -sub out 4 9 with 5"
[1] "Round 2 -sub out 5 2 with 7"
[1] "Round 2 -sub out 5 3 with 5"
[1] "Round 2 -sub out 5 4 with 2"
[1] "Round 2 -sub out 5 9 with 3"
[1] "Round 2 -sub out 7 1 with 8"
[1] "Round 2 -sub out 7 2 with 9"
[1] "Round 2 -sub out 7 3 with 6"
[1] "Round 2 -sub out 8 4 with 8"

The results look like this.  You can see that it was not able to solve all of the missing blocks though it is pretty close.
Interestingly, I think this puzzle has two potential solutions.
This solver is not at the level I wanted it to be at.  I can imagine two more levels of inference that it could handle: one in which the solution uses secondary inference information.  Such as knowing that if it chooses a particular value the other block cannot be filled with that value.  This I call a type 2 and finally a guess and check method which I call a type 3.  The guess and check method should be easy to program but have the difficulty that if there are many squares not solved by the previous methods, it will potentially bog down the machine as it has to check a near infinite number of potential solutions.

However, there should be very few puzzles in which guess and check would be required since those puzzles would be even harder for most people to solve than they would for an algorithm.

I am not satisfied with this solution but I wanted to post this because I had worked it out a few weeks ago and my real work is calling so I am not sure when I will get back to this.