Monday, August 19, 2013

Question and Answer: Generating Binary and Discrete Response Data

I was recently contacted by a reader with two very specific questions and I thought that this would be a good topic to publicity respond to. He would like to simulate his data:
I have firm level data and the model is discrete choice with the main explanatory variable also a binary choice:  First question is how can I calibrate the data generation model? 


This is a fundamental question for any kind of econometric model.  How you calibrate your data implies the inherent structure of your data which in term implies what method you should use to attempt to recover your parameters.  Now some data generating processes exist out there which do not yet have econometric solutions to.  Yet there are many that do.

In general you can calibrate your data by i. modifying the parameters, ii. the distribution of explanatory variables, or iii. the distribution of the errors.

In a binary response case the most common models are probit/logit in which case in order to simulate data you would generate your underlying model and overlay the appropriate CDF over it which gives you probabilities of a success.  Finally you would make a random draw based on those probabilities for each outcome being simulated.

I have numerous example code demonstrating this:
Stata: (Reverse Engineering a Probit) (Probit vs Logit)
Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
B <- c(B0=-.2, B1=-.1,B2=0,B3=-.2)
P <- pnorm(X%*%B)
SData <-,1,P), X))
summary(glm(Y ~ X1 + X2 + X3, family = binomial(link = "probit"), data = SData))

Discrete Data
As for discrete data, it is less clear what the optimal choice is. I prefer the logistic regression which is basically an extension of the Logit model with a few interesting caveats.

Stata: (Simulating Multinomial Logit)
R: (here is an article dealing specifically with using R to create discrete response data

Nobs <- 10^4
X <- cbind(cons=1, X1=rnorm(Nobs),X2=rnorm(Nobs),X3=rnorm(Nobs))
# Coefficients, each input vector (c) is associated with a different outcome
B <- cbind(0, c(B0=-.2, B1=-.1,B2=0,B3=-.2), c(B0=.3, B1=0,B2=.6,B3=.4))
# Everything is relative to option 1 which is the default
num <- exp(X%*%B) # Numerator
den <- apply(num,1,sum) # Denominator
P <- num * 1/cbind(den,den,den) # Probability
CP <- cbind(P[,1],P[,1]+P[,2]) # Cumulative probabilities
U <- runif(Nobs) # Draw from the uniform draw
Y <- rep(0,Nobs) ; Y[U>CP[,1]]<-1; Y[U>CP[,2]]<-2 # Calculate outcome

SData <-, X)) # Combine Datarequire("nnet")
summary(Mlogit <- multinom(Y ~ X1 + X2 + X3, data = SData))


  1. Thanks Francis. It is really helpful. I have another question. In case of my research at hand I have a probit model where one independent variable is binary and endogenous. I looked for instruments and due to data limitation they are also binary. I used -ivreg-with instruments. Now I am looking for some help in simulation for the case where I have to generate binary variables. I understand that I cannot use -drawnorm- in this case.
    Thanks again for your response.

    1. Hi Syed, I am glad you found the post helpful.

      I am not sure yet if I fully understand what you need. I think this is what your are saying:
      where R1 and R2 are a random uniform draw
      theta is the CDF of the normal distribution
      G is some coefficient on the endogenous variable Y1.
      B is the coefficient on the exogenous variables X
      N is the coefficient on the instrumental variables Z
      M is the coefficient on the exogenous variables X as they predict Y1

      Y2 = 1[theta(X*B + Y1*G)>R1]

      Thus: Y1 = = 1[theta(Z*N + X*M)>R2]

      It is not very hard to generate data looking like this. Is this what you are asking? However, solving this is slightly more complicated.

      Check out this post related to the Biprobit which I think should answer all of your questions: