## Thursday, May 10, 2012

### Reverse Engineering a Probit

* Stata simulation of a Probit

* Image that you are interested in estimating a binary response variable.
* For instance the likelihood of a farmer choosing to go to market with
* some portion of his or her yield rather than consume at the farmers
* personal plot.

* This is a binary response variable (i.e. one taking on a value of 0 or 1).

* There are a number of functional forms that help transform a linear model
* with no true minimum or maximum into a binary model for which the probability
* of an outcome approaches either 0 or 1.

* The most common choices are the probit and logit models though other options
* present themselves as well.

* From a maximum likelihood perspective think of ordinary least squares as
* the model that maximizes the probity of observing an outcome when the error
* is distributed with the probability density function of the normal distribution.

* max_choosing{beta} ln(pdf(u)) where u=Y-XB

* For a probit, what you are instead doing is maximizing the likelihood of
* observing a cumulative density function which takes 'yhat'.  That is:

* max_choosing{beta} ln(CFD(XB)) where Yhat=XB

* So this guides us on how to simulate a probit model accurately.

* Set the random seed
set seed 101
* Clear the memory
clear
* Tell stata to prepare 1000 rows for data entry
set obs 10000

* Generate yield as a chi-squared distribution
gen yield = rnormal()^2

* Generate number of family members. Poisson has the same mean and variance.
* Add +1 to ensure that the minimum household size is 1.
gen household_size = rpoisson(5)+1

* Generate the error
gen u=rnormal()*5

* First we generate the probability of selling at the market.
* In order to do that we need to do find the CDF of the XB+u
* values.  Stata first needs to know the variance and mean of
* XB+u.

**** Note the coefficient on yeild and household size
gen XB_u= 10*yeild - household_size + u

sum XB_u

* Now we normalize.
gen XB_u_norm = (XB_u-r(mean))/r(sd)

* Then we calculate the probabilities using normal CDF.
gen double Py=normal(XB_u_norm)
label var Py "True probability of market sale"

* Finally we need to change this from a probability vector to a response
* vector.
gen y=rbinomial(1,Py)

************************************************************************
* Simulation END

* Now lets see how well the probit does at recovering our initial estimates:

probit y yeild household_size

predict yhat
label var yhat "Estimated probability of market sale"

* Now, wait a second.  These values look way too small!
* But not so.  Take a look at the normalization we did above.

sum XB_u
di "This normalization shrunk the coefficients by a factor of " r(sd)

* So to recover the estimates multiply by r(sd)

di "The estimate for yield is " r(sd)*_b[yeild]
di "The estimate for household size is "  r(sd)*_b[household_size]

* Not exactly on but pretty close.

two (line yhat Py, sort) if _n/10==int(_n/10), title(Probit Probability Fit compared with True)
* The if statement thins out the distribution, telling Stata to show only
* 1 out of 10 observations.

* Later we can compare how well the probit performs relative to a logit and a
* linear heteroskedastically robust OLS model.