## Thursday, June 28, 2012

### Cragg's Double hurdle model used to explain censoring

* Cragg's 1971 lognormal hurdle (LH) model
* (See Wooldridge 2010 page 694)

* With a double hurdle model we want to think that there are two compents contributing to a process.

* First, there is the decision to do something. Say go to market and sell produce.  Second, there is the decision of how much produce to sell.

* This model is distinctly different than say a truncated normal regression because the truncated normal regression assumes linearity in y variableand only allows for the y variable to be kept above zero becuase the data is truncated.  Ie. people cannot sell negative quantities of produce (buy) in the data set.

* What instead we want to think about is that both decisions are independent (conditional on observables) and unique decisions. This might seem unreasonable at first.  However, in general if you are a grower you are probably going to decide to sell on the market ex ante to how much to sell.  That is, if you plant only cabbage then you probably are not going to settle for eating cabbage no matter how your crop turns out.

* Likewise if you are going to sell, the decision of how much to sell, may be uncorrelated with the  original decision to sell if say you plan to sell on the market at the time of planting and  decide how much to sell based on how well your crop does.

* It is easy to also think of example of when this  assumption might fail.  For instance, if for some reason there is a bumper crop one year which causes you to have more than you can eat, even though you  did not plan on it, you go to the market to sell. The amount you sell on the market will then be  probably a function also of the bumper crop.

```
* But for now let us imagine the two decisions are independent given observables.

* Sell or not:
* s=1[xg + v >0]

* How much to sell:
* w=exp(xB + u)

* The conditional indepence assumption can be written:
* corr(s,w|x)=0

clear
set obs 1000
set seed 101

gen x1 = rnormal()
label var x1 "Amount of fertilizer used"
gen x2 = rnormal()
label var x2 "Index indicating availablility of short term credit"
gen x3 = rnormal()
label var x3 "Distance from city center"

gen u = rnormal()
* Error term

* I want the variance to equal 1 in total
* Since xs and u are independent and rormally distributed with ```
```* a variance of 1:

* var(a*x1+b*x2+c*x3+d*u) = a^2*(1) + b^2*(1) + c^2*(1) + d^2*(1)
*          = a^2 + b^2 + c^2 + d^2 = 1
* Let's just make the variance of all 4 variables equal to v
*          = v^2 + v^2 + v^2 + v^2 = 1
*          = 4*v^2 = 1 -> v^2 = 1/4
*          -> v = +/- 1/2

* Generate the probability of selling. Use normal density.
gen s_inv = -1/2 + .5*x1 + .5*x2 - .5*x3 + .5*u
sum s_inv
* It s_inv has a variance close to 1

gen s_prob = normal(s_inv)

sum s_prob, detail

gen s = rbinomial(1,s_prob)
/*
gen s = 1 if s_prob>.5
replace s = 0 if s_prob<=.5
*/
label var s "Decision to sell on the market"

* This draws a response s for every individual
* It is the decision to sell on the market or
* not.

gen v = rnormal()
* Error term

gen w = 5 + 2*x1 + 3*x2 + 4*x3 + v*10
* Quantity of produce that this farmer would
* have sold if he went to the market.

sum w
* We want to make sure w does not go negative.
* We can ensure this by making sure the minimum of w is 0.

replace w = max(w,0)
* There should be very few draws.

gen y=s*w
label var y "How much produce is sold"
* Quantity of produce actually sold.

* Now let's drop the variables we do cannot
* actually observe in real data:

* Now the problem is that what we observe is:
* y=s*w=1[xG + v >0](xB + u)

* The unconditional partial effect of x on sales y is:
* dy/dx = s'w + s*w'

* The effect has two addative components.

* The marginal effect of x on the change in the probability of ```
`* sales at the current quantity of sales and the marginal quantity of `
```* sales given the probability in sales.

* With dy/dx this is a unique value for each person, therefore ```
`* in order to test how well our estimator is working, we will calculate `
```* it per person then take the average.

* This average is the analogue to the average partial effect that ```
```* will attempt to estimate.

* We know from the way s_prob was calculated the values for sprime

* gen s_prob = normal(-1/2 + x1 + 2*x2 - x3 + .1*u)

* by chain rule. CDF=normal(), PDF=normalden()

* ds/dx1 =  1*normalden(s_inv)
* ds/dx2 =  2*normalden(s_inv)
* ds/dx3 =  -1*normalden(s_inv)

* gen w = 5 + 2*x1 + 3*x2 + 4*x3 + v

* dw/dx1 = 2
* dw/dx2 = 3
* dw/dx3 = 4

* dy/dx = s'w + s*w'

gen dydx1 =   .5*normalden(s_inv)*w + s*2
gen dydx2 =   .5*normalden(s_inv)*w + s*3
gen dydx3 =  -.5*normalden(s_inv)*w + s*4

sum dydx?

* We can see that the unconditional effect of x1 and x2 is greater than x3.

* This is because probability of selling is going the opposit direction of ```
```* sale quantity.

**********************************************
* Begin estimation

probit s x?
* Recovers the coefficients pretty well.

reg w x? if s==1
* However, does not work so well.

* The only problem is that we do not observe s.
gen s_hat = 0
replace s_hat = 1 if y>0

probit s_hat x?
reg y x? if s_hat==1
* Now we can see that both estimates are biased

* Many people may assume that the tobit would be the correct model due ```
```* to knowing that sales cannot be negative.
tobit y x?, ll(0)
* The tobit left sensoring model clearly fails pretty spectaculary at ```
`* recovering the true marginal effects. This is because the tobit does `
`* not take into account the two stage nature of the quantity to `
```* sell decision.

* The bias in using tobit is particulary pronounced when looking at ```
`* x3 where we know that though the effect of x3 on quantity sold is `
`* the largest because it decreases the likelihood of selling at all, `
`* the coefficient actually turns out to be negative.* You will need `
```* to use the user written command by William Burke:

* http://www.stata-journal.com/article.html?article=st0179
* You should be able to install it using the findit craggit command
craggit s_hat x?, second(y x?)

* This looks pretty good.

Formatted By Econometrics by Simulation
```

1. Hi Francis,
Just found your blog while searching for some examples of Stata code to do double hurdle models related to offering a particular substance abuse treatment modality and quantity supplied. You are now on my weekly list of must reads. Thank you for what you do.
Best,
Mark

2. Just found your blog while searching for the appropriate software to use in running double hurdle model.Thanks a lot for the blog. Please advice on what software to use on my windows operated pc to run double hurdle model and what steps should I apply when running the software? Thanks.

1. Dear Tiger, everything you need to answer your question is within this post. As for software recommendations. I am recommending using Stata to solve this task. As for the commands you can generally skip over the simulation and just look at the end of the post:

* Many people may assume that the tobit would be the correct model due to knowing that sales cannot be negative.
tobit y x?, ll(0)
* The tobit left sensoring model clearly fails pretty spectaculary at recovering the true marginal effects. This is because the tobit does not take into account the two stage nature of the quantity to sell decision.

* The bias in using tobit is particulary pronounced when looking at x3 where we know that though the effect of x3 on quantity sold is the largest because it decreases the likelihood of selling at all, the coefficient actually turns out to be negative.* You will need to use the user written command by William Burke:

* http://www.stata-journal.com/article.html?article=st0179
* You should be able to install it using the findit craggit command
craggit s_hat x?, second(y x?)

3. This comment has been removed by a blog administrator.

1. Dear Francis, Please may I know why my comment was removed? Hope it was not offensive? Accept my apology if it was.

4. Dear Francis,

Thank you so much for this helpful information. Like heckman selection model, does craggit have a way to test the independence assumption. In my result, I saw the first tier is significant. Does that mean I cannot proceed to the next tier and interpret the results accordingly.

5. which econometric model can solve this objective
3. To determine the factors that influence participation in community based ecotourism activities

6. I've used the craggit download in Stata but find it to be fragile, often not converging.

7. Hi Francis,
Thanks for the blog. Just wondering, whether, there is any ready made method to incorporate factor variables. For example in linear regression model:
reg y x i.gender
Secondly, does craggit considers a dependency of the error terms of the probit and the truncated model?

Regards
Sandip, India