Original Code

* It is a little tricky to draw multinomial random variables (joint binomial distribution) in Stata because Stata does not have a canned inverse CDF function for the binomial that inverts P values to x values.

* Fortunately in my last post I programmed an inverseCDF command (http://www.econometricsbysimulation.com/2012/11/inverting-binomial-distribution-in.html).

* I will load in the program for later use:

cap program drop CDFinv

program define CDFinv

* Let's define the arguments of the program.

* N is the total number of trials.

* P is the CDF value that we would like inverted

* p is the probability of one outcome being a success.

* newvar is the name of the new variable to be generated.

args newvar P N p

* The challenging thing about trying to invert a CDF for a count distribution is that the CDF is a probability mass function with discontinuous regions.

* However, the particularly nice thing about the binomial distribution is the the pool of outcomes is well defined. (1,2,3..N)

* Thus we can simply count how many times the "steps" of the CDF fall below the target CDF level.

cap drop `newvar'

qui gen `newvar'=0

label var `newvar' "Inverse CDF of binomial"

forv k=1(1)`N' {

* Add 1 to the variable value every time the CDF(n,k-1,p) is less than the target CDF value P.

qui replace `newvar'=`newvar'+1 if binomial(`N', `k'-1, `p')<(`P')

}

end

* This post also draws on a previous post in which I show how to transform Stata multivariate normal random draws into into any correlated random variable (http://www.econometricsbysimulation.com/2012/09/drawing-jointly-distributed-non-normal.html).

set more off

clear

set obs 10000

* For instance let's draw four variables.

* 1. a chi2 with 5 degrees of freedom

* 2. a poisson k = 5

* 3. a uniform variable with min = -5 and max = 5

* 4. a random f distribution draw with 5 and 5 degrees for numerator and denominator degrees of freedom.

* First we will specify the correlation matrix.

* The only constraint as far as I know is that the covariance matrix has to be PSD.

* This in practicality limits the possible correlations between variables since cross terms tend to cause vialations more likely in the PSD requirement.

matrix c = ( 1, .9, .3, .2 \ ///

.9, 1, .2, .1 \ ///

.3, .2, 1, .5 \ ///

.2, .1, .5, 1 )

* If we do not specify a mean or covariance matrix then the default draws are standard normals which is what we want for simplicity.

drawnorm x1 x2 x3 x4, corr(c)

corr x?

spearman x?

* Now all we need to do is turn our normal draws into uniform draws.

* Note: if x~N(0,1) and THETA is the CDF of the normal then y=CDF(x)~uniform

* So for any new distribution with CDF ALPHA and inverse INVALPHA the variable z=INVALPHA(y) ~ alpha.

gen y1 = normal(x1)

gen y2 = normal(x2)

gen y3 = normal(x3)

gen y4 = normal(x4)

corr y?

spearman y?

sum

* Looking good. All of the y variables are uniformly distributed now while still maintaining most of the correlation and all of the rankcorrelation.

* The key final step is to take the inverse CDF of the uniforms for a series of binomial draws.

* The function CDFinv is programmed so that the:

* 1st argument is the new variable name to be created

* 2nd is the cumulative P value to be inverted

* 3rd Number of total draws

* 4th probability of success on each individual draw

* Let's imagine you would like to simulate quarterly data on verterans.

CDFinv z1 y1 4 .05

label var z1 "Hospitalization from mental illness"

CDFinv z2 y2 4 .15

label var z2 "Hospitalization from accident"

CDFinv z3 y3 4 .2

label var z3 "Seeks treatment for PTSD"

CDFinv z4 y4 1 .2

label var z4 "Unemployed"

corr z?

spearman z?

* Note: Using an inverse CDF function is highly non-linear. Thus the correlations are significantly weakened since only a linear transformation of the initial values would maintain the same correlation values.

* To adjust the final correlations, experiment with initial values until you find outcomes that approximate the level of correlation desired.

* It is worth noting that there is an upper limit to the amount of correlation possible as a result of this transformation.

## No comments:

## Post a Comment