Saturday, November 3, 2012

Draw Multinomial Random Variables

Original Code

* It is a little tricky to draw multinomial random variables (joint binomial distribution) in Stata because Stata does not have a canned inverse CDF function for the binomial that inverts P values to x values.

* Fortunately in my last post I programmed an inverseCDF command (http://www.econometricsbysimulation.com/2012/11/inverting-binomial-distribution-in.html).

* I will load in the program for later use:

cap program drop CDFinv
program define CDFinv

  * Let's define the arguments of the program.
  * N is the total number of trials.
  * P is the CDF value that we would like inverted
  * p is the probability of one outcome being a success.
  * newvar is the name of the new variable to be generated.
  args newvar P N p

  * The challenging thing about trying to invert a CDF for a count distribution is that the CDF is a probability mass function with discontinuous regions.
  * However, the particularly nice thing about the binomial distribution is the the pool of outcomes is well defined. (1,2,3..N)
  * Thus we can simply count how many times the "steps" of the CDF fall below the target CDF level.
  cap drop `newvar'
  qui gen `newvar'=0
    label var `newvar' "Inverse CDF of binomial"

  forv k=1(1)`N' {
    * Add 1 to the variable value every time the CDF(n,k-1,p) is less than the target CDF value P.
qui replace `newvar'=`newvar'+1 if binomial(`N', `k'-1, `p')<(`P')
  }

end

* This post also draws on a previous post in which I show how to transform Stata multivariate normal random draws into into any correlated random variable (http://www.econometricsbysimulation.com/2012/09/drawing-jointly-distributed-non-normal.html).

set more off

clear
set obs 10000

* For instance let's draw four variables.
* 1. a chi2 with 5 degrees of freedom
* 2. a poisson k = 5
* 3. a uniform variable with min = -5 and max = 5
* 4. a random f distribution draw with 5 and 5 degrees for numerator and denominator degrees of freedom.

* First we will specify the correlation matrix.
* The only constraint as far as I know is that the covariance matrix has to be PSD.
* This in practicality limits the possible correlations between variables since cross terms tend to cause vialations more likely in the PSD requirement.

matrix c = (  1, .9, .3,  .2 \ ///
             .9,  1, .2,  .1 \ ///
             .3, .2,  1,  .5 \ ///
             .2, .1, .5,   1 )

* If we do not specify a mean or covariance matrix then the default draws are standard normals which is what we want for simplicity.
drawnorm x1 x2 x3 x4, corr(c)

corr x?
spearman x?

* Now all we need to do is turn our normal draws into uniform draws.
* Note: if x~N(0,1) and THETA is the CDF of the normal then y=CDF(x)~uniform
* So for any new distribution with CDF ALPHA and inverse INVALPHA the variable z=INVALPHA(y) ~ alpha.

gen y1 = normal(x1)
gen y2 = normal(x2)
gen y3 = normal(x3)
gen y4 = normal(x4)

corr y?
spearman y?

sum
* Looking good. All of the y variables are uniformly distributed now while still maintaining most of the correlation and all of the rankcorrelation.

* The key final step is to take the inverse CDF of the uniforms for a series of binomial draws.

* The function CDFinv is programmed so that the:
*          1st argument is the new variable name to be created
*          2nd is the cumulative P value to be inverted
*          3rd Number of total draws
*          4th probability of success on each individual draw

* Let's imagine you would like to simulate quarterly data on verterans.

CDFinv z1 y1 4 .05
  label var z1 "Hospitalization from mental illness"

CDFinv z2 y2 4 .15
  label var z2 "Hospitalization from accident"

CDFinv z3 y3 4 .2
  label var z3 "Seeks treatment for PTSD"

CDFinv z4 y4 1 .2
  label var z4 "Unemployed"

corr z?
spearman z?

* Note: Using an inverse CDF function is highly non-linear.  Thus the correlations are significantly weakened since only a linear transformation of the initial values would maintain the same correlation values.
* To adjust the final correlations, experiment with initial values until you find outcomes that approximate the level of correlation desired.
* It is worth noting that there is an upper limit to the amount of correlation possible as a result of this transformation.

No comments:

Post a Comment