Monday, June 11, 2012

Generate rank correlated variables

* Generate variables with specific rank correlations

* Sets the version of Stata that was used for this simulation
version 11
set seed 11

* We want to generate variables that are not drawn from the normal distribution but have either a positive or negative rank correlation.

* Rank correlation is when once you rank all of the observations of a variable, how correlated those ranking are.

* Correlation coefficients are invariance with respect to any positive linear transformation while rank correlations are more generally invariant to any non-decreasing functional transformation.

* This makes using rank correlations as a measure of correlations between variables very useful.

* Let us see how this works.

  * This sets the number of observations to be generated at 10,000
  set obs 10000
  * This sets the correlation matrix to be generated as C.  Note, it must be symetric.
  matrix C = (1, .5 \ .5, 1)
  * We can see this correlation matrix more clearly with the following command.
  matrix list C

  * This draws from the joint normal distribution x and y.
  drawnorm x y, corr(C)

  * To test how well these variables are correlated (this is the Pearson correlation coefficient).
  corr x y

  * To see the rank correlations do the following:
  spearman x y

  * We can also arrive at this coefficient through use of the Pearson coefficient after a bit of transformations.
  sort x

  * This creates a count variable from 1 to _N (10,000) counting the observations.
  gen rank_x = _n

  sort y

  gen rank_y = _n

  corr rank_x rank_y

* My theory going into this is that if you can draw a random normally distributed data of a particular pearson correlation then the spearman correlation should be similar (if the data is normally distributed).

cap forv i = -.9(.1).9 {
  set obs 10000
  matrix C = (1, `i' \ `i', 1)
  drawnorm x y, corr(C)
  cor x y
    local pearson_rho=string(r(rho), "%9.2f")
  spearman x y
    local spearman_rho=string(r(rho), "%9.2f")

  noi di "rho = " string(`i', "%9.2f") " specified -> pearson = `pearson_rho' & spearman = `spearman_rho'"

* We can see for the normal distribution that the correlations and the rank correlations between variables is pretty close.

* We can transform our random normal variables into uniform distributions.

  set obs 10000
  matrix C = (1, .5 \ .5, 1)
  drawnorm x y, corr(C)

  gen x_unif = normal(x)
  gen y_unif = normal(y)
* Because the cdf function is a monotonic function rank correlations are preserved.

* Therefore:
  spearman x_unif y_unif
* Which is exactly the same as
  spearman x y

  corr x_unif y_unif
  * However is slightly different
  corr x y
  * This is because the normal cdf is a non-linear transformation

  * So now we have two random uniform variables that have a rank correlation close to .5

  * From that we can generate any other variable with a know cdf (that is inversible)

  * This is because r=rnormal()=normal^-1(runiform())=invnormal(runiform())

  * or in r is a draw from distribution f with cdf F.

  * To generate a random draw r from distribution F define r=F^-1(runiform())

  * First let's verify this
  gen a=rnormal()
  gen b=invnormal(runiform())
  sum a b
  * a and b are not exactly the same because they are different draws from the same distribution but we can see that they are pretty darn close.

  * Now that we know we can use any randomly drawn uniform variable to generate a corresponding variable of another distribution let's see it in action.

  spearman x y

  gen x_gamma = invgammap(4,x_unif)
  hist x_gamma
  * Remember we are replacing the p (probability variable with x_uniform
  gen y_f = invF(2, 10000, y_unif)

  spearman x_gamma y_f
  * These two variables though quite different in distributions still have the same rank correlations:

  corr x_gamma y_f
  * The correlations are still pretty close to .5 but not as good as the spearman

  sum x_gamma y_f

  * We can generate count data as well
  gen x_poisson = invpoisson(3,x_unif)
  hist x_poisson

  spearman x_poisson y

  * We can see that the inverse poisson distribution is probably not defined the same as the other inverse functions are.,

  * However, the only problem is that the sign is the opposite of what we expect.

  * The trick then is just to take the inverse of the probability (1-p)

  gen x_poisson2 = invpoisson(3,1-x_unif)

  spearman x_poisson2 y
  * That is looking better.

  corr x_poisson2 y

  * Thus we can see that the Stata command "drawnorm" can be used to draw most any distribution of interest with with correlations and rank correlations that the drawnorm command can take.

  * An industrious programmer might take it upon him or her to write a small program called drawany that allows the specification of any of the previously listed distributions as well as (t distributions, beta, hyperbolic, etc.)


  1. I believe Mathworks has an analytical solution to generating a spearman rank. The general procedure is:
    1) look at the correspondence graph/equation to computer the bivariate normal correlation needed for the spearman rank.
    2) generate the bivariate normal (y1,y2)
    3) if you need a distribution other than the bivariate normal (e.g. uniform), draw N readings of your distribution (z1) and perfectly rank-order correlate it with y1. Do this with y2 as well. As long as rank is preserved, the rank correlations should be the same.

  2. This is a very helpful post. If it is still possible that someone might answer, has anyone successfully used this to generate a beta distributed random variable from one of the correlated normal random variables? (I am clearly not industrious enough!) I have managed to obtain exponential random variables this way but I am struggling to get a beta random variable for a distribution that looks a little like an exponential would for a small mean value (<5).