Thursday, July 26, 2012

7 ways to Speed up your do files

* 1. Drop unused data.  When you have the choice of using if statements to exclude unused data or dropping that data.  Dropping the data is a faster option usually.


  * I am first going to clear the timer memory from Stata since we are going to use it.
  forv i = 1/20 {
    cap timer clear `i'
  }


  clear
  set obs 100000

  gen z = rbinomial(1,.5)

  forv i = 1/10 {
    gen x`i' = rnormal()
    forv ii = 1/40 {
      gen z`i'`ii' = rnormal()*(1/20)^.5
    di "loop `i' unneccessary variable `ii'"
    }
  }

  gen u = rnormal()*(10)^.5*(1/20)^.5

  gen y = rbinomial(1,normal(x1-x2+x3-x4+x5-x6+x7-x8+x9-x10+u))

  timer on 1
  probit y x* if z==1
  timer off 1

  drop if z!=1

  timer on 2
  probit y x*
  timer off 2

  timer list 1
  timer list 2

  * On my computer the first probit took 0.5350 seconds and the second probit took 0.4430.

* 2. Drop unnecessary variables

  keep x* y

  timer on 3
  probit y x*
  timer off 3

  timer list 3

  * This is the same with temporary variables.

  * Unnecessary temporary variables clog as much active memory as any other variables until they are dropped.

  * Dropping the unused variables decreased the computational time to 0.3180 seconds.

* 3. Removed visual feedback

  timer on 4
  qui probit y x*
  timer off 4

  timer list 4

  * Quieting the display decreases run time to 0.2990.  This is primarily because this command does a lot in the background.

  * With other commands the gain to turning the feedback is much greater.
  timer on 5
  forv v=1/100 {
    sum y
  }
  timer off 5

  timer on 6
  forv v=1/100 {
    qui sum y
  }
  timer off 6

  timer list 5
  timer list 6
  * Displaying the results sum command is causing Stata to run 74% slower.

* 4. Remove redundant data writing and reading commands.
  tempfile tempdata

  timer on 7
  forv v=1/100 {
    qui save `tempdata', replace
    qui sum y
    qui use `tempdata', clear
  }
  timer off 7

  timer list 7
  * Reading and writing causing the command to take five times as long to complete.

* 5. Choose commands that approximate time consuming commands.

  * For instance if you are really interested in the average partial effect of a probit command and your computer is getting bogged down then try using a linear probability model instead since the APE tends to be very similar to that of a probit.
  timer on 8
    probit y x*
    margins, dydx(x1 x2 x3 x4 x5 x6 x7 x8 x9 x10)
  timer off 8

  timer list 8

  timer on 9
    reg y x*
  timer off 9

  timer list 9

  * While the linear probability model is not as good a fit as the probit (because we know how the data was generated and because the pseudo-r2 is larger than the r2)
  * it gets similar results and runs about 560 times faster.

* 6. Minimize unused calculations.

  * Imagine that you have decided to bootstrap the standard errors on your estimators.
  * If that is the case then it is completely unnecessary in OLS to use robust standard errors because it just means more calculations.
  timer on 10
    bs: reg y x*, robust
  timer off 10

  timer on 11
    bs: reg y x*
  timer off 11

  timer list 10
  * On my computer this command takes 9.6 seconds
  timer list 11
  * While second one only takes 7.6

* 7. Minimize your data foot print

  * Compress will reduce your large data elements to smaller data elements when possible.
  compress

  * Sometimes it is helpful to convert your data from string variables to factor variables.
  gen longstring = "This is a really big string that is going to take up a lot of memory.  Because the longer the string you have the more space Stata needs to reserve for that sting within the variable even if all of the other values are small" if _n == 1

  timer on 12
    probit y x*
  timer off 12

  encode longstring, gen(factor_longstring)

  desc

  * Variables x1-x10 take up only 4 bytes a piece or 40 bytes collectively.
  * We can see that the storage type on the string is 244 which means it takes up 244 bytes.
  * While the data type on the factor version of it is a long meaning it only takes 4 bytes.
  * Note: This can be misleading. If you have many unique factor values then they are going to also take up a lot of space.

  drop longstring

  timer on 13
    probit y x*
  timer off 13

  timer list 12
  timer list 13

  * By reducing the size of the longstring to a factor variable we free up memory for Stata to use and speed up the probit regression by about 8%.


No comments:

Post a Comment