Saturday, June 30, 2012

The Fed is Watching!

Looks like the Fed (United States Federal Reserve) has finally caught on to what we have been up to.  They have added us to their lists of sites to watch.  These sites provide economically relevant posts ;-)

A summary of their aggregator's role is:

"The Federal Reserve Bank of St. Louis is launching a blog aggregator,, to highlight and promote the discussion of economics research. Your blog is part of this effort. This email explains why and how you can help promote the discussion of economic research in the blogosphere."

The aggregator can be found at

Of course they have upwards of 350 economics blogs listed, but we are glad to be doing our own small part  to keep the world spinning :)

Search all of the data in a directory for a variable

* This command looks in a directory through all of the data files for a text match in either the name of the variables or the description of the variables.

* Specify where to look.  I have all of my the data files from Jeffrey Wooldridge's Introductory Econometrics book in one directory.
local targ_dir = "A:/Wooldridge/"

* What to look for
local lookforwhat "computer" /* I know that one of Jeff's examples uses computers and coupons */

dir "`targ_dir'*.dta"

* Creates a list of data files in the directory and stores them in the `file_list' local
local file_list : dir "`targ_dir'" files "*.dta"

* Saves your current data


* Loops through all of the files in the file list
foreach v in `file_list' {
  * Tells the user what files is being openned
  di as input "Opening: `v'"

  * Opens the data file
  use "`targ_dir'\`v'", clear

  * Looks within the data file for the text match
  lookfor `lookforwhat'

  * Saves the match to `found_result' if a variable is found
  local found_result = r(varlist)

  * If a variable is found then it displays a message though of course the lookfor command also displays useful results
  if length("`found_result'")>1 di  _newline "`lookforwhat' found in file `v'" _newline


* Restores the user's data to the previous state.

Friday, June 29, 2012

A note on debugging in Stata

* As far as I know, there is no guide to debugging in Stata.  This post touches on a few topics though as with any programming language you really just need to spend many hours of trial and error to figure out what works and what doesn't and what it looks like when things are not working.

* As an interpretive language (one that takes commands one by one) Stata can be often times very easy to debug in.

* For instance:

sysuse auto

* Imagine you want to do a IVreg

ivregress  trunk foreign make

* This command for instance produces an error and the best thing to do at that point is to query the documentation for the proper syntax of Stata.

help ivregress

*  ivregress estimator depvar [varlist1] (varlist2 = varlist_iv) [if] [in] [weight] [, options]

* The way you read this is the first entry is the name of the command

* After the command the different elements are deplained in the pages of the command.  Anything that is in [] is optional while anything not is required.

* The best way for me to find out how to write a Stata command is to look at an example usually listed on the bottom of a command then modify it to fit my purposes.

* However, beyond this one line command debugging things can get much trickier.

* Imagine:

forv i=1/10 {

forv v=1/10 {

gen var_`i'_`v' = `i'*`v'

gen var_`i'_`v'2 =  var_`i'_v^2


* This code will not work.  It might be easy to identify what is going wrong with it for an experienced programmer.

* However, often macros make mistakes very difficult to identify because what is actually hapenning in the commands is difficult to observe.

* One way to help in this is to add lots of display commands so that you can see what commands are actually being sent to Stata.

forv i=1/10 {

di "i = `i'"

forv v=1/10 {

di "v = `v'"

di "gen var_`i'_`v' = `i'*`v'"
gen var_`i'_`v' = `i'*`v'

di "gen var_`i'_`v'2 =  var_`i'_v^2"
gen var_`i'_`v'2 =  var_`i'_v^2


* We can see that {gen var_`i'_`v'2 =  var_`i'_v^2} is the source of the problem because var_`i'_v^2 should be var_`i'_`v'^2

* Correcting that we still have one more error.


forv i=1/10 {

di "i = `i'"

forv v=1/10 {

di "v = `v'"

di "gen var_`i'_`v' = `i'*`v'"
gen var_`i'_`v' = `i'*`v'

di "gen var_`i'_`v'2 =  var_`i'_`v'^2"
gen var_`i'_`v'2 =  var_`i'_`v'^2


* That is an extra bracket, this is a very typical error
* Likewise, having a unpaired { will cause stata to give an unexpected end of file error.

Thursday, June 28, 2012

Cragg's Double hurdle model used to explain censoring

* Cragg's 1971 lognormal hurdle (LH) model
* (See Wooldridge 2010 page 694)

* With a double hurdle model we want to think that there are two compents contributing to a process.

* First, there is the decision to do something. Say go to market and sell produce.  Second, there is the decision of how much produce to sell.

* This model is distinctly different than say a truncated normal regression because the truncated normal regression assumes linearity in y variableand only allows for the y variable to be kept above zero becuase the data is truncated.  Ie. people cannot sell negative quantities of produce (buy) in the data set.

* What instead we want to think about is that both decisions are independent (conditional on observables) and unique decisions. This might seem unreasonable at first.  However, in general if you are a grower you are probably going to decide to sell on the market ex ante to how much to sell.  That is, if you plant only cabbage then you probably are not going to settle for eating cabbage no matter how your crop turns out.

* Likewise if you are going to sell, the decision of how much to sell, may be uncorrelated with the  original decision to sell if say you plan to sell on the market at the time of planting and  decide how much to sell based on how well your crop does.

* It is easy to also think of example of when this  assumption might fail.  For instance, if for some reason there is a bumper crop one year which causes you to have more than you can eat, even though you  did not plan on it, you go to the market to sell. The amount you sell on the market will then be  probably a function also of the bumper crop.

* But for now let us imagine the two decisions are independent given observables.

* Sell or not:
* s=1[xg + v >0]

* How much to sell:
* w=exp(xB + u)

* The conditional indepence assumption can be written:
* corr(s,w|x)=0

set obs 1000
set seed 101

gen x1 = rnormal()
  label var x1 "Amount of fertilizer used"
gen x2 = rnormal()
  label var x2 "Index indicating availablility of short term credit"
gen x3 = rnormal()
  label var x3 "Distance from city center"

gen u = rnormal()
* Error term

* I want the variance to equal 1 in total
* Since xs and u are independent and rormally distributed with 
* a variance of 1:

* var(a*x1+b*x2+c*x3+d*u) = a^2*(1) + b^2*(1) + c^2*(1) + d^2*(1)
*          = a^2 + b^2 + c^2 + d^2 = 1
* Let's just make the variance of all 4 variables equal to v
*          = v^2 + v^2 + v^2 + v^2 = 1
*          = 4*v^2 = 1 -> v^2 = 1/4
*          -> v = +/- 1/2

* Generate the probability of selling. Use normal density.
gen s_inv = -1/2 + .5*x1 + .5*x2 - .5*x3 + .5*u
sum s_inv
* It s_inv has a variance close to 1

gen s_prob = normal(s_inv)

sum s_prob, detail

gen s = rbinomial(1,s_prob)
gen s = 1 if s_prob>.5
replace s = 0 if s_prob<=.5
  label var s "Decision to sell on the market"

* This draws a response s for every individual
* It is the decision to sell on the market or
* not.

gen v = rnormal()
* Error term

gen w = 5 + 2*x1 + 3*x2 + 4*x3 + v*10
* Quantity of produce that this farmer would
* have sold if he went to the market.

sum w
* We want to make sure w does not go negative.
* We can ensure this by making sure the minimum of w is 0.

replace w = max(w,0)
* There should be very few draws.

gen y=s*w
  label var y "How much produce is sold"
* Quantity of produce actually sold.

* Now let's drop the variables we do cannot
* actually observe in real data:

* Now the problem is that what we observe is:
* y=s*w=1[xG + v >0](xB + u)

* The unconditional partial effect of x on sales y is:
* dy/dx = s'w + s*w'

* The effect has two addative components.

* The marginal effect of x on the change in the probability of 
* sales at the current quantity of sales and the marginal quantity of 
* sales given the probability in sales.

* With dy/dx this is a unique value for each person, therefore 
* in order to test how well our estimator is working, we will calculate 
* it per person then take the average.

* This average is the analogue to the average partial effect that 
* will attempt to estimate.

* We know from the way s_prob was calculated the values for sprime

* gen s_prob = normal(-1/2 + x1 + 2*x2 - x3 + .1*u)

* by chain rule. CDF=normal(), PDF=normalden()

* ds/dx1 =  1*normalden(s_inv)
* ds/dx2 =  2*normalden(s_inv)
* ds/dx3 =  -1*normalden(s_inv)

* gen w = 5 + 2*x1 + 3*x2 + 4*x3 + v

* dw/dx1 = 2
* dw/dx2 = 3
* dw/dx3 = 4

* dy/dx = s'w + s*w'

gen dydx1 =   .5*normalden(s_inv)*w + s*2
gen dydx2 =   .5*normalden(s_inv)*w + s*3
gen dydx3 =  -.5*normalden(s_inv)*w + s*4

sum dydx?

* We can see that the unconditional effect of x1 and x2 is greater than x3.

* This is because probability of selling is going the opposit direction of 
* sale quantity.

* Begin estimation

probit s x?
* Recovers the coefficients pretty well.

reg w x? if s==1
* However, does not work so well.

* The only problem is that we do not observe s.
gen s_hat = 0
replace s_hat = 1 if y>0

probit s_hat x?
reg y x? if s_hat==1
* Now we can see that both estimates are biased

* Many people may assume that the tobit would be the correct model due 
* to knowing that sales cannot be negative.
tobit y x?, ll(0)
* The tobit left sensoring model clearly fails pretty spectaculary at 
* recovering the true marginal effects. This is because the tobit does 
* not take into account the two stage nature of the quantity to 
* sell decision.

* The bias in using tobit is particulary pronounced when looking at 
* x3 where we know that though the effect of x3 on quantity sold is 
* the largest because it decreases the likelihood of selling at all, 
* the coefficient actually turns out to be negative.* You will need 
* to use the user written command by William Burke:

* You should be able to install it using the findit craggit command
craggit s_hat x?, second(y x?)

* This looks pretty good.

Formatted By Econometrics by Simulation

Wednesday, June 27, 2012

Selection Bias in the use of Fertilizer

* This simulation follows the inherent logic behind the following paper:

* Duflo E, Kremer M, Glennerster R. 2008b. Using randomization in development economics research: a toolkit. In Handbook of Development Economics, Vol. 4, ed. T Schultz, J Strauss, chpt. 15. Amsterdam: Elsevier Sci. Ltd. North Holl.

* We might think that the use of fertilizer might be a no-brainer given the results of experiments done in other countries with different types of soils, water input, climate inputs, but that the return to fertilizer might actually be much lower than we expect for many farmers in poor nations and that is why they are not purchasing and using more fertilizer.

* Let us imagine a sample of 1000 farmers:

set obs 1000

gen farm_id = _n

* Now, let's specify a binary choice production function.

* Where random spatial variation in productivity is:
gen sigma = abs(rnormal())+.2

* The base level of return on any field is:
gen beta = runiform() + .3

* 1. A farmer who does not use fertilizer will get:
*   Y(f=0) = Py*(beta)*sigma

gen Py = runiform()+.5

gen Y0=Py*(beta)*sigma
  label var Y0 "Expected return from no-use"

* 2. A farmer who does use fertilizer will have a higher return to yeild but have to pay for fertilizer Pf:
*   Y(f=1) = Py*(beta + alpha)*sigma-Pf

* Each farmer has some unobserved additional return from using fertilizer:
gen alpha = 6*runiform()

* The price of fertilizer is:
gen Pf = runiform()+2

gen Y1 = Py*(beta + alpha)*sigma-Pf
  label var Y1 "Expected return from fertilizer use"

* If fertilizer were free than everybody would always use fertilizer.

* If however it costs something then some people will choose not to use fertilizer.

* Farmers will decide between choosing to use fertilizer or not based on expected yeilds:

* Now let's generate the observed data
gen fert = 0
replace fert = 1 if Y0 < Y1
  label var fert "Farmer used fertilizer"

gen yeild = Y0
replace yeild = Y1 if Y0 < Y1
  label var yeild "Expected Value of Production"

sum fert
* We can see that there is only modest adoption of fertilizer under these parameters.

reg yeild fert
* Even though clearly the use of fertilizer is correlated with higher total yeild.

reg fert alpha Pf Py
* We can see that as unobserved expected gain in yeild with respect to fertilizer increases the use of fertilizer also increases.
* We can also see that as the price of fertilizer goes up less people use it.
* Finally we can see that as the value of the farm product increases then the relative cost of fertilizer drops and people are more inclined to use it.

* Thus the choice of using or not using fertilizer can be easily explained through a rational choice framework.

* If this is the true underlying scenario then their is no efficiency arguments for states, NGOs, or multinations for intervening through the supply or subsidization of fertilizer.

* However, this might not be the whole story.  Imagine the case where producers might be risk averse.

* So that they value different returns based on their personal preferences rather than their monetary value.

* Imagine the case where if a farmer were to get a return of less than gamma then that farmer would run the risk of dieing or having his family die so the farmer will be highly adverse to that scenario.

gen gamma = .2
  label var gamma "Minimum stafetly level of yeild"

gen starv_penalty = 9
  label var starv_penalty "The penalty that farmers face if part of their expected production is less than gamma"

* So we will say there is a 1/3 possibility of production level alpha_observed=alpha with fertilizer, 1/3 of production level alpha_observed=(1/4)*alpha with fertilizer, and alpha_observed=(7/4) * alpha

* The expected value of production with fertilizer is still Y1 but now there are two other potential levels of production:

gen Y2= Py*(beta + alpha*(1/4))*sigma-Pf
gen Y3= Py*(beta + alpha*(7/4))*sigma-Pf

* Now lets first check to make sure none of the Y1-Y3 range of production is going to be less than gamma
forv i=0/3 {
 gen Y`i'_util = Y`i'
* If any of the range of production levels is less than gamma then that production is heavily penalized:
 replace Y`i'_util = Y`i' - starv_penalty if Y`i' < gamma

* In this simplified framework there is no variance of return from not using fertilizer so Y0_utility = Expected Utility from not using fertizer.

* However, there is variable returns from using fertilizer so:

gen Y_fert_utility = (1/3)*(Y1_util+Y2_util+Y3_util)

* Now let's generate the choice of fertilizer usage
gen fert2 = 0
replace fert2 = 1 if Y0_util < Y_fert_utility
  label var fert2 "Farmer used fertilizer"

* Now let's see what the yeild is if farmers are starvation adverse.
gen yeild2 = Y0 if fert2 == 0
replace yeild2 = Y1 if fert2 == 1
  label var yeild2 "Expected Yeild of Farming Choice: Under Safetly First Utility"

sum yeild*
sum fert*
* Thus we can see that though the expected value of using fertilizer at a higher rate might lead to net gains in expected production, farmers might choose not to because they find the range of returns from fertilizer use too risky.

reg yeild2 fert2
* So only the farmers who know that even in the worse case scenario they are going to have enough to survive end up using fertilizer.
* These farmers are also the farmers that have the most to gain from using fertilizer, thus it looks extremely effective.

reg fert2 alpha Pf Py
* Interestingly we can see that under risk aversion, the different parameters entering into production are less effective at predicting the choice of using fertilizer or not.

* I think this is caused by the extreme flatness found in the use of fertilizer when there is a substantial risk of crop failure.

Tuesday, June 26, 2012

Stata Fuzzy match command

* This command checks if two strings match up.  There is a range of criteria by which this match can occur.  It is a potentially useful command when comparing two variables that might have different word orders or spellings such as names but which seem like they may be the same variables.  In the event that you allow some letters to vary, this command creates a variable that keeps track of how many variables are varying.

* Command Written by Francis Smart
cap program drop fuzzy
program define fuzzy

syntax varlist(min=2 max=2 string)     ///
       [, Gen_match(string) Free_letters(integer 0) ///
  Blanks Words_unordered Letters_unordered    ///
  Missed_letter_count(string) ///
  Reuse_letters Exclude(string) ///
  Case ]

local var1 = "`1'"
local var2 = subinstr("`2'",",","",.)

if "`gen_match'"=="" local gen_match = "matched"
if "`missed_letter_count'"=="" local missed_letter_count = "missed_count"
if "`free_letters'"=="" local free_letters = "0"

di _newline as text "Fuzzy matching (`var1' `var2')"
di as text "Generating match indicator variable (`gen_match')"
di as text "Number of free letters is (`free_letters')"

if "`words_unordered'"!="" {
  di as text "Word order does not matter (up to two words)"
  local blanks = "blanks"


if "`exclude'" != "" di "Characters `exclude' ignored"
if "`blanks'" != "" di as text "Blanks dropped"
if "`lunordered'" != "" di as text "Letter order does not matter when searching for match"
if "`reuse_letters'" != "" di as text "Letters may be resused when searching for matches"
if "`case'" != "" di "Case does not matter"
if (0 < `free_letters' | "`letters_unordered'" != "") di "Missed letter count variable created will be (`missed_letter_count')"

cap drop `gen_match'
 if _rc==0 noi di _newline "Matched indicator var: <`gen_match'> replaced"

gen `gen_match'=0 if `var1' != "" & `var2' != ""
  label var `gen_match' "Match indicator variable"

* Create a list of temporary variables
tempvar t_var1 t_var2 longest_word word_length

* Generate temporary variables for holding var1 and var2
qui gen `t_var1' = `var1' if length(`var1')>length(`var2')
qui gen `t_var2' = `var2' if length(`var1')>length(`var2')

* Whichever is the longest word will be the first variable
qui replace `t_var1' = `var2' if length(`var2')>length(`var1')
qui replace `t_var2' = `var1' if length(`var2')>length(`var1')

* di `longest_word_length'

if "`words_unordered'"=="" local loop_over_words = 1
if "`words_unordered'"!="" local loop_over_words = 2

* Generate a variable to indicate how many unmatched letters are in the variable comparison.
  if (0 < `free_letters' | "`letters_unordered'" != "") {
* Calculate how long the longest word is of the entire set of two strings being compared
    gen `word_length' = max(length(`t_var1'),length(`t_var2'))
    egen `longest_word' = max(`word_length')
    local longest_word_length = `longest_word'[1]
* Drop the variable if it already exists
    cap drop `missed_letter_count'
 if _rc==0 noi di "`missed_letter_count' replaced"

    * Create a variable to store the number of missed letters in the variable matchup.
    gen `missed_letter_count'=0
     label var `missed_letter_count' "Number of letters missed in matchup"

    * Remove any blanks from the variables before trying a match.
  if "`blanks'" != "" {
    replace `t_var1' = subinstr(`t_var1', " " , "", .)
    replace `t_var2' = subinstr(`t_var2', " " , "", .)

  if "`case'" != "" {
    replace `t_var1' = lower(`t_var1')
    replace `t_var2' = lower(`t_var2')
cap gen t_var2 = `t_var2'
 replace t_var2 = `t_var2'


* This will loop either once or twice (once if word order matters, twice if not)
if  "`letters_unordered'" == "" qui forv i=1(1)`loop_over_words' {
  ************    Begin Word Match             ************

  * If words unordered is set then on the second loop reverse the word order.
  if `i'==2 replace `t_var2' =  word(`t_var2',2)+word(`t_var2',1)

  * Remove any excluded characters from the variables before trying a match
  * Loop through the list of user supplied excluded characters.
  foreach v in `exclude' {
    noi di "`v'"
    replace `t_var1' = subinstr(`t_var1', "`v'" , "", .)
    replace `t_var2' = subinstr(`t_var2', "`v'" , "", .)
  cap gen t_var2 = `t_var2'

  replace `gen_match'=1 if `t_var1' ==`t_var2' & `gen_match'==0

  ************    End Word Match               ************

  * If there are free letters (# of letters that are allowed to be different)
  ************    Begin Ordered Letters Match  ************
  if "`letters_unordered'" == "" & 0 < real("`free_letters'") {

  * Start the missed letter count at 0
  replace `missed_letter_count' = 0 if `gen_match'==0

  * Loop through all of the lettered places for a number of loops equal to the longest word in either variable.
  forv v = 1(1)`longest_word_length' {
    * Add 1 to the missed letter count if the `v'th letter of both words does not match up
    replace `missed_letter_count' = `missed_letter_count'+1 ///
          if `gen_match'==0  & substr(`t_var1',`v',1) != substr(`t_var2',`v',1)
  replace `gen_match' = 1 if `missed_letter_count' <= `free_letters'
  ************    End Ordered Letters Match    ************

  ************    Begin Unordered Letters Match  ************
  qui if "`letters_unordered'" != "" {
    replace `missed_letter_count' = 0 if `gen_match'==0
    forv v = 1(1)`longest_word_length' {
      gen tempvar_var2_`v' = substr(`t_var2',`v',1)
    forv v = 1(1)`longest_word_length' {
    gen tempvar_match`v' = 0
    * This generates a variable that indicates if letter `v' in var1 is matched with a letter in var2

    gen tempvar_var1_`v' = substr(`t_var1',`v',1)
* This specifies letter `v' position of var1

    gen tempvar_match_place`v' = .
    * This specfies at what place (in terms of var2 letters) var1 letter `v' got matched with var2 letters

    forv vv = 1(1)`longest_word_length' {
      * This checks if any of the unused letters of var1 match var2
      replace tempvar_match`v' = 1 if tempvar_var2_`vv'==tempvar_var1_`v' & tempvar_match_place`v'==.
      replace tempvar_match_place`v' = `vv' if tempvar_var2_`vv'==tempvar_var1_`v' & tempvar_match_place`v'==.
      * If any of them match they are eleminated by replacing them with the value "ZZ" which cannot be equal to any of the individual letters of using_text.
      replace tempvar_var2_`vv'="ZZ" if tempvar_match`v' == 1 & "`reuse_letters'"=="" & tempvar_match_place`v'==`vv'
    * Add a counter to mismatched letters if none of the letters in the using match with the current mismatch
replace `missed_letter_count' = `missed_letter_count'+1 if `gen_match'==0 & tempvar_match`v'==0
    * Drop tempvar of match and using
cap drop tempvar*

  replace `gen_match' = 1 if `missed_letter_count' <= `free_letters'
  ************    End Unordered Letters Match  ************

 tab `gen_match'
 cap confirm variable `missed_letter_count'
   if !_rc tab `missed_letter_count'



set obs 1000

* Without specifying
gen v0= "John Smith"
gen v1="Smith John"

* This simply creates a variable match=1 if v1=v2
fuzzy v0 v1

* This creates a variable match=1 if v1=v2 with or without word order swtiched.
fuzzy v0 v1, w

gen v2="Smith        John"

* This creates a variable match=1 if v1=v2 with or without word order swtiched.
fuzzy v1 v2, b

* Note, that turning on off word order tells fuzzy that blanks don't matter.

gen v3="   ///     J.,.o,h...n ZZZZZZ"

* Make sure to seperate characters to exclude with spaces.
fuzzy v1 v3, b e(. , / Z)

gen obsid = _n

gen v4= v0 + string(obsid)

* This code will tell fuzzy match to check if the strings are similar with up to two letters wild
fuzzy v0 v4, f(2) b

fuzzy v0 v4, f(3) b

* L tells stata to ignore letter order when searching for a match
gen v5="Jist mhohn"
fuzzy v0 v5, f(0) l b

* This failed because Stata is case sensitive and the s in Jist does not match the S in Smith.
* But you can turn off case sensitivity with case (c)
fuzzy v0 v5, f(0) l b c

* Finally we might want to allow letters to be resused when attempting matching
gen v6="John Smith John"

fuzzy v0 v6 , f(1) r l

Monday, June 25, 2012

R vs Stata Non-linear least squares!

# NLS (non-linear least squares) R simulation followed by parallel Stata simulation

# Assumptions:
# For some T0 as an element of THETA = parameter space
# NLS.1 E(y|x) = m(x,T0)
# For T1 an element of the parameter space THETA
# NLS.2 E{[[m(x,T0)-m(x,T1)]^2}>0 for all T0!=T1

# NLS must have uniform convergence in probability:
# Meaning that Max {theta} |N^-1 * sum{q(wi, theta)-E(q(w,theta))}|->0

# This is to say that the difference between q and E(q) must be bounded as P->infinity  for all thetas.

#1. q(theta,wi) = theta*wi^2, so we know that for any theta the difference between it and its expected value can always be bounded.

#2. q(theta,wi) = w/theta, might be more difficult.  If theta == 0 then theta is undefined and the difference clearly cannot be bounded.

# Let's start with 10000 observations



# Now let's generate a variable w with 1000 random normal draws
w = rnorm(10000)

# The first 50 rows look good.

u = rnorm(10000)

theta = 2

y1 = theta*w^2 + u

y2 = w/theta + u

cbind(obs_id,w, y1, y2)[1:50,]
# The first 50 rows look good.
nls(y1~A*x^2, start=list(A=1))

nls(y2~x/B, start=list(B=1))

# Both these estimators work well.

# Let's try something a bit more complicated:

y3 = y1 + y2

nls(y1~A*x^2+x/B, start=list(A=1,B=1))

# R even has difficulty finding the solution to this estimator.

#   Now switching back to STATA
*  (same idea as above)

set seed 12
set obs 10000

gen w = rnormal()

gen u = rnormal()

gen theta = 2

gen y1 = theta*w^2 + u

gen y2 = w/theta + u

nl (y1 = {A=1}*w^2)
* It seems that the algorithm in Stata is not as precise as that used in R

nl (y2 = w/{B=1})
* It seems that the algorithm in Stata is not as precise as that used in R

gen y3 = y1 + y2

nl (y3 = {A=1}*w^2 + w/{B=1})
* However, the algorithm is able to solve this non-linear problem

Sunday, June 24, 2012

Commenting in Stata

* Commenting in Stata

* There are several common and useful ways to insert comments into Stata documents

*1. The very useful "*".  I don't think I need say more.

*2. You can comment out a section of text by starting it with

/* and ending it with */

di "This" /* form of commenting can be quite useful because it */ " can appear within a command!"




di "There "  /*
 */ "is also a " /*
 */ "useful way " /*
 */ "of using comments " /*
 */ "to continue lines in Stata " /*
 */ "though I prefer a slightly " ///
 "different " ///
 "form of comment (///). " ///
 "Which comments out the " ///
 "end of a line."

Saturday, June 23, 2012

Quantile Regression (qreg) is invariant to non-decreasing transformations

* That is med(f(x))=f(med(x)) so long as f' > = 0

* LAD is  is invariant to non-decreasing transformations.
set seed 110

set obs 10000

gen x = rnormal()*8+6
* Because x is symetric around 1 we know the median is 1

sum x, detail

gen fx = sign(x)*x^2+500
* fx is a non-decreasing function we can see this by ploting fx against x

line fx x, sort

* Likewise the median of fx is now easy to find med(f(x))=f(med(x))

* med(f(x))=f(med(x))=f(x=6) = sign(1)*6^2+50 = 536
* We can confirm this:

sum fx, detail

* also: med(f(x))=f(med(x)) so long as f' <= 0 by the symetry of the rank function around 50%
* med(g(x))=g(med(x))=g(x=6) = (-1)*(sign(1)*6^2+5) = -536
gen gx = (-1)*(sign(x)*x^2+500)
sum gx, detail

* Notice that while the medians are mirrors of each other (and equal) despite g' < 0.  However the quantile have now reversed order thus quantile(.25)=-quantile(.75).

* This is because of the mirror nature of the generated data.
two (hist  fx, color(blue)) (hist  gx, color(red)), legend(label(1 "fx") label(2 "gx")) title(Mirror quantiles)

* But what is more interesting to us how well LAD does at estimating the conditional median.

* First let us specify:

gen u = rnormal()*20

gen y = x*10 + u*10

* The conditional median is clearly 10

qreg y x
* And qreg is pretty good at identifying the conditional coefficient as 10.

* Also, because E(u|x)=med(u|x)=0, OLS also identifies the median.
* Thus the following also provides a good estimate
reg y x

* Now let us transform y so that it has larger tails using f(y)=fy:

gen fy = sign(y)*y^2+5

* Let's see how well LAD (least absolute deviations) works

qreg fy x
* But what does this mean?
* How well is the quantile regression working?

* Remember fy = sign(y)*y^2+500
* If fy>0: fy(x) = y(x)^2+500 = (x*10 + u)^2 + 500
* And the conditional effect of x on y is

* fy'(x) = 20*(x*10 + u)
* med(fy'(y)|x) = fy'(med(y|x)) =
* 20*(med(x|x)*10 + med(u|x)) =
* 20*(med(x)*10) =
* 20*(6*10) = 1200

* Alternatively:
reg fy x

graph twoway (lfitci fy x) ///
             (scatter fy x)

* This regression does not work very well even though it has a higher r2.

Friday, June 22, 2012

Linear Probability Model (LPM) under misclassified dependent variables

* Linear Probability Model (LPM) under misclassified dependent variables

set obs 10000

gen x = rnormal()
  label var x "Explanatory variable"

gen u = rnormal()
  label var u "error: Unexplained variation"

* I want y_prob to be centered at .5 so that when I draw the bernoulli draw conditional upon x and u it is lease likely to cause censorship and make the coefficients difficult to interpret.  If x was .2 then there would potentially be a large number of probabilities greater than 1 or less than 0 which would make the estimates naturally attenuated since the true change in probability cannot make probability exceed 1 or drop below 0 since it is naturally bounded.
gen y_prob = .5 + .075*x + .075*u

gen y_ideal = rbinomial(1,y_prob)

* This is the idea y values where there is no misclassification
reg y_idea x

* We can see we recover the coefficient on x quite well despite the binary nature of y_ideal

* Now let's imagine that we do not observe y_ideal but rather a noisy measure of it.
* 10% of our observations have misclassified the y values.
gen misclassified = rbinomial(1,.1)

* The following makes the observed y equal to the ideal when misclassification is not present.
gen y_observed = y_ideal if misclassified==0
* When misclassification is present the modular function works quite well at changing the binary values appropriately.
replace y_observed = mod(y_ideal+1,2) if misclassified==1

reg y_observed x
* We can see that the attenuation bias typical of measurement error in Y is present.  The estimates on x are biased towards zero.  An interesting thing about this particular scenario is that we can easily identify when and how we expect the attenuation bias to go.  As the misclassification approaches .5 then we expect all of the coefficients (which are uncorrelated with the missclassification) to also approach zero.  If the unlikely case where to happen that the misclassification were to become greater than .5 then we would expect the signs on the coefficients to change to the opposite direction in the estimates.

gen misclassified2 = rbinomial(1,.5)
gen y_observed2 = y_ideal if misclassified2==0
replace y_observed2 = mod(y_ideal+1,2) if misclassified2==1
reg y_observed2 x

gen misclassified3 = rbinomial(1,.85)
gen y_observed3 = y_ideal if misclassified3==0
replace y_observed3 = mod(y_ideal+1,2) if misclassified3==1
reg y_observed3 x

* The question then is, do we expect the probit/logit models to behave better than the LPM?
* See:

* Dave references the following paper:
* Hausman, J. A., J. Abrevaya & F. M. Scott-Morton, 1998. Misclassification of the dependent variable in a discrete-response setting. Journal of Econometrics, 87, 239-269. probit y_observed x
margins, dydx(*)
* Looking at the average partial effect it seems to me that the two methods yield nearly identical results.

* dprobit also presents a potential method though I am less comfortable with it than I am with the average partial effect since the APE seems to me to better represent the characteristics of the underlying population.
dprobit y_observed x

* Ultimately though I find myself more and more in favor of using the LPM if all that you are looking for is the average partial effect.  However, MLE methods are more flexible in general and allow the effect of x to vary by population characteristics which can be good if specification is know and I cannot help but think are hazardous if specification is unknown.  Ultimately though probit is a more complex procedure and ultimately tends to yield very similar results to the LPM.

* Thus in that absence of information I am inclined to use LPM over probit.  I know that this can cause the violation of underlying theories, but if the estimates are just as good (or just as bad) as the estimates from other procedures then I prefer to use the method that is easiest to do and write.

Thursday, June 21, 2012

Social network analysis simulation - contagious depression

* Social network analysis simulation - contagious depression

* We would like to simulate a panel data set in which we have students who might be related to each other as friends.

* If one of the persons becomes depressed then that person will increase the likelihood of friends becoming depressed.

*****          *****
* Model Parameters *
*****          *****


* Set the number of students to be simulated
local Num_students = 50

* Set the minimum probability that any two people are friends despite distance
local min_friend_prob = .01

* Set the max friend probability
local max_friend_prob = .75

* Set the distance coefficient. P(j&i being Friends)=max(`max_friend_prob'-`alpha'*distance(i,j),`min_friend_prob')
local alpha = 1

* Initial likelihood of facing depression P(depression)~base_likelihood + beta*#_friends_w_depression
local base_likelihood=.15
local beta = .1

*****       *****
* Generate Data *
*****       *****

set obs `Num_students'

gen stud_id = _n

* Lets imagine there is uniform three dimensional space x, y, and z (-1/2,1/2) in which students are connected to each other.

* This space is unobservable yet for the purposes of this simulation let's give it labels.

gen x = runiform()-.5
  label var x "Punk/Goth/Alternative (conformity scale)"

gen y = runiform()-.5
  label var y "Socio-economic status (wealth scale)"

gen z = runiform()-.5
  label var z "Attractiveness (physical appearance scale)"

* Now let's say the likelihood of any two people being friends is equal to (50% less the Euclidean distance from the other person with a minimum of 1%).

* To do this we will have a recursive loop that loops through all of the students and all of their potential "friendship" matches.

* First let's generate indicator variables to indicate if the student is friends with the other students
forv i=1(1)`Num_students' {
  gen fri_`i' = 0
    label var fri_`i' "This student is friends with student `i'"

* Count the number of friends
gen num_friends = 0

gen connection = 0
  label var connection "The connection number of the current friendship"
local num_connections = 0

* Now let's generate friendship levels
forv i=1(1)`Num_students' {
  * This recursively loops all of the students up to this point
  forv j=1(1)`i' {
    * A student cannot be friends with himself
if `i'!=`j' {
 * calculate the euclidean distance between student i and j
 local dist_ij = ((x[`i']-x[`j'])^2+(y[`i']-y[`j'])^2+(z[`i']-z[`j'])^2)^.5
 * Calculate the probability that i and j are friends
 local prob_ij = max(`max_friend_prob'-`alpha'*`dist_ij',`min_friend_prob')
 * Draw a bernoulli result for friendship
 local friends_ij = rbinomial(1,`prob_ij')
  di "`i'&`j' are " string(`dist_ij',"%9.2f") " apart ; friends prob=" string(`prob_ij',"%9.2f") " ; they are friends <`friends_ij'> 1=yes, 0=no"

 * This changes the variable indicators of how many friends students have.
 qui if `friends_ij' == 1 {
    local num_connections = `num_connections' + 1
    replace num_friends = num_friends+1 if stud_id==`i' | stud_id==`j'
replace fri_`i' = 1 if stud_id==`j'
replace fri_`j' = 1 if stud_id==`i'

di "--- `num_connections' connections (friendships) made ---"

tab num_friends

* My prediction is that the farther students get from the centers in any/all of the categories the less friends they will have.
reg num_friends x y z

* We cannot see this prediction hold true with this specification.

* That is because the effect is mirrored whether x y and z get larger than zero or less than zero.

* This simple OLS should fail.

* However, if we break the data into two observations positive and negative then I think we will come up with more interesting results.

foreach i in x y z {
  gen `i'_pos = max(0,`i')
    label var `i'_pos "Variable `i''s positive values"
  gen `i'_neg = min(0,`i')
    label var `i'_neg "Variable `i''s negative values"

* Now let's see what the results yeild
reg num_friends *pos *neg

* This is because in a uniform distribution (like almost all distributions) the closest any observation can get to the expected position of all other observations is the center of the distribution.

*****              *****
* Spread of depression *
*****              *****

* We would like to simulate the spread of depression through our network.

* First let us calculate the original level of depression:
* Initial likelihood of facing depression P(depression)~base_likelihood + beta*#_friends_w_depression

gen depression = rbinomial(1,.15)
  label var depression "Initial students depressed"

two (scatter y x if depression==0) (scatter y x if depression==1) , legend(label(1 "Healthy") label(2 "Depressed"))

gen depression2=0

* Next we will loop through every student to see if that student is not depressed if that student becomes depressed
forv i=1(1)`Num_students' {
  * If the student is depressed skip that student
  if depression[`i']==0 forv j=1(1)`Num_students' {
    * This checks if j is friends with i
 qui if fri_`j'[`i'] == 1 & depression[`j']==1  {
      local depression_spread=rbinomial(1,`beta')
 if `depression_spread' == 1 {
   replace depression2=1 if _n==`i'
noi di "Student `i' becomes depressed as a result of `j''s depression"

two (scatter y x if depression==0) (scatter y x if depression==1) ///
    (scatter y x if depression2==1) , legend(label(1 "Healthy") ///
label(2 "Depressed") label(3 "Newly depressed"))

*****                 *****
* Generate network graphs *
*****                 *****

* Can we plot one of those awesome social network graphs in Stata?
 gl two_list

* Person
forv i=1(1)`Num_students' {
  local x_pos = x[`i']
  local y_pos = y[`i']
  qui gen conct_x_`i' = `x_pos' if _n==1
  qui gen conct_y_`i' = `y_pos' if _n==1
  * This creates two variables for every person in the network.
  * It allows in effect a separate network for each person.

  local count_position = 1

  forv j=1(1)`i' {
    * This checks if j is friends with i
qui if fri_`j'[`i'] == 1  {
      replace conct_x_`i' = x[`j'] if _n==`count_position'+1
      replace conct_y_`i' = y[`j'] if _n==`count_position'+1
      replace conct_x_`i' = `x_pos' if _n==`count_position'+2
      replace conct_y_`i' = `y_pos' if _n==`count_position'+2
 * This draws a connection from the person i to the person j before returning to person i.
      local count_position = `count_position'+2
 * This moves the position up 2 spaces to make space for the next expansion of the set.
  gl two_list ${two_list} (line conct_y_`i' conct_x_`i', mcolor(gs8))
  * This adds a entry in the list of graphs to be graphed for each person.

* Warning this graph can tak
two ${two_list} (scatter y x if depression==0) (scatter y x if depression==1) ///
    (scatter y x if depression2==1) , legend(off)

* Note, this is not how network data is typically stored and created.

* A typical way of storing network data is as edgelists (see:

* We can create an edge list from our data with the following code:

gen edge1 = .
gen edge2 = .

local count_position = 1

* This will loop through all of the students and add to the edge list one observation for every connection in each direction
forv i=1(1)`Num_students' {
  forv j=1(1)`i' {
    * This checks if j is friends with i
qui if fri_`j'[`i'] == 1  {
      replace edge1 = `i' if _n==`count_position'
      replace edge2 = `j' if _n==`count_position'
      local count_position = `count_position'+1
 noi di "edge1 edge2"
* This will makge sure we do not run out of observations in the data set to record our edges
 if `count_position'==_N set obs `=`count_position'+1'

* netplot is a use written program that allows social network plots to be easily drawn from edgelist data.
* It works much faster than my network mapping code generated above.
cap netplot edge?
* If you do not have netplot installed the following code should install it.
if _rc!=0 {
  ssc install netplot
  netplot edge*

* This option allows network plots to be drawn as a circle.
cap netplot edge*, type(circle)

Wednesday, June 20, 2012

Non-Linear Least Squares (M-estimation)

* Non-Linear Least Squares (M-estimation)

* Sometimes we would like to estimate values that cannot be written in closed form

* For instance imagine Y=exp(XB) and we want to estimate the Bs directly.

* Any data from memory

* Set the initial number of observations
set obs 10000

gen x=rnormal()

gen u = rnormal()*10

gen y=exp(10+4*x+u)

* We can directly estimate B0 and B1 with non-linear least squares
nl (y=exp({b0}+x*{b1}))

* Or we can transform y
gen lny=ln(y)

reg lny x

* Notice that the non-linear least squares estimate is pretty bad.

* This is because the E(u)!=0, that is because E(normal())>0 and that by random error being enclosed in the exponential it means that the error is multiplicative.

* Thus defining the true error r = y - y=exp(10+4*x)

gen r = y-exp(3+1*x)

* We can see the source of the biase in b0 in nl estimation is
sum r
* Which causes positive bias in b0

gen expx=exp(x)
corr r expx
* Which causes positive bias in b1

* To generate a simulation that rightly favors the nl estimator we might generate a different y

gen y2 = exp(3+1*x) + u*10

nl (y2=exp({b0}+x*{b1}))

gen lny2=ln(y2)

reg lny2 x

* The problem with a simulation like this is that we can estimate B without needing to use non-linear least squares.

* Let us try something a little trickier: y = 5*sin(B0+B1*x)+B2*x2 + u

* An equation that cannot be decomposed into closed form.

set seed 101

set obs 10000

gen x = runiform()*_pi

gen y =5*x*sin(1+2.95*x)+2*x+rnormal()*1

scatter y x

nl (y=5*x*sin({b0}+{b1}*x)+{b2}*x)

* These coefficients are not right (b0 or b1), but then again maybe they are because the sign function is symmetric and repetitive.

* Thus two different input vectors could predict the same output.

* Let us check:

predict y_hat

two (scatter y x) (line y_hat x, sort color(pink)), title (Non-linear least squares fits pretty well)

* Looking pretty darn good.

cor y_hat y

* Of course this example is also very contrived.

* In order to estimate B you must know pretty exactly the functional form of m (where y=m(x)+u)

* Let's test how well the estimate performs when there is more noise in the model.

* Try instead:

gen y2 =5*x*sin(1+2.95*x)+2*x+rnormal()*10

nl (y2=5*x*sin({b0}+{b1}*x)+{b2}*x)

predict y2_hat

two (scatter y2 x) (line y2_hat x, sort)

* Now let's see what happens if we do not know the functional form as well:

* Let's say we say we saw:
scatter y x

* and we want to fit a polynomial to it.

gen x2 = x^2
gen x3 = x^3
gen x4 = x^4
gen x5 = x^5
gen x6 = x^6

reg y x x2 x3 x4 x5 x6

predict y_hat2

two (scatter y x) (line y_hat2 x, sort color(orange)), title(OLS with lots of polynomials fits well too)

corr y_hat2 y

* Not too bad a fit at all!

* But does a good fit imply a good forecast?

* Let's imagine that the true distribution of x is truncated normal runiform()*_pi*1.1 and we only observed runiform()*_pi portion of it.

* If we are using x to predict y in both these scenarios we will come to two ver different conclusions in our predictions.

* Imagine that we have 500 more new observations:

set obs 10500

gen new_obs = 1 if x==.
  * This variable is newly observed.

replace x = runiform()*_pi*1.15 if new_obs == 1

replace y = 5*x*sin(1+2.95*x)+2*x+rnormal()*1 if new_obs == 1

two (scatter y x if new_obs==. ,  msymbol(smcircle)) (scatter y x if new_obs==1 ,  msymbol(smcircle))

* Now let's see how our two models do.
* First we will just resubmit the nl regression using the old data
nl (y=5*x*sin({b0}+{b1}*x)+{b2}*x) if new_obs==.

* Then predict the yhat values
predict y_hat1_new

* Next we will redo the polynomial regression using the old values.
reg y x x2 x3 x4 x5 x6 if new_obs==.

* We need to make sure all of the x2, x3, ... xn polynomials are defined for the new observations.
replace x2 = x^2
replace x3 = x^3
replace x4 = x^4
replace x5 = x^5
replace x6 = x^6

cap predict y_poly_new
* This did not generate y_hat for all of the values.

two (scatter y x if new_obs==. , msymbol(smcircle)) ///
    (scatter y x if new_obs==1 , msymbol(smcircle)) ///
(line y_hat1_new x, sort color(pink) lwidth(.7)) ///
(line y_poly_new x, sort color(black) lwidth(.4)) ///
, legend(label(1 "Original Observations") ///
label(2 "New Observations") ///
label(3 "Non-linear model (correct)") ///
label(4 "Polynomial model")) ///
title(Specification is Everything in Forcasting)

* Despite both models having very similar r2, the correctly specified one is much better at forcasting.

* Of course the difficulty of discovering the underlying structural model is another matter :)

* Thus every so often you hear about a group of financial wizards who are able to "predict" the stock market for a few time periods but then something changes and the models that looked so good to begin with end up generating lots of trash.

Tuesday, June 19, 2012

Batch Convert files from SPSS into Stata format

# Batch Convert files from SPSS into Stata format (using R)
# Written by Francis Smart []

# a. First install R (

# b. Next install the package  foreign  from the Package Menu by selecting "Install Packages" [Packages>Install Package(s)>Select Closest Mirror>foreign].

# Now the following code should work:

# First load the foreign package

# Set working directory (directory or super-directory (opposite of sub?) in which sav files can be found)
# In R you must specify directories with either / or \\
# You must replace *my_directory* with the directory where to look for the sav files.

# Create a list of all of the sav files in the directory
dta_list<-dir(pattern = ".sav$", recursive=T, = T)
# The default in this code is to look for files recursively (meaning within sub-directories).
# If you only want to look within the specific directory turn recursive to equal F.

# Loop through each member of the list
for(self in dta_list){

# Display what file will be read
  cat("Reading file ", self, "\n")

# Import the data as a dataframe (a class of objects in R intended for spreadsheet type data)
  data <-, use.value.labels=TRUE, max.value.labels=Inf,

# Save the file substituting the sav extension for the dta extension
  write.dta(data, sub(".sav",".dta",self))
# dta files will be saved to the same directory as the orignal files

# I do not know how to correct the displayed error.  

# If you have any suggestions, please email me.

# Note this loop could easily be modified to accommodate many file formats since the "foreign" package  can handle many different types of file formats.

Monday, June 18, 2012

Demographic- Mortality Rates Simulation

* The following simulation generates mortality rates and typical demographic type data.

* NOTE:: The values in this simulation does not flect true demographic information but only meant to be sketch of how someone might simulate true demographic information.
set seed 1010

* Empty the memory

* Set the number of observations to generate
set obs 2000

* Specifies gender
gen female=rbinomial(1,.5)

* Specifies if person smoked
gen smoked=rbinomial(1,.25)

* Specifies if person excercised
gen exercise = runiform()*.5

* Specifies personal heterogeneity
gen genetics = abs(rnormal(1,.5))+.5

* Now let us generate initial rates:

* All of our people start at age 1
gen age=1
* None of them are initially deceased
gen deceased=0

gen mortality_age=.
  label var mortality_age "Age at death"

gen id=_n
  label var id "Personal ID"

gen cancer_risk = 0.005

gen death_prob = 0

gen age_risk = .

* Generates birth year
gen brth_year = round(runiform()*150)+1850

* Specifies the year of observation in the simulation
gen year = brth_year + age - 1

* Specifies the starting age
local age=0

* Is an indicator that when positive tells the while loop to end.
gl end=0

* Begin the mortality loop
qui while $end==0 {

  * Age is the age that the person will be in at the end of the current loop
  local age=`age'+1

  * Generates the current year
  replace year = brth_year + age  - 1

  * Generate the probability of death by cancer if the person smoked
  replace cancer_risk = `age'^.5*.001 if smoked == 1 & age > 18 &  age == `age'

  * Generate a general aging mortality factor
  replace age_risk = `age'^1.9-`age'^1.89

  * Create the combined death probability
  replace death_prob = .001*genetics*age_risk + cancer_risk - exercise*.1 if deceased == 0

  replace death_prob = abs(rnormal())/(200*age^2) if age &lt; 10 & deceased == 0 &  age == `age'
  * High mortality rate in first year

  * Men tend to have higher mortality rates at all ages
  replace death_prob = death_prob+.01 if female==0 &  age == `age'

  * The death rate for males increases during World War I and WWII
  replace death_prob = death_prob + .03 if age >17 & female==0 & deceased == 0 & year > 1913 & year &lt; 1919 &  age == `age'
  replace death_prob = death_prob + .05 if age >17 & female==0 & deceased == 0 & year > 1939 & year &lt; 1946 &  age == `age'

  * Make sure there is a minimum death rate of 1 out of 1000
  * replace death_prob = max(death_prob,.001) if  deceased == 0 & age == `age'

  * Save the true average probability of dieing for that age group
  sum death_prob if deceased == 0 &  age == `age'
  local death_prob = r(mean)

  * Generate a variable to indicate if this person died at this age
  gen yr_o_death = rbinomial(1, death_prob) if deceased == 0 & age == `age'

  * Replace the person's life living indicator if that person is deceased
  replace deceased = 1 if yr_o_death == 1 & age == `age'
  replace mortality_age = `age' if yr_o_death == 1 & age == `age'

  expand 2 if deceased == 0 & age==`age', gen(new_yr)

  replace age=age+1 if new_yr==1 & age==`age'

  sum deceased if age==`age'
  drop new_yr yr_o_death

  if r(N)==0 gl end=1

  noi di "Age: `age' mortality rate " r(mean) ": survivors : " r(N) " : true mean death prob =  `death_prob'"


  * Dummy variables to indicate if the person was alive
  gen WW1=0
  gen WW2=0
  replace  WW1 = 1 if female==0 & year > 1913 & year &lt; 1919 & age >17
  replace  WW2 = 1 if female==0 & year > 1939 & year &lt; 1946 & age >17

  * Create dummies for if the person was male, lived through the years
  bysort id: egen WW1dum=max(WW1)
  bysort id: egen WW2dum=max(WW2)

* This will graph a histogram of mortality by age
qui sum mortality_age
local bin_max = r(max)
hist mortality_age, kden bin(`bin_max') percent
sum mortality_age

* We can see that the simulation is producing the right signs for the independent variables.
reg mortality_age female exercise smoked WW?

* The following will look at the distribution of ages at a particular time (1945)
local yr=1945
qui sum age if year == `yr'
local bin_max = r(max)
di "hist age if year == `yr', kden bin(`bin_max') percent"
hist age if year == `yr', kden bin(`bin_max') percent title(Population Distribution in `yr')
sum mortality_age if year == `yr'

qui sum mortality_age
local bin_max = r(max)
hist mortality_age, kden bin(`bin_max') percent by(age)
sum mortality_age

Sunday, June 17, 2012

3 Ways of Loading SPSS (sav) files into Stata

1. The easiest and most straightforward way is using the user written package usespss.

  This package however only works for 32 bit windows.

2. The next easiest method if you have it available is using StatTranfer which allows for the conversion of many types of files from one format to another:

  A license could be had for as little as $59 if you are a student.

3. The last option takes a few more steps but is free.

  a. First install R

  b. Next install the package Rcmdr from the Package Menu by selecting "Install Packages" [Packages>Install Package(s)>Select Closest Mirror>Rcmdr].

  c. Next open up the Rcmdr package by going to the menu [Packages>Load Package...>Rcmdr]

 d. When Rcmdr loads it should load the R Commander window.  Select from the menu [Data>Import Data>from SPSS data set]

 e. Name your data set Data1 [select 'ok'], now find your sav file.

 f. If there has been no problem up to this point you should now have a data set in memory called Data1.

 g. Now you can export to a dta file with the following command:
write.dta(Data1, "File Directory/File Name.dta")

(In R you need to specify directory dividers as either / or \\ not \.)

Saturday, June 16, 2012

Stata wildcards

* Stata wildcards and shortcuts

* Wildcards are extremely useful.

* They can save a lot of time as well as create coding outcomes that are otherwise impossible or extremely difficult to otherwise wet up.

* Let us first generate a handful of variables to experiment on:

gl A c3a3aa a1 a2 a3 b2 b3 b4 c3 c4 c5 a3a3

set obs 12

* This will loop through each element of the global A and generate corresponding variables.
foreach v in $A {
  gen `v'=rnormal()

* Now if we want to use only some of the variables with wildcards.
* A ? allows for one character to be wild
sum a?

sum ?3

* Sometimes we can use combinations of stars and ? to target specific groups of variables
sum ?3a*

* A star allows for any number of characters to be wild
sum a*

sum *

* We can also refer to variables through variable range specifications:

sum b2-c4

Friday, June 15, 2012

Stata Coding Basics - foreach

* Stata Coding Basics - foreach

* Stata has two types of for commands

* 1. is the forvalues commands which are used to loop through sets of numbers. (see previous post)

* 2. is the foreach command which has a different syntax and can loop through just about any other list of interest.

* For instance, lets imagine we have a list of variables we would like to create.

set obs 100

foreach v in varA varB varC varD VarE VarF {
  gen `v' = rt(100)

* The v tells stata what the local that the foreach loop will use is called.

* The in after v tells Stata to take a discrete list of objects and loop through them.

* This list can use quotes to join items seperated by a space together.

foreach v in "varA varB" "varC varD" "VarE" VarF {
  di "The first value of `v' =" `v'
* Notice that there is no difference between "VarE" and VarF
* Also notice how the difference in how `v' outside of quotes and "`v'" within quotes behaves.

* Foreach loops can also use wildcards:

foreach v of varlist var* {
  di "The first value of `v' =" `v'
* Notice VarE and VarF are not displayed.

* This is because Stata is case sensitive.

* We can make sure to loop through all of the variables through more general choice of wildcards.

foreach v of varlist ?ar* {
  di "The first value of `v' =" `v'

* Of course if we are going general with wildcards then we can do the following:
foreach v of varlist * {
  di "The first value of `v' =" `v'

* Note the "of varlist" is important.

* Otherwise Stata does not know what you are referring to with the wildcard.

* There are several other specifications that can take the place of varlist.

* I will go few a few briefly but I don't really see their value above what I have already mentioned.

* If we a global called A then we can loop through the elements of A via:

gl A "" "Time-$S_TIME" "" "Hello," "" "How are you `c(username)'?" "Do you like tuna fish?"

foreach v of global A {
  di "`v'"

* However, an equally effective way of doing this would be:
foreach v in $A {
  di "`v'"

* This summarizes my thoughts and examples of foreach for now.

* It is an essential and efficient command for many circumstances.

Thursday, June 14, 2012

Werewolf attack: Spatial Multi-agent simulation - basic artificial intelligence

* This is a simulation of a werewolf attack using Stata.

* Initially there is X number of people on a map and a few werewolves.

* Werewolves head toward whatever unprotected humans are on the map.

* If they encounter a human they kill that human and go towards they next unprotected human.

* Humans do not know about the werewolves until they become within sensory range - specified by the user.

* Once they become within sensory range humans immediately seek the closest shelter.

* Shelters are distributed throughout the map.

* Once a human is in a shelter the human is safe.

* Simulation Parameters Begin


* Set the number of humans
local num_humans=1500

* Human sight range (radius)
local human_vision=65

* Human run speed
local human_speed = 1

* Set number of shelters
local num_shelters = 3

* Specify the initial number of werewolves
local num_werewolves = 8

* Werewolf run speed (if humans are faster than werewolf then it is only the unlucky human that will ever be caught)
local werewolf_speed = 5

* Specify the dimensions of the world grid
local xrange=400
local yrange=400

* Graph interval every x rounds the program graphs what happens
local graph_int = 20

* Tell stata to not graph as the simulation progresses
local draw=1

* Simulation Parameters End

* This tells stata to either draw the graphs as it runs the code or wait till the end to draw the graphs.
if `draw'==0 local nodraw nodraw

set obs `num_humans'

gen hmn_id=_n
  label var hmn_id "Human ID"

* To begin with all types are human
gen ty = 1
  lab var ty "Observation type 1=human, 2=werewolf, 3=shelter"
  label define obj_type 1 "Human" 2 "Werewolf" 3 "Shelter" 4 ///
     "Deceased" 5 "Spooked Humans" 6 "Safe"
  label values ty obj_type
* Then we will add the werewolves

set obs `=`num_humans'+`num_werewolves''

replace ty = 2 if ty==.

* Put the werewolves on the top of the list
gsort -ty
  gen were_id = _n if ty==2
  label var were_id "Werewolf ID"

* Next the shelters
set obs `=`num_humans'+`num_werewolves'+`num_shelters''

replace ty = 3 if ty==.

tab ty
gsort -ty
  gen shelter_id = _n if ty==3
  label var shelter_id "Shelter ID"

* Looks pretty good so far.

* People shelters and werewolves are uniformly throughout the data.
gen x=runiform()*`xrange'
gen y=runiform()*`yrange'

* Though werewolves will enter the maps from the edges
gen xy_edge = rbinomial(1,.5)

replace x=round(x, `xrange') if xy_edge==0 & ty == 2
replace y=round(y, `yrange') if xy_edge==1 & ty == 2
* This makes it so the werewolves start on one of the sides of the map.

* Generate
sum x
replace x=rnormal(r(mean),r(sd)*5/8) if ty == 3
sum y
replace y=rnormal(r(mean),r(sd)*5/8) if ty == 3

two (scatter y x if ty==1 , mcolor(gs12) msymbol(smcircle)) ///
    (scatter y x if ty==3, msymbol(square) color(yellow))   ///
    (scatter y x if ty==2, color(cranberry) )               ///
, legend(label (1 "Humans")                             ///
    label (2 "Werewolves")                                   ///
    label (3 "Shelters")                                    ///
    rows(1))  `nodraw' title(Full Moon - Round 0)           ///

* First we will calculate which direction and where the humans will run.
* This direction is constant since humans do not try to avoid werewolves, they just run for shelter.

* We will loops through shelters and calculate distance for each human.

global shelter_distance_list

qui forv i=1(1)`num_shelters' {
  * The first x number of observations are the werewolves.
  sort shelter_id
  local shelter_x = x[`i']
  local shelter_y = y[`i']
  * were_x and were_y hold the position of the current werewolf looking for its prey

  * Now let's calculate the distance of the spooked humans from the shelters.
  cap drop dist_hm_shltr`i'
  gen dist_hm_shltr`i'=((x-`shelter_x')^2 + (y-`shelter_y')^2)^.5 if ty==1

  global shelter_distance_list $shelter_distance_list , dist_hm_shltr`i'


di "$shelter_distance_list"
* We can see in out list of shelters there is unwanted comma.

* For now we will just fill in an meaningless value to prevent error.

* Now we figure out which shelter is the closest for all of the humans collectively
gen closest_shelter = .
gen closest_shelter_x = .
gen closest_shelter_y = .
gen closest_shelter_dist = .

qui forv i=1(1)`num_shelters' {
  replace closest_shelter = `i' if  dist_hm_shltr`i' == min(99999 $shelter_distance_list)
  replace closest_shelter_dist = dist_hm_shltr`i' if  closest_shelter == `i'

  local shelter_x = x[`i']
  local shelter_y = y[`i']

  replace closest_shelter_x = `shelter_x' if closest_shelter == `i'
  replace closest_shelter_y = `shelter_y' if closest_shelter == `i'


tab closest_shelter

* Now we will calculate the ideal trajectory for each human given that human is running for shelter.
* We know a, b, c [see graph] and the next section for algebra.
  gen a = y-closest_shelter_y
  gen b = x-closest_shelter_x

  gen a_prime=1/(1+b^2/a^2)^.5

  * b'^2 = 1-a'^2
  * b' = (1-a'^2)^.5
  gen b_prime = (1-(a_prime)^2)^.5

  gen x2 = x - b_prime*sign(b) if ty == 1
  gen y2 = y - a_prime*sign(a) if ty == 1

  replace x2 = x2 - b_prime*sign(b) if ty == 1
  replace y2 = y2 - a_prime*sign(a) if ty == 1

two (scatter y x if ty==1 , mcolor(gs12) msymbol(smcircle)) ///
    (scatter y2 x2 if ty==1, mcolor(sandb) msymbol(smcircle)) ///
    (scatter y x if ty==3, msymbol(square) color(yellow))   ///
, legend(label (1 "Starting Place")                     ///
    label (2 "One step")                                    ///
    label (3 "Shelters")                                    ///
    rows(1)) `nodraw' title(Sample humans running for shelter) ///
legend(color(white) region(fcolor(black)))              ///

drop x2 y2

* Werewolves search for human targets

* I loop through each of the werewolves.

gen x2=.
gen y2=.

* Each werewolf will survey all of the humans and find the one who is the closest.
qui forv i=1(1)`num_werewolves' {
  * The first x number of observations are the werewolves.
  sort were_id
  local were_x = x[`i']
  local were_y = y[`i']
  * were_x and were_y hold the position of the current werewolf looking for its prey

  * Now let's calculate the distance of that werewolf from the humans.
  cap drop dist_hm_wr
  gen dist_hm_wr`i'=((x-`were_x')^2 + (y-`were_y')^2)^.5 if ty==1 | ty==5
  * We take the square but for finding targets it is unnecessary since the ranks of the different distances remains constant.

  replace ty=5 if dist_hm_wr`i'  <  `human_vision'

  sort dist_hm_wr
  * The top of this list is the target human.

  * If a human is less than `werewolf_speed' distance from the were and it is the closest to that were then the human is dead.
  replace ty = 4 if dist_hm_wr < `werewolf_speed' &  _n==1

  * Now we need to calculate the attack vector.

  * Ie, what is the linear projection that gets the werewolf from its current location to the human fastest (in terms of x and y movement)?

  * We can use some algebra and a little trig to find the proper vector.

  * We know a, b, c
  local a = `were_y'-y[1]
  local b = `were_x'-x[1]

  * And c' which is equal to the speed of werewolves
  local c_prime=`werewolf_speed'

  * Triangle abc is similar to triangle a'b'c' so the ratios b/a=b'/a'
  * b'=b(a'/a)
  * (1) b'^2=b^2 * a'^2 / a^2
  * (2) Pathagoras (500BC): a'^2 + b'^2 = c'^2
  * (3) instert (1) into (2): a'^2 + b^2 * a'^2 / a^2 = c'^2
  * a'^2(1 + b^2 / a^2) = c'^2
  * a'^2 = c'^2/(1 + b^2 / a^2)
  * a'=c'/(1 + b^2 / a^2)^.5

  local a_prime=`c_prime'/(1 + (`b')^2 / (`a')^2)^.5

  * b'^2 = c'^2-a'^2
  * b' = (c'^2-a'^2)^.5
  local b_prime = ((`c_prime')^2-(`a_prime')^2)^.5

  * It is important to note the direction that the werewolf needs to travel as well.

  replace x = x + `b_prime'*sign(`b') if were_id == `i'
  replace y = y + `a_prime'*sign(`a') if were_id == `i'
  replace x2 = x - `b_prime'*sign(`b') if were_id == `i'
  replace y2 = y - `a_prime'*sign(`a') if were_id == `i'

  noi di "Werewolf `i' moves"

two (scatter y x if ty==1 , mcolor(gs12) msymbol(smcircle)) ///
    (scatter y x if ty==5 , color(sandb) msymbol(smcircle)) ///
    (scatter y x if ty==2, color(cranberry))                ///
    (scatter y2 x2 if ty==2, color(red))                    ///
    (scatter y x if ty==3, msymbol(square) color(yellow))   ///
    (scatter y x if ty==4, msymbol(smplus) color(gs16))     ///
, legend(label (1 "Unconcerned Humans")                 ///
    label (2 "Spooked Humans")                              ///
    label (3 "Werewolves")                                   ///
    label (4 "Werewolf's 1st Move")                         ///
    label (5 "Shelters")                                    ///
    label (6 "Deceased")                                    ///
    rows(2)) `nodraw' title(Full Moon - Round 0)            ///
legend(color(white) region(fcolor(black)))              ///

* Spooked Humans Seek Out Closest Shelter

* Move spooked humans towards shelters.
replace closest_shelter_dist = closest_shelter_dist-`human_speed'

replace ty=6 if closest_shelter_dist < 0
  * Once a human gets to a shelter, that human is considered safe.

replace x2 = x - b_prime*sign(b)*`human_speed' if ty == 5
replace y2 = y - a_prime*sign(a)*`human_speed' if ty == 5

two (scatter y x if ty==1 | ty==5, mcolor(gs12) msymbol(smcircle)) ///
    (scatter y2 x2 if ty==5 , color(sandb) msymbol(smcircle)) ///
    (scatter y x if ty==2, color(cranberry))                ///
    (scatter y2 x2 if ty==2, color(red))                    ///
    (scatter y x if ty==3, msymbol(square) color(yellow))   ///
    (scatter y x if ty==4, msymbol(smplus) color(gs16))     ///
, legend(label (1 "Unconcerned Humans")                 ///
    label (2 "Spooked Humans")                              ///
    label (3 "Werewolves")                                   ///
    label (4 "Werewolf's 1st Move")                         ///
    label (5 "Shelters")                                    ///
    label (6 "Deceased")                                    ///
    rows(2)) `nodraw' title(Full Moon - Round 0)            ///
legend(color(white) region(fcolor(black)))              ///

**** LOOKS like things are working properly
* Let's make the position changes to x and y permanent

replace x = x2 if ty == 5 | ty == 2
replace y = y2 if ty == 5 | ty == 2

drop x2 y2

* Let's allow for some summary statistics

gen obs_num = _n

gen unaware_humans = `num_humans' if _n == 1
gen spooked_humans = 0            if _n == 1
gen deceased_humans = 0           if _n == 1
gen safe_human = 0                if _n == 1

* Now we will begin the system loop
global graph_list

local ii=0
while _rc==0 {

local ii=`ii'+1

noi di "Round `=`ii'+1'"
* Let's record some summary statistics

qui sum ty if ty==1
replace unaware_humans = r(N) if obs_num == `ii'+1

qui sum ty if ty==5
replace spooked_humans = r(N) if obs_num == `ii'+1

qui sum ty if ty==4
replace deceased_humans = r(N) if obs_num == `ii'+1

qui sum ty if ty==6
replace safe_human = r(N) if obs_num == `ii'+1

* First the werewolf moves

* Each werewolf will survey all of the humans and find the one who is the closest.
  sum ty if ty==1 | ty==5
  if r(N)==0 cap end_loop
  if r(N)>0 qui forv i=1(1)`num_werewolves' {
  * The first x number of observations are the werewolves.
   sort were_id
   local were_x = x[`i']
   local were_y = y[`i']
  * were_x and were_y hold the position of the current werewolf looking for its prey

  * Now let's calculate the distance of that werewolf from the humans.
   cap drop dist_hm_wr
   gen dist_hm_wr`i'=((x-`were_x')^2 + (y-`were_y')^2)^.5 if ty==1 | ty==5
  * We take the square but for finding targets it is unnecessary since the ranks of the different distances remains constant.

   replace ty=5 if dist_hm_wr`i' < `human_vision'

   sort dist_hm_wr
  * The top of this list is the target human.

  * If a human is less than `werewolf_speed' distance from the were and it is the closest to that were then the human is dead.
   replace ty = 4 if dist_hm_wr < `werewolf_speed' & _n==1

  * Now we need to calculate the attack vector.

  * Ie, what is the linear projection that gets the werewolf from its current location to the human fastest (in terms of x and y movement)?

  * We can use some algebra and a little trig to find the proper vector.

  * We know a, b, c
   local a = `were_y'-y[1]
   local b = `were_x'-x[1]

  * And c' which is equal to the speed of werewolves.
   local c_prime=`werewolf_speed'

  * Triangle abc is similar to triangle a'b'c' so the ratios b/a=b'/a'
  * b'=b(a'/a)
  * (1) b'^2=b^2 * a'^2 / a^2
  * (2) Pathagoras (500BC): a'^2 + b'^2 = c'^2
  * (3) insert (1) into (2): a'^2 + b^2 * a'^2 / a^2 = c'^2
  * a'^2(1 + b^2 / a^2) = c'^2
  * a'^2 = c'^2/(1 + b^2 / a^2)
  * a'=c'/(1 + b^2 / a^2)^.5

   local a_prime=`c_prime'/(1 + (`b')^2 / (`a')^2)^.5

  * b'^2 = c'^2-a'^2
  * b' = (c'^2-a'^2)^.5
   local b_prime = ((`c_prime')^2-(`a_prime')^2)^.5

   replace x = x - `b_prime'*sign(`b') if were_id == `i'
   replace y = y - `a_prime'*sign(`a') if were_id == `i'

 * Then the humans move

 * Move spooked humans towards shelters.
 replace closest_shelter_dist = closest_shelter_dist-`human_speed' if ty == 5

 replace ty=6 if closest_shelter_dist < 0 & ty == 5
  * Once a human gets to a shelter, that human is considered safe.

 replace x = x - b_prime*sign(b)*`human_speed' if ty == 5
 replace y = y - a_prime*sign(a)*`human_speed' if ty == 5

 * Then the humans move
 if `ii' / `graph_int'==int(`ii' / `graph_int') {

 two (scatter y x if ty==1, mcolor(gs12) msymbol(smcircle)) ///
     (scatter y x if ty==5, color(sandb) msymbol(smcircle)) ///
     (scatter y x if ty==2, color(cranberry))                ///
     (scatter y x if ty==3, msymbol(square) color(yellow))   ///
     (scatter y x if ty==4, msymbol(smplus) color(gs16))     ///
 , legend(label (1 "Unconcerned Humans")                 ///
    label (2 "Spooked Humans")                              ///
    label (3 "Werewolves")                                   ///
    label (4 "Shelters")                                    ///
    label (5 "Deceased")                                    ///
    rows(2)) `nodraw' title(Full Moon - Round `ii')          ///
 legend(color(white) region(fcolor(black)))              ///

 two (scatter y x if ty==1, mcolor(gs12) msymbol(smcircle))  ///
     (scatter y x if ty==5, color(sandb) msymbol(smcircle))  ///
     (scatter y x if ty==2, color(cranberry))                ///
     (scatter y x if ty==3, msymbol(square) color(yellow))   ///
     (scatter y x if ty==4, msymbol(smplus) color(gs16))     ///
     , legend(off) nodraw name(round`ii', replace)           ///

    global graph_list $graph_list round`ii'


* Now we will end the system loop

sort obs_num

gen round=obs_num if unaware_humans!=.

two (line unaware_humans round, sort)                       ///
    (line spooked_humans round, sort)                       ///
    (line deceased_humans round, sort)                      ///
    (line safe_human round, sort)                           ///
    , legend(label (1 "Unconcerned Humans")                 ///
    label (2 "Spooked Humans")                              ///
    label (3 "Deceased Humans")                             ///
    label (4 "Safe Humans"))                                  

* Different specifications at the beginning of the model yield different results. (werewolves always win though)
sleep 5000

graph combine $graph_list

* (c) Francis Smart 2012.