Wednesday, December 19, 2012

User written functions in R

R Script

# Writing your own functions can make your programs work much more efficiently, decreasing the lines of codes required to accomplish the desired results as well as simultaneously reducing the chance of error through repetative code.

# It is easy to take whole blocks of arguments from one function to the next with the ... syntax.  This is a sort of catch all syntax that can take any number of arguments.  Thus if you do not like the default choices for some functions such as the default of paste to use " " to conjoin different pieces then a simple function that I often include is the following:

# Concise paste function that joins elements automatically together.
p = function(...) paste(..., sep="")
  x=21
  p("For example x=",x)
  # This function is particularly useful when using the get or assign commands since they take text identifiers of object names.
  a21 = 230
  get(p("a",x))

# Print is a useful function for returning feeback to the user.
# Unfortunately this function only takes one argument.  By conjoining it with my new paste function I can easily make it take multiple arguments.
# Concise paste and print function
pp = function(...) print(p(...))
  # Print displays text.
  for (i in seq(1,17,3)) pp("i is equal to ",i)

# Round to nearest x
# Functions can also take any number of specific arguments.
# These arguments are either targeted by use of the order arguments are placed in or by specific references.
# If an argument is left blank then the default for that argument is used when available or an error message is returned.
round2 = function(num, x=1) x*round(num/x)
  # This rounding function is programmed to function similar to Stata's round function which is more flexible than the built in round command in R.
  # In R you specify the number of digits to round to.
  # In Stata however you specify a number to round.
  # Thus in Stata round(z,.1) is the same in R as round(z,1).
  # However, the Stata command is more general than the R one since Stata for instance could round to the nearest .25 while the R command would need to be manipulated to accomplish the same goal.
  # I have therefore written round2 which rounds instead to the nearest x.
  round(1.123, 2)
  round2(1.123, .01)
  # Yeild the same result. Yet, round2 will work with the following values.
  # Order is not neccessary.  The following produces identical results.
  round2(x=.01,num=1.123)
  round2(123, 20)
  # The round2 has a default x of 1 so round2 can be used quickly to round to the nearest whole number.
  round2(23.1)
  # The original round has the same default.

# Using modular arithmatic is often times quite helpful in generating data.
# The following function reduces a number to the lowest positive integer in the modular arithmatic.
mod = function(num, modulo) num - floor(num/modulo)*modulo
  # This syntax is programmed in a similar manner to Stata's mod command.
  mod(15,12)
  # This kind of a 12 modular system is the kind of system used for hours.
  # Thus mod(15,12) is the same as
  mod(3,12)
  # Or even:
  mod(-9,12)
  # There is a built in operator that does the same thing as this function.
  -9 %% 12

# Check for duplicate adjacent rows.  First row of a duplicate set is ignored but subsequent rows are not.
radjdup = function(...) c(F, apply(...[2:(dim(...)[1]),]==...[1:dim(...)[1]-1,],1,sum)==dim(...)[2])

# Half the data is duplicated.
  example.data = data.frame(id=1:100, x=mod(1:100, 5), y=rep(1:10,each=10))

# The data needs to be sorted for radjdup to work.
  example.order = example.data[order(example.data$x, example.data$y),]

cbind(example.order, duplicate=radjdup(example.order[,2:3]))

# Half the observations should be duplicated. We must sort our data for the adjected row duplicate command to be of any use with the current data.
  example.data = example.data[order(example.data[,1], example.data[,2]),]

# We expect to see half the observations are duplicate flagged.  This is because the first instance of any duplicate is not flagged.
  cbind(example.order, duplicate=radjdup(example.order[,2:3]))

Friday, December 14, 2012

Easy Monte Carlo Sampler Command

R script

# In R I am often times unsure about the easiest way to run a flexible Monte Carlo simulation.
# Ideally, I would like to be able to feed a data generating function into a command with an estimator and get back out interesting values like mean estimates and the standard deviation of estimates.
# Stata has a very easly command to use called "Simulate".
# I am sure others have created these types of command previous but I don't know of them yet so I will write my own.

MC_easy = function(dgp, estimator, obs, reps, save_data_index=0) {
  # dgp (Data Generating Proccess) is the name of a function that creates the data that will be passed to the estimator
  # estimator is the name of a function that will process the data and return a vector of results to be analyzed
  # obs is the number of observations to be created in each spawned data set
  # reps is the number of times to loop through the dgp and estimation routine
  # save data index is a vector of data repetitions that you would like saved

  # I typically start looping programming by manually going through the first loop value before constructing the loop for both debugging and demonstrative purposes.

  # This command will create a a starting sample data set.
  start_data = get(dgp)(obs)
  # This command runs the estimate values on that starting data set.
  MC_results = get(estimator)(start_data)

  # Create an empty list of save data values
  save_data = list()
  # If the first data set generated has been specified as a data set to save then save it as the first entry in the save_data list.
  if (sum(1==save_data_index)>0) save_data[[paste("1",1,sep="")]] = start_data

  for (i in 2:(reps-1)) {
    temp_data = get(dgp)(obs)
 
    # If this repetition is in the save_data_index then save this data to the list save_data
    if (sum(i==save_data_index)>0) save_data[[paste("i",i,sep="")]] = temp_data
 
    MC_results=rbind(MC_results, get(estimator)(temp_data))
  }
  # Display the results of the estimations
  MC_results = as.data.frame(MC_results)
  print(rbind(mean=mean(MC_results), sd= sd(MC_results), var= sd(MC_results)^2))

  # If the number of items for which the data was saved is equal to zero then only return the results of the estimation.
  if (sum(save_data_index>0)==0) return(MC_results)

  # If the number of items is greater than zero also return the saved data.
  if (sum(save_data_index>0)>0)  return(list(MC_results,save_data))
}

#### Done with the Monte Carlo easy simulation command.

# Let's try see it in action.


# Let's define some sample data generating program
endog_OLS_data = function(nobs) {
  # Let's imagine that we are wondering if OLS is unbiased when we have a causal relationship between an unobserved variable z and the observed variables x2 and x3 and the unobserved error e.
  # x4 is an observed variable that has no relationship with any of the other variables.
  z = rnorm(nobs)
  x1 = rnorm(nobs)
  x2 = rnorm(nobs) + z
  x3 = rnorm(nobs) - z
  x4 = rnorm(nobs)

  e = rnorm(nobs)*2 + z

  y = 3 + x1 + x2 + x3 + 0*x4 + e
  data.frame(y, x1, x2, x3, x4)
}

sample_data = endog_OLS_data(10)
sample_data
# Everything appears to be working well

# Now let's define a sample estimation
OLS_est = function(temp_data) {
  # Summary grabs important values from the lm command and coef grabs the coefficients from the summary command.
  lm_coef = coef(summary(lm(y~x1+x2+x3+x4, data=temp_data)))

  # I want to reduce lm_coef to a single vector with values for each b, each se, and each t
  lm_b = c(lm_coef[,1])
    names(lm_b) = paste("b",0:(length(lm_b)-1),sep="")
  lm_se = c(lm_coef[,2])
    names(lm_se) = paste("se",0:(length(lm_se)-1),sep="")
  lm_se2 = c(lm_coef[,2])^2
    names(lm_se2) = paste("se2_",0:(length(lm_se2)-1),sep="")

  lm_rej = c(lm_coef[,4])<.1
    names(lm_rej) = paste("rej",0:(length(lm_rej)-1),sep="")
  c(lm_b,lm_se, lm_se2,lm_rej)
}
OLS_est(sample_data)

# From this is it not obvious which estimates if any are biased.
# Of course our sample size is only 10 so this is no surprise.
# However, even at such a small sample size we can identify bias if we repeat the estimation many times.

# Let's try our MC_easy command.

MC_est = MC_easy(dgp="endog_OLS_data", estimator="OLS_est", obs=10, reps=500, save_data_index=1:5)
# We can see that our mean estimate on the coefficient is close to 3 which is correct.
# While b1 is close to 1 (unbiased hopefully) while b2 is too large and b3 too small and b4 is close to zero which is the true value.
# What is not possible to observe with 500 repetitions but capable of being observed with 5000 is that the standard error is a downward biased estimate of the standard deviation.
# This is a well known fact.  I think even there is a proof that there is no unbaised estimator for the standard error for the OLS estimator.
# However, se^2 is an unbiased estimator of the variance of the coefficients.
# Though the variance of se^2 is large, making it neccessary to have a large number of repetitions before this becomes apparent.
# The final term of interest is the rejection rates.  We should expect that for the coefficients b0-b3 that the rejections rates should be above random chance 10% (since there trully exists an effect) which is what they all ended up being though with b1 and b3 the rejection rate is only slightly above 20% which indicates that our power given our sample size given the effect size of x1 and x3 is pretty small.
# The final note is that though we do not have much power we are at least not overpowered when it comes to overrejecting the null when in fact the null is true.  For b4 the mean rejection rate is close to 10% which is the correct power.

# We can recreate the results table by accessing elements of the MC_est list returned from the function.
rbind(mean=mean(MC_est[[1]]),sd=sd(MC_est[[1]]),var=sd(MC_est[[1]])^2)

# We may also be curious about the medians and maxes.
summary(MC_est[[1]])

# Or we may like to check if our data was generated correctly
MC_est[[2]]

Tuesday, December 11, 2012

Musing on the importance of Standard Errors

Standard errors are really just as important as coefficients when estimating a relationship because they provide a critical component to inference.  Without standard errors coefficients are largely uninteresting.

What does it matter that the coefficient from the estimation is say 7.3 if we do not know the standard error?  If the standard error is say 1 then a 7.3 is quite different from 0, if that is what is important.  However, if the standard error is instead 100, then a coefficient of 7.3 is meaningless (likely) random noise.

Standard errors are interesting because unlike coefficients they alone are interesting without even having coefficient estimates.  They provide a point value which gives us insight into how much we can expect our estimates to vary.

Likewise, standard errors are often much more difficult to calculate than coefficients and even more sensitive to correct specification.  Biased standard errors can have the effect of making an outcome look less likely than it actually is (over-reject the null) or more likely than it actually is (under-reject the null).

A good example is failing to cluster standards errors when appropriate.  Not clustering standard errors when appropriate might be a problem if you had data from twenty different summer camps.  Not clustering data by summer camp but evaluating student outcomes is implicitly arguing that the outcomes of camp goers at the summer camps are independent of each other camp goer.  Clustering at the camp level however allows for each camp to have common shared shock (error) that season.

Monday, December 10, 2012

Checking for differences in estimates by group


Stata do file

* Imagine that you have two groups in your sample and you are wondering if both groups respond the explanatory variables in the same manner.

clear
set obs 1000

gen group=rbinomial(1,.5)

gen x = rnormal()

gen y = 2 + 3*x + rnormal()*2 if group==0
replace y = 1 + 0*x + rnormal()*2 if group==1

* There is several ways to test if the two sets of estimators are the same.

* One method is using interaction dummies.

gen x_group = x*group

* The technique is to include an interaction term in the model for each coefficient that you would like to check if it is equal.

* Group counts as an interaction term between the constant and the group indicator.
* x_group is the interaction between the x variable and the group variable.

reg y group x_group x

* In order to interpret the coefficients in the above regression.

* First set group equal to zero.

* In that case all that is in the regression is: y = 1*b0 + x*b1 which is the estimator for the group 0.

* We can see the constant estimate is close to 2 and the coefficient on x is close to 3 which are our values.

* Now the tricky thing is to interpret the remaining coefficients.

* Because the group coefficient is effectively interacted with the constant then in order to estimate the constant in the group=1 set just add _b[group- to _b[cons].

di "Group 2 constant estimate is " _b[group]+ _b[_cons] " which we can see is close to 1"

* We do a similar procedure with the slope.

di "Group 2 slope estimate is " _b[x]+ _b[x_group] " which we can see is close to 0"

* The nice thing about this formulation is that the coefficients are already included in the estimation.

* Thus we can easily test the hypothesis that both are the same manner.

test x_group=group=0

* If both groups were the same then they would have the same mean (same constants) and respond the same the the treatment x.

* An alternative way of testing each coefficient individually is:

reg y x if group == 0

local b_cong0 = _b[_cons]
local b_xg0 = _b[x]
local se_cong0 = _se[_cons]
local se_xg0 = _se[x]

reg y x if group == 1

* The t-test that they are the same is b[group=0]-b[group=1]=0
* var(b[group=0]-b[group=1])= var(b[group=0]) + var(b[group=1])

local joint_var_cons = `se_cong0'^2 + _se[_cons]^2

* And the t test is (b[group=0]-b[group=1])/(var(b[group=0]) + var(b[group=1]))^.5

di "t_constants = " (`b_cong0'- _b[_cons] )/(`joint_var_cons')^.5

local joint_var_x= `se_xg0'^2 + _se[x]^2

di "t_constants = " (`b_xg0'-_b[x])/`joint_var_x'^.5

* This formation assumes that the estimates vary independently.

* If done properly the t-stats are almost indentical between formulations.

* By the way, please fill out my survey posted on the last post.

Saturday, December 8, 2012

Small Anonymous Research Survey

Dear Reader,

I am experimenting with Google Documents in order to conduct a small survey that I will use to attempt to answer some research questions that I have.

Please consider taking this anonymous short survey and sharing it with your friends.

https://docs.google.com/spreadsheet/viewform?formkey=dGZoWFRvbmMwZXVyRFVmc3daMGpoVXc6MQ

Thank you for your time,
Francis Smart

Friday, December 7, 2012

The Delta method to estimate standard errors from a non-linear transformation

* Stata do file

* The Delta method can be used to estimate the standard errors after a regression estimation.

* Imagine you have some parameter G = (3*b0-b1)*b2^2 = 3*b0*b2^2-b1*b2^2

* Where y = bo + b1*x1 + b2*x2 + u

* The delta method can be used to estimate the standard errors of G.

* The delta method states that var_hat(G)=(dG/db) var(b) (dG/db)

* dG/db is a gradient vector:

* dG/db = [dG/db0, dG/db1, dG/db2]
* dG/db = [b2^2, -b2^2, 2*(b0-b1)*b2]

* var_hat(G) = (3*b2^2)^2 * se(b0)^-2 + (-b2^2)^2 * se(b1)^-2 + (2*(b0-b1)*b2)^2 * se(b2)^-2

[There is an error in the code because I failed to include a covariance term for the coefficients.  Please see the more recent update on the method.]

clear
set obs 1000

gen x1 = rnormal()
gen x2 = rnormal() * 4

global b0 = 1
global b1 = 1.5
global b2 = .3

local true_G = (3*${b0}-${b1})*${b2}^2

di `true_G'

gen y = ${b0} + ${b1}*x1 + ${b2}*x2 + rnormal()*8

reg y x1 x2

* G = (3*b0-b1)*b2^2 = 3*b0*b2^2  - b1*b2^2
local Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2

di "Ghat = `Ghat' is our estimate (true = `true_G')"

* Let's see if we can't use the delta method to derive a standard error.
local var_hatG = (3*_b[x2]^2)^2 * _se[_cons]^2 + (-_b[x2]^2)^2 * _se[x1]^2 + (2*(_b[_cons]-_b[x1])*_b[x2])^2 * _se[x2]^2

di "Standard error estimate is " `var_hatG'^.5

* Alternatively, let us attempt to bootstrap our standard errors.

cap program drop deltaOLS
program define deltaOLS, rclass
  reg y x1 x2
  return scalar Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2
end

bs Ghat=r(Ghat), rep(500): deltaOLS
* The bootstrap standard errors are similar to that of the delta method's standard errors.

cap program drop deltaMonteCarlo
program define deltaMonteCarlo, rclass
  clear
  set obs 1000

  gen x1 = rnormal()
  gen x2 = rnormal() * 4

  gen y = ${b0} + ${b1}*x1 + ${b2}*x2 + rnormal()*8

  reg y x1 x2

  return scalar Ghat = (3*_b[_cons]-_b[x1])*_b[x2]^2
end

simulate Ghat=r(Ghat), reps(500): deltaMonteCarlo
sum
* We can see that our estimates of the standard error via the Delta method and bootstap is quite close to the monte carlo estimates on observed standard errors from 500 replications.

Thursday, December 6, 2012

Logistic or Logit or doesn't matter


Stata do file

* I was recently attending a talk in which the presenter talked about logit in contrast with logistic regression when the outcome variable was binary saying logistic was similar.

* This confused me.

* Looking at the equations for logit vs that of logistic I think it is clear that a logit regression is a special case of a logistic regression.

* However, in case there is still doubt.

clear
set obs 100

gen x = rnormal()
* Some explanatory variable

gen y = rbinomial(1, normal(x))
* Some binary response variable

mlogit y x, baseoutcome(0)

logit y x
* We can see the estimates are identical.

Wednesday, December 5, 2012

Path Analysis

Stata do file

* Path analysis is an interesting statistical method that can be used to indentify complex relationships beween variables and an outcome variable.

* As with all statistical methods the modelling framework is essential to derive reasonable results.

* Conviently, I am only interested in simulating data so as usual my data will perfectly conform to the model's specifications.

* Imagine the following model.







* All of the boxes are observable variables.  The arrows indicate the causal direction of the effects.

* There are two exogenous variables: A and D.   These variables are not influenced by any other variables in the model.

* All other variables are endogenous.

* Each of the variables represents a direct effect of a one unit change in one variable on that of the other variable.

* This framework is convient because it allows us to indentify a "total effect" which is a combined result of both the direct and indirect effects of variables on the outcome variable.

* The variable of primary interest in explaining is H.

* The variable G has only a direct effect on H (pHG).

* While the variable C only has an indirect effect on H (pFC*pHF).

* The reason the indirect effect is a product is because C has a pFC effect on F, and F has a pHF effect on H, thus a change in H as a result of a change in C is how much F changes as a result of C and how much that change effects H.

* Variables can have both and indirect and direct effect.

* B for instance has the direct effect: pHB
* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

* The key feature about this particular example is that all of the arrows are one directional.

* Making a great deal of inference possible that otherwise would not be possible.

* Usually we cannot say that when trying to explain H with explanatory variables A through G that A causes B and B causes H.

* However, if we do the work to indentify reasonable pathways then this type of analysis could be quite interesting.

* Let's generate out data.

clear

* Let's imagine 6000 youth in our sample.

set obs 6000

* Let's first specify our effects

* pEA = .3
* pEB = .13

* pHA = .2
* pHB = .2
* pHE = .3
* pHG = 1.1
* pHF = .2

* pBA = .5

* pCB = .2
* pCD = .1

* pGD = .2

* pFC = .76
* pFB = .4

* For B we can calculate our true effects:

* B Direct: pHB = .2

* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Indirect effects:.2*.76*.2 + .13*.3 = .0694

* Total effect: .0694+.2 = .2694

gen A =                rnormal()
gen B = A*.5 +         rnormal()
gen D =                rnormal()
gen C = B*.2 + D*.1 +  rnormal()
gen E = A*.3 + B*.13 + rnormal()
gen F = B*.4 + C*.76 + rnormal()
gen G = D*.2 +         rnormal()
gen H = E*.3 + A*.2 + B*.2 + F*.2 + G*1.1 + rnormal()

* Simualtion Done

* In order to generate our different effects we simply run OLS for each endogenous variable.

reg A B
  local pBA = _b[B]

reg C B D
  local pCB = _b[B]
  local pCD = _b[D]

reg C B D
  local pCB = _b[B]
  local pCD = _b[D]

reg G D
  local pGD = _b[D]

reg F C B
  local pFB = _b[B]
  local pFC = _b[C]

reg E A B
  local pEA = _b[A]
  local pEB = _b[B]

reg H A B E F G
  local pHA = _b[A]
  local pHB = _b[B]
  local pHE = _b[E]
  local pHF = _b[F]
  local pHG = _b[G]

* In order to estimate the indirect effect say of B on H.
* We just plug our estimates into the equation.

* B direct effect: pHB
* Indirect effects: pCB*pFC*pHF + pEB*pHE
* Total effect: pHB + pCB*pFC*pHF + pEB*pHE

di "B's estimated indirect effect = `pCB'*`pFC'*`pHF' + `pEB'*`pHE'"
di "B's estimated indirect effect = " `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* Which turns out to be close to our true value.

di "B's total estimated effect on H is " `pHB' + `pCB'*`pFC'*`pHF' + `pEB'*`pHE'

* It is possible to use the user written command pathreg to make things easier.

* Install it by typing the following command. findit pathreg
pathreg (H E B F G) (G D) (C B D) (B A) (E A B) (F B C)

* This command does not currently calculate out all of the indirect and direct effects.

* I am not sure the best way to calculate the standard errors of the different effect estimates.

* My guess is that since this is just a series of fast OLS regressions the easiest thing to do would be to boostrap the entire process.

* This would require slightly more code but definitely easy to do from this point.

Tuesday, December 4, 2012

Multivariate Least Squares - Multi-Step Estimator and Correlated Explanatory Variables

Stata do file

* In least squares estimation it is very common to make a statement like "conditioning on x1, x2 has an average effect on y of ..."

* What does this mean?

* One answer is, revoming the average correlation between x1 and x2 and the effect x1 has on predicting y, the least squares estimator estimates the effect of x2 on y.

* I am not really sure if this helps.

* Perhaps an example.

* Imagine a continuous explanatory variable x2

clear
set obs 1000

gen x2 = rnormal()

* Now imagine a couple of binary variables x1 correlated with x2.

gen x1 = rbinomial(1, normal(x2))

cor x?

* We can see that the xs have a correlation around .5

* This means in some sense that about 25% (.5^2) of our variation in x2 explained by x1.

* This has very practical implications for regression analysis using x1 and x2.

* Imagine x1 is whether the individual parents have college degrees and x2 is years of personal college education.

* Because x1 and x2 are correlated any estimation using them together can only isolate the individual effect of x1 from x2 in the ways that they move seperatly from each other.

gen y1 = x1 + x2 + rnormal()*2

reg y1 x1 x2, nocon

* We can see that both coefficients are working well.  So, what do I mean by my previous statement.

* In order to see how the effect of estimating x2 is only based on the remaining variation controlling for x1 we can use a two step estimator that is the same as the previous regression.

* First we regress x2 on x1

reg x2 x1, nocon

* Now we take the residuals form the above regression.

predict x2_residual, resid

reg y1 x2_residual, nocon

* The results of the above regression is that the coefficient on the residuals of x2 is exactly the same as that of x2 in the above regression.

* Thus we can see that the result of multivariate analysis is not based on total varition in one x but that variation in a single x which is independent of the variation in other xs.

* All statistical estimation is an inductive exploritory task based on exploiting variation in observable characteristics.

* However, some variables move together with other variables making it difficult to estimate the effect of individual movements in some variables.

* It is worth noting that there is a different two step estimation procedure which is often used in some applications which is quite different from the multivariate OLS regresssion.

reg y1 x1, nocon

predict y1_resid, resid

reg y1_resid x2, nocon

* This estimator only works as well as the OLS multivariate estimator when x1 is independent of x2.  If it is not then this estimator is heavily biased.

* Why?  In the first step it is attributing all of the coviation in y with x1 to x1 (even though some of it is trully x2s).

* In the second stage it is using what is left over of the variation in y as the explanatory variable.

* In this case, this method vastly underestimates the magnitude of the coefficinet of x2 and overestimates the magnitude of the coefficient on x1.

Thursday, November 29, 2012

Pre and Post Test Data Merge/Append

* Stata Script

* Imagine that you have two sets of data.

* 1 pre-test
* 2 current

* You would like to merge the data together.

* You have two options.

* 1 Create a tall data set (using the append command)
* 2 Create a wide data set (using the merge command)

* Both forms of joined data are common.  Let's see hwo to do this.

* First let's create our pretest data.

clear
set obs 1000

* We have some unique identifier on each person/student
gen ID = _n

gen score = rnormal()

* Now let's generate say 88 variables (v11 through v99)
forv i=11/99 {
  gen v`i' = runiform()
    label var v`i' "Random uniform variable `i' - pretest"
}

save pretest, replace

* Now we have our pretest data saved

* Let's create our "current" data

* Let's imagine that there is some kind of treatment
gen treatment = rbinomial(1,.3)

* It has a positive effect
replace score = score+.4*treatment

* Now let's generate say 88 variables (v11 through v99)
forv i=11/99 {
  local change = rbinomial(1,.1)
  * There is a 10% chance that one of your other variables will change

  if `change'==1 replace v`i' = v`i'+runiform()
      label var v`i' "Random uniform variable `i' - current"

}

save current, replace

clear

* END DATA SIMULATION
******************************************************************
* Begin file management

* Now we have two data sets with different variable in each.
* Let's start by appending the data together into a long file.

use pretest, clear

* First let's generate a variable to indicate pretest
gen phase = "Pretest"

* Now, let's append the data together:
append using current

sum
* We can see that now we have 2000 observations as expected.

* Now our data is in a tall format

* We might also be interested in making sure the "treatment" variable is duplicated for every observation.
bysort ID: egen treat = mean(treatment)

* Drop the old treatment variable
drop treatment
rename var treat treatment

clear
*************
* Alternatively, we may want to put our data into a long format.

* First we need to rename variables so that they will not be overwritten.
use pretest, clear

foreach v of varlist * {
  rename `v' `v'_0
}

* We need to make sure only that the merging variable keeps the same name.

rename ID_0 ID

save pretest_rename, replace

* Now we load the current test data.

use current, clear

* We will rename the variables in the current as well
foreach v of varlist * {
  rename `v' `v'_1
}

* Making sure to change ID_1 back to ID
rename ID_1 ID

* Now we merge the data together
merge 1:1 ID using pretest_rename

* Now our data is wide.  Note, the reshape command could be use to change data from wide to tall or tall to wide.

* It might be useful to order all of the variables from before and after next to each other.

order *, alphabetic

Wednesday, November 28, 2012

CAT using Kullback–Leibler vs Fisher Information

R Code

# This post builds on my recent post that builds a simulation of a computer adaptive test (CAT) selecting items from an infinite item pool using fisher information as the criteria.

http://www.econometricsbysimulation.com/2012/11/computer-adaptive-test-assuming.html

# This post is very similar except that it uses instead the Kullback-Leibler (KL) information criteria to select items.

# More information about the KL can be found at:

http://www.econometricsbysimulation.com/2012/11/selecting-your-first-item-on-computer.html


#########################################################
# Specify the initial conditions:

true.ability = rnorm(1)

# Load three parameter model ICC
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# Load three parameter item information:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

#########################################################
#  Mock Computer Adaptive Test

# First let's specify and initial guess

est.start = 0

# How much do we adjust our estimate of person ability when all answer's are either right or wrong.
est.jump = .7

# Number of items on the test. This will be the end condition.
num.items = 50

# Set the other parameters in the 3PL.
a.base = 3
c.base = .1

# We will draw all of the random outcomes in advance so that both methods of selecting items get the "same" draws.
rdraw = runif(num.items)

### Let's generate a vector to hold both ability estimates.  Fisher and KL start at the same estimate.
ability.est = est.start

# Let's first generate a empty data frame to hold the set of item taken.
items = data.frame(a=NA,b=NA,c=NA,response=NA,p=NA, ability.est=NA)

i = 1

# For this first mock test we will not select items from a pool but instead assume the pool is infinite and has an item with an a=a.base, c=c.base, and b equal to whatever the current guess is.

# Let's select our first item - a,b,c, response are scalars that will be reused to simplify coding.
a=a.base
c=c.base

### This is the first divergence from the alternative CAT simulation.

### In order to select an item we must specify first a theta distribution to select the best item for.

### I will use 1000 draws from theta~normal()
theta.dist = rnorm(1000)

### Drawing from the second link posted above the KL information is defined as:
KL = function(theta.true,theta.estimate, a, b, c) {
  # For the true value
  p.true = PL3(theta.true,a,b,c)
  q.true = 1-p.true
 
  # For the estimate
  p.estimate = PL3(theta.estimate,a,b,c)
  q.estimate = 1-p.estimate

  # The following line is the value to be returned to the KL function:
  p.true*log(p.true/p.estimate) + q.true*log(q.true/q.estimate)
  }

### In order to select the best item from all the infinite number of items I will use a maximimization routine to select the b while a and c are held constant.

### I want to define a function to maximize the KL over for a given theta estimate.
KLmax = function(b) mean(KL(theta.true=theta.dist, theta.estimate=ability.est[i], a=a, b=b, c=c))

### Now we want to optimize the KLmax function looking for the choice of b that gives us the highest value.
optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))

b=optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

# Probability of getting the item correct
p=PL3(true.ability, a,b,c)

# Note that all of the ps are already draw
response =  p > rdraw[i]

# The Item Characteristic Curve (ICC) gives the probability of getting the item correct.
# Thus, a .9 is the max of the runifrom that should produce a response of TRUE or correct or 1 (as far as R is concerned these TRUE and 1 are the same as is FALSE and 0)

items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

# We have now successfully administered our first item.

# Should do our first MLE estimation?

# Not quite, unfortunately MLE requires in the bianary case that the student has answered at least one question right and at least one question wrong.

i=1+1

# Instead we will just adjust the ability estimate by the fixed factor (est.jump)
ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump

# Now we administer the second item:
# We will continue this until we get some heterogeneity in the responses
response.v = items$response
response.ave = sum(response.v)/length(response.v)

while ((response.ave == ceiling(response.ave)) & (num.items >= i) ) {
  # This condition will no longer be true when at least one of the items is ansered correctly and one of the items answered incorrectly.
  ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump

  a=a.base
  c=c.base
 
  ### Using the KL information criteria
  b=optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par
 
  p=PL3(true.ability, a,b,c)

  response =  p > rdraw[i]

  items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

  response.v = items$response
  response.ave = sum(response.v)/length(response.v)

  i=i+1
}

items
true.ability

# Now that we have some heterogeneity of responses we can use the MLE estimator
MLE = function(theta) sum(log((items$response==T)*PL3(theta, items$a, items$b, items$c) +
                              (items$response==F)*(1-PL3(theta, items$a, items$b, items$c))))

optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
  # Okay, it seems to be working properly  now we will loop through using the above function.
 
# The only thing we need change is the ability estimate.
while (num.items >= i) {
  ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

  a=a.base
  c=c.base
 
  ### Using the KL information criteria
  b=optim(0,fn=KLmax , method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par
 
  p=PL3(true.ability, a,b,c)
 
  response =  p > rdraw[i]

  items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

  response.v = items$response
  response.ave = sum(response.v)/length(response.v)

  i=i+1
}

ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

# Save the paths for comparison with Fisher information
items.KL = items
ability.est.KL = ability.est

#######################################################################
#####
##### Now let's compare this with the Fisher information item selection
#####
#######################################################################


ability.est = est.start
items = data.frame(a=NA,b=NA,c=NA,response=NA,p=NA, ability.est=NA)
i = 1
a=a.base
c=c.base
### Now b just equals the ability estimate
  b=ability.est[i]
p=PL3(true.ability, a,b,c)
response =  p > rdraw[i]
items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])
i=1+1
ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump
response.v = items$response
response.ave = sum(response.v)/length(response.v)

while ((response.ave == ceiling(response.ave)) & (num.items >= i) ) {
  ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump
  a=a.base
  c=c.base
  ### Change how b is selected
    b=ability.est[i]
  p=PL3(true.ability, a,b,c)
  response =  p > rdraw[i]
  items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])
  response.v = items$response
  response.ave = sum(response.v)/length(response.v)
  i=i+1
}
while (num.items >= i) {
  ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par
  a=a.base
  c=c.base
  ### Change how b is selected
    b=ability.est[i]
  p=PL3(true.ability, a,b,c)
  response =  p > rdraw[i]
  items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])
  response.v = items$response
  response.ave = sum(response.v)/length(response.v)
  i=i+1
}

# One final ability estimate calculation
ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

ability.est.Fisher = ability.est
items.Fisher = items

#######################################################################

minmax = function(x) c(min(x),max(x))

plot(0:num.items, ability.est.Fisher, type="l", main="CAT Estimates Ideally Converge on True Ability",
  , xlab="Number of Items Administered", ylim=minmax(c(ability.est.Fisher,ability.est.KL)),
  ylab="Estimated Ability", lwd=3, col="blue")
lines(0:num.items, ability.est.KL, lwd=2, col="green")
abline(h=true.ability, col="red", lwd=2)

## Graph I

# I have included two different graphs because I thought both of these were quite peculiar.
# The first graph is strange because in this chance draw the KL looks like it is doing as well or better than the Fisher.
# In almost all other samples that I ran it seemed to do considerably worse.

## Graph II


# Let's see what items both methods are choosing and how they relate to the true ability estimate.
plot(1:num.items, items.Fisher$b, type="l", main="Item's are selected using different criteria",
  , xlab="Number of Items Administered", ylim=minmax(c(items.KL$b,items.Fisher$b)),
  ylab="Estimated Ability", lwd=3, col="blue")
lines(1:num.items, items.KL$b, lwd=2, col="green")
abline(h=true.ability, col="red", lwd=2)



# Blue is Fisher
# Green is KL

# I have included two different graphs because I thought both of these were quite peculiar.
# The first graph is strange because in this chance draw the KL looks like it is doing as well or better than the Fisher.
# In almost all other samples that I ran it seemed to do considerably worse.




# Let's see what items both methods are choosing and how they relate to the true ability estimate.
plot(1:num.items, items.Fisher$b, type="l", main="Item's are selected using different criteria",
  , xlab="Number of Items Administered", ylim=minmax(c(items.KL$b,items.Fisher$b)),
  ylab="Item Difficulty", lwd=3, col="blue")
lines(1:num.items, items.KL$b, lwd=2, col="green")
abline(h=true.ability, col="red", lwd=2)


# The y-axis should read item difficulty.

# We can see the items tend to be closer to zero in difficulty for the KL selection (these items correspond to the most recent graph)

# From this simulation, it appears KL is an inferior item selection criteria.

# It is possible that this simulation is not taking into consideration all of the features of the KL criteria.  Perhaps, if the item pool has differences in the parameters a and c then KL will perform better.  I will leave that for a later simulation.

Sunday, November 25, 2012

Commodity Demand Estimation - Fish Demand

Stata Do File

* My field exam is coming up and I am therefore giving some thought to commodity demand once again.

* This simulation will follow the Paper "Demand Structure for Fish" Asche, Bjorndal, and Gordan SNF Working Paper No. 37/05

* In this simulation I will attempt to simulate commodity data that fits several empirical model specifications.

* A common initial attempt at estimating demand was to specify:

* ln(q_it) = alpha_i + sum(across j) of Beta_j*ln(pjt) + e_i*ln(X_t)

* The advantage of this formation was that it allowed the elasticities to be easily recovered.

* If consumers of fish have a fixed budget for fish and no strong substitutes then we expect that the price elasticity of fish to be around -1.

* Let's see if we can generate aggrogate data that looks like this.

clear
* Let's say we have 20 years of weekly price data ~1000 observations
set obs 1000
gen t=_n

* Let's generate some substitute products
gen pbeef = (rpoisson(10)+1)/10
gen pchicken = (rpoisson(7)+1)/10
  label var pchicken "Price of Chicken"
gen ppork = (rpoisson(12)+1)/14
gen pfish = (rpoisson(15)+1)/12
  label var pfish "Price of Fish"

* Let's generate some fish covariates.

* There was a scare that commercial fishing was killing dolphins.  Let's say that happened at time 500.
gen dolphin = 0
replace dolphin = 1 if t>500

* Let's imagine that the fishing industry is also campainging to encourage the consumption of fish.
gen fishad = rnbinomial(10,.1)

* Now let's generate the natural logs of the above values
foreach v in pbeef pchicken ppork pfish fishad {
  gen ln`v' = ln(`v')
}

* Generate an error term
gen e=rnormal()/4

* Now let's generate quantity demand for fish!
gen lnqfish = 1 + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish - .3*dolphin + .1*lnfishad + e

* Done with simulation

* I won't even bother estimating this. The reason is that by construction our data perfectly fits our OLS model.

* Well, just a reference point I will estimate the above.
reg lnqfish lnpbeef lnpchicken lnppork lnpfish dolphin lnfishad

* We can see all of our estimates work exceedingly well when the model is perfectly specified.

* However, it might be instructive to look at the relationships between variables.

gen qfish = exp(lnqfish)
label var qfish "Quantity of Fish Purchased"

twoway (scatter pchicken qfish) (lfit pchicken qfish) ///
       (scatter pfish qfish) (lfit pfish qfish) ,     ///
  title(Relationship Between Own Price and Quantity)



* This graph is a little bit busy and might be hard to read.
* Notice that each point in blue is also represented in green.
* This is because while the horizontal axis is the same the vertical axis changes from price of chicken to price of fish.

* Let's imagine that rather than estimating using logs we estimated using quantities and prices.
reg qfish pbeef pchicken ppork pfish dolphin fishad

* It is interesting to note that while the estimates when the functional form is not speficied correctly are not as good as previously they still seem to be working well.

* Let's try converting some of them to point elasticities and see how they compare.

foreach v in pbeef pchicken ppork pfish fishad {
  qui sum `v'
  di "Point Elasticity of `v' = `=_b[`v']/r(mean)'"
}

* Some of the point elasticities are better estimates of the constant elasticity than just the coefficients but not consistently so.

* Okay, so this has gotten our feet a little wet with the idea of estimating demand.
* Now, let's shift into demand estimation given a more flexible forms for the demand to take.

*****************************************************
* Houthakker and Taylor's habit formation model.

* ln(q_it) = alpha_i + c_i*ln(q_i,t-1) + sum(across j) of Beta_j*ln(pjt) + e_i*ln(X_t)

* The difference between this model and the last is that now we have a form of temporal dependence.

* Consumption this period depends upon consumption last period.

gen lnqfish2 = 1 + .75*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish - .3*dolphin + .1*lnfishad + e if t==1
replace lnqfish2 = .75*lnqfish2[_n-1] + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish - .3*dolphin + .1*lnfishad + e if t>1
label var lnqfish2 "Auto-correlated Fish Quantity Demanded"

* In order to see the time trends
forv i=2/12 {
  gen t`i'=t^(`i'/3)
}

reg lnqfish2 t t? t??

predict lnqfish2_hat

two (line lnqfish2 t) (line lnqfish2_hat t) , title(Fish Quantity Demanded with Temporal Dependency)



* We can see long term movement in quantity demanded as a result of the dophin scare in additition to some short term temporal dependency as a result of habit formation.
gen lnqfish2_l = lnqfish2[_n-1]

* To verify that we can estimate the model directly:
reg lnqfish2 lnqfish2_l lnpbeef lnpchicken lnppork lnpfish dolphin lnfishad

* As before, when our model is correctly specified we have little difficulty estimating the parameters.

* If we fail to include the once lagged fish quantity:
reg lnqfish2 lnpbeef lnpchicken lnppork lnpfish dolphin lnfishad
* We have less explanatory power, but generally almost all of the estimates are identical.

predict resid, resid

twoway (scatter resid lnqfish2) (lfit  resid lnqfish2)
* There is a clear relationship between the y variable and the residual suggesting autocorrelation.



* Even though there is autocorrelation of the errors this does not substantially affect the estimates of the demand function.

* However, this is somewhat an artifact of the simulation.

* We should think that there might be a causal relationship between amount of fish demanded during the previous period and current period prices if there exists some friction in the fish markets.

* Thus perhaps the higher the quantity demanded in the current period the higher the price is expected to be during the next period because of fish reserves.

* This demands a more complex simulation in which we calculate each period individually.

gen lnpfish2 = ln((rpoisson(15)+1)/12) if t==1
gen lnqfish3 = 1 + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish2 - .3*dolphin + .1*lnfishad + e if t==1

* For the first period we assume prices are exogenous.

* For each additional period we will make prices a function of quantity demanded.
forv i=2/1000 {
  qui replace lnpfish2 = ln((rpoisson(15)+1)/12) + .7*lnqfish3[`i'-1] if t==`i'
  * Thus the price elasticity is something less than 1/.7= 1.4.  Less because there is the random poisson variation that diminishes the relationship.
  qui replace  lnqfish3 = .75*lnqfish3[_n-1] + .3*lnpbeef + .5*lnpchicken + .3*lnppork - 1*lnpfish2 - .3*dolphin + .1*lnfishad + e if t==`i'
}

reg lnqfish3 lnpbeef lnpchicken lnppork lnpfish2 dolphin lnfishad
* Now failing to take into account the previous quantity of fish demanded causes the price elasticity to be downward biased.

* The reason is that high quantity demanded last period will cause the habit of high quantity demanded this period.

* However, high quantity demanded last period will also cause the price to be higher this period.

* If you do not control for high demand last period then the higher price will look like it is explaining both the higher habitual consumption of fish and the lower consumption of fish due to the higher price.

* Thus, controlling for habit formation is necessary for unbiased estimates.
gen lnqfish3_l = lnqfish3[_n-1]
reg lnqfish3 lnqfish3_l lnpbeef lnpchicken lnppork lnpfish2 dolphin lnfishad

predict res3, residual

scatter res3 t if t<200 autocorrelation="autocorrelation" o="o" of="of" p="p" remaining="remaining" residuals="residuals" title="title">



gen res3_1 = res3[_n-1]

reg res3_1 res3
* After controlling for AR1 errors, there is no longer autocorrelation of residuals.

Saturday, November 24, 2012

Moving Averages & Data Cleaning

Original Code


* Imagine you have data on prices for many products.

* For each of the products you record weekly price information.

clear
set obs 200

gen prod_id = _n

* Each product has a unique average price
gen prod_price = rpoisson(5)/7

* You have data on weekly prices for 200 weeks.
expand 200
bysort prod_id: gen t=_n
  label var t "Week"

sort prod_id t

* There is also some seasonal variation
gen seasonal = .2*sin(_pi*t/50)

* As well as a general time trend
gen trend = t*.005

* The first observation is not correlated with anything
gen price = prod_price*2.5 + trend +  rpoisson(10)/10 if t==1
replace price = prod_price*2 + trend + seasonal + .7*price[_n-1] + .3*rpoisson(10)/10 if t==2
replace price = prod_price + trend + seasonal + .5*price[_n-1] + .2*price[_n-2] + .3*rpoisson(10)/10 if t==3
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .2*price[_n-2] + .2*price[_n-3] + .3*rpoisson(10)/10 if t==4
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .175*price[_n-2] + .125*price[_n-3] + .1*price[_n-4] +.3*rpoisson(10)/10 if t>4

* Create a globabl to store
global twograph

forv i = 1/6 {
  global twograph ${twograph} (line price t if prod_id == `i')
}

twoway $twograph, legend(off) title(True price trends for first six products)



* Now let's imagine that the above generated data is the true price information which is fundamentally unobservable.

* Instead you have multiple collections of data per week on prices which each vary by some random addative error.
expand 3

bysort prod_id t: gen prod_obs = _n

gen price_collect = price + rnormal()*.25

* However the price information that you have has some entries that 10% have been mistakenly entered wrong.

gen entry_error = rbinomial(1,.1)
  gen scalar_error = rnormal()+1

gen price_obs = price_collect*(1+entry_error*scalar_error)
  label var price_obs "Recorded Price"

* In addition, 35% of your price data was never collected
gen missing = rbinomial(1,.35)

drop if missing==1

* Create a globabl to store
global twograph

forv i = 1/6 {
  global twograph ${twograph} (line price_obs t if prod_id == `i' & prod_obs==1)
}

twoway $twograph, legend(off) title(Observed price trends for first six products)



keep t price_obs  prod_id entry_error
* I am keeping entry error in the data set as a means of comparison though it would not be directly observed.

**************************************************************************
* DONE with simulation

* The question is:

* Can you now with this messy data recover price data that is similar to the original?

* The first thing that we should exploit is the duplicate recorded data.

scatter price_obs t if prod_id == 1, title(It is easy to see individual deviations)


* It is easy to see individual deviations but we do not want to go through all 200 products to identify individually price outliers.
* We want to come up with a system to identify outliers.

* Let's generate a mean by product and time
bysort prod_id t: egen price_mean = mean(price_obs)

* Let's flag any observation that is 120% greater than the mean or 80% less than the mean.
gen flag = (price_mean > price_obs*1.2 | price_mean < price_obs*.8)

* Let's see how it is working:
two (scatter price_obs t if prod_id == 1)                           ///
    (scatter price_obs t if prod_id == 1 & flag==1  , msymbol(lgx)) ///
, title(Some of outliers can be identified just looking at the mean) legend(off)




corr flag entry_error
* Our flag is correlated about 45% with the entry errors.  This is good but we can do better.

* I propose that rather than using just the mean that we construct a moving average of prices and see how each entry deviates from the average.
* The only problem is that the moving average command requires xtset and that requires only one entry per time period.
* So, I say we rescale the time variable and add in as if recorded at a different time of the week the observation number.

* We need to newly generate prod_obs since we do not know which observation is missing from each product.
bysort prod_id t: gen prod_obs = _n

gen t2 = t*4 + prod_obs

* xtset sets the panel data panel id and time series level.
xtset prod_id t2

* The command we will be using is "tssmooth"

* It is coded such that by specifying ma it means moving average and window tells Stata how many time periods to count ahead and how many behind in the moving aerage.
* This command can take a little while.
tssmooth ma ma_price_obs=price_obs, window(23 0 23)
* 23 is in effect 5 weeks ahead and 5 weeks behind
* The 0 tells stata not to include inself in that average

* The moving average
two (scatter price_obs t if prod_id == 1)                           ///
    (line ma_price_obs t if prod_id == 1)                        ///
    (line price_mean t if prod_id == 1)                        ///
, title(The Moving Average is less succeptable to outliers)



* The moving average is more stable than just the time average.

* Let's try flagging using the moving average
cap drop flag2
gen flag2 = (ma_price_obs > price_obs*1.2 | ma_price_obs < price_obs*.8)

two (scatter price_obs t if prod_id == 1)                           ///
    (scatter price_obs t if prod_id == 1 & flag2==1  , msymbol(lgx)) ///
, title(The Moving Average can also be useful) legend(off)



corr flag2 entry_error

* Drop our flagged data
drop if flag2==1

* Collapse to the weekly level
collapse price_obs, by(prod_id t)
  label var price_obs "Mean price observed"

global twograph

forv i = 1/6 {
  global twograph ${twograph} (scatter price_obs t if prod_id == `i')
}

twoway $twograph, legend(off) title(Observed price trends for first six products)
* The data is looking a lot better but we still clearly have some unwanted outliers.



* We could take advantage of the cross product trends to help identify outliers within product prices.
bysort t: egen ave_price = mean(price_obs)

reg price_obs ave_price if prod_id == 1
predict resid1, residual

reg price_obs ave_price if prod_id == 2
predict resid2, residual

reg price_obs ave_price if prod_id == 3
predict resid3, residual

twoway (line resid1 t if prod_id == 1) ///
       (line price_obs t if prod_id == 1) ///
  (line resid2 t if prod_id == 2) ///
       (line price_obs t if prod_id == 2) ///
  (line resid3 t if prod_id == 3) ///
       (line price_obs t if prod_id == 3) , title(The residuals are clear indicators of outliers) legend(off)



* Finally, let us drop observations with residuals that are greater than 1.5 standard deviations from the mean.

drop resid?


gen flag = 0

qui forv i=1/200 {
  reg price_obs ave_price if prod_id == `i'
  predict resid_temp, residual
  sum resid_temp
  replace flag = ((resid_temp-r(mean)>r(sd)*1.5 | resid_temp-r(mean)  drop resid_temp
}


* Let's see how it is working:
two (scatter price_obs t if prod_id == 2)                           ///
    (scatter price_obs t if prod_id == 2 & flag==1  , msymbol(lgx)) ///
, title(Now just trying remove some final outliers) legend(off)


* Plotting product 1 pricing relative to outliers.
  global twograph

  forv i = 1/6 {
    global twograph ${twograph} (line price_obs t if prod_id == `i')
  }


* Finally dropping the outliers
drop if flag

* One final graph
global twograph

forv i = 1/6 {
  global twograph ${twograph} (scatter price_obs t if prod_id == `i')
}

  twoway $twograph, legend(off) title(Observed price trends for first six products)



* Not as clean as our first graph but definitely much improved.

Friday, November 23, 2012

Computer Adaptive Test Assuming an Infinite Item Pool


Original Code

# In order for us to understand what a Computer Adaptive Test (CAT) is, let's first think about how the CAT works.

# A CAT test starts by having some kind of initial assessment of student ability (test taker's ability).

# This is typically at the population mean.

# The test then selects an item that (in the most straightforward case) has the most information at that initial guess.

# If the student answers that question correctly then the program reassesses student ability and finds the next question which has the most information at the new assessment of student ability.

# The computer continues to select items until the termination conditions are met.  These conditions might be anything from a fixed length of time for the test, a fixed number of questions for the test, or more interestingly a sufficient level of precision achieved for student ability.  See flow chart:



# In order to asses student ability I will use code form:

http://www.econometricsbysimulation.com/2012/11/estimating-person-characteristics-from.html

#########################################################
# Specify the initial conditions:

true.ability = rnorm(1)

# Load three parameter model ICC
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# Load three parameter item information:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

#########################################################
#  Mock Computer Adaptive Test

# First let's specify and initial guess

est.start = 0

# How much do we adjust our estimate of person ability when all answer's are either right or wrong.
est.jump = .7

# Number of items on the test. This will be the end condition.
num.items = 50

# Set the other parameters in the 3PL
a.base = 3
c.base = .1

# Let's generate a vector to hold ability estimates.
ability.est <- est.start="est.start" p="p">
# Let's first generate a empty data frame to hold the set of item taken.
items <- a="NA,b=NA,c=NA,response=NA,p=NA," ability.est="NA)</p" data.frame="data.frame">
i = 1

# For this first mock test we will not select items from a pool but instead assume the pool is infinite and has an item with an a=a.base, c=c.base, and b equal to whatever the current guess is.

# Let's select our first item - a,b,c, response are scalars that will be reused to simplify coding.
a=a.base
c=c.base
b=ability.est[i]

# Probability of getting the item correct
p=PL3(true.ability, a,b,c)
response = runif(1) < p
# The Item Characteristic Curve (ICC) gives the probability of getting the item correct.
# Thus, a .9 is the max of the runifrom that should produce a response of TRUE or correct or 1 (as far as R is concerned these TRUE and 1 are the same as is FALSE and 0)

items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

# We have now successfully administered our first item.

# Should do our first MLE estimation?

# Not quite, unfortunately MLE requires in the bianary case that the student has answered at least one question right and at least one question wrong.

i=1+1

# Instead we will just adjust the ability estimate by the fixed factor (est.jump)
ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump

# Now we administer the second item:
# We will continue this until we get some heterogeneity in the responses
response.v = items$response
response.ave = sum(response.v)/length(response.v)

while ((response.ave == ceiling(response.ave)) & (num.items >= i) ) {
  # This condition will no longer be true when at least one of the items is ansered correctly and one of the items answered incorrectly.
  ability.est[i] = ability.est[i-1]-(-1)^(response)*est.jump

  a=a.base
  c=c.base
  b=ability.est[i]
  p=PL3(true.ability, a,b,c)

  response = runif(1) < p

  items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

  response.v = items$response
  response.ave = sum(response.v)/length(response.v)

  i=i+1
}

items
true.ability

# Now that we have some heterogeneity of responses we can use the MLE estimator
MLE = function(theta) sum(log((items$response==T)*PL3(theta, items$a, items$b, items$c) +
                              (items$response==F)*(1-PL3(theta, items$a, items$b, items$c))))

optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
  # Okay, it seems to be working properly  now we will loop through using the above function.
 
# The only thing we need change is the ability estimate.
while (num.items >= i) {
  ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par

  a=a.base
  c=c.base
  b=ability.est[i]
 
  p=PL3(true.ability, a,b,c)
 
  response = runif(1) < p

  items[i,] = c(a=a, b=b, c=c, response=response, p=p, ability.est=ability.est[i])

  response.v = items$response
  response.ave = sum(response.v)/length(response.v)

  i=i+1
}

items
(ability.est[i] = optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1)))
true.ability

# We can see that even in this ideal scenario in which you always have appropriately difficult items with high discriminatory power and low guessing, there is a noticeable amount of error.

plot(0:num.items, ability.est, type="l", main="CAT Estimates Ideally Converge on True Ability",
  ylim=c(-3,3), xlab="Number of Items Administered", ylab="Estimated Ability", lwd=2)
abline(h=true.ability, col="red", lwd=2)


Thursday, November 22, 2012

Computing Expected Values

Original Code


# It is often times difficult to solve for the expected value of a variable in closed form.

# However, using computers it can be easy to approximate.

# Imagine (for whatever reason) you would like to calculate the expected value of exp(x) where x is distributed as a standard normal distribution.

# Method 1: Randomly draw many draws of the variable and take the mean.
draws = 100000
rvar = exp(rnorm(draws))
mean(rvar)

# Method 2: Draw from the inverse CDF
draws = 100000
CDF = seq(0.0000001,.9999999,length.out=draws)
rvar = exp(qnorm(CDF))
mean(rvar)

# Both methods are likely to produce very similar results.  Method 2 might be preferred because it is not susceptible to the random draw.  However, Method 1 has the strength of not having to specify and upper and lower limit to values entering the inverse CDF.

# Sometimes you might be interested in estimating the expected value of a censored variable.

# Say we are interested in exp(x) where x is still standard normal but missing at 0 and 2 (min = 0 and max = 2).

# This is easy to approximate as well.

draws = 100000
rvar = rnorm(draws)
rvar.cens = rvar[rvar > 0]
mean(rvar.cens)

Open Source Location For Online Computer Adaptive Test Developers

I just found a very cool online platform that is powerful and flexible that allows the free development and implementation of online computer adaptive testing.  Too bad their tutorials seem a little outdated.  But they use R to handle the adaptive item selection!  I am very excited.

http://www.psychometrics.cam.ac.uk/page/338/concerto-testing-platform.htm

Wednesday, November 21, 2012

Estimating Person Characteristics from IRT Data - 3PL Model

Original Code

# One of the basic tasks in item response theory is estimating the ability of a test taker from the responses to a series of items (questions).

# Let's draw the same pool of items that we have used on several previous posts:

# First let's imagine we have a pool of 100 3PL items.
set.seed(101)

npool = 500

pool = as.data.frame(cbind(item=1:npool, a=abs(rnorm(npool)*.25+1.1), b=rnorm(npool), c=abs(rnorm(npool)/7.5+.1)))
summary(pool)

# Drawing on code from a previous post we can calculate several useful functions:

# (http://www.econometricsbysimulation.com/2012/11/matching-item-information.html)

# Each item has a item characteristic curve (ICC) of:
PL3 = function(theta,a, b, c) c+(1-c)*exp(a*(theta-b))/(1+exp(a*(theta-b)))

# Let's imagine that we have a single test taker and that test taker has taken the first 15 items from the pool.

items.count = 15
items.taken = pool[1:items.count,]

# And that the person has a latent theta ability of 1
theta = 1.3

# Let's calculate the cut points for each of the items.
items.cut = PL3(theta, items.taken$a, items.taken$b, items.taken$c)

# We can see how the cut point works by graphing
plot(0,0, type="n", xlim=c(-3,3),ylim=c(0,1), xlab=~theta,
           ylab="Probability of Correct Response", yaxs = "i", xaxs = "i" , main="Item Characteristics Curves and Ability Level")

for(i in 1:items.count) {
  lines(seq(-3,3,.1),PL3(seq(-3,3,.1), items.taken$a[i], items.taken$b[i], items.taken$c[i]), lwd=2)
  abline(h=items.cut[i], col="blue")
}
abline(v=theta,col="red", lwd=3)



# Now let's draw a uniform draw that we will use to calculate whether each item as passed.

rdraw = runif(items.count)

# Finally, we will calculate item responses
item.responses = 0
item.responses = items.cut > rdraw

###############################################
# Done with Simulation - Time for Estimation

# We want to use the information we know about the items (the item parameters and the responses) in order to estimate a best guess at the true ability of the test taker.

# First we must check if the person got all of the items either correct or incorrect.

sum(item.responses)
# If this is either a 0 or a number equal to the number of items then we cannot esimate an interior maximum without additional assumptions.

# We will attempt to recover our theta value using the r command optim

# First we need to specify the function to optimize over.

MLE = function(theta) sum(log((item.responses==T)*PL3(theta, items.taken$a, items.taken$b, items.taken$c) +
                              (item.responses==F)*(1-PL3(theta, items.taken$a, items.taken$b, items.taken$c))))
# The optimization function takes as its argument the choice variables to be optimized (theta).
# The way the above optimization works is that you specify the probility of each response piecewise.
# If the response is correct, then you count the CDF of theta up to that point as contributing to the probability of observing a correct outcome.  If the response is negative, then you count it as contributing the the probability of a incorrect outcome.  You then choose the theta that produces the greatest total probability.

MLEval = 0
theta.range = seq(-3,3,.1)
for(i in 1:length(theta.range)) MLEval[i] =MLE(theta.range[i])

plot(theta.range, MLEval, type="l", main="Maximim Likelihood function", xlab= ~theta, ylab="Sum of Log Likelihood")
abline(v=theta, col="blue")
# We can visually see that the maximum of the slope will not be at the true value though it will be close.

optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
abline(v=optim(0,MLE, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par, col="red")


# We can see that we can estimate theta reasonably well with just 15 items from a paper test (red line estimate, blue line true).  However, looking at the graph of the ICCs, we can see that for most of the items, the steepest point (where they have the most discriminating power) is at an ability set lower than the test taker's ability.  Thus, this test provides the most information about a person who has a lower ability than the person with a theta=1.2.

# We can use R's optim function to find the ideal theta that would maximize the information from this test.
# Item information is:
PL3.info = function(theta, a, b, c) a^2 *(PL3(theta,a,b,c)-c)^2/(1-c)^2 * (1-PL3(theta,a,b,c))/PL3(theta,a,b,c)

# Notice, this is not the best way of defining the test information function since the items are not arguments.
test.info = function(theta) sum(PL3.info(theta, items.taken$a, items.taken$b, items.taken$c))

# Construct a vector to hold the test information
info = 0
for(i in 1:length(theta.range)) info[i]=test.info(theta.range[i])

plot(theta.range, info, type="l", main="Information Peaks Slightly Above 0", xlab= ~theta, ylab="Information")
abline(v=theta, col="blue")
# But we want to know about the test taker at theta

optim(0,test.info, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))
# The person this test would be best suited to evaluate would have an ability rating of .19
abline(v=optim(0,test.info, method="Brent", lower=-6, upper=6, control=list(fnscale = -1))$par, col="red")


Monday, November 19, 2012

t-tests and F-tests and rejection rates

Original Code

* When I was first learning about t-tests and f-tests I was told that a t-test estimated the probability of falsely rejecting the null for a single estimator.

* While the f-test estimated the probability of rejecting the null that the model explained nothing significant.

* It was also stated that sometimes the t-tests can fail to reject the null when the f-test will reject the null and this is the result primarily of correlation among explanatory variables.

* All of these things I believe are well understood.

* However, what I always wanted to know was, if the t-test rejected did that mean that the f-test must also reject?

* This seems intuitively to me to be true.

* That is, if one part of the model seems to be statistically significant, mustent the whole model be statistically significant?

* Now thinking back on the question, I think the answer must be no.

* Assuming we are using rejection rates of 10%, the reason I argue is that if the f-test assumptions are met then it should falsely reject the null 10% of the time.

* Likewise, if the t-test's assumptions are met it should reject the null 10% of the time.

* However, if we are estimating two ts and they are independent then the probability that neither of them reject the null at 10% is 1-(1-.10)^2=19%

* Thus if the f-test rejects at 10% then there must be a range for which one or more t-stat can reject but the f-stat will fail to reject.

* Let's see if we can see this in action through simulation.

cap program drop ft_test
program define ft_test, rclass

  clear
  set obs 1000

  gen x1=rnormal()
  gen x2=rnormal()
  gen y=rnormal()

  reg y x?

  * Calculate the p-stats for the individual coefficients.
  * We multiply by 2 because the ttail is initially one sided and we are interested in the two sided alternative.
  return scalar pt1 = 2 * ttail(e(df_r), abs(_b[x1]/_se[x1]))
  return scalar pt2 = 2 * ttail(e(df_r), abs(_b[x2]/_se[x2]))

  * We also want to extract the F stat
  return scalar pF  = Ftail(e(df_m),e(df_r),e(F))

end

ft_test
ft_test
ft_test
* Running the simulated regression on the data a few times, I can easily see how the P-stat for the t-tests diverge from the f-stat fairly frequently.

simulate pt1=r(pt1) pt2=r(pt2) pF=r(pF), reps(1000): ft_test

gen rt1 = (pt1<=.1)
gen rt2 = (pt2<=.1)
gen rF  = (pF<=.1)

sum r*
* All of the p tests seem to be rejecting at the right level

* It might be the case that we always reject the null for the f if the rejection of the null for the t-tests are correlated.
pwcorr rt1 rt2, sig

* There does not appear to be correlation between the two t-tests rejections.

* By now we should already know the answer to the question.

* But let's check directely.

gen rtF = 0 if rt1 | rt2
replace rtF = 1 if rF == 1 & rtF == 0
sum rtF

* Thus the probability of rejecting the f-null given that we have rejected at least one of the t-nulls is only a little above 50%.

* It does make sense that the f and t rejections be correlated.

* That is, when the individual coefficients seem to be explaining the unknown variance then overall the model seems to be working relatively well.

pwcorr rF rt1 rt2, sig

* There is one more thing to check.  How frequently do we reject the null for the F but not for either of the ts.
gen rFt = 0 if rF
replace rFt = 1 if rt1 | rt2 & rFt == 0

sum rFt
* In this simulation, we only reject the F-stat when at least one of the t-stats rejects.

* We could therefore argue that the F-stat is a more conservative test than the t-stats.

* However, I do not believe this to be entirely the case.

* As mentioned before, I think it is possible for the t-stat to fail to reject when the explanatory variables are correlated when the F-stat does reject.

* Let's see if we can simulate this.

cap program drop ft_test2
program define ft_test2, rclass

  clear
  set obs 1000

  gen x1=rnormal()
  gen x2=rnormal()+x1*3
  * This will cause x1 and x2 to be strongly correlated.
  gen y=rnormal()

  reg y x?

  * Calculate the p-stats for the individual coefficients.
  * We multiply by 2 because the ttail is initially one sided and we are interested in the two sided alternative.
  return scalar pt1 = 2 * ttail(e(df_r), abs(_b[x1]/_se[x1]))
  return scalar pt2 = 2 * ttail(e(df_r), abs(_b[x2]/_se[x2]))

  * We also want to extract the F stat
  return scalar pF  = Ftail(e(df_m),e(df_r),e(F))

end

simulate pt1=r(pt1) pt2=r(pt2) pF=r(pF), reps(1000): ft_test2

* Same analysis as previously
gen rt1 = (pt1<=.1)
gen rt2 = (pt2<=.1)
gen rF  = (pF<=.1)

sum r*
pwcorr rt1 rt2, sig
* The rate of rejection between ts is highly correlated.

gen rtF = 0 if rt1 | rt2
replace rtF = 1 if rF == 1 & rtF == 0
sum rtF
* Under this setup, the rejection rate of the null for the F is about 45% of the time when one of the ts is rejected.

pwcorr rF rt1 rt2, sig
* We can see that now the rejection rates by component is still very strong.

* There is one more thing to check.  How frequently do we reject the null for the F but not for either of the ts?
gen rFt = 0 if rF
replace rFt = 1 if rt1 | rt2 & rFt == 0

sum rFt
* Now we can see the result as discussed previously.  About 25% of the time the f-stat is rejecting the null even though neither t-stat is rejecting the null.

* Thus it may be informative to use a F-stat to check for model fit even when the t-stats do not suggest statistical significance of the individual components.

* The ultimate result of this simulation is to emphasize for me the need to do tests of model fit.

* If, I were to look only at the t-tests in this example then I would falsely assume that the model fits well nearly twice as frequently as if I were to look at the F-stat only.