* Every so often someone asks about how to bootstrapping time series data.

* The answer is not either easy or clear (at least as far as I know though there are many people mroe expert on the subject than I).

* The problem is that time series data is by its nature linearly dependent with itself (auto-correlated).

* Let's see how this looks:

clear

set obs 300

gen t = _n

gen x = 0

gen y = 0

gen u = rnormal()*7.5

local N=_N

* You must fill in the simulated data sequentially

qui forv i=2(1)`N' {

replace x = rnormal() + .75*x[`i'-1] if t==`i'

replace y = x*1.5 + .75*y[`i'-1] + u if t==`i'

}

* Let's see what it looks like

* I will first standardize the two.

egen ystd = std(y)

egen xstd = std(x)

two (line ystd t) (line xstd t), name(original, replace) legend(off) ///

title("We can see that the two series are moving together")

* In order to estimate the relationship between y properly we must account for the linear depenent structure of the data.

reg y x

* Is obviously very biased.

gen yl = y[t-1]

* Now we can see that our estimate on x is much closer to our true (1.5).

reg y yl x

* It is worth noting that at this point it is completely unnecccesary to included any lagged values of x.

* Though this is only because we are assuming contemporaneous exogeneity between y and x.

* Now let's make the obvious mistake and try to boostrap.

bs: reg y yl x

* Wow, our estimates look great! .... Not quite.

* The problem is we are no longer trully bootstrapping our data because lagged values of y are not lagged values of y in our new distributions.

* Drawing on a previous post:

* http://www.econometricsbysimulation.com/2012/07/write-your-own-bootstrap-command.html

cap program drop bootme

program bootme

$nq preserve

syntax anything [, Reps(integer 100) c1(string) c2(string) c3(string) Keep(string) Noisily]

if "`noisily'" == "" local nq quietly

if "`noisily'" != "" local nq

`nq' gen bootmatcher = _n-1

`nq' gl boot_N=_N

tempfile drawfile

`nq' save `drawfile' , replace

`nq' clear

`nq' set obs `reps'

foreach v in `keep' {

local vtemp=strtoname("v`v'")

`nq' gen `vtemp'=.

}

tempfile returnlist

`nq' save `returnlist' , replace

di as text _newline "Bootstrap replications ({result:`reps'})"

di "----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5"

forv i=1/`reps' {

clear

`nq' set obs $boot_N

`nq' gen bootmatcher = int(runiform()*$boot_N)

`nq' sort bootmatcher

`nq' merge m:1 bootmatcher using `drawfile', keep(3)

* Making a slight modification to the code her

gen tempvarzzz = rnormal()

sort tempvarzzz

`nq' `c1'

`nq' `c2'

`nq' `c3'

`nq' `c4'

`nq' `c5'

`nq' `c6'

drop tempvarzzz

`nq' use `returnlist', clear

foreach v in `keep' {

local vtemp=strtoname("v`v'")

`nq' replace `vtemp'=`v' if _n==`i'

}

`nq' save `returnlist', replace

if `i'/50 == int(`i'/50) di ". " `i'

else di _continue "."

}

sum

`nq' restore

gen tempvarzzz = rnormal()

sort tempvarzzz

`c1'

`c2'

`c3'

`c4'

`c5'

`c6'

drop tempvarzzz

end

* We can see that we get basically the same results.

bootme 1, r(200) k(_b[x]) c1(reg y yl x)

* However, now let's add a line of code that generates y lagged as a function of the boostrapped data.

bootme 1, r(200) k(_b[x]) c1(gen ylboot = y[t-1]) c2(reg y ylboot x)

* In this formulation, unsurprisingly ylboot has almost no explanatory power.

gen tboot = _n

two (line ystd tboot) (line xstd tboot), name(boot1, replace) ///

title("Boostrapped data is not autocorrelated") legend(off)

graph combine boot1 original, rows(2)

* We can see that clearly bootstrapping our results in this way is not going to work.

* One of my classmates is using a method that I think he got from the literature which is a little more complicated.

* First he estimates his model.

reg y yl x

* Then predicts the fitted and residuals

predict yhat

predict uhat, resid

* Then he bootstraps those values and reconstructs the y values from the fitted values by randomly drawing the errors.

* Let's see if we can accomplish this with the previous command.

* First we need to save the fitted values to be grabbed later.

global N=_N

forv i=1(1)$N {

gl yhat`i' = yhat[`i']

gl x`i' = x[`i']

}

* Write a small program to load yhat values back into the memory

cap program drop datasetup

program datasetup

cap gen yhat0 = .

cap gen ynew = .

cap gen ynewlag = .

cap gen xnew = .

forv i=1(1)$N {

qui replace yhat0 = ${yhat`i'} if _n==`i'

qui replace xnew = ${x`i'} if _n==`i'

if `i'>1 qui replace ynew = yhat0+uhat

}

replace ynewlag=ynew[_n-1]

end

datasetup

corr y ynew x xnew

* Initially these values will be identical

bootme 1, r(200) k(_b[x]) c1(datasetup) c2(reg ynew ynewlag xnew)

* As with my classmate's results, the new confidence interval for the x coefficient does not even include the original estimate ~1.2.

* Looking at the data generated in this method

egen ynewstd = std(ynew)

egen xnewstd = std(xnew)

replace t=_n

two (line ynewstd t) (line xnewstd t), name(original, replace) legend(off) ///

title("The new time series")

* Clearly this method is not accomplishing what we would like it to accomplish from a perspective of replicating the estimated coefficients.

* However, interestingly (though not neccessarily believably) the standard errors from the bootstrap are close to the size of the standard errors from the OLS regression above reg y y1 x

* Hope this is helpful. I will send my friend this link to see if he has anything to add.

Francis - check out the block bootstrap and its extensions.

ReplyDelete