Saturday, November 24, 2012

Moving Averages & Data Cleaning

Original Code


* Imagine you have data on prices for many products.

* For each of the products you record weekly price information.

clear
set obs 200

gen prod_id = _n

* Each product has a unique average price
gen prod_price = rpoisson(5)/7

* You have data on weekly prices for 200 weeks.
expand 200
bysort prod_id: gen t=_n
  label var t "Week"

sort prod_id t

* There is also some seasonal variation
gen seasonal = .2*sin(_pi*t/50)

* As well as a general time trend
gen trend = t*.005

* The first observation is not correlated with anything
gen price = prod_price*2.5 + trend +  rpoisson(10)/10 if t==1
replace price = prod_price*2 + trend + seasonal + .7*price[_n-1] + .3*rpoisson(10)/10 if t==2
replace price = prod_price + trend + seasonal + .5*price[_n-1] + .2*price[_n-2] + .3*rpoisson(10)/10 if t==3
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .2*price[_n-2] + .2*price[_n-3] + .3*rpoisson(10)/10 if t==4
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .175*price[_n-2] + .125*price[_n-3] + .1*price[_n-4] +.3*rpoisson(10)/10 if t>4

* Create a globabl to store
global twograph

forv i = 1/6 {
  global twograph ${twograph} (line price t if prod_id == `i')
}

twoway $twograph, legend(off) title(True price trends for first six products)



* Now let's imagine that the above generated data is the true price information which is fundamentally unobservable.

* Instead you have multiple collections of data per week on prices which each vary by some random addative error.
expand 3

bysort prod_id t: gen prod_obs = _n

gen price_collect = price + rnormal()*.25

* However the price information that you have has some entries that 10% have been mistakenly entered wrong.

gen entry_error = rbinomial(1,.1)
  gen scalar_error = rnormal()+1

gen price_obs = price_collect*(1+entry_error*scalar_error)
  label var price_obs "Recorded Price"

* In addition, 35% of your price data was never collected
gen missing = rbinomial(1,.35)

drop if missing==1

* Create a globabl to store
global twograph

forv i = 1/6 {
  global twograph ${twograph} (line price_obs t if prod_id == `i' & prod_obs==1)
}

twoway $twograph, legend(off) title(Observed price trends for first six products)



keep t price_obs  prod_id entry_error
* I am keeping entry error in the data set as a means of comparison though it would not be directly observed.

**************************************************************************
* DONE with simulation

* The question is:

* Can you now with this messy data recover price data that is similar to the original?

* The first thing that we should exploit is the duplicate recorded data.

scatter price_obs t if prod_id == 1, title(It is easy to see individual deviations)


* It is easy to see individual deviations but we do not want to go through all 200 products to identify individually price outliers.
* We want to come up with a system to identify outliers.

* Let's generate a mean by product and time
bysort prod_id t: egen price_mean = mean(price_obs)

* Let's flag any observation that is 120% greater than the mean or 80% less than the mean.
gen flag = (price_mean > price_obs*1.2 | price_mean < price_obs*.8)

* Let's see how it is working:
two (scatter price_obs t if prod_id == 1)                           ///
    (scatter price_obs t if prod_id == 1 & flag==1  , msymbol(lgx)) ///
, title(Some of outliers can be identified just looking at the mean) legend(off)




corr flag entry_error
* Our flag is correlated about 45% with the entry errors.  This is good but we can do better.

* I propose that rather than using just the mean that we construct a moving average of prices and see how each entry deviates from the average.
* The only problem is that the moving average command requires xtset and that requires only one entry per time period.
* So, I say we rescale the time variable and add in as if recorded at a different time of the week the observation number.

* We need to newly generate prod_obs since we do not know which observation is missing from each product.
bysort prod_id t: gen prod_obs = _n

gen t2 = t*4 + prod_obs

* xtset sets the panel data panel id and time series level.
xtset prod_id t2

* The command we will be using is "tssmooth"

* It is coded such that by specifying ma it means moving average and window tells Stata how many time periods to count ahead and how many behind in the moving aerage.
* This command can take a little while.
tssmooth ma ma_price_obs=price_obs, window(23 0 23)
* 23 is in effect 5 weeks ahead and 5 weeks behind
* The 0 tells stata not to include inself in that average

* The moving average
two (scatter price_obs t if prod_id == 1)                           ///
    (line ma_price_obs t if prod_id == 1)                        ///
    (line price_mean t if prod_id == 1)                        ///
, title(The Moving Average is less succeptable to outliers)



* The moving average is more stable than just the time average.

* Let's try flagging using the moving average
cap drop flag2
gen flag2 = (ma_price_obs > price_obs*1.2 | ma_price_obs < price_obs*.8)

two (scatter price_obs t if prod_id == 1)                           ///
    (scatter price_obs t if prod_id == 1 & flag2==1  , msymbol(lgx)) ///
, title(The Moving Average can also be useful) legend(off)



corr flag2 entry_error

* Drop our flagged data
drop if flag2==1

* Collapse to the weekly level
collapse price_obs, by(prod_id t)
  label var price_obs "Mean price observed"

global twograph

forv i = 1/6 {
  global twograph ${twograph} (scatter price_obs t if prod_id == `i')
}

twoway $twograph, legend(off) title(Observed price trends for first six products)
* The data is looking a lot better but we still clearly have some unwanted outliers.



* We could take advantage of the cross product trends to help identify outliers within product prices.
bysort t: egen ave_price = mean(price_obs)

reg price_obs ave_price if prod_id == 1
predict resid1, residual

reg price_obs ave_price if prod_id == 2
predict resid2, residual

reg price_obs ave_price if prod_id == 3
predict resid3, residual

twoway (line resid1 t if prod_id == 1) ///
       (line price_obs t if prod_id == 1) ///
  (line resid2 t if prod_id == 2) ///
       (line price_obs t if prod_id == 2) ///
  (line resid3 t if prod_id == 3) ///
       (line price_obs t if prod_id == 3) , title(The residuals are clear indicators of outliers) legend(off)



* Finally, let us drop observations with residuals that are greater than 1.5 standard deviations from the mean.

drop resid?


gen flag = 0

qui forv i=1/200 {
  reg price_obs ave_price if prod_id == `i'
  predict resid_temp, residual
  sum resid_temp
  replace flag = ((resid_temp-r(mean)>r(sd)*1.5 | resid_temp-r(mean)  drop resid_temp
}


* Let's see how it is working:
two (scatter price_obs t if prod_id == 2)                           ///
    (scatter price_obs t if prod_id == 2 & flag==1  , msymbol(lgx)) ///
, title(Now just trying remove some final outliers) legend(off)


* Plotting product 1 pricing relative to outliers.
  global twograph

  forv i = 1/6 {
    global twograph ${twograph} (line price_obs t if prod_id == `i')
  }


* Finally dropping the outliers
drop if flag

* One final graph
global twograph

forv i = 1/6 {
  global twograph ${twograph} (scatter price_obs t if prod_id == `i')
}

  twoway $twograph, legend(off) title(Observed price trends for first six products)



* Not as clean as our first graph but definitely much improved.

No comments:

Post a Comment