## Saturday, November 24, 2012

### Moving Averages & Data Cleaning

Original Code

* Imagine you have data on prices for many products.

* For each of the products you record weekly price information.

clear
set obs 200

gen prod_id = _n

* Each product has a unique average price
gen prod_price = rpoisson(5)/7

* You have data on weekly prices for 200 weeks.
expand 200
bysort prod_id: gen t=_n
label var t "Week"

sort prod_id t

* There is also some seasonal variation
gen seasonal = .2*sin(_pi*t/50)

* As well as a general time trend
gen trend = t*.005

* The first observation is not correlated with anything
gen price = prod_price*2.5 + trend +  rpoisson(10)/10 if t==1
replace price = prod_price*2 + trend + seasonal + .7*price[_n-1] + .3*rpoisson(10)/10 if t==2
replace price = prod_price + trend + seasonal + .5*price[_n-1] + .2*price[_n-2] + .3*rpoisson(10)/10 if t==3
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .2*price[_n-2] + .2*price[_n-3] + .3*rpoisson(10)/10 if t==4
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .175*price[_n-2] + .125*price[_n-3] + .1*price[_n-4] +.3*rpoisson(10)/10 if t>4

* Create a globabl to store
global twograph

forv i = 1/6 {
global twograph \${twograph} (line price t if prod_id == `i')
}

twoway \$twograph, legend(off) title(True price trends for first six products)

* Now let's imagine that the above generated data is the true price information which is fundamentally unobservable.

* Instead you have multiple collections of data per week on prices which each vary by some random addative error.
expand 3

bysort prod_id t: gen prod_obs = _n

gen price_collect = price + rnormal()*.25

* However the price information that you have has some entries that 10% have been mistakenly entered wrong.

gen entry_error = rbinomial(1,.1)
gen scalar_error = rnormal()+1

gen price_obs = price_collect*(1+entry_error*scalar_error)
label var price_obs "Recorded Price"

gen missing = rbinomial(1,.35)

drop if missing==1

* Create a globabl to store
global twograph

forv i = 1/6 {
global twograph \${twograph} (line price_obs t if prod_id == `i' & prod_obs==1)
}

twoway \$twograph, legend(off) title(Observed price trends for first six products)

keep t price_obs  prod_id entry_error
* I am keeping entry error in the data set as a means of comparison though it would not be directly observed.

**************************************************************************
* DONE with simulation

* The question is:

* Can you now with this messy data recover price data that is similar to the original?

* The first thing that we should exploit is the duplicate recorded data.

scatter price_obs t if prod_id == 1, title(It is easy to see individual deviations)

* It is easy to see individual deviations but we do not want to go through all 200 products to identify individually price outliers.
* We want to come up with a system to identify outliers.

* Let's generate a mean by product and time
bysort prod_id t: egen price_mean = mean(price_obs)

* Let's flag any observation that is 120% greater than the mean or 80% less than the mean.
gen flag = (price_mean > price_obs*1.2 | price_mean < price_obs*.8)

* Let's see how it is working:
two (scatter price_obs t if prod_id == 1)                           ///
(scatter price_obs t if prod_id == 1 & flag==1  , msymbol(lgx)) ///
, title(Some of outliers can be identified just looking at the mean) legend(off)

corr flag entry_error
* Our flag is correlated about 45% with the entry errors.  This is good but we can do better.

* I propose that rather than using just the mean that we construct a moving average of prices and see how each entry deviates from the average.
* The only problem is that the moving average command requires xtset and that requires only one entry per time period.
* So, I say we rescale the time variable and add in as if recorded at a different time of the week the observation number.

* We need to newly generate prod_obs since we do not know which observation is missing from each product.
bysort prod_id t: gen prod_obs = _n

gen t2 = t*4 + prod_obs

* xtset sets the panel data panel id and time series level.
xtset prod_id t2

* The command we will be using is "tssmooth"

* It is coded such that by specifying ma it means moving average and window tells Stata how many time periods to count ahead and how many behind in the moving aerage.
* This command can take a little while.
tssmooth ma ma_price_obs=price_obs, window(23 0 23)
* 23 is in effect 5 weeks ahead and 5 weeks behind
* The 0 tells stata not to include inself in that average

* The moving average
two (scatter price_obs t if prod_id == 1)                           ///
(line ma_price_obs t if prod_id == 1)                        ///
(line price_mean t if prod_id == 1)                        ///
, title(The Moving Average is less succeptable to outliers)

* The moving average is more stable than just the time average.

* Let's try flagging using the moving average
cap drop flag2
gen flag2 = (ma_price_obs > price_obs*1.2 | ma_price_obs < price_obs*.8)

two (scatter price_obs t if prod_id == 1)                           ///
(scatter price_obs t if prod_id == 1 & flag2==1  , msymbol(lgx)) ///
, title(The Moving Average can also be useful) legend(off)

corr flag2 entry_error

* Drop our flagged data
drop if flag2==1

* Collapse to the weekly level
collapse price_obs, by(prod_id t)
label var price_obs "Mean price observed"

global twograph

forv i = 1/6 {
global twograph \${twograph} (scatter price_obs t if prod_id == `i')
}

twoway \$twograph, legend(off) title(Observed price trends for first six products)
* The data is looking a lot better but we still clearly have some unwanted outliers.

* We could take advantage of the cross product trends to help identify outliers within product prices.
bysort t: egen ave_price = mean(price_obs)

reg price_obs ave_price if prod_id == 1
predict resid1, residual

reg price_obs ave_price if prod_id == 2
predict resid2, residual

reg price_obs ave_price if prod_id == 3
predict resid3, residual

twoway (line resid1 t if prod_id == 1) ///
(line price_obs t if prod_id == 1) ///
(line resid2 t if prod_id == 2) ///
(line price_obs t if prod_id == 2) ///
(line resid3 t if prod_id == 3) ///
(line price_obs t if prod_id == 3) , title(The residuals are clear indicators of outliers) legend(off)

* Finally, let us drop observations with residuals that are greater than 1.5 standard deviations from the mean.

drop resid?

gen flag = 0

qui forv i=1/200 {
reg price_obs ave_price if prod_id == `i'
predict resid_temp, residual
sum resid_temp
replace flag = ((resid_temp-r(mean)>r(sd)*1.5 | resid_temp-r(mean)  drop resid_temp
}

* Let's see how it is working:
two (scatter price_obs t if prod_id == 2)                           ///
(scatter price_obs t if prod_id == 2 & flag==1  , msymbol(lgx)) ///
, title(Now just trying remove some final outliers) legend(off)

* Plotting product 1 pricing relative to outliers.
global twograph

forv i = 1/6 {
global twograph \${twograph} (line price_obs t if prod_id == `i')
}

* Finally dropping the outliers
drop if flag

* One final graph
global twograph

forv i = 1/6 {
global twograph \${twograph} (scatter price_obs t if prod_id == `i')
}

twoway \$twograph, legend(off) title(Observed price trends for first six products)

* Not as clean as our first graph but definitely much improved.