* Imagine you have data on prices for many products.
* For each of the products you record weekly price information.
clear
set obs 200
gen prod_id = _n
* Each product has a unique average price
gen prod_price = rpoisson(5)/7
* You have data on weekly prices for 200 weeks.
expand 200
bysort prod_id: gen t=_n
label var t "Week"
sort prod_id t
* There is also some seasonal variation
gen seasonal = .2*sin(_pi*t/50)
* As well as a general time trend
gen trend = t*.005
* The first observation is not correlated with anything
gen price = prod_price*2.5 + trend + rpoisson(10)/10 if t==1
replace price = prod_price*2 + trend + seasonal + .7*price[_n-1] + .3*rpoisson(10)/10 if t==2
replace price = prod_price + trend + seasonal + .5*price[_n-1] + .2*price[_n-2] + .3*rpoisson(10)/10 if t==3
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .2*price[_n-2] + .2*price[_n-3] + .3*rpoisson(10)/10 if t==4
replace price = prod_price + trend + seasonal + .3*price[_n-1] + .175*price[_n-2] + .125*price[_n-3] + .1*price[_n-4] +.3*rpoisson(10)/10 if t>4
* Create a globabl to store
global twograph
forv i = 1/6 {
global twograph ${twograph} (line price t if prod_id == `i')
}
twoway $twograph, legend(off) title(True price trends for first six products)
* Now let's imagine that the above generated data is the true price information which is fundamentally unobservable.
* Instead you have multiple collections of data per week on prices which each vary by some random addative error.
expand 3
bysort prod_id t: gen prod_obs = _n
gen price_collect = price + rnormal()*.25
* However the price information that you have has some entries that 10% have been mistakenly entered wrong.
gen entry_error = rbinomial(1,.1)
gen scalar_error = rnormal()+1
gen price_obs = price_collect*(1+entry_error*scalar_error)
label var price_obs "Recorded Price"
* In addition, 35% of your price data was never collected
gen missing = rbinomial(1,.35)
drop if missing==1
* Create a globabl to store
global twograph
forv i = 1/6 {
global twograph ${twograph} (line price_obs t if prod_id == `i' & prod_obs==1)
}
twoway $twograph, legend(off) title(Observed price trends for first six products)
keep t price_obs prod_id entry_error
* I am keeping entry error in the data set as a means of comparison though it would not be directly observed.
**************************************************************************
* DONE with simulation
* The question is:
* Can you now with this messy data recover price data that is similar to the original?
* The first thing that we should exploit is the duplicate recorded data.
scatter price_obs t if prod_id == 1, title(It is easy to see individual deviations)
* It is easy to see individual deviations but we do not want to go through all 200 products to identify individually price outliers.
* We want to come up with a system to identify outliers.
* Let's generate a mean by product and time
bysort prod_id t: egen price_mean = mean(price_obs)
* Let's flag any observation that is 120% greater than the mean or 80% less than the mean.
gen flag = (price_mean > price_obs*1.2 | price_mean < price_obs*.8)
* Let's see how it is working:
two (scatter price_obs t if prod_id == 1) ///
(scatter price_obs t if prod_id == 1 & flag==1 , msymbol(lgx)) ///
, title(Some of outliers can be identified just looking at the mean) legend(off)
corr flag entry_error
* Our flag is correlated about 45% with the entry errors. This is good but we can do better.
* I propose that rather than using just the mean that we construct a moving average of prices and see how each entry deviates from the average.
* The only problem is that the moving average command requires xtset and that requires only one entry per time period.
* So, I say we rescale the time variable and add in as if recorded at a different time of the week the observation number.
* We need to newly generate prod_obs since we do not know which observation is missing from each product.
bysort prod_id t: gen prod_obs = _n
gen t2 = t*4 + prod_obs
* xtset sets the panel data panel id and time series level.
xtset prod_id t2
* The command we will be using is "tssmooth"
* It is coded such that by specifying ma it means moving average and window tells Stata how many time periods to count ahead and how many behind in the moving aerage.
* This command can take a little while.
tssmooth ma ma_price_obs=price_obs, window(23 0 23)
* 23 is in effect 5 weeks ahead and 5 weeks behind
* The 0 tells stata not to include inself in that average
* The moving average
two (scatter price_obs t if prod_id == 1) ///
(line ma_price_obs t if prod_id == 1) ///
(line price_mean t if prod_id == 1) ///
, title(The Moving Average is less succeptable to outliers)
* The moving average is more stable than just the time average.
* Let's try flagging using the moving average
cap drop flag2
gen flag2 = (ma_price_obs > price_obs*1.2 | ma_price_obs < price_obs*.8)
two (scatter price_obs t if prod_id == 1) ///
(scatter price_obs t if prod_id == 1 & flag2==1 , msymbol(lgx)) ///
, title(The Moving Average can also be useful) legend(off)
corr flag2 entry_error
* Drop our flagged data
drop if flag2==1
* Collapse to the weekly level
collapse price_obs, by(prod_id t)
label var price_obs "Mean price observed"
global twograph
forv i = 1/6 {
global twograph ${twograph} (scatter price_obs t if prod_id == `i')
}
twoway $twograph, legend(off) title(Observed price trends for first six products)
* The data is looking a lot better but we still clearly have some unwanted outliers.
* We could take advantage of the cross product trends to help identify outliers within product prices.
bysort t: egen ave_price = mean(price_obs)
reg price_obs ave_price if prod_id == 1
predict resid1, residual
reg price_obs ave_price if prod_id == 2
predict resid2, residual
reg price_obs ave_price if prod_id == 3
predict resid3, residual
twoway (line resid1 t if prod_id == 1) ///
(line price_obs t if prod_id == 1) ///
(line resid2 t if prod_id == 2) ///
(line price_obs t if prod_id == 2) ///
(line resid3 t if prod_id == 3) ///
(line price_obs t if prod_id == 3) , title(The residuals are clear indicators of outliers) legend(off)
* Finally, let us drop observations with residuals that are greater than 1.5 standard deviations from the mean.
drop resid?
gen flag = 0
qui forv i=1/200 {
reg price_obs ave_price if prod_id == `i'
predict resid_temp, residual
sum resid_temp
replace flag = ((resid_temp-r(mean)>r(sd)*1.5 | resid_temp-r(mean)
}
* Let's see how it is working:
two (scatter price_obs t if prod_id == 2) ///
(scatter price_obs t if prod_id == 2 & flag==1 , msymbol(lgx)) ///
, title(Now just trying remove some final outliers) legend(off)
* Plotting product 1 pricing relative to outliers.
global twograph
forv i = 1/6 {
global twograph ${twograph} (line price_obs t if prod_id == `i')
}
* Finally dropping the outliers
drop if flag
* One final graph
global twograph
forv i = 1/6 {
global twograph ${twograph} (scatter price_obs t if prod_id == `i')
}
twoway $twograph, legend(off) title(Observed price trends for first six products)
* Not as clean as our first graph but definitely much improved.
No comments:
Post a Comment