* Imagine you have data on prices for many products.

* For each of the products you record weekly price information.

clear

set obs 200

gen prod_id = _n

* Each product has a unique average price

gen prod_price = rpoisson(5)/7

* You have data on weekly prices for 200 weeks.

expand 200

bysort prod_id: gen t=_n

label var t "Week"

sort prod_id t

* There is also some seasonal variation

gen seasonal = .2*sin(_pi*t/50)

* As well as a general time trend

gen trend = t*.005

* The first observation is not correlated with anything

gen price = prod_price*2.5 + trend + rpoisson(10)/10 if t==1

replace price = prod_price*2 + trend + seasonal + .7*price[_n-1] + .3*rpoisson(10)/10 if t==2

replace price = prod_price + trend + seasonal + .5*price[_n-1] + .2*price[_n-2] + .3*rpoisson(10)/10 if t==3

replace price = prod_price + trend + seasonal + .3*price[_n-1] + .2*price[_n-2] + .2*price[_n-3] + .3*rpoisson(10)/10 if t==4

replace price = prod_price + trend + seasonal + .3*price[_n-1] + .175*price[_n-2] + .125*price[_n-3] + .1*price[_n-4] +.3*rpoisson(10)/10 if t>4

* Create a globabl to store

global twograph

forv i = 1/6 {

global twograph ${twograph} (line price t if prod_id == `i')

}

twoway $twograph, legend(off) title(True price trends for first six products)

* Now let's imagine that the above generated data is the true price information which is fundamentally unobservable.

* Instead you have multiple collections of data per week on prices which each vary by some random addative error.

expand 3

bysort prod_id t: gen prod_obs = _n

gen price_collect = price + rnormal()*.25

* However the price information that you have has some entries that 10% have been mistakenly entered wrong.

gen entry_error = rbinomial(1,.1)

gen scalar_error = rnormal()+1

gen price_obs = price_collect*(1+entry_error*scalar_error)

label var price_obs "Recorded Price"

* In addition, 35% of your price data was never collected

gen missing = rbinomial(1,.35)

drop if missing==1

* Create a globabl to store

global twograph

forv i = 1/6 {

global twograph ${twograph} (line price_obs t if prod_id == `i' & prod_obs==1)

}

twoway $twograph, legend(off) title(Observed price trends for first six products)

keep t price_obs prod_id entry_error

* I am keeping entry error in the data set as a means of comparison though it would not be directly observed.

**************************************************************************

* DONE with simulation

* The question is:

* Can you now with this messy data recover price data that is similar to the original?

* The first thing that we should exploit is the duplicate recorded data.

scatter price_obs t if prod_id == 1, title(It is easy to see individual deviations)

* It is easy to see individual deviations but we do not want to go through all 200 products to identify individually price outliers.

* We want to come up with a system to identify outliers.

* Let's generate a mean by product and time

bysort prod_id t: egen price_mean = mean(price_obs)

* Let's flag any observation that is 120% greater than the mean or 80% less than the mean.

gen flag = (price_mean > price_obs*1.2 | price_mean < price_obs*.8)

* Let's see how it is working:

two (scatter price_obs t if prod_id == 1) ///

(scatter price_obs t if prod_id == 1 & flag==1 , msymbol(lgx)) ///

, title(Some of outliers can be identified just looking at the mean) legend(off)

corr flag entry_error

* Our flag is correlated about 45% with the entry errors. This is good but we can do better.

* I propose that rather than using just the mean that we construct a moving average of prices and see how each entry deviates from the average.

* The only problem is that the moving average command requires xtset and that requires only one entry per time period.

* So, I say we rescale the time variable and add in as if recorded at a different time of the week the observation number.

* We need to newly generate prod_obs since we do not know which observation is missing from each product.

bysort prod_id t: gen prod_obs = _n

gen t2 = t*4 + prod_obs

* xtset sets the panel data panel id and time series level.

xtset prod_id t2

* The command we will be using is "tssmooth"

* It is coded such that by specifying ma it means moving average and window tells Stata how many time periods to count ahead and how many behind in the moving aerage.

* This command can take a little while.

tssmooth ma ma_price_obs=price_obs, window(23 0 23)

* 23 is in effect 5 weeks ahead and 5 weeks behind

* The 0 tells stata not to include inself in that average

* The moving average

two (scatter price_obs t if prod_id == 1) ///

(line ma_price_obs t if prod_id == 1) ///

(line price_mean t if prod_id == 1) ///

, title(The Moving Average is less succeptable to outliers)

* The moving average is more stable than just the time average.

* Let's try flagging using the moving average

cap drop flag2

gen flag2 = (ma_price_obs > price_obs*1.2 | ma_price_obs < price_obs*.8)

two (scatter price_obs t if prod_id == 1) ///

(scatter price_obs t if prod_id == 1 & flag2==1 , msymbol(lgx)) ///

, title(The Moving Average can also be useful) legend(off)

corr flag2 entry_error

* Drop our flagged data

drop if flag2==1

* Collapse to the weekly level

collapse price_obs, by(prod_id t)

label var price_obs "Mean price observed"

global twograph

forv i = 1/6 {

global twograph ${twograph} (scatter price_obs t if prod_id == `i')

}

twoway $twograph, legend(off) title(Observed price trends for first six products)

* The data is looking a lot better but we still clearly have some unwanted outliers.

* We could take advantage of the cross product trends to help identify outliers within product prices.

bysort t: egen ave_price = mean(price_obs)

reg price_obs ave_price if prod_id == 1

predict resid1, residual

reg price_obs ave_price if prod_id == 2

predict resid2, residual

reg price_obs ave_price if prod_id == 3

predict resid3, residual

twoway (line resid1 t if prod_id == 1) ///

(line price_obs t if prod_id == 1) ///

(line resid2 t if prod_id == 2) ///

(line price_obs t if prod_id == 2) ///

(line resid3 t if prod_id == 3) ///

(line price_obs t if prod_id == 3) , title(The residuals are clear indicators of outliers) legend(off)

* Finally, let us drop observations with residuals that are greater than 1.5 standard deviations from the mean.

drop resid?

gen flag = 0

qui forv i=1/200 {

reg price_obs ave_price if prod_id == `i'

predict resid_temp, residual

sum resid_temp

replace flag = ((resid_temp-r(mean)>r(sd)*1.5 | resid_temp-r(mean)

}

* Let's see how it is working:

two (scatter price_obs t if prod_id == 2) ///

(scatter price_obs t if prod_id == 2 & flag==1 , msymbol(lgx)) ///

, title(Now just trying remove some final outliers) legend(off)

* Plotting product 1 pricing relative to outliers.

global twograph

forv i = 1/6 {

global twograph ${twograph} (line price_obs t if prod_id == `i')

}

* Finally dropping the outliers

drop if flag

* One final graph

global twograph

forv i = 1/6 {

global twograph ${twograph} (scatter price_obs t if prod_id == `i')

}

twoway $twograph, legend(off) title(Observed price trends for first six products)

* Not as clean as our first graph but definitely much improved.

## No comments:

## Post a Comment