Thursday, April 14, 2016

Calculating Average Consumption From One Week of Purchases

A number of large surveys have attempted to quantify consumer consumption from a limited period
of time observed. This task can be fairly complex as it is fraught with potentially large difficulties directly observing who is consuming what. Rather than this expensive method some researchers have attempted to substitute more easily observed purchase patterns inferring that in general house holds are going to consume what they purchase.

In order to aid in this analysis researchers collect data on both what is purchased and over what period of time it is to be consumed, for instance today (1) or over the next week (7).

Yet purchase patterns can be difficult to work with. Typically household consumption does not map perfectly to household consumption. For one, households can consume stocks from previous weeks. Likewise, households can purchase food to be held in stock for future weeks.

In order to adjust for missing consumption levels we want to adjust consumption to account for both the food items that will not all be consumed the week of observations
(1)
$$ C_{current.purchases} = C_{purchase} \frac{days.remaining.in.observation.period}{days.expected.to.consume}$$

as well as the food items that were purchased the previous week and consumed this week. We can calculate the probabilities of observing an individual outcome in the following way:
(2)
$$ P_{observing.purchase} = \frac{observation.period}{days.expected.to.consume}$$

We can note that the probability of observing a particular purchase if greater than 1 need only be set to one since if this is the case it is likely that this particular purchase will appear one or more times in our data.

Now we can combine (1) and (2) by dividing the current purchases by the likelihood of observing those purchases.

(3)
$$ E(C_{current.purchases}) = C_{purchase} \frac{days.remaining.in.observation.period}{days.expected.to.consume}/\frac{observation.period}{days.expected.to.consume}$$
$$=C_{purchase} \frac{days.remaining.in.observation.period}{observation.period}$$

If the probability is less than 1 otherwise we can use equation (1).

Finally in order to calculate average consumption we take the daily average for our estimated expected consumption levels? Right?

Not even close. This only begins to capture the problem as we have multiple purchases often on different days consumed in different patterns throughout the week.


In order to get us closer to the appropriate level of estimated consumption we need to both infer the missing consumption as well as spread out the observed consumption so that when we look at daily averages good A purchased on day 1 with an expected consumption period of 1 week will also be included with good B purchased on day 7.

In order to explore how to estimate consumption from only observing a limited period of time I have written a simulation testing four methods of estimation. The true consumption level for any individual is 1 unit. If there are multiple goods consumed than that 1 unit of consumption is spread across all goods so that every day only one unit is consumed.

Using only one good we get the following results. M1 is just taking the mean consumption if we divide quantity of goods purchased by number of days expected to consume. M2 is adjusting consumption by the inverse of the likelihood of observing that consumption. M3 is spreading consumption across all of the days of the week observed. M4 is both adjusting by likelihood of observations and spreading consumption across days of the week observed.

Table 1:Sim # is the simulation number while # Items is the number of different food items purchased while C Spread is the number of days consumption of that item is spread over. All values are simulated 250 times.

Sim# ItemsC SpreadM1M2M3M4
1111.001.001.001.00
2121.001.001.001.00
3131.001.001.001.00
4151.001.001.001.00
5161.001.001.001.00
6171.001.001.001.00
7180.881.010.881.01
8190.740.950.740.95
91100.701.000.701.00
101150.420.890.420.89
111200.351.010.351.01

Notice that with only 1 item consumed M1 and M3 are equivalent and M2 and M4 are equivalent. We can see that expected consumption for M2 and M4 provide much better estimates than for M1 and M3 when the consumption is spread out for goods for more than the observation period of one week on average.

Things get much more difficult when we include other goods in our calculation.

Table 2: Equivalent to Table 1 except now multiple items are being purchased at different periods (identified as # Items). In this the C Spread only refers to the first item. The remaining items are drawn randomly from the possible consumption spreads with much greater weight applied to lower consumption levels.
Sim# ItemsC SpreadM1M2M3M4
12210.720.720.900.91
13220.710.710.900.91
14230.710.710.910.92
15250.640.650.870.89
16260.660.670.830.84
17270.600.600.800.81
18280.590.620.780.83
19290.590.650.740.83
202100.580.650.700.81
212150.580.710.640.83
222200.540.690.590.83
23310.620.620.860.88
24320.610.620.880.89
25330.580.590.870.88
26350.530.530.830.84
27360.550.550.810.82
28370.500.510.780.79
29380.520.530.780.82
30390.520.540.750.81
313100.500.540.720.80
323150.490.550.680.82
333200.470.530.630.76
34410.580.580.840.85
35420.550.550.850.86
36430.520.520.840.85
37450.480.480.810.82
38460.490.490.800.82
39470.460.460.770.78
40480.460.480.760.79
41490.480.490.760.81
424100.480.500.730.79
434150.450.490.700.79
444200.440.470.680.78
45510.560.570.850.86
46520.520.520.840.85
47530.490.490.830.84
48550.450.450.790.80
49560.450.460.780.79
50570.450.450.780.79
51580.440.450.760.78
52590.440.450.760.79
535100.440.450.740.79
545150.440.460.710.77
555200.420.450.690.76
56610.520.530.830.85
57620.490.490.830.85
58630.470.480.820.83
59650.440.450.790.81
60660.440.450.800.81
61670.430.440.780.79
62680.450.460.760.79
63690.440.450.770.80
646100.420.440.730.78
656150.430.450.740.82
666200.410.430.700.76

When consuming multiple items simultaneously, the importance of spreading consumption out across all days observed becomes increasingly important. This is because daily consumption need be calculated as the sum of goods consumed that each day averaged across the number of days observed. Thus we see that while in Table 1 M2 does very well. In Table 2 M3 and M4 do much better than either M1 or M2 and M4 does slightly better than any other method at approximating total consumption.

Figure 1: Estimator performance given different item consumption spreads. The above values are for the estimator value averaged across between 1 and 6 items consumed with only the first item being at that particular spread value. M is method 1 through 4 described above.
It is worth noting that all of the methods underestimate total consumption though M4 does the best at adjusting for the missing data problem.

There are some things to consider when estimating consumption data in this way. One important thing is that if consumption tends to be for goods consumed over a long period of time then using anything but directly dividing by the period of time expected to be consumed over is going to give some pretty lumpy values.

For instance, imagine someone buys four liters of oil which they expect to consume over the next 30 days. Sure on average in order to account for the oil not observed for the many other similar people  who bought their oil on previous periods you may want to divide the oil not by the thirty days (4L/30days) but by the probably adjustment value equation (3).

Thus you get (4L/7days). Averaging across four similar people who did not happen to purchase oil you approximate the population consumption level. (1L/7days*1/4=1/28). Thus on average for the population estimate, you are pretty close. However for that one guy in your data you now have one person who looks like they are consuming 4/7 of a liter of oil per day.

When screening your data for outliers this oil consumption positively pops out of the page at you. So you figure it is some kind of recording error and replace it using population estimates.

But the problem here is entirely created by the method used to infer consumption levels. If instead you had taken the consumer at his/her reported level and said that average consumption for that individual is 4L/30days or 2/15 liters per day then you would never need to substitute out this particular outlier because it would not exist in the data in the first place.

If you would like to review the R simulation used to generate these results you can find it here.