Friday, October 9, 2015

Flip a fair coin 4x. Probability of H following H is 40%???

A recent working paper has come out arguing for the existence of Hot Hands (in basketball), a concept psychologists had dismissed decades ago. Hot hands is where a player is thought to have a higher likelihood of scoring the next basket if the last three baskets where shot successfully. (In NBA Jam, that is when you hands caught on fire).

The first paragraph of the paper reads, "Jack takes a coin from his pocket and decides that he will flip it 4 times in a row, writing down the outcome of each flip on a scrap of paper. After he is done flipping, he will look at the flips that immediately followed an outcome of heads, and compute the relative frequency of heads on those flips. Because the coin is fair, Jack of course expects this empirical probability of heads to be equal to the true probability of flipping a heads: 0.5. Shockingly, Jack is wrong. If he were to sample one million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4."

In other words $$P(H_t|H_{t-1}) \ne .5$$ even though $$P(H)=.5$$

If you are anything like me, you will say, "WTF, that can't be true!"

Before getting any further into the logic of the paper, let us do some simulations.

First off, could this be the result of looking only after heads values?

That is, perhaps by selecting only heads values we are reducing the number of available heads.

But, no, this does not make sense!

# x is a binary random variable drawn from 1000^2 draws with a .5 probability  of success.

x <- rbinom(10^6,1,.5)   mean(x[-length(x)][x[-1]==1])

# Pretty much as close as you can get to 50% as possible    

# To make this feel more concrete for me I ended up flipping a coin 252 times 
# with the following results (1=heads)

myx <- c(1,1,0,1,1,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,0,0,1,0,1, 1,1,1,0,1,0,1,1,0,1,1,1,1,0,1,0,1,0,0,0,1,0,1,1,0,1,1,0, 0,1,1,0,0,0,0,1,1,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,1,1,0,0, 1,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,1,0,1,0,1,1,1,0,0,0,1,0, 1,0,1,1,0,1,1,1,0,0,0,1,1,1,0,0,0,1,0,1,0,1,0,1,1,1,0,1, 0,1,1,1,1,0,0,1,0,0,0,0,1,0,1,1,1,0,1,1,1,0,1,0,1,0,0,1, 0,0,1,1,0,1,1,0,0,1,0,1,1,0,1,0,1,0,0,0,0,1,0,0,1,0,0,0, 0,1,0,0,0,0,0,1,0,0,1,1,0,1,0,1,1,0,1,1,1,1,0,1,0,0,1,0, 0,1,0,0,1,1,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1,0,1,1,1,0,1,0)   # We can see we drew 131 heads or 51.9% mean(myx);sum(myx)   # This is very close to 50%. What is the likelihood of heads coming 
# up this many times on a fair coin?
plot(100:152,pbinom(100:152, 252, .5), type='l', lwd=3, 
     main="CDF of binomial(252,.5)", 
     ylab="Probability",
     xlab="# of Heads")
 
abline(v=sum(myx), lwd=3, lty=2) 
  
# The likelihood of gettings 131 or higher given the coin is fair is 
# about 24.4%
1-pbinom(sum(myx), 252, .5)
 
# I am fairly confident therefore that the coin is therefore fair. 
 
# However, it could be somewhat unfair and still achieve this outcome 
# without straining the bounds of probability. 
 
# But this is not the point.
 
# Now let's look at the claim. The probability of observing a heads 
# after tossing a heads is argued to be .4.
 
# So let's imagine a million fair coins sequences of length 4. Say set Y.
 
# This set will be composed of the outcomes of 10^6 probabilities 
# where those probabilities are the likelihood that 
# the coin value that follows a heads is a head.
 
# I write function nheads for this purpose
 
nhead <- function(flips=4, N=10^6, prev=1)  {
  Y <- rep(NA,N)
  for (i in 1:N) {
    # Draw four binomial draws
    x <- rbinom(flips,1,.5)
    # Now keep only draws after any heads
    Y[i] <- mean(x[-length(x)][x[-1]==prev]) 
  }
  Y
}
 
# Now let's take the average across the coins
mean(nhead(4,10^6), na.rm=TRUE)
 
# Damn, sure enough! The expected number of heads is 40.8%
 
# Let's see if the empirical results adhere. We can think of the 
# 252 previous coinflips as 63 four coin flip trials.
myset <- matrix(myx, 4)
 
myY <- Y <- rep(NA,63)
 
# Now let's calculate the likelihood of heads after the first heads in each set
for (i in 1:ncol(myset)) myY[i] <- mean(myset[-4,i][myset[-1,i]==1])
 
# Pooling across sets
mean(myY, na.rm=TRUE)
 
# We get 34.2% which could be interpretted as an unluckily low deviation 
# from 50% except it is surprisingly close to 40%.
 
# That is 19.5 heads 
sum(myY, na.rm = TRUE)
 
# However the total number of trials that returned a result is 57
sum(!is.na(myY))
 
# If we believed that the true expected number of heads was .5 then the 
# likelihood of the 19 heads or less appearing is less than 1%.
pbinom(sum(myY, na.rm = TRUE), sum(!is.na(myY)), .5)
 
# This seems compelling and surprising evidence before even looking at 
# the arguments presented in the paper.
 
# Looking at the paper, Table 1 is quite convincing. It is organized 
# by # of Heads.
#    Heads            Sequence        P(H|H)
#1   0                TTTT            -
#2   1                TTTH            -
#3   1                TTHT            0
#4   1                THTT            0
#5   1                HTTT            0
#6   2                TTHH            1
#7   2                THTH            0
#8   2                THHT            1/2
#9   2                HTTH            0
#10  2                HTHT            0
#11  2                HHTT            1/2
#12  3                THHH            1
#13  3                HTHH            1/2
#14  3                HHTH            1/2
#15  3                HHHT            2/3
#16  4                HHHH            1
 
# To come up with the P(H|H) we take the average of the feasible outcomes:
(0+0+0+1+0+1/2+0+0+1/2+1+1/2+1/2+2/3+1)/14
 
# Sure enough! 40.47
 
# So what is driving this? Selection bias certainly! But from where?
 
# Is it the two values that cannot be computed because no heads are generated?
 
# Even if we specify those as heads this does not fix the probabilities.
(0+0+0+1+0+1/2+0+0+1/2+1+1/2+1/2+2/3+1+1+1)/16
# 48.9
 
# So it must be a feature of the series since we know that 
# if we sample 6 million then we are spot on (near) 1/2.
 
# Let's see what we can discover
 
I <- 60
 
Y <- rep(NA, I-1)
 
for (i in 2:I) Y[i-1] <- mean(nhead(i,10^5), na.rm=TRUE)
 
plot(2:I, Y, cex=1, lwd=2, 
     main="P(H_t|H_t-1)",
     xlab="# of coins")
abline(h=.5, lwd=3) 
 
  
# This graph can be quite disorienting. 

It shows that only in the case of flipping two coins is the probability of observing a head after the first head  equal to .5 (well .5 with measurement error).

There is still a significant divergence from .5 even in the case of 60 coints being flipped!

This is so non-intuative as to be downright disorienting.

Does this mean that if someone is flipping four coins and one of them pop up heads you can bet them $10 that the next result is a tail and feel confident the odds are in your favor?

We can see this by looking at our probability table with each sequence arranged by the first head up to that point.

The answer is not quite. I have added up the likelihood of seeing a heads after the previous heads that appears in both sequences (H|1H, H|2H, H|3H).

Heads        Sequence        P(H|H)     H|1H   H|2H   H|3H
  0              TTTT           -
  1              TTTH           -
  2              TTHH           1        1
  1              TTHT           0        0
 

  2              THTH           0        0
  1              THTT           0        0
  3              THHH           1        1      1
  2              THHT           1/2      1      0

  1              HTTT           0        0
  2              HTTH           0        0
  2              HTHT           0        0      0
  3              HTHH           1/2      0      1
  3              HHTH           1/2      1      0
  2              HHTT           1/2      1      0
  3              HHHT           2/3      1      1      0
  4              HHHH           1        1      1      1 


Thus we can see that according to this, no matter that the sequence the likelihood of the next flip being a head after the first is .5. That is, equal number of heads as tails.

How about the second head (H|1H)? Looking over the table we can see the same thing. As well as for the 3rd head.

From this we can see that in part our intuition is correct. Knowing the previous outcome in the sequence does not provide information on future outcomes. THANK THE GODS!

To get some insight into what is happening, you have to think about how we are scoring each sequence. After each flip of a head we score either 0 or 1. If this is the end of a sequence then we stop there and things work fine: TH (-), TT (-), HT (0), HH (1) -> P(H|H)=1/2.

If it is not the end of a sequence let's see what happens.
THT (0), THH (1), TTT (-), TTH (-), {this part looks okay so far}

HTT (0), HTH (0), HHT ((1+0)/2=1/2), HHH (1).

Now we can begin to see what is happening with the coin flip.

The previously unscorable series TH now becomes scorable. No problem.

The unscorable series TT remanes unscorable.

And the zero scored series HT remains zero scoring!

To review, we now have a previous sequence TH which was unscoreable and now has a probability of Heads of .5 (confirming our expectations). We had a sequence TT which was discarded and remains discarded. And we had a sequence HT which was scored 0 and remains scored 0.

So all of the action must be happening in the HH part of the series!

And we can now think of what is happening. The problem is in the averaging across heads within a series. In this case, by flipping HH this means that the next flip must be evaluated. With HT the next flip does not matter, the sequence will be scored 0 no matter what.

However, with HH, we know that the next flip can be either H or T which means HHT (scored (1+0)/2=.5) and HHH (scored (1+1)/2=1) and $$ mean(H|HH)=.75$$. Thus effectively all other probabilities in the sequence remain the same except what was previously a P(H|HH)=1 is now a P(H|HH)=-.75.

But why? Well, there is a selection issue going on. In the case of HH we are asking what the next score is and rounding our previous positive result to match that. But in the case of say HT we not asking what our next result is and rounding. Thus an isolated few negative outcomes are kept while a few positive outcomes are downgraded resulting in an overall selection effect which reduces the likelihood of seeing a head overall after seeing a previous head.

Now that we can see why this is happening, why does it ever go away? The truth is, that it never goes away. The deviation from .5 gets really small as the sample size gets large. This is because the selective scoring (last value scored) is only happening on a smaller and smaller portion of the series.

I am tempted to say something about the larger implications on series analysis. However, I don't really know how well this sampling issue is known to time series people. I do not recall hearing of it in my meager studies of the matter, but frankly that could mean very little.

4 comments:

  1. Your last table shows the 16 possible outcomes (all of which presumably have equal probability).

    Of the 16, 8 have a sequence that includes “HH.” So there's a 50% chance of observing a “hot hand” in coin flipping, if by “hot hand” you mean a pair of heads in a row. The test seems not very stringent, roughly the same test as whether a basketballer shoots successful 3-pointers back to back. Is that a hot hand?

    But if there is a hot hand, there must be a similar period of cold hands, the likelihood of finding a “TT” sequence. Nobody will brag about cold hands, or pay much attention to them, but they are…uh, the flip side of the hot hands argument. Not surprisingly, there are 8 sequences that have a “TT” in them. The chance of a narrowly-defined cold hand is the same 50% as the chance of the narrowly-defined hot hand.

    If a hot hand can't exist without there being a corresponding cold hand, it's notable that only 2 of the sequences have neither HH or TT (THTH and HTHT). The weakness of looking for only a single pair of back-to-back lined-up results is shown by the fact that it's hard to NOT find a temperature effect by it.

    /ss WaltFrench. Anon is the only ID I can get to work; grr!

    ReplyDelete
  2. The problem here is the averaging. If you look at HHHH vs TTHH for example, both of them have P(H|H) = 1, but there are 3 successes in HHHH and only 1 in TTHH. Instead of averaging across samples, you could add 1 to a global counter if a head followed a head, and 0 otherwise. If you do it this way, you'll see 0.5.

    ReplyDelete
  3. In calculating his 40.7% the author has calculated the probability of the event within different sets of observations and then takes the average of these probabilities. This is effectively taking the average of averages. Bad Math!!

    The correct way to calculate the likelihood of p(HH|H) is to count out the number of instances of H, then see how many of them are followed by another H.

    Let’s examine the16 permutations of Heads and tails taken 4 at a time again. There are (16*3) / 2 or 24 Heads in the first 3 positions of the 16 sets of coin flips. Of those 24 Events of Heads that appear, 12 are followed by another H, or 50% probability of H following H.

    If the original article was trying to point out the issue of edge conditions then good for them. The calculated probability of a H following an H, counting H’s that are at the end of a set as a failure is 37.5%. This is easily calculated in the following way:

    1. 25% of The H’s that are observed in the strings of 4 observations occur at the end of the strings. These observations have a 0% chance of being followed by an H.
    2. 75% of the H’s do not end a string and are followed by an H 50% of the time.
    3. 25% * 0% + 75% * 50% = 37.5%.

    This edge effect attenuates as the strings get longer, because the edge condition H is a lower percentage of the total number of H’s observed. If the strings of observations were 100 observations long then the likelihood of an H followed by an H would be 49.5%: (1% * 0 + 99%* 50% = 49.5%)

    The message of the article should be that you need to take care that when you count initial observations. To calculate the probability of an event occurring, you should only count those observations that are observed at a time when they have the opportunity to should the tested for pattern.

    To bring this home, let’s say we were looking for the pattern of 4 H’s in a row. We expect this to happen 1 time in 8. But it happens only once in our 16 patterns and the count of H’s is 32. 75% of the H’s have a 0% chance of developing into 4H’s in a row due to the edge condition. If you do not take the edge condition into effect, you might think that the likelihood of an H being followed 3 more H’s is 1/32. (25% * 1/8 + 75% * 0%)

    -- Richard Moore (also as Anonymous because it was the only way to get it entered)

    ReplyDelete
  4. The paper is not clear at all (on a first, quick read of the first 3 pages) about whether the authors show that people incorrectly compute probabilities or whether the authors believe that the incorrect method they derive is correct. A mess of a paper. Very good blog post. And great comment by Richard Moore.

    ReplyDelete