Friday, October 9, 2015

Flip a fair coin 4x. Probability of H following H is 40%???

A recent working paper has come out arguing for the existence of Hot Hands (in basketball), a concept psychologists had dismissed decades ago. Hot hands is where a player is thought to have a higher likelihood of scoring the next basket if the last three baskets where shot successfully. (In NBA Jam, that is when you hands caught on fire).

The first paragraph of the paper reads, "Jack takes a coin from his pocket and decides that he will flip it 4 times in a row, writing down the outcome of each flip on a scrap of paper. After he is done flipping, he will look at the flips that immediately followed an outcome of heads, and compute the relative frequency of heads on those flips. Because the coin is fair, Jack of course expects this empirical probability of heads to be equal to the true probability of flipping a heads: 0.5. Shockingly, Jack is wrong. If he were to sample one million fair coins and flip each coin 4 times, observing the conditional relative frequency for each coin, on average the relative frequency would be approximately 0.4."

In other words $$P(H_t|H_{t-1}) \ne .5$$ even though $$P(H)=.5$$

If you are anything like me, you will say, "WTF, that can't be true!"

Before getting any further into the logic of the paper, let us do some simulations.

First off, could this be the result of looking only after heads values?

That is, perhaps by selecting only heads values we are reducing the number of available heads.

But, no, this does not make sense!

# x is a binary random variable drawn from 1000 draws with a .5 probability 
# of success.
x <- rbinom(10^6,1,.5)
# Pretty much as close as you can get to 50% as possible
# So I ended up flipping the coin 252 times with the following results (1=heads)
myx <- c(1,1,0,1,1,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,0,0,1,0,1,
# We can see we drew 131 heads or 51.9% 
# This is very close to 50%. What is the likelihood of heads coming 
# up this many times on a fair coin?
plot(100:152,pbinom(100:152, 252, .5), type='l', lwd=3, 
     main="CDF of binomial(252,.5)", 
     xlab="# of Heads")
abline(v=sum(myx), lwd=3, lty=2) 
# The likelihood of gettings 131 or higher given the coin is fair is 
# about 24.4%
1-pbinom(sum(myx), 252, .5)
# I am fairly confident therefore that the coin is therefore fair. 
# However, it could be somewhat unfair and still achieve this outcome 
# without straining the bounds of probability. 
# But this is not the point.
# Now let's look at the claim. The probability of observing a heads 
# after tossing a heads is argued to be .4.
# So let's imagine a million fair coins sequences of length 4. Say set Y.
# This set will be composed of the outcomes of 10^6 probabilities 
# where those probabilities are the likelihood that 
# the coin value that follows a heads is a head.
# I write function nheads for this purpose
nhead <- function(flips=4, N=10^6, prev=1)  {
  Y <- rep(NA,N)
  for (i in 1:N) {
    # Draw four binomial draws
    x <- rbinom(flips,1,.5)
    # Now keep only draws after any heads
    Y[i] <- mean(x[-length(x)][x[-1]==prev]) 
# Now let's take the average across the coins
mean(nhead(4,10^6), na.rm=TRUE)
# Damn, sure enough! The expected number of heads is 40.8%
# Let's see if the empirical results adhere. We can think of the 
# 252 previous coinflips as 63 four coin flip trials.
myset <- matrix(myx, 4)
myY <- Y <- rep(NA,63)
# Now let's calculate the likelihood of heads after the first heads in each set
for (i in 1:ncol(myset)) myY[i] <- mean(myset[-4,i][myset[-1,i]==1])
# Pooling across sets
mean(myY, na.rm=TRUE)
# We get 34.2% which could be interpretted as an unluckily low deviation 
# from 50% except it is surprisingly close to 40%.
# That is 19.5 heads 
sum(myY, na.rm = TRUE)
# However the total number of trials that returned a result is 57
# If we believed that the true expected number of heads was .5 then the 
# likelihood of the 19 heads or less appearing is less than 1%.
pbinom(sum(myY, na.rm = TRUE), sum(!, .5)
# This seems compelling and surprising evidence before even looking at 
# the arguments presented in the paper.
# Looking at the paper, Table 1 is quite convincing. It is organized 
# by # of Heads.
#    Heads            Sequence        P(H|H)
#1   0                TTTT            -
#2   1                TTTH            -
#3   1                TTHT            0
#4   1                THTT            0
#5   1                HTTT            0
#6   2                TTHH            1
#7   2                THTH            0
#8   2                THHT            1/2
#9   2                HTTH            0
#10  2                HTHT            0
#11  2                HHTT            1/2
#12  3                THHH            1
#13  3                HTHH            1/2
#14  3                HHTH            1/2
#15  3                HHHT            2/3
#16  4                HHHH            1
# To come up with the P(H|H) we take the average of the feasible outcomes:
# Sure enough! 40.47
# So what is driving this? Selection bias certainly! But from where?
# Is it the two values that cannot be computed because no heads are generated?
# Even if we specify those as heads this does not fix the probabilities.
# 48.9
# So it must be a feature of the series since we know that 
# if we sample 6 million then we are spot on (near) 1/2.
# Let's see what we can discover
I <- 60
Y <- rep(NA, I-1)
for (i in 2:I) Y[i-1] <- mean(nhead(i,10^5), na.rm=TRUE)
plot(2:I, Y, cex=1, lwd=2, 
     xlab="# of coins")
abline(h=.5, lwd=3) 
# This graph can be quite disorienting. 

It shows that only in the case of flipping two coins is the probability of observing a head after the first head  equal to .5 (well .5 with measurement error).

There is still a significant divergence from .5 even in the case of 60 coints being flipped!

This is so non-intuative as to be downright disorienting.

Does this mean that if someone is flipping four coins and one of them pop up heads you can bet them $10 that the next result is a tail and feel confident the odds are in your favor?

We can see this by looking at our probability table with each sequence arranged by the first head up to that point.

The answer is not quite. I have added up the likelihood of seeing a heads after the first heads that appears in both sequences (Col 3).

Heads        Sequence        P(H|H)     H|1H   H|2H   H|3H
  0              TTTT           -
  1              TTTH           -
  2              TTHH           1        1
  1              TTHT           0        0

  2              THTH           0        0
  1              THTT           0        0
  3              THHH           1        1      1
  2              THHT           1/2      1      0

  1              HTTT           0        0
  2              HTTH           0        0
  2              HTHT           0        0      0
  3              HTHH           1/2      0      1
  3              HHTH           1/2      1      0
  2              HHTT           1/2      1      0
  3              HHHT           2/3      1      1      0
  4              HHHH           1        1      1      1 

Thus we can see that according to this, no matter that the sequence the likelihood of the next flip being a head after the first is .5. That is, equal number of heads as tails.

How about the second head? Looking over the table we can see the same thing. As well as for the 3rd head.

From this we can see that in part our intuition is correct.

Knowing the previous outcome sequence does not provide information on future outcomes. THANK THE GODS!

To get some insight into what is hapenning, you have to think about how we are scoring each sequence. After each flip of a head we score either 0 or 1. If this is the end of a sequence then we stop there and things work fine: TH (-), TT (-), HT (0), HH (1) -> P(H|H)=1/2.

If it is not the end of a sequence let's see what happens.
THT (0), THH (1), TTT (-), TTH (-), {this part looks okay so far}

HTT (0), HTH (0), HHT (1/2), HHH (1).

Now we can see clearly what is happenning with the coin flip!
The previously unscorable series TH now becomes scorable. No problem.

The unscorable series TT remanes unscorable.

And the zero scored series HT remains zero scoring!

So to review we now have a previous sequence TH which was unscoreable and now has a probability of Heads of .5 (confirming our expectations). We had a sequence TT which was discarded and remains discarded. And we had a sequence HT which was scored 0 and remains scored 0.

So all of the action must be happening in the HH part of the series!

And we can now think of what is happening. The problem is in the averaging across heads within a series. In this case, by flipping HH this means that the next flip must be evaluated. With HT the next flip does not matter, the sequence will be scored 0 no matter what.

However, with HH, we know that the next flip can be either H or T which means HHT (scored (1+0)/2=.5) and HHH (scored (1+1)/2=1) and $$ mean(H|HH)=.75$$. Thus effectively all other probabilities in the sequence remain the same except what was previously a P(H|HH)=1 is now a P(H|HH)=-.75.

But why? Well, there is a selection issue going on. In the case of HH we are asking what the next score is and rounding our previous positive result to match that. But in the case of say HT we not asking what our next result is and rounding. Thus an isolated few negative outcomes are kept while a few positive outcomes are downgraded resulting in an overall selection effect which reduces the likelihood of seeing a head overall after seeing a previous head.

Now that we can see why this is happening, why does it ever go away. The truth is, that it never goes away. The deviation from .5 gets really small as the sample size gets large. This is because the selective scoring (last value scored) is only happening on a smaller and smaller portion of the series.

 I am tempted to say something about the larger implications on series analysis. However, I don't really know how well this sampling issue is known to time series people. I do not recall hearing of it in my meager studies of the matter, but frankly that could mean very little.

Friday, March 13, 2015

Open Data - Stack Exchange

As a fan of Stack Exchange I am very excited about the new exchange being made available. This exchange is called 'Open Data' and allows users to post questions on where to find publicly available data on a topic and other users will respond with one or more answers.

Open Data - Stack Exchange

Friday, February 6, 2015

Calculating the likelihood of item overlap with Base R Assessment

I have recently received an email from someone who had taken my Base R Assessment. In this email, the test taker reported that a large portion of the test items taken were duplicates (around 50%) when he took the test the second time.

I began wondering what the likelihood of this occurring would be.

The test selects between 10-20 items each time the test is taken.

These items are selected randomly.

There is around 80 items in total available.

First let us do some back of the envelope calculations.

1. Lets say the average number of test items is 15. Thus the average quantity of the test items taken is 18.75% for the first test.
2. If the same person were to take the test again then we would expect around 18.75% * 15 = 2.8 items to be duplicates.

We might want to know instead what the change of say 50% of the items administered in the second test being duplicates.

In order to do this calculation we would need to basically come up with the a list of all of the contamination of items which result in 50% overlap.

For instance, let us say that we have a ten item exam. The items that overlap are 1:5 and the items that do not are 6:10.

The probability of this outcome occurring is {first the overlap}*{those that do not overlap} : 1/80 * 1/79 * 1/78 * 1/77 * 1/76 * 74/75 * 73/74 * 72/73 * 71/72 * 70/71

Which results in a really small number. Then we would calculate the probability of the first 4 matching and last items matching: 1/80 * 1/79 * 1/78 * 1/77 * 75/76 * 74/75 * 73/74 * 72/73 * 71/72 * 1/71

Next the first three and last two and so on. You keep doing this with every possible combination. This process becomes rather tedious and time consuming pretty quickly.

However, if you are willing to undergo a little approximation bias then a much easier process is to put together a little simulation.

Let us simulate instead the process x number of times, say 100,000.

Nsim    <- 10^5
overlap <- rep(NA,Nsim)
testMin <- 10
testAdd <- 10
for (i in 1:Nsim) {
  testL1 <- testMin + sample(0:testAdd, 1)
  testL2 <- testMin + sample(0:testAdd, 1)
  first  <- sample(80, testL1)
  second <- sample(80, testL2)
  overlap[i] <- mean(second %in% first)
# Generate a density curve of overlap. The mode is around .2 which is just around 
# where we expected the average to be.
# [1] 0.1877303
# Now we can use a built in function 'ecdf' (empirical cdf) to process our 
# frequencies.
OverCDF <- ecdf(overlap)
# The ecdf function is pretty great! It can turn observations into probability data.
# We can plot our cdf curve. It is a step function because there is only a finite 
# number of possible overlap percentages possible.
plot(OverCDF, verticals = TRUE, do.points = FALSE) 
# We can also use it to calculate the likelihood of a certain level of overlap or 
# less.
# Say what is the likelihood of 25% or less of the items being repeated?
# [1] 0.76224 This is pretty high value.
# What about 18.75% or less of the items being repeated?
# [1] 0.53936] Which is pretty close to
# Now we can ask the question, what is the likelihood that the user had 50% or 
# more overlap between exams.
#[1] 0.00212 This is a really small number and looking at it alone we might think 
# this person must have been exagerating.
# However, a lot of people have taken the exam (around 1,250 independent tries).
# Let's say each person attempts twice which gives us 625 attempts. What is the 
# probability that during one of these events someone recieved 50% or more of the 
# same items?
# [1] 0.7482862 Which gives us a surprisingly large number.
# However, saying everybody took the test twice is pretty unlikely. Let's instead 
# say 100 people took the test twice.
# [1] 0.1912173 Which is a very respectable odds. These numbers indicate that if 
# 100 people took the test twice there is a 20% chance that one person would end 
# up seeing on the second test 50% of the same items.
# Conversely we might want to ask the likelihood that someone would not see any 
# of the same items.
# Giving about 5%.
Created by Pretty R at

Tuesday, February 3, 2015

Base R Assessment!

Test your skills with this R-powered R assessment of base R knowledge!

Built using the R powered adaptive testing platform Concerto, this assessment provides a short but powerful tool at evaluating your base R understanding relative to that of your peers.

Currently the assessment has over seventy items (questions) in the pool while each individual takes less than twenty of these selected randomly. So each test is unique.

Those who score well enough will be given the chance to contribute their own items to challenge other users.

Wednesday, November 12, 2014

Convergence of a Series

Let us explore using simulation some of the concepts of basic asymptotic theory as presented in Wooldridge 2012, Chapter 3.

Definition: A sequence of nonrandom numbers {a_N:N=1,2,...} converges to a if for all epsilon>0 there exists N_epsilon such that N>N_epsilon, then
$$|a_N - a|<\epsilon$$

Paraphrase: A sequence converges on point a, if you can choose any positive number (any epsilon greater than zero) and you will find some point in the series after which the difference between the series and the point it converges to a will always be smaller than epsilon.

Let us try to make this more concrete.

Imagine the series:

Does this series converge on a value a?

Let us pick an epsilon, say 100.
## R ---------------------------
f = function(x) x^2
N = 1:30
plot(N, f(N), type='l', col='lightblue', lwd=4)
## -----------------------------
We can see that there is no a which could exist for which our series seems to converge within the positive bound epsilon=100.

This concergence might have been obvious. Let us try a different series.
## R ---------------------------
g = function(x) 10*x^(-99/100) * sin(x) + pi
N = 1:100
plot(N, g(N), type='l', lwd=2)
## -----------------------------

The series g demonstrates a common convergence pattern. That is you can alway pick a epsilon after which the difference between N and the convergence point is less then epsilon. In this case the series converges on pi. Let us play around with a few epsilons to see if our hypothesis that the series converges seems to hold true.

## R ---------------------------
e1 = 5
abline(h=pi+e1, lwd=2, col='darkred')
abline(h=pi-e1, lwd=2, col='darkred')
## -----------------------------
We can see that at N<10 there are some points which are not bounded by |N-pi|<e1. However by the time we get to N>20, all points are easily bound (at least those we have plotted).

How about a harder epsilon?
## R ---------------------------
e2 = .5
abline(h=pi+e2, lwd=2, col='darkblue')
abline(h=pi-e2, lwd=2, col='darkblue')
## -----------------------------

This epsilon is outside our bounds at 20 but by the time we have reached 40 the series does not reach outside our bounds.

## R ---------------------------
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
## -----------------------------
We can see that with e3=.1 there is no point on the graph we have plotted which is bounded by our epsilon. However plotting the graph at a longer series time frame will give us a number N after which g is bounded.

## R ---------------------------
N = 100:200
plot(N, g(N), type='l', lwd=2)
e3 = .1
abline(h=pi+e3, lwd=2, col=rgb(.3,.6,1,.5))
abline(h=pi-e3, lwd=2, col=rgb(.3,.6,1,.5))
N= 1:30
plot(N, f(N), type='l', col='paste(readLines(paste0('',
              fdate, '/'), warn="F"), collapse='');lightblue', lwd=4)
## -----------------------------

We can see that the graph clearly shows that at our small value of epsilon (.1) we are bounded around pi as early as N=110. For all other even smaller epsilons there will always be an N for which
this is true (even if our computers do not have sufficient accuracy to show this to be the case).

However, since we are working with an infinite series, this also means that there is an infinite area of the graph for which we can never plot. We therefore cannot use graphs to prove that series converge. In theory a series can always break the bounds just beyond the area of inspection. The switch function
$$f(N) = 1[N>100]$$
is an extreme example in which the series seems to converge at 0 up to N<100 but at N=100 or greater the series now converges on 1.

Other series converge without oscillating:

Some more challenging series which also converge include:

Monday, November 3, 2014

Make your own hotly criticised circle graph!!!

Make your own hotly criticized circle graph! The recent critical post on R-bloggers has got me wanting to make my own bubble graphs despite the popular criticism of the graph. After some attempts I have produced something that looks similar but needs some more work.

I would like to say briefly that R-blogger post by while interesting and reasonable seemed to miss the point. The graphic was not supposed to convey massive amounts of information but simply to make obvious the scope and degree of how the current Ebola epidemic compares with other epidemics.

There are many good reasons to critique this graph from an epidemiological perspective (forthcoming), yet from a graphing perspective I think  this graph format is extremely efficient at conveying memorable information while the graphs assembled by PirateGrunt were more detailed they were also entirely forgettable unfortunately.

Here are a couple of examples. The first is with meaningless simulated data demonstrating how large differences in scale may appear. The second is actually using HIV prevalence data to give a graphical representation of how such rates have changed over the years. You can fine the code as well as the data on github.

Table 1: Demonstrates how scale can be communicated by size of circle

Table 2: Demonstrates the relative stability of AIDS in terms of people ill. Top number is  year while lower number is millions of cases.