I have recently received an email from someone who had taken my Base R Assessment. In this email, the test taker reported that a large portion of the test items taken were duplicates (around 50%) when he took the test the second time.

I began wondering what the likelihood of this occurring would be.

The test selects between 10-20 items each time the test is taken.

These items are selected randomly.

There is around 80 items in total available.

First let us do some back of the envelope calculations.

1. Lets say the average number of test items is 15. Thus the average quantity of the test items taken is 18.75% for the first test.

2. If the same person were to take the test again then we would expect around 18.75% * 15 = 2.8 items to be duplicates.

We might want to know instead what the change of say 50% of the items administered in the second test being duplicates.

In order to do this calculation we would need to basically come up with the a list of all of the contamination of items which result in 50% overlap.

For instance, let us say that we have a ten item exam. The items that overlap are 1:5 and the items that do not are 6:10.

The probability of this outcome occurring is {first the overlap}*{those that do not overlap} : 1/80 * 1/79 * 1/78 * 1/77 * 1/76 * 74/75 * 73/74 * 72/73 * 71/72 * 70/71

Which results in a really small number. Then we would calculate the probability of the first 4 matching and last items matching: 1/80 * 1/79 * 1/78 * 1/77 * 75/76 * 74/75 * 73/74 * 72/73 * 71/72 * 1/71

Next the first three and last two and so on. You keep doing this with every possible combination. This process becomes rather tedious and time consuming pretty quickly.

However, if you are willing to undergo a little approximation bias then a much easier process is to put together a little simulation.

Let us simulate instead the process x number of times, say 100,000.

Created by Pretty R at inside-R.org

I began wondering what the likelihood of this occurring would be.

The test selects between 10-20 items each time the test is taken.

These items are selected randomly.

There is around 80 items in total available.

First let us do some back of the envelope calculations.

1. Lets say the average number of test items is 15. Thus the average quantity of the test items taken is 18.75% for the first test.

2. If the same person were to take the test again then we would expect around 18.75% * 15 = 2.8 items to be duplicates.

We might want to know instead what the change of say 50% of the items administered in the second test being duplicates.

In order to do this calculation we would need to basically come up with the a list of all of the contamination of items which result in 50% overlap.

For instance, let us say that we have a ten item exam. The items that overlap are 1:5 and the items that do not are 6:10.

The probability of this outcome occurring is {first the overlap}*{those that do not overlap} : 1/80 * 1/79 * 1/78 * 1/77 * 1/76 * 74/75 * 73/74 * 72/73 * 71/72 * 70/71

Which results in a really small number. Then we would calculate the probability of the first 4 matching and last items matching: 1/80 * 1/79 * 1/78 * 1/77 * 75/76 * 74/75 * 73/74 * 72/73 * 71/72 * 1/71

Next the first three and last two and so on. You keep doing this with every possible combination. This process becomes rather tedious and time consuming pretty quickly.

However, if you are willing to undergo a little approximation bias then a much easier process is to put together a little simulation.

Let us simulate instead the process x number of times, say 100,000.

Nsim <- 10^5 overlap <- rep(NA,Nsim) testMin <- 10 testAdd <- 10 for (i in 1:Nsim) { testL1 <- testMin + sample(0:testAdd, 1) testL2 <- testMin + sample(0:testAdd, 1) first <- sample(80, testL1) second <- sample(80, testL2) overlap[i] <- mean(second %in% first) } # Generate a density curve of overlap. The mode is around .2 which is just around

# where we expected the average to be. plot(density(overlap))

mean(overlap) # [1] 0.1877303 # Now we can use a built in function 'ecdf' (empirical cdf) to process our

# frequencies. OverCDF <- ecdf(overlap) # The ecdf function is pretty great! It can turn observations into probability data. # We can plot our cdf curve. It is a step function because there is only a finite

# number of possible overlap percentages possible. plot(OverCDF, verticals = TRUE, do.points = FALSE)

```
# We can also use it to calculate the likelihood of a certain level of overlap or
```

# less. # Say what is the likelihood of 25% or less of the items being repeated? OverCDF(.25) # [1] 0.76224 This is pretty high value. # What about 18.75% or less of the items being repeated? OverCDF(.1875) # [1] 0.53936] Which is pretty close to # Now we can ask the question, what is the likelihood that the user had 50% or

# more overlap between exams. 1-OverCDF(.5) #[1] 0.00212 This is a really small number and looking at it alone we might think

# this person must have been exagerating. # However, a lot of people have taken the exam (around 1,250 independent tries). # Let's say each person attempts twice which gives us 625 attempts. What is the

`# probability that during one of these events someone recieved 50% or more of the `

# same items? 1-(OverCDF(.5))^650 # [1] 0.7482862 Which gives us a surprisingly large number. # However, saying everybody took the test twice is pretty unlikely. Let's instead

# say 100 people took the test twice. 1-(OverCDF(.5))^100 # [1] 0.1912173 Which is a very respectable odds. These numbers indicate that if

`# 100 people took the test twice there is a 20% chance that one person would end `

# up seeing on the second test 50% of the same items. # Conversely we might want to ask the likelihood that someone would not see any

# of the same items. OverCDF(0) # Giving about 5%.