The failure of mainstream research to consistently reproduce results have led many to look for the faults in current methodologies.
One of these potential faults identified is that the significance levels of current standards is too high. A standard rejection rate of either .05 or .01 is not high enough.
Statistician Valen Johnson recently released an article in The National Proceedings Academy of Sciences which reccomends more appropriate standard of rejection being .005 or .001.
In this post I will attempt to examine how such a change could affect the required sample size for studies.
Before initiating an experimental study often a power analysis is done. When possible it is using a relatively simple closed form numerical statistic. This is the case when relatively simple methods are intended to be used. More complex methods often require the use of simulations to do a power analysis.
Usually the logic of a power analysis (as far as I know) goes something like. Let's say the possible effect size is tau and we know the conditional distribution of outcomes (from previous work) has a standard deviation of SD. How many people would be need to reject the null at our intended level.
With Pi being power and tau being effect size and alpha being rejection power. Assuming lots of normality:
Pi(tau) = 1 - PDFNORMAL(Z(alpha)-tau*N^.5/SD)
We want to solve for N:
# PDFNORMAL^(-1)(1 - Pi) = Z(alpha)-tau*N^.5/SD
# tau*N^.5/SD = PDFNORMAL^(-1)(1 - Pi) - Z(alpha)
# N = ((SD/tau)*(PDFNORMAL^(-1)(1 - Pi) - Z(alpha)))^2
samp.power <- alpha="" function="" p="" pi="" tau=""> ->
ceiling(((SD/tau)*(pnorm(1 - pi) - qnorm(alpha)))^2)
# added the ceiling function to round up.
Let's see it in action!
Let's say we have an outcome which we know has a SD=2 and we hope our effect will have at least a size of tau=1. Following standard practices we require a detection rate of 80% for our power analysis. Let's see what happens when we vary our alpha rate!
samp.power(SD=2, tau=1, pi=.8, alpha=.05)
samp.power(SD=2, tau=1, pi=.8, alpha=.01)
samp.power(SD=2, tau=1, pi=.8, alpha=.005)
samp.power(SD=2, tau=1, pi=.8, alpha=.001)
We can see that increasing our rejection standards from .05 to .001 we are basically increasing our required sample pool by 2.7. Which is really not that bad. Looking at a smaller effect size does not change things except by a squared factor s. tau'=tau/s
# Looking at our equation for N
# N = ((SD/tau)*(pnorm(1 - pi) - qnorm(alpha)))^2
# N = (1/tau * H)^2
# where H = (SD)*(pnorm(1 - pi) - qnorm(alpha))
# Thus substituting in tau'
# N = (1/tau/s * H)^2 = s^2 * (H/tau)^2
# So let's say s=10 then N(tau')=N(tau)*100
samp.power(SD=2, tau=1/10, pi=.8, alpha=.001)
Not to make light of this new proposed rejection rate and its larger sample size. By increasing the sample size by roughly a factor of 2.7 the cost of a study might easily double.
However, a researcher saying that the chance of an outcome occurring randomly going from 1 out of 20 to 1 out of 1000 might easily be worth the additional cost.
The nice thing about this new approach would be that it would still allow for less strong rejections even when the effect size is smaller than expected or when there is more noise in the sample than expected.
Well, that is what I have to say in support of the idea. I also have some reservations. If the cost of the larger rejection rates is really doubling the cost of the study then why not do two studies? Assuming the outcomes of each study are random and iid the likelihood of rejecting the null 1 out of 20 (5%) of the time given two studies is 1 out of 20^2 or 1 out of 400.
.25% (1/400) is not as good as as .1% but it is still pretty strong.
It makes sense in a world where we do not know what really has an effect to have more smaller studies which we follow up with larger studies when we do find an effect. In addition, there might be factors unique to individual studies which are for whatever reason unobservable and nonreproducable, driving the results.
Say, the researchers introduced bias without intending to. Scaling up the project might have no effect on removing or controlling that bias. However, having two different studies run by different research groups is less likely to reintroduce the same bias.
Overall, I think it may be useful to introduce higher standards into social science research especially in non-experimental data in which numerous potential researchers are looking at the data with different hypothesizes. It is improbable that if there is enough researchers looking at the data from enough angles that there will not be at least a few that reject at a 5% level. Imagine that you have 20 research teams each picking 5 different angles that is 1000 different draws.
Assuming they are independent the likelihood of rejecting at a 5% level would lead researchers to falsely reject 50 null hypothesizes on average. That is a lot of false rejections. Choosing a level of .1% however would lead only one research team to reject one null falsely on average. This is a pretty appealing change for a conservative statistician. However, once again there will be many times in which we fail to reject the null when our hypothesizes are in fact true but our data or effect set are insufficiently large. Which would we rather we have?