The failure of mainstream research to consistently reproduce results have led many to look for the faults in current methodologies.
One of these potential faults identified is that the
significance levels of current standards is too high. A standard rejection rate of either .05 or
.01 is not high enough.
Statistician
Valen Johnson recently released an article in The National Proceedings Academy of Sciences which
reccomends more appropriate standard of rejection being .005 or .001.
In this post I will attempt to examine how such a change
could affect the required sample size for studies.
Before initiating an experimental study often a power
analysis is done. When possible it is
using a relatively simple closed form numerical statistic. This is the case
when relatively simple methods are intended to be used. More complex methods often require the use of
simulations to do a power analysis.
Usually the logic of a power analysis (as far as I know)
goes something like. Let's say the
possible effect size is tau and we know the conditional distribution of
outcomes (from previous work) has a standard deviation of SD. How many people
would be need to reject the null at our intended level.
Referencing wikipedia:
With Pi being power and tau being effect size and alpha being
rejection power. Assuming lots of
normality:
Pi(tau) = 1 - PDFNORMAL(Z(alpha)-tau*N^.5/SD)
We want to solve for N:
# PDFNORMAL^(-1)(1 - Pi) =
Z(alpha)-tau*N^.5/SD
# tau*N^.5/SD = PDFNORMAL^(-1)(1 - Pi) - Z(alpha)
# N = ((SD/tau)*(PDFNORMAL^(-1)(1 - Pi) - Z(alpha)))^2
samp.power <- alpha="" function="" p="" pi="" tau="">
->
ceiling(((SD/tau)*(pnorm(1 - pi) - qnorm(alpha)))^2)
# added the ceiling function to round up.
Let's see it in action!
Let's say we have an outcome which we know has a SD=2 and we
hope our effect will have at least a size of tau=1. Following standard
practices we require a detection rate of 80% for our power analysis. Let's see what happens when we vary our alpha
rate!
samp.power(SD=2, tau=1, pi=.8, alpha=.05)
# 20
samp.power(SD=2, tau=1, pi=.8, alpha=.01)
# 34
samp.power(SD=2, tau=1, pi=.8, alpha=.005)
# 40
samp.power(SD=2, tau=1, pi=.8, alpha=.001)
# 54
We can see that increasing our rejection standards from .05
to .001 we are basically increasing our required sample pool by 2.7. Which is
really not that bad. Looking at a
smaller effect size does not change things except by a squared factor s. tau'=tau/s
# Looking at our equation for N
# N = ((SD/tau)*(pnorm(1 - pi) - qnorm(alpha)))^2
# N = (1/tau * H)^2
# where H = (SD)*(pnorm(1 - pi) - qnorm(alpha))
# Thus substituting in tau'
# N = (1/tau/s * H)^2 = s^2 * (H/tau)^2
# So let's say s=10 then N(tau')=N(tau)*100
samp.power(SD=2, tau=1/10, pi=.8, alpha=.001)
# 5387
Not to make light of this new proposed rejection rate and
its larger sample size. By increasing
the sample size by roughly a factor of 2.7 the cost of a study might easily
double.
However, a researcher saying that the chance of an outcome occurring
randomly going from 1 out of 20 to 1 out of 1000 might easily be worth the
additional cost.
The nice thing about this new approach would be that it
would still allow for less strong rejections even when the effect size is
smaller than expected or when there is more noise in the sample than expected.
Well, that is what I have to say in support of the
idea. I also have some
reservations. If the cost of the larger
rejection rates is really doubling the cost of the study then why not do two
studies? Assuming the outcomes of each
study are random and iid the likelihood of rejecting the null 1 out of 20 (5%)
of the time given two studies is 1 out of 20^2 or 1 out of 400.
.25% (1/400) is not as good as as .1% but it is still pretty
strong.
It makes sense in a world where we do not know what really
has an effect to have more smaller studies which we follow up with larger
studies when we do find an effect. In
addition, there might be factors unique to individual studies which are for
whatever reason unobservable and nonreproducable, driving the results.
Say, the researchers introduced bias without intending to. Scaling up the project might have no effect
on removing or controlling that bias.
However, having two different studies run by different research groups
is less likely to reintroduce the same bias.
Overall, I think it may be useful to introduce higher
standards into social science research especially in non-experimental data in
which numerous potential researchers are looking at the data with different
hypothesizes. It is improbable that if
there is enough researchers looking at the data from enough angles that there
will not be at least a few that reject at a 5% level. Imagine that you have 20 research teams each
picking 5 different angles that is 1000 different draws.
Assuming they are independent the likelihood of rejecting at
a 5% level would lead researchers to falsely reject 50 null hypothesizes on
average. That is a lot of false rejections. Choosing a level of .1% however would lead
only one research team to reject one null falsely on average. This is a pretty appealing change for a
conservative statistician. However,
once again there will be many times in which we fail to reject the null when
our hypothesizes are in fact true but our data or effect set are insufficiently
large. Which would we rather we have?
No comments:
Post a Comment