--- title: "Introduction to Justifying Alpha Levels" author: "Daniel Lakens & Aaron R. Caldwell" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to Justifying Alpha Levels} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) library(Superpower) ``` # Introduction We have recently, August 2020, added 5 new functions to `Superpower` to expand the package's capabilities. In this vignette, we will introduce these new functions and how they can help with study planning. These new functions include: `power.ftest`, `optimal_alpha`, `ANOVA_compromise`, `alpha_standardized`, and `p_standardized` The overall goal is to provide ways for researchers to justify their alpha level when designing studies, and provide greater flexibility. # How are error rates minimized? Within the new `optimal_alpha` function, there are two options within the `error` argument: `minimal` or `balanced`. First, we have an equation for how a weighted combined error rate is calculated when `error = minimal`. `(costT1T2 * x + priorH1H0 * y)/(priorH1H0 + costT1T2)` $WCER = \frac{\frac{cost_\alpha}{cost_\beta} \cdot \alpha + \frac{Pr(H_1)}{Pr(H_0)}\cdot\beta}{\frac{Pr(H_1)}{Pr(H_0)}+\frac{cost_\alpha}{cost_\beta}}$ Second, we have an equation for how a weighted combined error rate is calculated when `error = balanced`. `max((costT1T2 * x - priorH1H0 * y)/(priorH1H0 + costT1T2), (priorH1H0 * y - costT1T2 * x)/(priorH1H0 + costT1T2))` $WCER = max(\frac{\frac{cost_\alpha}{cost_\beta} \cdot \alpha - \frac{Pr(H_1)}{Pr(H_0)}\cdot\beta}{\frac{Pr(H_1)}{Pr(H_0)}+\frac{cost_\alpha}{cost_\beta}} | \frac{ \frac{Pr(H_1)}{Pr(H_0)}\cdot\beta - \frac{cost_\alpha}{cost_\beta} \cdot \alpha }{\frac{Pr(H_1)}{Pr(H_0)}+\frac{cost_\alpha}{cost_\beta}})$ # Minimizing Error Rates Assume we plan to perform an independent samples *t*-test, where our smallest effect size of interest is d = 0.5, and we are planning to collect 64 participants in each condition. We would normally calculate power as follows: `power.t.test(delta = .5, sd = 1, n = 64, sig.level = 0.05, type = 'two.sample', alternative = 'two.sided')$power` This analysis tells us that we have 80% power with a 5% alpha level for our smallest effect size of interest, d = 0.5, when we collect 64 participants in each condition. If we design 2000 studies like this, the number of Type 1 and Type 2 errors we make depend on how often the null hypothesis is true, and how often the alternative hypothesis is true. Let's assume both are equally likely for now. This means that in 1000 studies the null hypothesis is true, and we will make 50 Type 1 errors. In 1000 studies the alternative hypothesis is true, and we will make 100-80 = 20% Type 2 errors, so in 200 studies we will not observe a significant result even if there is a true effect. Combining Type 1 and Type 2 errors, in the long run, we should expect 250 of our 2000 studies to yield an error. The goal in Neyman-Pearson hypothesis testing is to control the number of errors we make, as we perform hypothesis tests. Researchers often rely on convention when setting error rates, and there is no special reason to set the Type 1 error rate at 5% and the Type 2 error rate at 20%, and there might be better choices when designing studies. For example, when collecting 64 participants per condition, it is also not optimally efficient to use these norms. ```{r fig.height=5, fig.width=6} res <- optimal_alpha(power_function = "power.t.test(delta = .5, sd = 1, n=64, sig.level = x, type='two.sample', alternative='two.sided')$power") print(res) ``` If a researcher is interested in effects of d = 0.5, and plans to collect 64 participants in each condition, setting the Type 1 error rate to `r 100*round(res$alpha,2)`% will increase the power to `r 100*round(1-res$beta,2)`%. If we would perform 2000 studies designed with these error rates, we would observe 100 Type 1 errors and 120 Type 2 errors. The combined error rate across 2000 studies is 220 instead of 250. In other words, by choosing a more optimal alpha level, we can design lines of research more efficiently, because we are less likely to make errors in our statistical inferences. ## Balancing Error Rates You can choose to minimize the combined error rates, but you can also decide that it makes most sense to balance the error rates. For example, you might think a Type 1 error is just as problematic as a Type 2 error, and therefore, you want to design a study that has balanced error rates for a smallest effect size of interest (e.g., a 5% Type 1 error rate and a 95% Type 2 error rate). The `optimal_alpha` function can either minimize errors, or balance them, by specifying an additional argument in the function. The default is to minimize error rates, but by adding `error = "balance"` an alpha level is calculated so that the Type 1 error rate equals the Type 2 error rate. ```{r fig.height=5, fig.width=6} res2 <- optimal_alpha(power_function = "power.t.test(delta = .5, sd = 1, n=64, sig.level = x, type='two.sample', alternative='two.sided')$power", error = "balance") print(res2) ``` Repeating our earlier example, the alpha level is `r 100*round(res2$alpha,2)`%, and the power is `r 100*round(1-res2$beta,2)`% (or the Type 2 error rate is `r 100*round(res2$beta,2)`%). Choosing to balance error rates is only slightly less efficient (`r 100*round(res2$alpha+res2$beta,4)`%) compared to minimizing error rates (`r 100*round(res$alpha+res$beta,4)`%). Power analysis is always a messy business due to the uncertainty in the true effect size, and I would not worry about the rather trivial difference between minimal error rates and balanced error rates, and the latter seems slightly more intuitive to explain, which might give this approach some practical benefits when we try to teach or explain the idea to others. # Relative costs and prior probabilities So far we have assumed a Type 1 error and Type 2 error are equally problematic. But you might believe Cohen (1988) was right, and Type 1 errors are exactly 4 times as bad as Type 2 errors. Or you might think they are twice as problematic, or 10 times as problematic. However you weigh them, as explained by Mudge et al., 2012, and Ulrich & Miller, 2019, you should incorporate those weights into your decisions. The function has another optional argument, `costT1T2`, that allows you to specify the relative cost of Type1:Type2 errors. By default this is set to 1, but you can set it to 4 (or any other value) such that Type 1 errors are 4 times as costly as Type 2 errors. This will change the weight of Type 1 errors compared to Type 2 errors, and thus also the choice of the best alpha level. ```{r fig.height=5, fig.width=6} res3 <- optimal_alpha(power_function = "power.t.test(delta = .5, sd = 1, n=100, sig.level = x, type='two.sample', alternative='two.sided')$power", error = "minimal", costT1T2 = 4) print(res3) ``` Now, the alpha level that minimized the *weighted* Type 1 and Type 2 error rates is `r 100*round(res3$alpha,2)`%. Similarly, you can take into account prior probabilities that either the null is true (and you will observe a Type 1 error), or that the alternative hypothesis is true (and you will observe a Type 2 error). By incorporating these expectations, you can minimize or balance error rates in the long run (assuming your priors are correct). Priors can be specified using the `priorH1H0` argument, which by default is 1 (H1 and H0 are equally likely). Setting it to 4 means you think the alternative hypothesis (and hence, Type 2 errors) are 4 times more likely than that the null hypothesis is true (and hence, Type 1 errors). ```{r fig.height=5, fig.width=6} res4 <- optimal_alpha(power_function = "power.t.test(delta = .5, sd = 1, n=100, sig.level = x, type='two.sample', alternative='two.sided')$power", error = "minimal", priorH1H0 = 2) print(res4) ``` If you think H1 is four times more likely to be true than H0, you need to worry less about Type 1 errors, and now the alpha that minimizes the weighted error rates is `r 100*round(res4$alpha,2)`%. It is always difficult to decide upon priors (unless you are Omniscient Jones) but even if you ignore them, you are making the decision that H1 and H0 are equally plausible. # Extending to the Analysis of Variance Now that the basics of finding an optimal alpha are out of the way we can move on to how these methods can be used for an ANOVA, and *F*-tests in general. This is accomplished by the replacing the `power.t.test` function with the `power.ftest` function. As you can see below it operates very similar to the `power.t.test`; the main differences are that `power.ftest` requires arguments for the degrees of freedom rather than sample size. Further, rather than allow for input of the power as an argument, `power_ftest` allows the user to input the beta_level. ``` power.ftest( num_df = NULL, den_df = NULL, cohen_f = NULL, alpha_level = Superpower_options("alpha_level"), beta_level = NULL, liberal_lambda = Superpower_options("liberal_lambda") ) ``` The use of the function on its own is simple, but not very intuitive for power analysis on its own. For example, let us suppose we have known degrees of freedom of 1 and 128 respectively, and an effect size of interest of `cohen_f = .22`. Now we simple want to find what the power would be given an `alpha_level` of .07. ```{r} power.ftest( num_df = 1, den_df = 128, cohen_f = .22, alpha_level = .045, beta_level = NULL, liberal_lambda = FALSE ) ``` The process is simple for using power.ftest; now we can see where it is actually useful. ## An Example with ANOVA_exact2 Let us use an example from another vignette. Imagine you plan to perform a study in which participants interact with an artificial voice assistant that sounds either like a human or like a robot, and who sounds either cheerful or sad. In the example below, we can setup a 2*2 mixed design in `ANOVA_design` (first factor, voice, is manipulated between participants, the second factor, emotion, is manipulated within participants). The sample size is fixed at 40 (meaning the researchers are unable to collect more than this amount) in each between subject condition (so 80 participants in total), the assumed population standard deviation is 1.03, the correlation for the within factors is 0.8, and the means are 1.03, 1.21, 0.98, 1.01. We will start by using the default `alpha_level`. The outcome of interest in this case is the interaction effect; this means we want this particular ANOVA-level effect to be detected if it were to exist (more so than other possible outcomes). ```{r, fig.width=5, fig.height=4} design_result <- ANOVA_design(design = "2b*2w", n = 40, mu = c(1.03, 1.41, 0.98, 1.01), sd = 1.03, r = 0.8, labelnames = c("voice", "human", "robot", "emotion", "cheerful", "sad"), plot = TRUE) ``` We can then pass this design onto to the `ANOVA_exact2` function to estimate the effect sizes for the this particular design. ```{r} exact2_res = ANOVA_exact2(design_result, verbose = FALSE) knitr::kable(exact2_res$main_results) ``` As we can see here we can tell that we are tad under-powered to detect the effect of interest. Now, let us use `power.ftest` to find the alpha that will get us our desired beta (in this case let us assume .2). ```{r} power.ftest(num_df = exact2_res$anova_table$num_df[3], den_df = exact2_res$anova_table$den_df[3], cohen_f = exact2_res$main_results$cohen_f[3], alpha_level = NULL, beta_level = .2) ``` So, we can see that the required alpha in this case is approximately .12. However, what would be the optimal alpha? Now, let us use this same approach in the `optimal_alpha` function. ```{r fig.height=5, fig.width=6} res5 = optimal_alpha("power.ftest(num_df = exact2_res$anova_table$num_df[3], den_df = exact2_res$anova_table$den_df[3], cohen_f = exact2_res$main_results$cohen_f[3], alpha_level = x)$power/100", error = "minimal") print(res5) ``` Interestingly the values for alpha and beta are fairly close to the calculations we made with `power.ftest` alone. ## ANOVA_compromise Finally, we can extend everything we've done to a convenience function `ANOVA_compromise`. This way we can simply pass on the design from `ANOVA_design` and the optimal alpha can be determined for *every* comparison we select in a study. ```{r ANOVA_compromise} comp_res = ANOVA_compromise(design_result, emm = TRUE) ``` You'll notice it takes a lot longer to run `ANOVA_compromise` than `ANOVA_exact` or `ANOVA_exact2`, but overall the options (such as `emm = TRUE`) are the same. Now, we can go through the results one-by-one. ```{r} #ANOVA results knitr::kable(comp_res$aov_comp) #MANOVA results knitr::kable(comp_res$manova_comp) #emmeans results knitr::kable(comp_res$emmeans_comp) ``` In addition, we can extract the optimal alpha plots for any of the comparisons outlined above. These can be found in `aov_plotlist`, `manova_plotlist`, and `emmeans_plotlist`. ```{r comp_plots, fig.height=5, fig.width=6} comp_res$aov_plotlist ```