Tuesday, 26 December 2017

Using simulations to understand p-values

Intuitive explanations of statistical concepts for novices #4

The p-value is widely used but widely misunderstood. I'll demonstrate this in the context of intervention studies. The key question is how confident can we be that an apparently beneficial effect of treatment reflects a change due to the intervention, rather than arising just through the play of chance. The p-value gives one way of deciding that. There are other approaches, including those based on Bayesian statistics, which are preferred by many statisticians. But I will focus here on the traditional null hypothesis significance testing (NHST) approach, which dominates statistical reporting in many areas of science, and which uses p-values.

As illustrated in my previous blogpost, where our measures include random noise, the distorting effects of chance mean that we can never be certain whether or not a particular pattern of data reflects a real difference between groups. However, we can compute the probability that the data came from a sample where there was no effect of intervention.

There are two ways to do this. One way is by simulation. If you repeatedly run the kind of simulation described in my previous blogpost, specifying no mean difference between groups, each time taking a new sample, for each result you can compute a standardized effect size. Cohen's d is the mean difference between groups expressed in standard deviation units, which can be computed by subtracting the group A mean from the group B mean, and dividing by the pooled standard deviation (i.e. the square root of the average of the variances for the two groups). You then see how often the simulated data give an effect size at least as large as the one observed in your experiment.
Histograms of effecct sizes obtained by repeatedly sampling from population where there is no difference between groups*
Figure 1 shows the distribution of effect sizes for two different studies: the first has 10 participants per group, and the second has 80 per group. For each study, 10,000 simulations were run; on each run, a fresh sample was taken from the population, and the standardized effect size, d, computed for that run. The peak of each distribution is at zero: we expect this, as we are simulating the case of no real difference between groups – the null hypothesis. But note that, though the shape of the distribution is the same for both studies, the scale on the x-axis covers a broader range for the study with 10 per group than the study with 80 per group. This relates to the phenomenon shown in Figure 5 of the previous blogpost, whereby estimates of group means jump around much more when there is a small sample.

The dotted red lines show the cutoff points that identify the top 5%, 1% and 0.1% of the effect sizes. Suppose we ran a study with 10 people and it gave a standardized effect size of 0.3. We can see from the figure that a value in this range is fairly common when there is no real effect: around 25% of the simulations gave an effect size of at least 0.3. However, if our study had 80 people per group, then the simulation tells us this is an improbable result to get if there really is no effect of intervention: only 2.7% of simulations yield an effect size as big as this.

The p-value is the probability of obtaining a result at least as extreme as the one that is observed, if there really is no difference between groups. So for the study with N = 80, p = .027. Conventionally, a level of p < .05 has been regarded as 'statistically significant', but this is entirely arbitrary. There is an inevitable trade-off between false positives (type I errors) and false negatives (type II errors). If it is very important to avoid false positives, and you do not mind sometimes missing a true effect, then a stringent p-value is desirable. If, however, you do not want to miss any finding of potential interest, even if it turns out to be a false positive, then you could adopt a more lenient criterion.

The comparison between the two sample sizes in Figure 1 should make it clear that statistical significance is not the same thing as practical significance. Statistical significance simply tells us how improbable a given result would be if there was no true effect. The larger the sample size, the smaller the effect size that would be detected at a threshold such as p < .05. Small samples are generally a bad thing, because they only allow us to reliably detect very large effects. But very large samples have the opposite problem: they allow us to detect as 'significant' effect that are so small as to be trivial. The key point that the researcher who is conducting an intervention study should start by considering how big an effect would be of practical interest, given the cost of implementing the intervention. For instance, you may decide that staff training and time spent on a vocabulary intervention would only be justified if it boosted children's vocabulary by at least 10 words. If you knew how variable children scores were on the outcome measure, the sample size could then be determined so that the study has a good chance of detecting that effect while minimising false positives. I will say more about how to do that in a future post.

I've demonstrated p-values using simulations in the hope that this will give some insight into how they are derived and what they mean. In practice, we would not normally derive p-values this way, as there are much simpler ways to do this, using statistical formulae. Provided that data are fairly normally distributed, we can use statistical approaches such as ANOVA, t-tests and linear regression to compute probabilities of observed results (see this blogpost). Simulations can, however, be useful in two situations. First, if you don't really understand how a statistic works, you can try running an analysis with simulated data. You can either simulate the null hypothesis by creating data from two groups that do not differ, or you can add a real effect of a given size to one group. Because you know exactly what effect size was used to create the simulated data, you can get a sense of whether particular statistics are sensitive to detect real effects, and how these might vary with sample size.

The second use of simulations is for situations where the assumptions of statistical tests are not met – for instance, if data are not normally distributed, or if you are using a complex design that incorporates multiple interacting variables. If you can simulate a population of data that has the properties of your real data, you can then repeatedly sample from this and compute the probability of obtaining your observed result to get a direct estimate of a p-value, just as was done above.

The key point to grasp about a p-value is that it tells you how likely your observed evidence is, if the null hypothesis is true. The most widely used p-value is .05: if the p-value in your study is less than .05, then the chance of your observed data arising when the intervention had no effect is 1 in 20. You may decide on that basis that it's worth implementing the intervention, or at least investing in the costs of doing further research on it.

The most common mistake is to think that the p-value tells you how likely the null hypothesis is given the evidence. But that is something else. The probability of A (observed data) given B (null hypothesis) not the same as the probability of B (null hypothesis) given A (observed data). As I have argued in another blogpost, the probability that if you are a man you are a criminal is not high, but if you are a criminal, the probability that you are a man is much higher. This may seem fiendishly complicated, but a concrete example can help.

Suppose Bridget Jones has discovered three weight loss pills: if taken for a month, pill A is totally ineffective placebo, pill B leads to a modest weight loss of 2 lbs, and pill C leads to an average loss of 7 lb. We do studies with three groups of 20 people; in each group, half are given A, B or C and the remainder are untreated controls. We discover that after a month, one of the treated groups has an average weight loss of 3 lb, whereas their control group has lost no weight at all. We don't know which pill this group received. If we run a statistical test, we find the p-value is .45. This means we cannot reject the null hypothesis of no effect – which is what we'd expect if this group had been given the placebo pill, A. But the result is also compatible with the participants having received pills B or C. This is demonstrate in Figure 2 which shows the probability density function for each scenario - in effect, the outline of the histogram. The red dotted line corresponds to our obtained result, and it is clear it is highly probable regardless of which pill was used. In short, this result doesn't tell us how likely the null hypothesis is – only that the null hypothesis is compatible with the evidence that we have.
Probability density function for weight loss pills A, B and C, with red line showing observed result

Many statisticians and researchers have argued we should stop using p-values, or at least adopt more stringent levels of p. My view is that p-values can play a useful role in contexts such as the one I have simulated here, where you want to decide whether an intervention is worth adopting, provided you understand what they tell you. It is crucial to appreciate how dependent a p-value is on sample size, and to recognise that the information it provides is limited to telling you whether an observed difference could just be due to chance. In a later post I'll go on to discuss the most serious negative consequence of misunderstanding of p-values: the generation of false positive findings by the use of p-hacking.

*The R script to generate Figures 1 and 2 can be found here.

Thursday, 21 December 2017

Using simulations to understand the importance of sample size

Intuitive explanations of statistical concepts for novices #3

I'll be focusing here on the kinds of stats needed if you conduct an intervention study. Suppose we measured the number of words children could define on a 20-word vocabulary task. Words were selected so that at the start of training, none of the children knew any of them. At the end of 3 months of training, every child in the vocabulary training group (B) knew four words, whereas those in a control group (A) knew three words. If we had 10 children per group, the plot of final scores would look like Figure 1 panel 1.
Figure 1. Fictional data to demonstrate concept of random error (noise)

In practice, intervention data never look like this. There is always unexplained variation in intervention outcomes, and real results look more like panel 2 or panel 3. That is, in each group, some children learn more than average and some less than average. Such fluctuations can reflect numerous sources of uncontrolled variation: for instance, random error will be large if we use unreliable measures, or there may be individual differences in responsiveness to intervention in the people included in the study, as well as things that can fluctuate from day to day or even moment to moment, such as people's mood, health, tiredness and so on.

The task for the researcher is to detect a signal – the effect of intervention – from noise – the random fluctuations. It is important to carefully select our measures and our participants to minimise noise, but we will never eliminate it entirely.

There are two key concepts behind all the statistics we do: (a) data will contain random noise, and (b) when we do a study we are sampling from a larger population. We can make these ideas more concrete through simulation.

The first step is to generate a large quantity of random numbers. Random numbers can be easily generated using the free software package R: if you have this installed, you can follow this demo by typing in the commands shown in italic at your console. R has a command, rnorm, that generates normally distributed random numbers. For instance:


will generate 10 z-scores, i.e. random numbers with mean of 0 and standard deviation of 1.You get new random numbers each time you submit the command, (unless you explicitly set something known as the random number seed to be the same each time). Now let's use R to generate 100,000 random numbers, and plot the output in a histogram. Figure 2 can be generated with the commands:

myz = rnorm(100000,0,1) 

Figure 2: Distribution of z-scores simulated with rnorm

This shows that numbers close to zero are most common, and the further we get from zero in either direction, the lower the frequency of the number. The bell-shaped curve is a normal distribution, which we get because we opted to generate random numbers following a normal distribution using rnorm. (Other distributions of random number are also possible; you can see some options here).

So you might be wondering what we do with this list of numbers. Well, we can simulate experimental data based on this population of numbers by specifying two things:
1. The sample size
2. The effect size – i.e., Cohen's d, the mean difference between groups in standard deviation (SD) units.

Suppose we want two groups, A and B, each with a sample size of 10, where group B has scores that are on average 1 SD larger than group A. First we select 20 values at random from myz:

mydata = sample(myz, 20) 

Next we create a variable corresponding to group, which is created by just making a variable, mygroup, that combines ten repeats of 'A' with ten repeats of 'B'.

mygroup = c(rep('A', 10), rep('B', 10))  

Next we add the effect size, 1, to the last 10 numbers, i.e. those for group B

mydata[11:20] = mydata[11:20] + 1 

Now we can plot the individual points clustered by group. First install and activate the beeswarm package to make a nice plot format:


Then you can make the plot with the command:

beeswarm(mydata ~ mygroup) 

The resulting plot will look something like one of the graphs in Figure 3. It won't be exactly the same as any of them because your random sample will be different from the ones we have generated. In fact, this is one point of this exercise: to show you how numbers will vary from one occasion to another when you sample from a population.

If you just repeatedly run these lines, you will see how things vary just by chance:

mydata = sample(myz, 20) 
mydata[11:20] = mydata[11:20] + 1 
beeswarm(mydata ~ mygroup) 

Figure 3: Nine runs of simulated data from 2 groups: A comes from population with mean score of 0 and B from population with mean score of 1
Note how in Figure 3, the difference between groups A and B is far more marked in runs 7 and 9 than in runs 4 and 6, even though each dataset was generated by the same script. This is what is meant by the 'play of chance' affecting experimental data.

Now let's look at Figure 4, which gives output from another nine runs of a simulation. This time, some runs were set so that there was a true effect of intervention (by adding .6 to values for group B) and some were set with no difference between groups. Can you tell which simulations were based on a real effect?

Figure 4: Some of these runs were generated with effect size of .6, others had no difference between A and B
The answer is that runs 1, 2, 4, 8 and 9 came from runs where there was a real effect of .6 (which, by the standard of most intervention studies is a large effect). You may have identified some of these runs correctly, but you may also to have falsely selected run 3 as showing an effect. This would be a false positive, where we wrongly conclude there is an intervention effect when the apparent superiority of the intervention group is just down to chance. This type of error is known as a type I error. Run 2 looks like a false negative – we are likely to conclude there is no effect of intervention, when in fact there was one. This is a type II error. One way to remember this distinction is that a type I error is when you think you've won (1) but you haven't.

The importance of sample size 
Figures 3 and 4 demonstrate that, when inspecting data from intervention trials, you can't just rely on the data in front of your eyes. Sometimes, they will suggest a real effect when the data are really random (type I error) and sometimes they will fail to reveal a difference when the intervention is really effective (type II error). These anomalies arise because data incorporates random noise which can generate spurious effects or mask real effects. This masking is particularly problematic when samples are small.

Figure 5 shows two sets of data: the top panel and the bottom panel were derived by the same simulation, the only difference being the sample size: 10 per group in the top panels, and 80 per group in the bottom panels. In both cases, the simulation specified that group B scores were drawn from a population that had higher scores than group A, with an effect size of 0.6.  The bold line shows the group average. The figure shows that the larger the sample, the closer the results from the sample will agree with the population from which it was drawn.
Figure 5: Five runs of simulation where true effect size = .6

When samples are small, estimates of the means will jump around much more than when samples are large. Note, in particular, that with the small sample size, on the third run, the mean difference between A and B is overestimated by about 50%, whereas in the fourth run, the mean for B is very close to that for A.

In the population from which these samples are taken the mean difference between A and B is 0.6, but if we just take a sample from this population, by chance we may select atypical cases, and these will have a much larger impact on the observed mean when the sample is small.

 In my next post, I will show how we can build on these basic simulations to get an intuitive understanding of p-values.

P.S. Scripts for generating the figures in this post can be found here.