Medical Statistics in Clinical Research – Mean, Median, Standard Deviation, p-value, Chi Squared test

T. Dhasaratharaman*

Statistician, Kauvery Hospitals, India

*Correspondence: Tel: +91 90037 84310 Email: dhasa.cst@kauveryhospital.com

Mean It is also known as an arithmetic mean, or an average. Is very commonly used in papers, so it is important to have an understanding of how it is calculated. It is one of the simplest statistical concepts to grasp.

It is used when the spread of the data is fairly similar on each side of the mid-point, for example when the data are “normally distributed”.

The “normal distribution” is referred to a lot in statistics. It’s the symmetrical, bell-shaped distribution of data (Fig. 1).

images-7-16

Fig 1. The normal distribution. The centre line shows the mean of the data.

How is it calculated?

The mean is the sum of all the values, divided by the number of values.

Example

Five women in a study on lipid-lowering agents are aged 52, 55, 56, 58 and 59 years.

Add these ages together: 52 + 55 + 56 + 58 + 59 = 280

Now divide by the number of women: 280/5 = 56

So, the mean age is 56 years.

Median

It is also known as Mid-Point. Used in many research papers.

It is used to represent the average when the data are not symmetrical, for instance the “skewed” distribution (not normal distribution) (Fig. 2).

images-7-17

Fig. 2. A skewed distribution. The dotted line shows the median.

How is it calculated?

It is the point which has half the values above, and half below.

Example 1:

Consider the same example from the one used for mean calculation. There are five patients aged 52, 55, 56, 58 and 59, the median age is 56, the same as the mean – half the women are older, half are younger.

However, in the second example with six patients aged 52, 55, 56, 58, 59 and 92 years, there are two “middle” ages, 56 and 58. The median is half- way between these, i.e., 57 years. This gives a better idea of the mid-point of this skewed data than the mean of 62.

Example 2:

A dietician measured the energy intake over 24 hours of 50 patients on a variety of wards. One ward had two patients that were “nil by mouth”. The median was 12.2 megajoules, IQR 9.9 to 13.6. The lowest intake was 0, the highest was 16.7. This distribution is represented by the box and whisker plot.

images-7-18

Box and whisker plot of energy intake of 50 patients over 24 hours. The ends of the whiskers represent the maximum and minimum values, excluding extreme results like those of the two “nil by mouth” patients.

Standard Deviation

This is very important concept, Standard deviation (SD) is used for data which are “normally distributed”, to provide information on how much the data vary around their mean.

Interpretation

SD indicates how much a set of values is spread around the average.

A range of one SD above and below the mean (abbreviated to ± 1 SD) includes 68.2% of the values.

±2 SD includes 95.4% of the data.

±3 SD includes 99.7%.

Example 1:

Let us say that a group of patients enrolling for a trial had a normal distribution for weight. The mean weight of the patients was 80 kg. For this group, the SD was calculated to be 5 kg.

±1 SD below the average is 80-5 = 75kg.

1 SD above the average is 80+5 = 85kg.

±1 SD will include 68.2% of the subjects, so 68.2% of patients will weigh between 75 and 85 kg.

95.4% will weigh between 70 and 90 kg (±2 SD).

99.7% of patients will weigh between 65 and 95 kg (±3 SD).

This data correlates to the following graph (Fig. 3).

images-7-19

Fig. 3. Graph showing normal distribution of weights of patients enrolling in a trial with mean 80 kg, SD 5 kg.

If we have two sets of data with the same mean but different SDs, then the data set with the larger SD has a wider spread than the data set with the smaller SD.

For example, if another group of patients enrolling for the trial has the same mean weight of 80 kg but an SD of only 3, ±1 SD will include 68.2% of the subjects, so 68.2% of patients will weigh between 77 and 83 kg (Fig. 4).

Fig. 3. Graph showing normal distribution of weights of patients enrolling in a trial with mean 80 kg, SD 5 kg.

images-7-20

Fig. 4. Graph showing normal distribution of weights of patients enrolling in a trial with mean 80 kg, SD 3 kg.

Example 2:

SD should only be used when the data have a normal distribution. However, means and SDs are often wrongly used for data which are not normally distributed.

A simple check for a normal distribution is to see if 2 SDs away from the mean are still within the possible range for the variable. For example, if we have some length of hospital stay data with a mean stay of 10 days and a SD of 8 days then:

Mean – (2 x SD) = 10 – (2 x 8) = 10 – 16 = -6 days

This is clearly an impossible value for length of stay, so the data cannot be normally distributed. The mean and SDs are therefore not appropriate measures to use.

p-value

The p (probability) value is used when we wish to see how likely it is that a hypothesis is true. The hypothesis is usually that there is no difference between two treatments, known as the “null hypothesis”.

Interpretation

The p-value gives the probability of any observed difference having happened by chance.

p = 0.5 means that the probability of the difference having happened by chance is 0.5 in 1, or 50:50.

p = 0.05 means that the probability of the difference having happened by chance is 0.05 in 1, i.e., 1 in 20.

It is the figure frequently quoted as being “statistically significant”, i.e., unlikely to have happened by chance and therefore important. However, this is an arbitrary figure.

If we look at 20 studies, even if none of the treatments work, one of the studies is likely to have a P value of 0.05 and so appear significant!

The lower the p-value, the less likely it is that the difference happened by chance and so the higher the significance of the finding.

p = 0.01 is often considered to be “highly significant”. It means that the difference will only have happened by chance 1 in 100 times. This is unlikely, but still possible.

p = 0.001 means the difference will have happened by chance 1 in 1000 times, even less likely, but still just possible. It is usually considered to be “very highly significant”..

Example 1:

Out of 50 new babies on average 25 will be girls, sometimes more, sometimes less.

Say there is a new fertility treatment and we want to know whether it affects the chance of having a boy or a girl. Therefore, we set up a null hypothesis – that the treatment does not alter the chance of having a girl. Out of the first 50 babies resulting from the treatment, 15 are girls. We then need to know the probability that this just happened by chance, i.e., did this happen by chance or has the treatment had an effect on the sex of the babies?

The p-value gives the probability that the null hypothesis is true.

The p-value in this example is 0.007. Do not worry about how it was calculated, concentrate on what it means. It means the result would only have happened by chance in 0.007 in 1 (or 1 in 140) times if the treatment did not actually affect the sex of the baby. This is highly unlikely, so we can reject our hypothesis and conclude that the treatment probably does alter the chance of having a girl.

Example 2:

Patients with minor illnesses were randomized to see either Dr XXXX ended up seeing 176 patients in the study whereas Dr. YYYY saw 200 patients (Table 1).

Table 1. Number of patients with minor illnesses seen by two groups

Doctor Dr. XXXX (n = 200) Dr. YYYY (n = 176) p-value
Patients satisfied with consultation (%) 186 (93) 168 (95) 0.4
Mean (SD) consultation length (min) 16 (3.1) 6 (2.8) < 0.001
Patients getting a prescription (%) 58 (29) 76 (43) 0.3
Mean (SD) number of days off work 3.5 (1.3) 3.6 (1.3) 0.8
Patients needing a follow-up appointment (%) 46 (23) 72 (41) 0.05

Caution

The “null hypothesis” is a concept that underlies this and other statistical tests. The success of the program is that we have done this for three years continuously before we were confronted with the COVID pandemic.

The test method assumes (hypothesizes) that there is no (null) difference between the groups. The result of the test either supports or rejects that hypothesis.

The null hypothesis is generally the opposite of what we are actually interested in finding out. If we are interested if there is a difference between two treatments, then the null hypothesis would be that there is no difference and we would try to disprove this.

Try not to confuse statistical significance with clinical relevance. If a study is too small, the results are unlikely to be statistically significant even if the intervention actually works. Conversely a large study may find a statistically significant difference that is too small to have any clinical relevance.

Chi Squared test (χ2)

Usually written as χ2 (for the test) or Χ2 (for its value); Chi is pronounced as in sky without the s.
It is a measure of the difference between actual and expected frequencies.

Interpretation

The “expected frequency” is that there is no difference between the sets of results (the null hypothesis). In that case, the Χ2 value would be zero.

The larger the actual difference between the sets of results, the greater the Χ2 value. However, it is difficult to interpret the Χ2 value by itself as it depends on the number of factors studied.

Example

 

A group of patients with bronchopneumonia were treated with either amoxicillin or erythromycin. The results are shown in Table 2.

Table 2. Type of antibiotic given

Tablet Amoxicillin (%) Erythromycin (%) Total (%)
Improvement at 5 Days 144 (60%) 160 (67%) 304 (63%)
No improvement at 5 Days 96 (40) 80 (33%) 176 (37%)
Over-all 240 (100%) 240 (100%) 480 (100%)

First, look at the table to get an idea of the differences between the effects of the two treatments. Remember, do not worry about the Χ2 value itself, but see whether it is significant. In this case P is 0.13, so the difference in treatments is not statistically significant.

Caution

Instead of the χ2 test, “Fisher’s exact test” is sometimes used. Fisher’s test is the best choice as it always gives the exact p-value, particularly where the numbers are small.

The χ2 test is simpler for statisticians to calculate but gives only an approximate p-value and is inappropriate for small samples. Statisticians may apply “Yates’ continuity correction” or other adjustments to the χ2 test to improve the accuracy of the p-value.