9 Epidemiology Module - Everything you wanted to know about statistics

Alison Grant 30/8/18

Summarising a variable: Is it continuous or categorical?

9.1 Continuous Data

Are continuous data normally distributed or skewed? If a mean and a median are the same = normal distributed Plotting will also show if normally distributed

Data can be positively skewed or negatively skewed. Mean is very influenced by outliers. If your data is skewed then do not use the mean, use the median

If normally distributed, by definition you usually give the mean and the standard deviation.

If our data are skewed, we’d usually use the median and the interquartile range. The 75th percentile will have 75% of data less than it, and 25% greater than it.

This is important as normally and skewed distribution data need different statistical tests.

9.2 Categorical Data

Are your data ordered or not ordered?

9.3 Summariseing data

Normal continuous data: Mean, SD Skewed continuous data: Median, IQR Categorical data: Proportions/%s by category

9.4 Comparing between groups

Continuous normal groups: Students T Test - this uses the actual values of the variables Continuous skewed groups: Wilcoxon rank sum test (or Mann Whitney U) - these are non parametric tests, they use the rank (the order) of a value rather than the actual value itself

The reason you don’t always use a non parameteric data is you want to keep as much info as possible, and using values is better than rank if possible

Categorical (not ordered): Chi squared test (or fishers exact test if less than 5) Categorical (ordered): Chi squared test, or chi squared test for linear trend

9.5 Paired data

t test - paired t test wilcoxon - wilcoxon signed rank chi squared - mcnemars test

9.6 Ratios

9.6.1 Risk Ratios

We would generally put the exposure of interest as the rows, we put the outcome variable as the columns (so title of exposure on the side, title of outcome at the top)

This is to compare two groups of variables

Denominator is the total of the row of the exposure

- - Disease Disease
- - Positive Negative
Exposure Positive a b
Exposure Negative c d

Risk Ratio = (a / a+b) / (c / c+d)

Your risk of disease if positive exposure over your risk of disease if negative exposure

9.6.2 Odds Ratios

Denominator is the total of people without the thing you’re interested in.

- - Disease Disease
- - Positive Negative
Exposure Positive a b
Exposure Negative c d

Odds Ratio = (a/b) / (c/d)

Your odds of disease if exposure positive (disease vs no disease in patients with exposure) over your odds of disease if exposure negative

We tend to interpret odds as if they’re risk, but that’s not true. When an outcome is common, odds become quite different to risks. Odds are generally further away from one than the risk ratio

9.7 P value

What is the probability that the observed difference occured by chance alone?

They don’t tell you the strength of the association between a variable and an outcome. That’s what a risk ratio or an odds ratio is for.

9.7.1 Confifence intervals

Tell you more than p values, they tell you about the plausible range of results. Both confidence intervals and p values are strongly influenced by sample size.

If presenting the data, give confidence intervals rather than just p values (especially rather than just > or < 0.05)

If a confidence interval crosses one, there’s no difference between the two groups.

9.8 Sample Size,

Is incredibly important, work it out at the start of the study design.