9 Epidemiology Module - Everything you wanted to know about statistics
Alison Grant 30/8/18
Summarising a variable: Is it continuous or categorical?
9.1 Continuous Data
Are continuous data normally distributed or skewed? If a mean and a median are the same = normal distributed Plotting will also show if normally distributed
Data can be positively skewed or negatively skewed. Mean is very influenced by outliers. If your data is skewed then do not use the mean, use the median
If normally distributed, by definition you usually give the mean and the standard deviation.
If our data are skewed, we’d usually use the median and the interquartile range. The 75th percentile will have 75% of data less than it, and 25% greater than it.
This is important as normally and skewed distribution data need different statistical tests.
9.2 Categorical Data
Are your data ordered or not ordered?
9.3 Summariseing data
Normal continuous data: Mean, SD Skewed continuous data: Median, IQR Categorical data: Proportions/%s by category
9.4 Comparing between groups
Continuous normal groups: Students T Test - this uses the actual values of the variables Continuous skewed groups: Wilcoxon rank sum test (or Mann Whitney U) - these are non parametric tests, they use the rank (the order) of a value rather than the actual value itself
The reason you don’t always use a non parameteric data is you want to keep as much info as possible, and using values is better than rank if possible
Categorical (not ordered): Chi squared test (or fishers exact test if less than 5) Categorical (ordered): Chi squared test, or chi squared test for linear trend
9.5 Paired data
t test - paired t test wilcoxon - wilcoxon signed rank chi squared - mcnemars test
9.6 Ratios
9.6.1 Risk Ratios
We would generally put the exposure of interest as the rows, we put the outcome variable as the columns (so title of exposure on the side, title of outcome at the top)
This is to compare two groups of variables
Denominator is the total of the row of the exposure
- | - | Disease | Disease |
---|---|---|---|
- | - | Positive | Negative |
Exposure | Positive | a | b |
Exposure | Negative | c | d |
Risk Ratio = (a / a+b) / (c / c+d)
Your risk of disease if positive exposure over your risk of disease if negative exposure
9.6.2 Odds Ratios
Denominator is the total of people without the thing you’re interested in.
- | - | Disease | Disease |
---|---|---|---|
- | - | Positive | Negative |
Exposure | Positive | a | b |
Exposure | Negative | c | d |
Odds Ratio = (a/b) / (c/d)
Your odds of disease if exposure positive (disease vs no disease in patients with exposure) over your odds of disease if exposure negative
We tend to interpret odds as if they’re risk, but that’s not true. When an outcome is common, odds become quite different to risks. Odds are generally further away from one than the risk ratio
9.7 P value
What is the probability that the observed difference occured by chance alone?
They don’t tell you the strength of the association between a variable and an outcome. That’s what a risk ratio or an odds ratio is for.
9.7.1 Confifence intervals
Tell you more than p values, they tell you about the plausible range of results. Both confidence intervals and p values are strongly influenced by sample size.
If presenting the data, give confidence intervals rather than just p values (especially rather than just > or < 0.05)
If a confidence interval crosses one, there’s no difference between the two groups.
9.8 Sample Size,
Is incredibly important, work it out at the start of the study design.