report.code.Rmd

---
title: "Survival Models (MATH6143) Coursework"
author: "Student ID: 34273638"
geometry: margin = 2cm
output:
  pdf_document: default
header-includes:
  - \usepackage{wrapfig}
  - \usepackage{lipsum}
  - \usepackage{setspace}
  - \usepackage{titlesec}
  - \titlespacing{\title}{0pt}{\parskip}{-\parskip}
---
```{r,echo=F,message=F}
#Setting up enviroment
setwd("~/Desktop/Survival Models/CW/SM-CW")
lymp.data <- read.csv("lymphoma.csv",header=T)
duck.data <- read.table("duck.txt",header = T)
mortality.data <- read.csv("mortality.csv",header=T)
library(survival) #basic functions for survival modelling
library(survminer) # package to produce much nicer visualisations
library(tidyverse)
library(kableExtra)
library(readxl)
```
# Question 1:
\vspace{-5truemm}
#### Part a:
Producing Kaplan Meier estimates of the 38 patients with lymphocytic non-Hodgkins lymphoma who have been classified into 2 groups, symptomatic (n=10) and asymptomatic (n=28) gives the below plot.
```{r,include=F}
#Data explortation
lymp.data%>%
  group_by(symp)%>%
  summarise(n=n())

#Computing Kaplan Meier estimate of survivor function 
lymp.km <- survfit(Surv(time,event=cens) ~ symp, data=lymp.data)

summary(lymp.km)

#using log rank test
survdiff(Surv(time,event=cens) ~ symp, data=lymp.data)
```

```{r,echo=F,fig.align='center',fig.width=10,fig.height=5}
#Producing KM plot
plot(lymp.km,
     col=c("red","blue"),
     xlab = "Time (weeks)",
     ylab = "Survival probability",
     main = "Kaplan Meier survival estimates of patients \n with lymphocytic non-Hodgkins lymphoma")
legend("topright", legend = c("Asymptomatic", "Symptomatic"),
col = c("red", "blue"), lty = c(1, 1), cex = 0.8)
```
Viewing the plot, the asymptomatic group has a higher estimated survival probability for all time compared to the symptomatic group (except when both survival probabilities are 1), thus those in the asymptomatic group have a longer survival time than those in the symptomatic group on average. Furthermore, performing a log-rank test we attain a Chi-Squared statistic of 8.7 with 1 degree of freedom which corresponds to a significant p-value of **0.003**. Thus, we reject the null hypothesis of no difference in survival between the groups. Therefore, we have evidence that there is a statistically significant difference of survival between the groups, backing up the analysis of the Kaplan Meier plot.

#### Part b:
```{r,include=F}
#Finding CI (95%)
lymp.95 <- survfit(Surv(time,event=cens) ~ symp, data=lymp.data,conf.int=0.95)
summary(lymp.95)
#Using assumption that there are 52 weeks in a year, 6 year survival happens at week 312
#Reading off 95% confidence interval estimate at 312 weeks gives thew following 95% confidence intervals for the groups

#Asymptomatic (t>=301)
c(0.309, 0.690)
#Symptomatic (t>=281)
c(0.0156, 0.642)
```
For the asymptomatic and symptomatic groups the 95% confidence interval of the Kaplan Meier estimates of 6 year survival (taken as survival to week 312) is [**0.309**, **0.690**] and [**0.0156**, **0.642**] respectively. 

#### Part c:
```{r,include=F}
summary(lymp.km)
#For Asymptomatic time is some point greater than 301 weeks (study ended)
#For Symptomatic time is 276 weeks (drops from 0.3 to 0.2)
```
For the symptomatic group the estimated time beyond which the survival probability is below 30% is 276 weeks.

\newpage
#### Part d:
```{r,include=F}
#Part d
lymp.na <- survfit(Surv(time,event=cens) ~ symp, data=lymp.data,type = "fleming")


#Estimates are very similar, Nelson-Aalen estimates have survival portability shifted up (increased) slightly compared to Kaplan Meier
# estimates and this shift (increase in survival probability) is greater for the symptomatic group
summary(lymp.km)
summary(lymp.na)
#Survival estimates similar as d_{j}`/r_{j} small for all time for asymptomatic group (thus due to taylor approx they are approx equal)
#Survival estimates similar for symptomatic group too except at the end (due to small number at risk left) thus d_{j}`/r_{j} is not small
#so approximation is bad which is why the N-A estimates are more different for this group 

```

```{r,echo=F,fig.align='center'}
# par(mfrow = c(1,2))
# plot(lymp.na,
#      col=c("red","blue"),
#      xlab = "Time (weeks)",
#      ylab = "Survival probability",
#      main = "Nelson-Aalen survival estimates of patients \n with lymphocytic non-Hodgkins lymphoma",
#      cex.main=0.8)
# legend("topright", legend = c("Asymptomatic", "Symptomatic"),
# col = c("red", "blue"), lty = c(1, 1), cex = 0.5)
# plot(lymp.km,
#      col=c("red","blue"),
#      xlab = "Time (weeks)",
#      ylab = "Survival probability",
#      main = "Kaplan Meier survival estimates of patients \n with lymphocytic non-Hodgkins lymphoma",
#      cex.main=0.8)
# legend("topright", legend = c("Asymptomatic", "Symptomatic"),
# col = c("red", "blue"), lty = c(1, 1), cex = 0.5)

plot(lymp.na,
     col=c("red","blue"),
     xlab = "Time (weeks)",
     ylab = "Survival probability",
     main = "Nelson-Aalen survival estimates of patients \n with lymphocytic non-Hodgkins lymphoma",
     cex.main=0.8)
legend("topright", legend = c("Asymptomatic", "Symptomatic"),
col = c("red", "blue"), lty = c(1, 1), cex = 0.5)
```
Viewing the Nelson-Aalen survival estimates via plots they appear very similar to the Kaplan Meier estimates for the asymptomatic group, for the symptomatic group the Nelson-Aalen estimates diverge slightly in the positive direction from the Kaplan Meier estimates as time increases (survival probability is larger for Nelson-Aalen estimates later in the study compared to the Kaplan Meier estimates). In more detail, the survival estimates are very similar for the asymptomatic group as the value of $\frac{d_{j}}{r_{j}}$ (where $d_{j}$ is number of events at observed event time $t_{j}$ and $r_{j}$ is number at risk at $t_{j}$) is small for all observed event times $t_{j}$ with the largest value of $\frac{1}{13}$ at event time 301 weeks. Thus by Taylor expansion the survival estimates are approximately equal ($\hat{S}_{NA}(t) \approx \hat{S}_{KM}(t)$) for the asymptomatic group. This is what also causes the larger differences in estimates for the symptomatic group as the sample size is smaller for this subgroup causing the value of $\frac{d_{j}}{r_{j}}$ to be larger at high event times, with the largest value of 0.5 at event time 281 weeks. Thus, this larger value of $\frac{d_{j}}{r_{j}}$  at later event times causes the larger difference in survival estimates observed due to the approximation from Taylor expansion being less reliable.


#### Part e:
```{r, include=F}
lymp.95.na <- survfit(Surv(time,event=cens) ~ symp, data=lymp.data,conf.int=0.95,type="fleming")
summary(lymp.95.na)
#95% CI of probability of 6-year (312 week) survival for each group

#Asymptomatic (t>=301)
c( 0.319, 0.697)
#Symptomatic (t>=281)
c(0.0340, 0.621)
```
For the asymptomatic and symptomatic groups the 95% confidence interval for the Nelson-Aalen estimates of 6 year survival are [**0.319**, **0.697**] and [**0.0340**, **0.621**] respectively. 

# Question 2:
#### Part a:
Fitting Weibull regression models to the duck survival time data I used AIC as my model selection criteria and started with a model including all possible covariates and then proceeded to remove variables that led to a model with a lower AIC. Starting with a hierarchical model with the three way interaction between age, weight and length (and thus all lower order terms), applying my model selection criteria, my final model was the null model with an AIC of **209.97** compared to an AIC of **216.08** for the initial model. This final model implies that all the covariates (age, weight and length) have no effect on the survival of the ducks. Moreover, creating a model with the covarites age, weight and length, all three covariates were insignificant (had a p-value greater than 0.05). Thus, using the data provided I conclude that the survival of the ducks does not depend on any of the explanatory variables age, weight or length.

```{r, include=F}
#Using AIC (information criteria for model selection we start with full model and perform backward selection)
duck.wb.all <- survreg(Surv(Time,event=Observed) ~ Age*Weight*Length, data = duck.data)
final.model.wb.1 <- step(duck.wb.all,data=duck.data,direction = "backward",trace = T)
AIC(final.model.wb.1)
#216.0846
#Note that the backward selection gets 'stuck' as all 2 way interactions but not 3 way has higher AIC (even though lower order models have lower AIC)

#`unsticking` model selection
duck.wb.2int <- survreg(Surv(Time,event=Observed) ~ Age*Weight+Weight*Length+Age*Length, data = duck.data)
final.model.wb.2 <- step(duck.wb.2int,data=duck.data,direction = "backward",trace = T)
AIC(final.model.wb.2)
#209.9668 - has lower AIC
final.model.wb.2$coefficients #final model is null model (only includes intercept)
final.wb <- final.model.wb.2

duck.wb.simple <- survreg(Surv(Time,event=Observed) ~ Age+Weight+Length, data = duck.data)
AIC(duck.wb.simple)
#Note that no interaction model has AIC of 213.5725
summary(duck.wb.simple)
```
#### Part b:
To check if a Weibull model may be reasonable I plotted $log(\hat{H}(t))$ against $log(t)$ and checked that $log(\hat{H}(t))$ is linear with $log(t)$ with the gradient equal to $\alpha$ the shape parameter of the estimated Weibull model. In the below graph the red line has intercept -5.5 and gradient 1.21 (the estimated shape parameter is $\hat{\alpha}=1.211$ in my chosen final Weibull model from part a). The line is a good approximation of the points thus a Weibull distribution is a reasonable model.

```{r,echo=F,fig.align='center',fig.width=10,fig.height=5}
#calculating NA estimates
duck.na <- survfit(Surv(Time,event=Observed) ~ 1, data=duck.data,type = "fleming")
alpha <- 1/final.wb$scale
beta_0 <- exp(-final.wb$coefficients[1])

wb.check.plot.data <- data.frame("time"=duck.na$time,"cumhaz"=duck.na$cumhaz)
ggplot(aes(y=log(cumhaz),x=log(time)),data=wb.check.plot.data)+
  geom_point()+
  geom_abline(intercept = -5.5, slope = 1.21,col="red")+
  ylab("log of cumulative hazard")+
  xlab("log of time (time in days)")+
  ggtitle("Checking Weibull assumption with line y=1.21x-5.5")+
  geom_text(x=1.8, y=-2.8, label="y=1.21x-5.5",cex=5, col="red")
```


```{r, include=F}
duck.age.km <- survfit(Surv(Time,event=Observed) ~ Age, data=duck.data)
plot(duck.age.km, col = c("red", "blue"),
     xlab = "Time (days)",
     ylab = "Survival probability",
     main = "Kaplan Meier survival estimates of ducks split by age")
legend("topright", legend = c("Age: 0", "Age: 1"),
col = c("red", "blue"), lty = c(1, 1), cex = 0.8)
#Furthermore looking at the 2 Age groups for Kaplan Meier as they cross implies no
#significant difference in survival between 2 groups - as the insignificant p values
#was saying for Weibull regression models
```

#### Part c:
Using the same model selection criteria of AIC using the Cox proportional hazards model starting with a hierarchical model with a three way interaction between age, weight and length (and thus all lower order terms) my final model is the null model with a AIC of **128.00** compared to the initial models AIC of **135.44**. Thus, using my final model none of the covariates age, weight and length affect the survival of the ducks. Furthermore, fitting a model with the 3 covariates age, weight and length, none are significant at $\alpha=0.05$. Therefore, from the cox proportional hazard models I conclude that the survival of the ducks does not depend on their age, weight or length using the provided data.
```{r, include=F}
#Fitting cox models
duck.cox.all <- coxph(Surv(Time,event=Observed)~Age*Weight*Length,data=duck.data)
AIC(duck.cox.all)
duck.cox.final.1 <- step(duck.cox.all,data=duck.data,direction = "backward",trace = T)
AIC(duck.cox.final.1)

duck.cox.int2 <- coxph(Surv(Time,event=Observed) ~ Age*Weight+Weight*Length+Age*Length, data = duck.data)
duck.cox.final.2 <- step(duck.cox.int2,data=duck.data,direction = "backward",trace = T)
AIC(duck.cox.final.2)
final.cox <- duck.cox.final.2

summary(coxph(Surv(Time,event=Observed) ~ Age+Weight+Length, data = duck.data))
```

#### Part d:
Plotting the estimated survivor function for a one year old duck with a weight of 1000g and a length of 250cm using my preferred models for part a and c gives the corresponding plot below. I note that these are the same survivor functions as those for any other duck with a different weight, length or age (within the values of the dataset used to produce my model, as we may not be able to extrapolate) as both models don't include the covariates age, weight or length.
```{r,echo=F,fig.align='center',fig.width=10,fig.height=5}
plot(survfit(final.cox),conf.int = F,col="red",
     xlab="Time (days)",
     ylab="Survival probabilty",
     main = "Estimated survival of ducks using \n Weibull and Cox regression")
curve(expr = exp(-(beta_0*x)^alpha), from = 0, to = 63, add = TRUE, col="blue")
legend("topright", legend = c("Cox", "Weibull"),
col = c("red", "blue"), lty = c(1, 1), cex = 0.8)
```
# Question 3:
#### Part a:
```{r,include=F}
## Question 3 
mortality.data <- read.csv("mortality.csv",header=T)
#Part a - calculating crude central mortality rates
mortality.data<- mortality.data%>%
  mutate(cc.mort.rate.male=Male.deaths/Male.exposed,
         cc.mort.rate.female=Female.deaths/Female.exposed)%>%
  rename("Age"="Age..x.")

plot.mort.data.female <- mortality.data%>%
  dplyr::select(Age,cc.mort.rate.female)%>%
  mutate(gender="Female")%>%
  rename("cc.mort.rate"="cc.mort.rate.female")

plot.mort.data.male <- mortality.data%>%
  dplyr::select(Age,cc.mort.rate.male)%>%
  mutate(gender="Male")%>%
  rename("cc.mort.rate"="cc.mort.rate.male")

plot.mort.data <- rbind(plot.mort.data.female,plot.mort.data.male)%>%
  mutate(log.cc.mort.rate=log(cc.mort.rate))
```

Calculating the crude central mortality rates $m_{x}$ for male and female pensioners using $\hat{m_{x}}=\frac{D_{x}}{E_{x}^{C}}$ (where $D_{x}$ is the number of deaths at age $x$ and $E_{x}^{C}$ is the central exposed risk at age $x$) and plotting it on the log scale gives the below graph. In it we can see that $log(m_{x})$ is higher for males than for females at all ages with the exception of ages 63, 95, 96, 102. Thus, overall this graph suggests that $log(m_{x})$ and therefore $m_{x}$, the crude central mortality rate is higher for male pensioners than female pensioners.
```{r,echo=F,fig.align='center',fig.width=10,fig.height=4}
#Plotting
ggplot(aes(x=Age,y=log.cc.mort.rate,colour=gender),data=plot.mort.data)+
  geom_point()+
  xlab("Age (years)")+
  ylab("Estimated log crude central mortality rate")+
  ggtitle("Log crude central mortality rates for members of large pension scheme")
```

#### Part b:
Calculating $q_{x}$ for males and female using both the assumption of constant force of mortality and uniform distribution of deaths within each year of age gives the corresponding life table (using $l_{60}=100000$) below.

```{r,include=F}
#Part b - calculating q_{x} 

#(i) Under the assumption of a constant force of mortality 
mortality.data <- mortality.data %>%
  mutate(qx.const.male = 1-exp(-cc.mort.rate.male),
         qx.const.female = 1-exp(-cc.mort.rate.female),
         qx.unif.male = cc.mort.rate.male/(1+0.5*cc.mort.rate.male),
         qx.unif.female = cc.mort.rate.female/(1+0.5*cc.mort.rate.female))


#under uniform dist of death for male
lx.unif.male <- lx.unif.female <- NULL
lx.unif.male[1] <- lx.unif.female[1] <- 100000
for (i in 1:42){
lx.unif.male[i+1] <- lx.unif.male[i]*(1-mortality.data$qx.unif.male[i])
lx.unif.female[i+1] <- lx.unif.female[i]*(1-mortality.data$qx.unif.female[i])
}

#Under constant force of mortality
lx.const.male <- lx.const.female <- NULL
lx.const.male[1] <- lx.const.female[1] <- 100000
for (i in 1:42){
lx.const.male[i+1] <- lx.const.male[i]*(1-mortality.data$qx.const.male[i])
lx.const.female[i+1] <- lx.const.female[i]*(1-mortality.data$qx.const.female[i])
}

#Creating table of results
table.qx <- data.frame("Age"=seq(from=60,to=102,by=1),
                        "lx.unif.male"=lx.unif.male,
                        "lx.const.male"=lx.const.male,
                        "lx.unif.female"=lx.unif.female,
                        "lx.const.female"=lx.const.female)%>%
  filter(Age %in% c(60,65,70,75,80,85,90,95,100))

```

```{r,include=F}
#creating lifetable not had to screenshot (then include as image) as the format wasn't working when knitting directly 
knitr::kable(table.qx,digits=2,caption = "Life table",col.names = c("Age","$l_{x}$ unif","$l_{x}$ const","$l_{x}$ unif","$l_{x}$ const"))%>%
  row_spec(0,bold=T)%>%
  add_header_above(c(" "=1,"Male"=2,"Female"=2))%>%
  footnote(number = c("unif means the values have been calculated using a uniform distribution of deaths within each year assumption",
                      "const means the values have been calculated using constant force of mortality within each year assumption"))%>%
  kable_styling(position = "center",latex_options = "HOLD_position",full_width=F)
```
```{r,echo=F,fig.align='center',out.width="300px"}
knitr::include_graphics("lifetable.png")
```

#### Part c:

```{r, include=F}
#curate life expectancies
sum(lx.unif.male[2:43])/100000
sum(lx.unif.female[2:43])/100000

#complete (using assumption of uniform death)
sum(lx.unif.male[2:43])/100000 +0.5
sum(lx.unif.female[2:43])/100000 +0.5
```

To calculate the curtate life expectancy $e_{x}$, I used the formula $e_{x} = \frac{1}{l_{x}}\sum_{k=1}^{\infty}l_{x+k}$, then using the assumption of a uniform distribution of deaths within each year the complete life expectancy, $\mathring{e_{x}}$ is $\mathring{e_{x}}=e_{x}+0.5$. Thus, using the calculated life table results above for the uniform distribution assumption the complete and curtate life expectancies for males aged 60 are **22.1** and **21.6** years respectively and for females aged 60 are **25.4** and  **24.9** years respectively (to 1 d.p.).

#### Part d:
```{r,include=F}
#Part d
# comparing to whole population
ELT17 <- read_excel("ELT17.xls")

ELT17.male <- ELT17[,1:4] %>% 
  filter(Age %in% c(0,10,20,30,40,50,60,70,80,90,100))%>%
  rename("lx"="Males","qx"="...3","ex"="...4")%>%
  mutate(lx=as.numeric(lx),
         qx=as.numeric(qx),
         ex=as.numeric(ex),
         Age=as.numeric(Age))

ELT17.female <- ELT17[,c(1,5,6,7)] %>% 
  filter(Age %in% c(0,10,20,30,40,50,60,70,80,90,100))%>%
  rename("lx"="Females","qx"="...6","ex"="...7")%>%
  mutate(lx=as.numeric(lx),
         qx=as.numeric(qx),
         ex=as.numeric(ex),
         Age=as.numeric(Age))

#Formal test (using uniform dist of deaths to calculate mxs )

#For males
ELT17.male <- ELT17.male%>%
  mutate(mxs = qx/(1-0.5*qx))

mort.male.comp <- mortality.data%>%
  filter(Age %in% c(60,70,80,90,100))%>%
  dplyr::select(Male.exposed,Male.deaths,Age)

test.data.male <- inner_join(mort.male.comp,ELT17.male,by="Age")%>%
  mutate(test.stat = ((Male.deaths-Male.exposed*mxs)^2)/Male.exposed*mxs,
         checking.reliable = Male.exposed*mxs)

X2.male <- sum(test.data.male$test.stat)
#calculating p-value
1-pchisq(X2.male,df=5)
#Not significant (p- value is 1 (3 d.p.)) therefore fail to reject null thus for males mx=mxs 
#(death rates the same in this insured population compared to whole population of England and Wales)

#For females
ELT17.female <- ELT17.female%>%
  mutate(mxs = qx/(1-0.5*qx))

mort.female.comp <- mortality.data%>%
  filter(Age %in% c(60,70,80,90,100))%>%
  dplyr::select(Female.exposed,Female.deaths,Age)

test.data.female <- inner_join(mort.female.comp,ELT17.female,by="Age")%>%
  mutate(test.stat = ((Female.deaths-Female.exposed*mxs)^2)/Female.exposed*mxs,
         checking.reliable = Female.exposed*mxs)

X2.female <- sum(test.data.female$test.stat)
#calculating p-value
1-pchisq(X2.female,df=5)
#Not significant (p-value is 0.99) therefore fail to reject null thus for females mx=mxs 
#(death rates the same in this insured population compared to whole population of England and Wales)
```

To compare the death rates in the insured population to the whole population of England and Wales I used the chi-squared test. For both males and females all values of $E_{x}^{C}m_{x}^{S}$ were greater than 1 and for males 1 value (20%) was less than 5 ($E_{100}^{C}m_{100}^{S}=1.55$), thus for males the chi-squared test conclusion was reliable. For females 2 values (40%) were less than 5 ($E_{60}^{C}m_{60}^{S}=3.90$ and $E_{100}^{C}m_{100}^{S}=3.22$), thus the conclusions from the chi-squared test were less reliable for females. For males the chi-squared test statistic was 0.042 corresponding to a non-significant p-value of **1** (3 d.p.) and for females the chi-squared test statistic was 0.533 corresponding to a non-significant p-value of **0.991** (3 d.p.). Therefore, for both males and females there is no evidence the death rates are different between the insured population and the whole population of England and Wales using the chi-squared test. I note that even though the results are less reliable for females, due to the very high p-value (and thus very strong evidence) it is still very unlikely that deaths rates for females are different between the insured and whole population.

#### Part e:
```{r,include=F}
#Part e - use a Gompertz log-linear model to produce a set of graduated central mortality rates from the crude mortality data

#From crude mortality data fitting gompertz regression model
grad.gompertz <- glm(Male.deaths ~ Age+offset(log(Male.exposed)),family = poisson(link=log),data=mortality.data)
summary(grad.gompertz)

#calculating gompertz estimated mx
gomp.mx <- data.frame("Age"=seq(from=60,to=102,by=1),
                      "mx"=exp(grad.gompertz$coefficients[1]+seq(from=60,to=102,by=1)*grad.gompertz$coefficients[2]),
                      "type"=rep("Gompertz",43))
#sorting data to create plot
comp.mort.plot.data <- mortality.data%>%
  dplyr::select(Age,cc.mort.rate.male)%>%
  rename("mx"="cc.mort.rate.male")%>%
  mutate("type"="Crude")

comp.mort.plot.data <- rbind(gomp.mx,comp.mort.plot.data)%>%
  mutate(log.mx=log(mx),
         shape2=c(rep(16,43),rep(17,43)))
```
Using a Gompertz log-linear model to produce a set of graduated central mortality rates from the crude mortality data and plotting it on the log scale gives the below graph. It shows a good fit of the graduated rates to the crude rates as the graduated rates are consistently close to the crude rates on the log scale.
```{r,echo=F,fig.align='center',fig.width=10,fig.height=3.5}
ggplot(aes(x=Age,y=log.mx,colour=type),data=comp.mort.plot.data)+
  geom_point(shape=comp.mort.plot.data$shape2)+
  geom_line()+
  xlab("Age (years)")+
  ylab(expression("log m"[x]))+
  ggtitle("Comparing crude and graduated central mortality rates for the male population")+
  guides(color = guide_legend(
    override.aes=list(shape = c(17,16))))
```