Survival analysis: Celiac patients with Colorectal Cancer (CRC)
Pojection description
In this project we investigate if patients with celiac disease are at a higher risk of future colorectal cancer (CRC) compared to controls. The start of CRC is defined as the day of the first ICD code starting by C18, C19, or C20. End of follow-up of the study is 2019-12-31. The design is a matched cohort study, where each celiac patient was matched to maximum 2 non-celiac persons on age and sex at start of follow-up.
#The data can be found in 3 files
celiac.txt contains 7 variables:
Id.number: The number identifying the celiac patient
Start.of.flwp: Start of follow-up
Gender: 1= Male, 2= Female
Age: The person’s age
Birthday: The person’s birthday
Date.of.Death: The day the person died
Date.of.Migration: The day the person migrated
nonceliac.txt contains 4 variables:
Id.number: The number identifying the non-celiac person
Celiac.number: The number identifying the celiac patient the person is matched to
Date.of.Death: The day the person died
Date.of.Migration: The day the person migrated.
ICD.txt contains 3 variables
Id.number: The number identifying the person
ICD.codes: The list of ICD codes assigned at a specific date.
Date.of.ICD.codes: The date the ICD codes were assigned
Project Tasks:
Estimate the following measures for both Celiac and Nonceliac patients
* Number of persons * Number of person years * Number of events * Incidence rate per 1000 person-year(PY) (95% CI) * Unadjusted HR** (95% CI) * Adjusted HR (95% CI) * Unadjusted HR (95% CI) * Adjusted HR (95% CI) * Present the results in a table
Show the code
library(survival) #library(survminer)library(simstudy) #the package for non-PH modelslibrary(ggplot2)library(dplyr)library(tidyverse)library(viridis)library(ggrepel)library(ggthemes)library(lubridate) # to handle the dates datalibrary(anytime) #handles the dates data in any formatlibrary(fmsb) # for 95% CI of incidence rate
1. Upload the data files
Show the code
celiac <-read.csv("~/Desktop/Celiac-main/celiac.txt")nonceliac <-read.csv("~/Desktop/Celiac-main/nonceliac.txt")ICD <-read.csv("~/Desktop/Celiac-main/ICD.txt")# peek into the datahead(celiac)
id.number Start.of.flwp Gender Age Birthday Date.of.Death Date.of.Migration
1 424 20081106 2 72 1936-05-15 NA NA
2 881 20070613 1 21 1986-02-15 NA NA
3 1739 20040609 2 12 1991-12-15 NA NA
4 2462 20000306 2 19 1980-09-15 NA NA
5 2652 20040712 2 46 1957-10-15 NA NA
6 2785 20031001 1 17 1986-01-15 NA NA
Show the code
head(nonceliac)
id.number Celiac.number Date.of.Death Date.of.Migration
1 7659518 424 NA NA
2 1038521 424 NA NA
3 9265189 881 NA NA
4 10658482 881 NA NA
5 8460019 1739 NA NA
6 10980993 1739 NA NA
# Find out the CRC events from ICD file both for celiac and nonceliac ICD_CRC_events <- ICD %>%filter(grepl('C18', ICD.codes))%>%rbind( ICD %>%filter(grepl('C19', ICD.codes)) )%>%rbind( ICD %>%filter(grepl('C20', ICD.codes)) )%>%mutate(Date.of.ICD.codes=as.Date(Date.of.ICD.codes))head(ICD_CRC_events)
We follow the directions given in the project description
Show the code
#Find first CRC events both for celiac and nonceliac groups celiac_CRC <- celiac%>%filter(id.number %in%unique(ICD_CRC_events$id.number) ) %>%merge(ICD_CRC_events%>%filter(id.number %in%unique(celiac$id.number))%>%group_by(id.number)%>%summarise(Date.of.CRC.onset =min(Date.of.ICD.codes) ),by.x =c("id.number"), by.y =c("id.number"),all.x =TRUE)%>%filter(Start.of.flwp <= Date.of.CRC.onset)%>%# exclude the CRC cases before the follow-upmutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:8)), na.rm =TRUE)) )%>%# Case-wise End date of follow-up conditioned on Death/Migration/CRC case#filter(Start.of.flwp>Date.of.flwp.end)%>%mutate(person.time.years =as.numeric(Date.of.CRC.onset-Start.of.flwp)/365.25)%>%# calculate person-years timemutate(CRC.status="Yes") #mutate the variable for the CRC eventsceliac_nonCRC <- celiac %>%filter(id.number %in%setdiff(unique(celiac$id.number),unique(ICD_CRC_events$id.number) ))%>%mutate(Date.of.CRC.onset=as.Date(NA))%>%mutate(Date.of.flwp.end =as.Date("2019-12-31") )%>%#filter(Start.of.flwp <= Date.of.CRC.onset)%>%mutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:9)), na.rm =TRUE)) )%>%#filter(Start.of.flwp>Date.of.flwp.end)%>%mutate(person.time.years =as.numeric(Date.of.flwp.end-Start.of.flwp)/365.25)%>%mutate(CRC.status="No") #mutate the variable for the CRC eventsceliac_final_df <-rbind(celiac_CRC, celiac_nonCRC)%>%filter(Start.of.flwp<Date.of.flwp.end)%>%# filter out the events with negative follow-up timemutate(CRC.status=as.factor(CRC.status))%>%mutate(exposure.status="1") #mutate the variable for the exposure status (celiac~1, nonceliac~0)dim(celiac_CRC)
[1] 113 11
Show the code
dim(celiac_nonCRC)
[1] 12199 11
Show the code
dim(celiac_final_df)
[1] 12310 12
5. Calculate the required estimates for Celiac patients
# OR using the following linesSE_celiac =1.96*sqrt((total_cases_celiac*1000)/person_time_celiac^2 )CIL_95_celiac_IR <- Incidence_rate_celiac-SE_celiacCIU_95_celiac_IR <- Incidence_rate_celiac+SE_celiac
6. Construct the data frame for nonceliac (nonexposed) group based on the instructions
events,
person years time/1000
Incidence rate
Show the code
nonceliac_CRC <-nonceliac %>%merge(celiac[,c(1:5)], by.x ="Celiac.number" , by.y ="id.number" ,all.x =TRUE )%>%filter(id.number %in%unique(ICD_CRC_events$id.number) ) %>%merge( ICD_CRC_events%>%filter(id.number %in%unique(nonceliac$id.number)) %>%group_by(id.number)%>%summarise(Date.of.CRC.onset =min(Date.of.ICD.codes)),by.x =c("id.number"), by.y =c("id.number"),all.y =TRUE )%>%select(c(1,5:8,3,4,9))%>%filter(Start.of.flwp <= Date.of.CRC.onset)%>%mutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:8)), na.rm =TRUE)) )%>%filter(Start.of.flwp<Date.of.flwp.end)%>%mutate(person.time.years =as.numeric(Date.of.CRC.onset-Start.of.flwp)/365.25)%>%mutate(CRC.status="Yes")nonceliac_nonCRC <- nonceliac%>%#match the nonceliac patients with celiac with celiac numbermerge(celiac[,c(1:5)], by.x ="Celiac.number" , by.y ="id.number" ,all.x =TRUE )%>%#filter the nonceliacs found in the ICD datafilter(id.number %in%setdiff(unique(nonceliac$id.number),unique(ICD_CRC_events$id.number) ))%>%#mutate 'Date.of.CRC.onset= NA' variablemutate(Date.of.CRC.onset=as.Date(NA))%>%#mutate Followup end date variable with fixed date(2019-12-31)mutate(Date.of.flwp.end =as.Date("2019-12-31") )%>%select(c(2,5:8,3:4,9,10))%>%#mutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:9)), na.rm =TRUE)) )%>%filter(Start.of.flwp<Date.of.flwp.end)%>%mutate(person.time.years =as.numeric(Date.of.flwp.end-Start.of.flwp)/365.25)%>%mutate(CRC.status="No")nonceliac_final_df <-rbind(nonceliac_CRC, nonceliac_nonCRC)%>%filter(Start.of.flwp<Date.of.flwp.end)%>%mutate(CRC.status=as.factor(CRC.status))%>%mutate(exposure.status="0")
7. Calculate the required estimates (events,person years time /1000 and Incidence rate) for nonceliac persons
Call:
survival::coxph(formula = Surv(tstart, flwp.time, event) ~ exposure.status *
strata(tgroup) + Age + Gender, data = df_celiac_nonceliac_stratfied)
n= 106558, number of events= 307
coef exp(coef) se(coef) z
exposure.status1 2.664553 14.361534 0.532301 5.006
Age 0.072778 1.075492 0.003593 20.254
Gender2 -0.153608 0.857608 0.114911 -1.337
exposure.status1:strata(tgroup)tgroup=2 -2.480087 0.083736 0.569740 -4.353
exposure.status1:strata(tgroup)tgroup=3 -3.037078 0.047975 0.560359 -5.420
Pr(>|z|)
exposure.status1 5.57e-07 ***
Age < 2e-16 ***
Gender2 0.181
exposure.status1:strata(tgroup)tgroup=2 1.34e-05 ***
exposure.status1:strata(tgroup)tgroup=3 5.96e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95
exposure.status1 14.36153 0.06963 5.05947
Age 1.07549 0.92981 1.06794
Gender2 0.85761 1.16603 0.68466
exposure.status1:strata(tgroup)tgroup=2 0.08374 11.94230 0.02741
exposure.status1:strata(tgroup)tgroup=3 0.04797 20.84425 0.01600
upper .95
exposure.status1 40.7658
Age 1.0831
Gender2 1.0742
exposure.status1:strata(tgroup)tgroup=2 0.2558
exposure.status1:strata(tgroup)tgroup=3 0.1439
Concordance= 0.896 (se = 0.006 )
Likelihood ratio test= 715.7 on 5 df, p=<2e-16
Wald test = 445.5 on 5 df, p=<2e-16
Score (logrank) test = 714.5 on 5 df, p=<2e-16
Show the code
cox.zph(HR_CRC_AgeGender_adjusted_stratified)
chisq df p
exposure.status 0.623 1 0.43
Age 2.464 1 0.12
Gender 0.453 1 0.50
exposure.status:strata(tgroup) 0.570 2 0.75
GLOBAL 8.436 5 0.13
Show the code
#ggcoxzph(HR_CRC_AgeGender_adjusted_stratified)
Conclusion/ Interpretations
Overall, we observed an increased risk of CRC among celiac patients compared to the nonceliac individuals which was statistically non-significant. The unadjusted and adjusted (for age and sex) hazards were approximately 15% and 16% higher among celiac patients compared to the nonceliac individuals (Table-1). However, the proportional-hazard (PH) assumptions for the cox regression model were not fulfilled (p<0.05) when tested with Schoenfeld test. To tackle this, we stratified the follow-up duration into three intervals. Subsequently, the CRC hazards in exposed group increased significantly (p<0.05) up to 15-times (Table-1) compared to the unexposed group for the first time-interval. In contrast, the CRC hazard in celiac patients was significantly lower (p <0.05) compared to the unexposed for other time-intervals. Similar findings were observed after adjusting for age and sex (Table-1). The results should be interpreted carefully due to large uncertainties in the estimates.