Description: Celiac patients with Colorectal Cancer (CRC)
In this project we investigate if patients with celiac disease are at a higher risk of future colorectal cancer (CRC) compared to controls. The start of CRC is defined as the day of the first ICD (International Classification of Diseases) code starting by C18, C19, or C20. End of follow-up of the study is 2019-12-31. The study design is a matched cohort study , where each celiac patient was matched to maximum 2 non-celiac persons on age and gender at start of follow-up.
Definitions
The International Classification of Diseases (ICD) codes used to identify colorectal cancer and celiac disease in patients are as follows:
ICD-10 Codes:
C18: Malignant neoplasm of colon
C19: Malignant neoplasm of rectosigmoid junction
C20: Malignant neoplasm of rectum
C21: Malignant neoplasm of anus and anal canal
Data
The data comes in the following three files:
celiac.txt
nonceliac.txt
ICD.txt
Variables
The file named celiac.txt consists of even variables as follows:
Id.number: The number identifying the celiac patient
Start.of.flwp: Start of follow-up
Gender: 1= Male, 2= Female
Age: The person’s age
Birthday: The person’s birthday
Date.of.Death: The day the person died
Date.of.Migration: The day the person migrated
The file named nonceliac.txt contains the following four variables:
Id.number: The number identifying the non-celiac person
Celiac.number: The number identifying the celiac patient the person is matched to
Date.of.Death: The day the person died
Date.of.Migration: The day the person migrated.
The file named ICD.txt contains the following three variables:
Id.number: The number identifying the person
ICD.codes: The list of ICD codes assigned at a specific date.
Date.of.ICD.codes: The date the ICD codes were assigned
Tasks
Estimate the following measures both for cases, i.e. celiac patients and controls, i.e., non-celiac patients
Number of persons per group (N)
Number of person years per group (PY)
Number of events per group (Events)
Incidence rate per 1000 person-year 95% confidence intervals
TheHazard ratios (HR) with 95% confidence intervals
a. without adjusting for Age and Gender variables
b. adjusting for Age and Gender variables
Consolidate the above results into a table
Data wrangling and cleaning
Show the code
library(survival) #library(survminer)library(simstudy) #the package for non-PH modelslibrary(ggplot2)library(dplyr)library(tidyverse)library(viridis)library(ggrepel)library(ggthemes)library(lubridate) # to handle the dates datalibrary(anytime) #handles the dates data in any formatlibrary(fmsb) # for 95% CI of incidence ratelibrary(gt)
Upload the data files
Show the code
celiac <-read.csv("~/Desktop/Celiac-main/celiac.txt")nonceliac <-read.csv("~/Desktop/Celiac-main/nonceliac.txt")ICD <-read.csv("~/Desktop/Celiac-main/ICD.txt")# peek into the dataglimpse(celiac)
# Find out the CRC events from ICD file both for celiac and nonceliac ICD_CRC_events <- ICD %>%filter(grepl('C18', ICD.codes))%>%rbind( ICD %>%filter(grepl('C19', ICD.codes)) )%>%rbind( ICD %>%filter(grepl('C20', ICD.codes)) )%>%mutate(Date.of.ICD.codes=as.Date(Date.of.ICD.codes))head(ICD_CRC_events)
4. Construct the data frame for celiac patients by following the instructions given in the project description
Show the code
#Find first CRC events both for celiac and nonceliac groups celiac_CRC <- celiac%>%filter(id.number %in%unique(ICD_CRC_events$id.number) ) %>%merge(ICD_CRC_events%>%filter(id.number %in%unique(celiac$id.number))%>%group_by(id.number)%>%summarise(Date.of.CRC.onset =min(Date.of.ICD.codes) ),by.x =c("id.number"), by.y =c("id.number"),all.x =TRUE)%>%filter(Start.of.flwp <= Date.of.CRC.onset)%>%# exclude the CRC cases before the follow-upmutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:8)), na.rm =TRUE)) )%>%mutate(person.time.years =as.numeric(Date.of.CRC.onset-Start.of.flwp)/365.25)%>%# calculate person-years timemutate(CRC.status="Yes") #mutate the variable for the CRC eventsceliac_nonCRC <- celiac %>%filter(id.number %in%setdiff(unique(celiac$id.number),unique(ICD_CRC_events$id.number) ))%>%mutate(Date.of.CRC.onset=as.Date(NA))%>%mutate(Date.of.flwp.end =as.Date("2019-12-31") )%>%#filter(Start.of.flwp <= Date.of.CRC.onset)%>%mutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:9)), na.rm =TRUE)) )%>%#filter(Start.of.flwp>Date.of.flwp.end)%>%mutate(person.time.years =as.numeric(Date.of.flwp.end-Start.of.flwp)/365.25)%>%mutate(CRC.status="No") #mutate the variable for the CRC eventsceliac_final_df <-rbind(celiac_CRC, celiac_nonCRC)%>%filter(Start.of.flwp<Date.of.flwp.end)%>%# filter out the events with negative follow-up timemutate(CRC.status=as.factor(CRC.status))%>%mutate(exposure.status="1") #mutate the variable for the exposure status (celiac~1, nonceliac~0)dim(celiac_CRC)
[1] 113 11
Show the code
dim(celiac_nonCRC)
[1] 12199 11
Show the code
dim(celiac_final_df)
[1] 12310 12
Modelling and analysis
5.Calculate the required estimates ( events, person years time/1000 and incidence rates ) for celiac patients
# OR using the following linesSE_celiac =1.96*sqrt((total_cases_celiac*1000)/person_time_celiac^2 )CIL_95_celiac_IR <- Incidence_rate_celiac-SE_celiacCIU_95_celiac_IR <- Incidence_rate_celiac+SE_celiac
6. Construct the data frame for non-celiac patients based on the instructions
Show the code
nonceliac_CRC <-nonceliac %>%merge(celiac[,c(1:5)], by.x ="Celiac.number" , by.y ="id.number",all.x =TRUE )%>%filter(id.number %in%unique(ICD_CRC_events$id.number) ) %>%merge( ICD_CRC_events%>%filter(id.number %in%unique(nonceliac$id.number)) %>%group_by(id.number)%>%summarise(Date.of.CRC.onset =min(Date.of.ICD.codes)),by.x =c("id.number"), by.y =c("id.number"),all.y =TRUE )%>%select(c(1,5:8,3,4,9))%>%filter(Start.of.flwp <= Date.of.CRC.onset)%>%mutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:8)), na.rm =TRUE)) )%>%filter(Start.of.flwp<Date.of.flwp.end)%>%mutate(person.time.years =as.numeric(Date.of.CRC.onset-Start.of.flwp)/365.25)%>%mutate(CRC.status="Yes")nonceliac_nonCRC <- nonceliac%>%#match the nonceliac patients with celiac with celiac numbermerge(celiac[,c(1:5)], by.x ="Celiac.number" , by.y ="id.number",all.x =TRUE )%>%#filter the nonceliacs found in the ICD datafilter(id.number %in%setdiff(unique(nonceliac$id.number),unique(ICD_CRC_events$id.number) ) )%>%#mutate 'Date.of.CRC.onset= NA' variablemutate(Date.of.CRC.onset=as.Date(NA))%>%#mutate Followup end date variable with fixed date(2019-12-31)mutate(Date.of.flwp.end =as.Date("2019-12-31") )%>%select(c(2,5:8,3:4,9,10))%>%mutate(Date.of.flwp.end =do.call(pmin, c(select(., c(6:9)), na.rm =TRUE)) )%>%filter( Start.of.flwp < Date.of.flwp.end)%>%mutate( person.time.years =as.numeric(Date.of.flwp.end-Start.of.flwp)/365.25)%>%mutate(CRC.status="No")nonceliac_final_df <-rbind(nonceliac_CRC, nonceliac_nonCRC)%>%filter(Start.of.flwp<Date.of.flwp.end)%>%mutate(CRC.status=as.factor(CRC.status))%>%mutate(exposure.status="0")
7. Calculate the required estimates (events,person years time /1000 and incidence rate) for nonceliac controls
chisq df p
exposure.status 21.949 1 2.8e-06
Age 2.814 1 0.093
Gender 0.489 1 0.484
GLOBAL 24.748 3 1.7e-05
Show the code
#(PH_assump_CRC_AgeGender_Adjusted)
Stratify the followup duration time
Show the code
#Divide the followup duration into 3 intervals [0,200), [200, 1650), [1650, 7302)#Here, we use 'surveSplit' function from 'simstudy' packagedf_celiac_nonceliac_stratfied <-survSplit(Surv(flwp.time, as.numeric(CRC.status)) ~., data= df_celiac_nonceliac, cut=c(200,1650),episode="tgroup")head(df_celiac_nonceliac_stratfied)
Call:
survival::coxph(formula = Surv(tstart, flwp.time, event) ~ exposure.status *
strata(tgroup) + Age + Gender, data = df_celiac_nonceliac_stratfied)
n= 106558, number of events= 307
coef exp(coef) se(coef) z
exposure.status1 2.664553 14.361534 0.532301 5.006
Age 0.072778 1.075492 0.003593 20.254
Gender2 -0.153608 0.857608 0.114911 -1.337
exposure.status1:strata(tgroup)tgroup=2 -2.480087 0.083736 0.569740 -4.353
exposure.status1:strata(tgroup)tgroup=3 -3.037078 0.047975 0.560359 -5.420
Pr(>|z|)
exposure.status1 5.57e-07 ***
Age < 2e-16 ***
Gender2 0.181
exposure.status1:strata(tgroup)tgroup=2 1.34e-05 ***
exposure.status1:strata(tgroup)tgroup=3 5.96e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95
exposure.status1 14.36153 0.06963 5.05947
Age 1.07549 0.92981 1.06794
Gender2 0.85761 1.16603 0.68466
exposure.status1:strata(tgroup)tgroup=2 0.08374 11.94230 0.02741
exposure.status1:strata(tgroup)tgroup=3 0.04797 20.84425 0.01600
upper .95
exposure.status1 40.7658
Age 1.0831
Gender2 1.0742
exposure.status1:strata(tgroup)tgroup=2 0.2558
exposure.status1:strata(tgroup)tgroup=3 0.1439
Concordance= 0.896 (se = 0.006 )
Likelihood ratio test= 715.7 on 5 df, p=<2e-16
Wald test = 445.5 on 5 df, p=<2e-16
Score (logrank) test = 714.5 on 5 df, p=<2e-16
Show the code
cox.zph(HR_CRC_AgeGender_adjusted_stratified)
chisq df p
exposure.status 0.623 1 0.43
Age 2.464 1 0.12
Gender 0.453 1 0.50
exposure.status:strata(tgroup) 0.570 2 0.75
GLOBAL 8.436 5 0.13
Show the code
#ggcoxzph(HR_CRC_AgeGender_adjusted_stratified)
Table
Now, construct the final table of the results
Table 1 combines the completed tasks of the project.
Show the code
# Data preparationsummary_table <-tibble(Group =c("Celiac Patients", "Non-Celiac Controls"),`Number of Persons`=c(nrow(celiac_final_df), nrow(nonceliac_final_df)),`Number of Person-Years`=c(person_time_celiac, person_time_nonceliac),`Number of Events`=c(total_cases_celiac, total_cases_nonceliac),`Incidence Rate per 1000 PY (95% CI)`=c(paste0(round(Incidence_rate_celiac, 2), " (", round(CIL_95_celiac_IR, 2), " - ", round(CIU_95_celiac_IR, 2), ")"),paste0(round(Incidence_rate_nonceliac, 2), " (", round(CIL_95_nonceliac_IR, 2), " - ", round(CIL_95_nonceliac_IR, 2), ")") ),`Unadjusted HR (95% CI)`=c(paste0(round(summary(HR_CRC_Unadjusted)$coefficients[1, "exp(coef)"], 2), " (", round(summary(HR_CRC_Unadjusted)$conf.int[1, "lower .95"], 2), " - ", round(summary(HR_CRC_Unadjusted)$conf.int[1, "upper .95"], 2), ")"),"" ),`Adjusted HR (95% CI)`=c(paste0(round(summary(HR_CRC_AgeGender_adjusted)$coefficients[1, "exp(coef)"], 2), " (", round(summary(HR_CRC_AgeGender_adjusted)$conf.int[1, "lower .95"], 2), " - ", round(summary(HR_CRC_AgeGender_adjusted)$conf.int[1, "upper .95"], 2), ")"),"" ))# Create the table using gtgt_table <- summary_table %>%gt() %>%tab_header(title ="Summary of Colorectal Cancer Risk among Celiac and Non-Celiac Patients",subtitle ="Based on Survival Analysis" ) %>%cols_label(`Number of Persons`="N",`Number of Person-Years`="PY",`Number of Events`="Events",`Incidence Rate per 1000 PY (95% CI)`="Incidence Rate / 1000 PY (95% CI)",`Unadjusted HR (95% CI)`="Unadjusted HR (95% CI)",`Adjusted HR (95% CI)`="Adjusted HR (95% CI)" ) %>%fmt_number(columns =c(`Number of Person-Years`, `Number of Events`, `Unadjusted HR (95% CI)`, `Adjusted HR (95% CI)`),decimals =2 ) %>%cols_align(align ="center",columns =everything() )# Display the tablegt_table
Table 1: Results
Summary of Colorectal Cancer Risk among Celiac and Non-Celiac Patients
Based on Survival Analysis
Group
N
PY
Events
Incidence Rate / 1000 PY (95% CI)
Unadjusted HR (95% CI)
Adjusted HR (95% CI)
Celiac Patients
12310
137,267.82
113.00
0.82 (0.82 - 0.83)
1.16 (0.92 - 1.46)
1.15 (0.91 - 1.44)
Non-Celiac Controls
24358
272,313.71
194.00
0.71 (0.72 - 0.72)
Interpretations and conclusions
Overall, we observed an increased risk of CRC among celiac patients compared to the nonceliac individuals which was statistically non-significant. The unadjusted and adjusted (for age and sex) hazards were approximately 15% and 16% higher among celiac patients compared to the non-celiac individuals (Table 1). However, the proportional-hazard (PH) assumptions for the cox regression model were not fulfilled (p<0.05) when tested with Schoenfeld test. To tackle this, we stratified the follow-up duration into three intervals. Subsequently, the CRC hazards in exposed group increased significantly (p<0.05) up to 15-times (Table 1) compared to the unexposed group for the first time-interval. In contrast, the CRC hazard in celiac patients was significantly lower (p <0.05) compared to the unexposed for other time-intervals. Similar findings were observed after adjusting for age and sex (Table 1). The results should be interpreted carefully due to large uncertainties in the estimates.