Estimate Counterfactual Survival or Failure Probabilities for Levels of a Continuous Variable
curve_cont.Rd
This function can be utilized to estimate counterfactual survival curves or cumulative incidence functions (CIF) for specific values of a continuous covariate.
Usage
curve_cont(data, variable, model, horizon,
times, group=NULL, cause=1, cif=FALSE,
contrast="none", reference="km", ref_value=NULL,
event_time=NULL, event_status=NULL,
conf_int=FALSE, conf_level=0.95,
n_boot=300, n_cores=1,
na.action=options()$na.action,
return_boot=FALSE, ...)
Arguments
- data
A
data.frame
containing all required variables.- variable
A single character string specifying the continuous variable of interest, for which the survival curves should be estimated. This variable has to be contained in the
data.frame
that is supplied to thedata
argument.- model
A model describing the time-to-event process (such as an
coxph
model). Needs to includevariable
as an independent variable. It also has to have an associatedpredictRisk
method. See?predictRisk
for more details.- horizon
A numeric vector containing a range of values of
variable
for which the survival curves should be calculated.- times
A numeric vector containing points in time at which the survival probabilities should be calculated.
- group
An optional single character string specifying a factor variable in
data
. When used, the regression standardization is performed conditional on this factor variable, meaning that one estimate is returned for each level of the factor variable. See details for a better description. Set toNULL
(default) to use no grouping variable.- cause
The cause of interest. In standard survival data with only one event type, this should be kept at 1. For data with multiple failure types, this argument should be specified. In addition, the
cif
argument should be set toTRUE
in those cases.- cif
Whether to calculate the cumulative incidence (CIF) instead of the survival probability. If multiple failure types are present, the survival probability cannot be estimated in an unbiased way. In those cases, this argument should always be set to
TRUE
.- contrast
Defines what kind of estimate should be returned. Can be either
"none"
(default),"diff"
or"ratio"
. When"none"
is used, it simply returns the counterfactual survival probabilities (or CIF ifcif=TRUE
) without any further calculations. If"diff"
or"ratio"
is used instead, the difference or ratio to some reference value will be returned. See argument below.- reference
Defines what kind of reference value to use when estimating causal contrasts. Only used if
contrast!="none"
, ignored otherwise. Can be either"km"
(using standard Kaplan-Meier estimates as reference) or"value"
(using the g-computation estimates of a specific value of thevariable
as reference). To specify which value to use when usingreference="value"
, theref_value
argument should be used. See details for more information.- ref_value
A single number corresponding to the reference value used when estimating causal contrasts using
reference="value"
. See details.- event_time
A single character string specifying the time until the occurrence of the event of interest or
NULL
(default). Only used when estimating causal contrasts withreference="km"
, ignored otherwise. If specified, this variable has to be contained in thedata.frame
that is supplied to thedata
argument.- event_status
A single character string specifying the status of the event of interest or
NULL
(default). Only used when estimating causal contrasts withreference="km"
, ignored otherwise. If specified, this variable has to be contained in thedata.frame
that is supplied to thedata
argument.- conf_int
Whether to calculate point-wise confidence intervals or not. If
TRUE
,n_boot
bootstrap samples are drawn, the g-computation step is performed on each bootstrap sample and the confidence intervals are calculated using the percentile method from these results. Can get very slow if the dataset is large or there are many values inhorizon
ortimes
. Usingn_cores
can speed up the process a lot.- conf_level
A number specifying the confidence level of the bootstrap confidence intervals. Ignored if
conf_int=FALSE
.- n_boot
A single integer specifying how many bootstrap repetitions should be performed. Ignored if
conf_int=FALSE
.- n_cores
The number of processor cores to use when performing the calculations. If
n_cores=1
(default), single threaded processing is used. Ifn_cores > 1
the foreach package and the doParallel package are used to run the calculations onn_cores
in parallel. This might speed up the runtime considerably when it is initially slow.- na.action
How missing values should be handled. Can be one of:
na.fail
,na.omit
,na.pass
,na.exclude
or a user-defined custom function. Also accepts strings of the function names. See?na.action
for more details. By default it uses the na.action which is set in the global options by the respective user.- return_boot
Either
TRUE
orFALSE
(default). IfTRUE
(andconf_int=TRUE
) this function will return the individual bootstrap estimates instead of the usual estimates. This may be useful to perform tests or calculate other type of bootstrap confidence intervals manually.- ...
Further arguments passed to
predictRisk
.
Details
This function is used internally in all plot functions included in this R-package and generally does not need to be called directly by the user. It can however be used to get specific values or as a basis to create custom plots not included in this package. Below we give a small introduction to what this function does. A more detailed description of the underlying methodology can be found in our article on this subject (Denz & Timmesfeld 2022).
Target Estimand (default)
By default (contrast="none"
) this function tries to estimate the survival probability at times
that would have been observed if every individual in the sample had received a value of horizon
in the variable
. Let \(Z\) be the continuous variable we are interested in. Let \(T\) be the time until the occurrence of the event of interest. Under the potential outcome framework, there is an uncountably infinite amount of potential survival times \(T^{(Z=z)}\), one for each possible value of \(Z\). The target estimand is then defined as:
$$S_{z}(t) = E(I(T^{(Z=z)} > t))$$
If we additionally consider a categorical group
variable \(D\), the target estimand is similarly defined as:
$$S_{zd}(t) = E(I(T^{(Z=z, D=d)} > t))$$
where \(T^{(Z=z, D=d)}\) is the survival time that would have been observed if the individual had received both \(Z = z\) and \(D = d\).
Target Estimand (using contrasts)
If contrasts are used (contrast!="none"
), target estimands based on \(S_{z}(t)\) or \(S_{zd}(t)\) are used instead. When using contrast="diff"
and reference="km"
, the target estimand is simply the difference between the observed survival probability in the entire sample (denoted by \(S(t)\), estimated using a Kaplan-Meier estimator) and the counterfactual survival probability:
$$\Delta_{KM}(t, z) = S(t) - S_z(t)$$
If contrast="ratio"
is used instead, the substraction sign is simply replace by a division. If group
was specified, \(S(t)\) is replaced by \(S_d(t)\) (a stratified Kaplan-Meier estimator) and \(S_z(t)\) is replaced by \(S_{zd}(t)\). Instead of using a Kaplan-Meier estimator as reference
, one may also use a specific \(S_z(t)\) as reference using reference="value"
and setting ref_value
to the \(Z\) that should be used. All of these causal contrasts and their implications are described in detail in the appendix of our article on this topic (Denz & Timmesfeld 2022).
Estimation Methodology
G-Computation, also known as the Corrected Group Prognosis method, Direct-Standardization or G-Formula, is used internally to estimate the counterfactual survival probability or CIF for values of a continuous variable. This is done by setting the variable
to a specific value for all rows in data
first. Afterwards, the model
is used to predict the survival probability of each individual at all times
, given the value of variable
and their other observed covariates (which are included in the model as independent variables). These estimates are then averaged for each time point. This procedure is repeated for every value in horizon
. If a group
is supplied, these calculations are repeated for every possible value in group
, with the estimated individual survival probabilities also being conditional on that value.
To obtain valid estimates of the target estimand using this function, the fundamental causal identifiability assumptions have to be met. Those are described in detail in our article on this topic (Denz & Timmesfeld 2022). If those assumptions are not met, the estimates may only be used to showcase simple associations. They cannot be endowed with a counterfactual interpretation in this case.
Supported Models
This function relies on the predictRisk
function from the riskRegression package to create the covariate and time specific estimates of the probabilities. All models with an associated predictRisk
method may be used in this function. This includes a variety of models, such as the Cox proportional hazards regression model and the aalen additive hazards model. If the model that should be used has no predictRisk
method, the user either needs to write their own predictRisk
method or contact the maintainers of the riskRegression package.
Bootstrap Confidence Intervals
By using conf_int=TRUE
, bootstrap confidence intervals may be estimated. This will draw n_boot
samples of size nrow(data)
with replacement from data
in the first step. Afterwards, the model
is fit to each of those bootstrap samples and the curve_cont
function is recursively called to obtain the estimates of interests for each sample. Using the percentile approach, the bootstrap confidence intervals are calculated. This requires that the model
object contains a call
parameter, because the update()
function is used internally.
Computational Complexity
If many values are included in times
, horizon
or both and/or data
has a lot of rows, this function may become very slow. Since bootstrapping relies on sampling with replacement from the original data
and repeating the entire procedure n_boot
times, this also dramatically increases the runtime. Parallel processing may be used to speed up the computations. This can be done by simply setting n_cores
to values higher than 1.
Missing Data
Currently, this function does not support advanced handling of missing data. The data.frame
supplied to data
should contain no missing data in the relevant columns. To achieve this, the na.action
argument can be set to "na.omit"
, which will remove all rows that do contain missing data.
Competing Events
By supplying a model
that directly takes into account competing events and using the cause
argument, the user may also use the functionality offered in this package to create plots in this setting. Internally, the predictRisk
method will then be used to estimate the conditional cause-specific cumulative incidence function, which will then be used to carry out the g-computation step explained above. However the underlying target estimand of this procedure is dependent on which kind of model was supplied. It is therefore not possible to define it here concisely. Future research is necessary to clarify this point. This feature should only be used with great caution.
Value
Returns a data.frame
containing the columns time
(the point in time), est
(the estimated survival probability or CIF) and cont
(the specific value of variable
used). If group
was used, it includes the additional group
column, specifying the level of the grouping variable. If conf_int=TRUE
was used, it additionally includes the columns se
(bootstrap standard error of the estimate), ci_lower
(lower limit of the bootstrap confidence interval) and ci_upper
(upper limit of the bootstrap confidence interval).
References
Robin Denz, Nina Timmesfeld (2023). "Visualizing the (Causal) Effect of a Continuous Variable on a Time-To-Event Outcome". In: Epidemiology 34.5
Brice Ozenne, Anne Lyngholm Sorensen, Thomas Scheike, Christian Torp-Pedersen and Thomas Alexander Gerds. riskRegression: Predicting the Risk of an Event using Cox Regression Models. The R Journal (2017) 9:2, pages 440-460.
I-Ming Chang, Rebecca Gelman, and Marcello Pagano. Corrected Group Prognostic Curves and Summary Statistics. Journal of Chronic Diseases (1982) 35, pages 669-674
James Robins. A New Approach to Causal Inference in Mortality Studies with a Sustained Exposure Period: Application to Control of the Healthy Worker Survivor Effect. Mathematical Modelling (1986) 7, pages 1393-1512.
Examples
library(contsurvplot)
library(riskRegression)
#> riskRegression version 2023.09.08
library(survival)
# using data from the survival package
data(nafld, package="survival")
# take a random sample to keep example fast
set.seed(42)
nafld1 <- nafld1[sample(nrow(nafld1), 150), ]
# fit cox-model with age
model <- coxph(Surv(futime, status) ~ age, data=nafld1, x=TRUE)
# estimate survival probability at some points in time, for
# a range of age values
plotdata <- curve_cont(data=nafld1,
variable="age",
model=model,
horizon=c(50, 60, 70, 80),
times=c(1000, 2000, 3000, 4000))
# estimate cumulative incidences instead
plotdata <- curve_cont(data=nafld1,
variable="age",
model=model,
horizon=c(50, 60, 70, 80),
times=c(1000, 2000, 3000, 4000),
cif=TRUE)