Introduction
The simDAG
package can be used to generate all kinds of
data, as shown in the many examples and other vignettes of this package.
In this vignette we will further illustrate the capabilities of the
package, by giving very short and simplified examples on how to use
simDAG
to generate multiple different kinds of data. The
vignette can be seen as a sort of “cookbook”, in the sense that it
includes the building blocks for many different possible data generation
processes (DGP), which users can expand on for their specific needs.
Note that the examples shown are not meant to be realistic, they are
only meant to show the general structure of the required code.
This vignette assumes that the reader is already somewhat familiar
with the simDAG
syntax. More detailed explanations of how
to simulate data using a time-fixed DAG and DAGs including
time-dependent variables is given in other vignettes and the
documentation of the functions.
Simulating Randomized Controlled Trials
First, we will give some examples on how data for randomized
controlled trials (RCT) could be generated. In a classic RCT, the
treatment of interest (below named Treatment
) is randomly
assigned to the individuals of the study, making it a “root node” in the
terminology of DAGs.
Two Treatment Groups
Below is an example for an RCT with two treatment groups, a binary
outcome and two baseline covariates (Age
and
Sex
):
dag <- empty_dag() +
node("Age", type="rnorm", mean=55, sd=5) +
node("Sex", type="rbernoulli", p=0.5) +
node("Treatment", type="rbernoulli", p=0.5) +
node("Outcome", type="binomial",
formula= ~ -12 + Age*0.2 + Sex*1.1 + Treatment*-0.5)
data <- sim_from_dag(dag, n_sim=1000)
head(data)
#> Age Sex Treatment Outcome
#> <num> <lgcl> <lgcl> <lgcl>
#> 1: 47.99978 FALSE FALSE FALSE
#> 2: 56.27659 TRUE TRUE FALSE
#> 3: 42.81368 FALSE FALSE FALSE
#> 4: 54.97214 FALSE TRUE FALSE
#> 5: 58.10776 TRUE FALSE FALSE
#> 6: 60.74206 FALSE TRUE FALSE
Three or More Treatment Groups
The example from earlier can easily be made a little more complex, by
including three treatment groups instead of just two. This can be
achieved by using the rcategorical()
function as node type
instead of the rbernoulli()
function.
dag <- empty_dag() +
node("Age", type="rnorm", mean=55, sd=5) +
node("Sex", type="rbernoulli", p=0.5, output="numeric") +
node("Treatment", type="rcategorical", probs=c(0.33333, 0.33333, 0.33333),
labels=c("Placebo", "Med1", "Med2"), output="factor") +
node("Outcome", type="binomial",
formula= ~ -12 + Age*0.2 + Sex*1.1 + TreatmentMed1*-0.5 +
TreatmentMed2*-1)
data <- sim_from_dag(dag, n_sim=1000)
head(data)
#> Age Sex Treatment Outcome
#> <num> <num> <fctr> <lgcl>
#> 1: 56.92997 0 Placebo TRUE
#> 2: 50.80579 1 Med2 FALSE
#> 3: 52.34986 1 Med1 FALSE
#> 4: 43.53409 1 Med2 FALSE
#> 5: 64.27632 1 Med1 FALSE
#> 6: 51.87643 1 Med2 FALSE
By setting probs=c(0.33333, 0.33333, 0.33333)
, each
treatment group is choosen with a probability of 0.33333, meaning that
the resulting groups should be roughly of equal size in expectation.
Multiple Outcome Measurements
The previous two examples assumed that there was a single fixed time
at which the binary outcome was measured. Below we switch things up a
bit by using a continuous outcome that is measured at 5 different points
in time after baseline. The syntax is nearly the same as before, with
the major difference being that we now use a node_td()
call
to define the outcome, because it is time-dependent. Additionally,
because of this inclusion of a time-dependent node, we have to use the
sim_discrete_time()
function instead of the
sim_from_dag()
function. Finally, the
sim2data()
has to be called to obtain the desired output in
the long-format.
dag <- empty_dag() +
node("Age", type="rnorm", mean=55, sd=5) +
node("Sex", type="rbernoulli", p=0.5, output="numeric") +
node("Treatment", type="rbernoulli", p=0.5) +
node_td("Outcome", type="gaussian",
formula= ~ -12 + Age*0.2 + Sex*1.1 + Treatment*-0.5, error=1)
sim <- sim_discrete_time(dag, n_sim=1000, max_t=5, save_states="all")
data <- sim2data(sim, to="long")
head(data)
#> Key: <.id, .time>
#> .id .time Age Sex Treatment Outcome
#> <int> <int> <num> <num> <lgcl> <num>
#> 1: 1 1 47.29087 1 FALSE -0.6673257
#> 2: 1 2 47.29087 1 FALSE -2.6859386
#> 3: 1 3 47.29087 1 FALSE -2.3041754
#> 4: 1 4 47.29087 1 FALSE -1.4885914
#> 5: 1 5 47.29087 1 FALSE -1.9022557
#> 6: 2 1 58.86501 0 TRUE -3.7253747
Non-Compliance to Treatment Assignment
Previously we assumed that the treatment was assigned at baseline and never changed throughout the study. In real RCTs, individuals are often assigned to treatment strategies instead (for example: one pill per week). Some individuals might not adhere to their assigned treatment strategy, by for example “switching back” to not taking the pill.
The following code shows one possible way to simulate such data. In
the shown DGP, individuals are assigned to a treatment strategy at
baseline. Here, Treatment = FALSE
refers to the control
condition in which the individual should not take any pills. In the
intervention group (Treatment = TRUE
), however, the
individual is assigned to take a pill every week. On average, 5% of
these individuals stop taking their pills with each passing week. Once
they stop, they never start taking them again. The continuous outcome at
each observation time depends on how many pills each individual took in
total. None of the individuals in the control group start taking pills.
For simplicity, the simulation is run without any further covariates for
5 weeks.
# function to calculate the probability of taking the pill at t,
# given the current treatment status of the person
prob_treat <- function(data) {
fifelse(!data$Treatment_event, 0, 0.95)
}
dag <- empty_dag() +
node("Treatment_event", type="rbernoulli", p=0.5) +
node_td("Treatment", type="time_to_event", parents=c("Treatment_event"),
prob_fun=prob_treat, event_count=TRUE, event_duration=1) +
node_td("Outcome", type="gaussian", formula= ~ -2 +
Treatment_event_count*-0.3, error=2)
sim <- sim_discrete_time(dag, n_sim=1000, max_t=5, save_states="all")
data <- sim2data(sim, to="long")
head(data)
#> Key: <.id, .time>
#> .id .time Treatment Outcome Treatment_event_count
#> <int> <int> <lgcl> <num> <num>
#> 1: 1 1 TRUE -2.988481 1
#> 2: 1 2 TRUE -2.545962 2
#> 3: 1 3 TRUE -3.029647 3
#> 4: 1 4 TRUE -2.593921 4
#> 5: 1 5 TRUE 2.327432 5
#> 6: 2 1 TRUE -3.427912 1
What this DAG essentially means is that first, the
Treatment_event
column is generated, which includes the
assigned treatment at baseline. We called it
Treatment_event
here instead of just
Treatment
, because the value in this column should be
updated with each iteration and nodes of type
"time_to_event"
always split the node into two columns:
status and time. By setting event_count=TRUE
in the
node_td()
call for the Treatment
node, a count
of the amount of pills taken up to
is directly calculated at each point in time (column
Treatment_event_count
), which can then be used directly to
generate the Outcome
node.
This simulation could be made more realistic (or just more complex), by for example adding either of the following things:
- making the probability of switching dependent on the outcome at
- making the probability of switching dependent on other variables
- allowing patients to switch back to taking the pill after discontinuation
- allowing patients in the control group to switch to the treatment
These additions are left as an exercise to the user.
With Cluster Randomization
Instead of randomly assigning individuals to the treatment,
many trials actually randomly assign the treatment at the level of
clinics or other forms of clusters, which is fittingly called
cluster randomization. This may be implemented in simDAG
using the following syntax:
dag <- empty_dag() +
node("Clinic", type="rcategorical", probs=rep(0.02, 50)) +
node("Treatment", type="identity", formula= ~ Clinic >= 25) +
node("Outcome", type="poisson", formula= ~ -1 + Treatment*4 + (1|Clinic),
var_corr=0.5)
data <- sim_from_dag(dag, n_sim=1000)
#> Loading required namespace: simr
head(data)
#> Clinic Treatment Outcome
#> <int> <lgcl> <int>
#> 1: 5 FALSE 2
#> 2: 9 FALSE 0
#> 3: 19 FALSE 0
#> 4: 17 FALSE 0
#> 5: 11 FALSE 4
#> 6: 3 FALSE 0
In this DGP, each individual is randomly assigned to one of 50 Clinics with equal probability. All patients in clinics numbered 0-24 are assigned to the control group, while the other patients are assigned to the treatment group. The outcome is then generated using a Poisson regression with a random intercept based on the clinic.
Simulating Observational Studies
In all previous examples, the variable of interest was a treatment
that was randomly assigned and did not depend on other variables
(excluding the Clinic
for the cluster randomization
example). In observational studies, the variable of interest is usually
not randomly assigned. Below are some examples for generating such
observational study data.
Crossectional Data
In the following example, the treatment probability is dependent on both a categorical and a continuous variable, which both also cause the outcome:
dag <- empty_dag() +
node("cat", type="rcategorical", probs=c(0.4, 0.2, 0.2),
labels=LETTERS[1:3]) +
node("cont", type="rbeta", shape1=0.2, shape2=1.2) +
node("treatment", type="binomial",
formula= ~ -0.2 + catB*0.3 + catC*1 + cont*0.2) +
node("outcome", type="gaussian",
formula= ~ 3 + catB*1.1 + catC*0.2 + cont*-0.1,
error=1)
data <- sim_from_dag(dag, n_sim=100)
head(data)
#> cat cont treatment outcome
#> <char> <num> <lgcl> <num>
#> 1: A 1.921125e-04 FALSE 1.248619
#> 2: C 3.624581e-02 TRUE 2.636620
#> 3: B 8.407934e-05 FALSE 4.334688
#> 4: A 2.480741e-05 FALSE 2.074354
#> 5: A 2.095210e-07 TRUE 2.713159
#> 6: A 6.110730e-01 FALSE 1.385816
If the goal was to estimate the causal effect of the treatment on the
outcome, we would need to adjust for both cat
and
cont
here, using for example a linear regression model
(same as in the DGP) or something like inverse probability of treatment
weighting.
Miscellaneous Simulations
Simulating Multi-Level Data
The simDAG
package directly supports the inclusion of
arbitrary mixed model syntax in the formula
interface of
nodes of type "gaussian"
, "binomial"
and
"poisson"
, which makes it relatively straightforward to
simulate multi-level data as well. In the following example, we consider
students that are nested in different schools. The outcome is a
continuous score of some kind.
dag <- empty_dag() +
node("school", type="rcategorical", probs=rep(0.1, 10),
labels=LETTERS[1:10]) +
node("female", type="rbernoulli", p=0.5) +
node("age", type="rnorm", mean=12, sd=3) +
node("score", type="gaussian",
formula= ~ -2 + female*3 + age*0.1 + (1|school),
var_corr=0.5, error=1)
data <- sim_from_dag(dag, n_sim=10)
head(data)
#> school female age score
#> <char> <lgcl> <num> <num>
#> 1: H TRUE 10.832768 3.459773
#> 2: D FALSE 10.103622 -1.372296
#> 3: H TRUE 10.841744 1.842507
#> 4: A FALSE 14.161571 -0.514471
#> 5: H FALSE 7.989865 -1.226325
#> 6: C TRUE 14.900755 2.019298
In this example, there is a single random effect for school, with a
standard deviation of 0.5. This example could be expanded to include a
random slope for age per school, by exchanging (1|school)
with (age|school)
, although we would then also need to
adjust the var_corr
argument accordingly. More examples are
given in the formula vignette.
Simulating Mixture Distributions
Instead of using random effects and random slopes, another
possibility to simulate a sort of “mixed” variables is to directly model
it using different regression models for different individuals. This can
be done using the "mixture"
node type. Consider the
following example:
dag <- empty_dag() +
node("strata", type="rbernoulli", p=0.5) +
node(c("var1", "var2"), type="rnorm", mean=0, sd=1) +
node("Y", type="mixture", parents=c("strata", "var1", "var2"),
distr=list(
"strata==0", node(".", type="gaussian",
formula= ~ -2 + var1*2 + var2*-0.5, error=1),
"strata==1", node(".", type="gaussian",
formula= ~ 5 + var1*-1 + var2*2.3, error=1.5)
))
data <- sim_from_dag(dag, n_sim=10)
head(data)
#> strata var1 var2 Y
#> <lgcl> <num> <num> <num>
#> 1: TRUE -1.2703000 -1.86080057 1.757752
#> 2: TRUE 0.2556095 0.67908621 7.516805
#> 3: FALSE -0.4941493 0.26821288 -2.263518
#> 4: TRUE 2.0075698 1.18606863 7.014331
#> 5: TRUE -0.8370690 -0.14223463 7.302422
#> 6: TRUE -1.0869544 0.02476878 6.237979
Using the distr
argument, we can easily define to which
simulated individuals the corresponding node definitions should be
applied to. In the example above, we defined different linear regression
models for individuals with strata==0
and for individuals
with strata==1
. It would also be possible to use entirely
different node types etc.
Simulating “Outliers”
One possibility to introduce a kind of outliers to a variable, is by defining the variable that should contain it as a kind of mixture distribution, made up of two parts, similar to how zero-inflated models work. First, values are generated from the general distribution. Then, if that variable exceeds some value, sample different values for it. A very simple example:
dag <- empty_dag() +
node(c("A", "B", "C"), type="rnorm") +
node("Y", type="mixture", parents=c("A", "B", "C"),
distr=list(
"TRUE", node(".", type="gaussian", formula= ~ -2 + A*0.1 + B*1 + C*-2,
error=1),
"Y > 3", node(".", type="rnorm", mean=10000, sd=500)
))
data <- sim_from_dag(dag, n_sim=10000)
Here, we first simulate 3 standard normal variables A
,
B
and C
, which will be used as predictor
variables for our desired outcome Y
. For Y
itself, we use the "mixture"
node type, which allows us to
define the node as a mix of multiple node types, based on some
conditions, which are generated from one by one. Because we set the
first condition to TRUE
, the linear model next to it gets
applied to all individuals in the first run, generating values from the
specified linear model for all individuals. In the next run we condition
on Y
itself, instructing the function to only simulate new
values for Y > 3
. These values are drawn from a normal
distribution with a very high mean.
Simulating Missing Values
Real data usually includes at least some missing values in key variables. Below is a very simple example on how users could include missing values in their data:
dag <- empty_dag() +
node("A_real", type="rnorm", mean=10, sd=3) +
node("A_missing", type="rbernoulli", p=0.5) +
node("A_observed", type="identity",
formula= ~ fifelse(A_missing, NA, A_real))
data <- sim_from_dag(dag, n_sim=10)
head(data)
#> A_real A_missing A_observed
#> <num> <lgcl> <num>
#> 1: 7.881682 TRUE NA
#> 2: 8.773459 FALSE 8.773459
#> 3: 14.785682 FALSE 14.785682
#> 4: 7.617797 FALSE 7.617797
#> 5: 11.036126 TRUE NA
#> 6: 7.330183 FALSE 7.330183
In this DAG, the real values of node A
are generated
first. Afterwards, an indicator of whether the corresponding values is
observed is drawn randomly for each individual in node
A_missing
. The actually observed value, denoted
A_observed
, is then generated by simply using either the
value of A_real
or a simple NA
if
A_missing==TRUE
. The missingness shown here would
correspond to missing completely at random (MCAR) in
the categorisation by XX. Although this might seem a little cumbersome
at first, it does allow quite a lot of flexibility in the specification
of different missingness mechanisms. For example, to simulate
missing at random (MAR) patterns, we could use the
following code instead:
dag <- empty_dag() +
node("A_real", type="rnorm", mean=0, sd=1) +
node("B_real", type="rbernoulli", p=0.5) +
node("A_missing", type="rbernoulli", p=0.1) +
node("B_missing", type="binomial", formula= ~ -5 + A_real*0.1) +
node("A_observed", type="identity",
formula= ~ fifelse(A_missing, NA, A_real)) +
node("B_observed", type="identity",
formula= ~ fifelse(B_missing, NA, B_real))
data <- sim_from_dag(dag, n_sim=10)
head(data)
#> A_real B_real A_missing B_missing A_observed B_observed
#> <num> <lgcl> <lgcl> <lgcl> <num> <lgcl>
#> 1: 2.5611064 FALSE FALSE FALSE 2.5611064 FALSE
#> 2: -1.0780466 FALSE TRUE FALSE NA FALSE
#> 3: -1.4581759 FALSE FALSE FALSE -1.4581759 FALSE
#> 4: -0.6715334 FALSE FALSE FALSE -0.6715334 FALSE
#> 5: 0.1095760 FALSE FALSE FALSE 0.1095760 FALSE
#> 6: 1.5605237 TRUE FALSE FALSE 1.5605237 TRUE
Here, the missingness in A_observed
is again MCAR,
because it is independent of everything. The missingness in
B_observed
, however, is MAR because the probability of
missingness is dependent on the actually observed value of
A_real
.
Simulating Measurement Error
Measurement error refers to situations in which variables of interest
are not measured perfectly. For example, the disease of interest may
only be detected in 90% of all patients with the disease and may falsely
be detected in 1% of all patients without the disease. The same strategy
shown for missing values could be used to simulate such data using
simDAG
:
probs <- list(`TRUE`=0.9, `FALSE`=0.01)
dag <- empty_dag() +
node("Disease_real", type="rbernoulli", p=0.5) +
node("Disease_observed", type="conditional_prob", parents="Disease_real",
probs=probs)
data <- sim_from_dag(dag, n_sim=10)
head(data)
#> Disease_real Disease_observed
#> <lgcl> <lgcl>
#> 1: FALSE FALSE
#> 2: TRUE TRUE
#> 3: FALSE FALSE
#> 4: TRUE TRUE
#> 5: FALSE FALSE
#> 6: FALSE FALSE
In this example, the disease is present in 50% of all individuals. By
using a node with type="conditional_prob"
we can easily
draw new values for the observed disease status by specifying the
probs
argument correctly. We could similarly extend this
example to make the probability of misclassification dependent on
another variable:
# first TRUE / FALSE refers to Sex = TRUE / FALSE
# second TRUE / FALSE refers to Disease = TRUE / FALSE
probs <- list(TRUE.TRUE=0.9, TRUE.FALSE=0.01,
FALSE.TRUE=0.8, FALSE.FALSE=0.05)
dag <- empty_dag() +
node("Sex", type="rbernoulli", p=0.5) +
node("Disease_real", type="rbernoulli", p=0.5) +
node("Disease_observed", type="conditional_prob",
parents=c("Sex", "Disease_real"), probs=probs)
data <- sim_from_dag(dag, n_sim=1000)
head(data)
#> Sex Disease_real Disease_observed
#> <lgcl> <lgcl> <lgcl>
#> 1: FALSE FALSE FALSE
#> 2: TRUE TRUE TRUE
#> 3: FALSE FALSE FALSE
#> 4: TRUE TRUE TRUE
#> 5: TRUE TRUE TRUE
#> 6: TRUE TRUE TRUE
In this extended example, Sex
is equally distributed
among the population (with TRUE
= “female” and
FALSE
= “male”). In this example, the probability of being
diagnosed with the disease (Disease_observed
) if the
disease is actually present is 0.9 for females and only 0.8 for males.
Similarly, the probability of being diagnosed if the disease is not
present is 0.01 for females and 0.05 for males.