Simulate Data from a Given DAG and Node Information
sim_from_DAG.Rd
This function can be used to generate data from a given DAG. It additionally requires information on node distributions, beta coefficients and, depending on the node type, more parameters such as intercepts.
Arguments
- dag
A
DAG
object created using theempty_dag
function with nodes added to it using the+
syntax. See details.- n_sim
A single number specifying how many observations should be generated.
- sort_dag
Whether to topologically sort the DAG before starting the simulation or not. If the nodes in
dag
were already added in a topologically sorted manner, this argument can be kept atFALSE
to safe some computation time. This usually won't safe too much time though, because it internally uses thetopological_sort
function from the Rfast package, which is very fast.- check_inputs
Whether to perform plausibility checks for the user input or not. Is set to
TRUE
by default, but can be set toFALSE
in order to speed things up when using this function in a simulation study or something similar.
Details
How it Works:
First, n_sim
i.i.d. samples from the root nodes are drawn. Children of these nodes are then generated one by one according to specified relationships and causal coefficients. For example, lets suppose there are two root nodes, age
and sex
. Those are generated from a normal distribution and a bernoulli distribution respectively. Afterward, the child node height
is generated using both of these variables as parents according to a linear regression with defined coefficients, intercept and sigma (random error). This works because every DAG has at least one topological ordering, which is a linear ordering of vertices such that for every directed edge \(u\) \(v\), vertex \(u\) comes before \(v\) in the ordering. By using sort_dag=TRUE
it is ensured that the nodes are processed in such an ordering.
This procedure is simple in theory, but can get very complex when manually coded. This function offers a simplified workflow by only requiring the user to define the dag
object with appropriate information (see documentation of node
function). A sample of size n_sim
is then generated from the DAG specified by those two arguments.
Specifying the DAG:
Concrete details on how to specify the needed dag
object are given in the documentation page of the node
function and in the vignettes of this package.
Can this function create longitudinal data?
Yes and no. It theoretically can, but only if the user-specified dag
directly specifies a node for each desired point in time. Using the sim_discrete_time
is better in some cases. A brief discussion about this topic can be found in the vignettes of this package.
If time-dependent nodes were added to the dag
using node_td
calls, this function may not be used. Only the sim_discrete_time
function will work in that case.
Value
Returns a single data.table
including the simulated data with (at least) one column per node specified in dag
and n_sim
rows.
Examples
library(simDAG)
set.seed(345345)
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("bmi", type="gaussian", parents=c("sex", "age"),
betas=c(1.1, 0.4), intercept=12, error=2)
sim_dat <- sim_from_dag(dag=dag, n_sim=1000)
# More examples for each directly supported node type as well as for custom
# nodes can be found in the documentation page of the respective node function