Specifying Custom Node Types in a DAG

Introduction

In this small vignette, we give a detailed explanation on how to define custom functions that can be used in the type argument of node() or node_td() calls. Although simDAG includes a large number of different node types that can be used in this argument directly, it also allows the user to pass any function to this argument, as long as that function meets some limited criteria (as described below). This is an advanced feature that most users probably don’t need for standard simulation studies. We strongly recommend reading the documentation and the other vignettes first, because this vignette assumes that the reader is already familiar with the simDAG syntax and general features.

The support for custom functions in type allows users to create root nodes, child nodes or time-dependent nodes that are not directly implemented in this package. By doing so, users may create data with any functional dependence they can think of. The requirements for each node type are listed below. Some simple examples for each node type are given in each section. If you think that your custom node type might be useful to others, please contact the maintainer of this package via the supplied e-mail address or github and we might add it to this package.

library(simDAG)

set.seed(1234)

Root Nodes

Requirements

Any function that generates some vector of size n with n==nrow(data), or a data.frame() with as many rows as the current data can be used as a child node. The only requirement is:

1.) The function should have an argument called n which controls how many samples to generate.

Some examples that are already implemented in R outside of this package are stats::rnorm(), stats::rgamma() and stats::rbeta(). The function may take any amount of further arguments, which will be passed through the three-dot (...) syntax. Note that whenever the supplied function produces a data.frame() (or similar object), the user has to ensure that the included columns are named properly.

Examples

Using external functions that fulfill the requirements which are already defined by some other package can be done this way:

dag <- empty_dag() +
  node("A", type="rgamma", shape=0.1, rate=2) +
  node("B", type="rbeta", shape1=2, shape2=0.3)

Of course users may also define an appropriate root node function themselves. The code below defines a function that takes the sum of a normally distributed random number and a uniformly distributed random number for each simulated individual:

custom_root <- function(n, min=0, max=1, mean=0, sd=1) {
  out <- runif(n, min=min, max=max) + rnorm(n, mean=mean, sd=sd)
  return(out)
}

# the function may be supplied as a string
dag <- empty_dag() +
  node("A", type="custom_root", min=0, max=10, mean=5, sd=2)

# equivalently, the function can also be supplied directly
# This is the recommended way!
dag <- empty_dag() +
  node("A", type=custom_root, min=0, max=10, mean=5, sd=2)

data <- sim_from_dag(dag=dag, n_sim=100)
head(data)
#>            A
#>        <num>
#> 1:  2.524972
#> 2: 10.058842
#> 3:  8.874968
#> 4:  9.203870
#> 5: 13.284535
#> 6: 12.529218

Child Nodes

Requirements

Again, almost any function may be used to generate a child node. Only four things are required for this to work properly:

1.) Its’ name should start with node_ (if you want to use a string to define it in type).
2.) It should contain an argument called data (contains the already generated data).
3.) It should contain an argument called parents (contains a vector of the child nodes parents).
4.) It should return either a vector of length n_sim or a data.frame() (or similar object) with any number of columns and n_sim rows.

The function may include any amount of additional arguments specified by the user.

Examples

Below we define a custom child node type that is basically just a gaussian node with some (badly done) truncation, limiting the range of the resulting variable to be between left and right.

node_gaussian_trunc <- function(data, parents, betas, intercept, error,
                                left, right) {
  out <- node_gaussian(data=data, parents=parents, betas=betas,
                       intercept=intercept, error=error)
  out <- ifelse(out <= left, left,
                ifelse(out >= right, right, out))
  return(out)
}

Please note that this is a terrible form of truncation in most cases, because it artificially distorts the resulting normal distribution at the left and right values. It is only meant as an illustration. Here is another example of a custom child node function, which simply returns the sum of its parents:

parents_sum <- function(data, parents, betas=NULL) {
  out <- rowSums(data[, parents, with=FALSE])
  return(out)
}

We can use both of these functions in a DAG like this:

dag <- empty_dag() +
  node("age", type="rnorm", mean=50, sd=4) +
  node("sex", type="rbernoulli", p=0.5) +
  node("custom_1", type="gaussian_trunc", parents=c("sex", "age"),
       betas=c(1.1, 0.4), intercept=-2, error=2, left=10, right=25) +
  node("custom_2", type=parents_sum, parents=c("age", "custom_1"))

data <- sim_from_dag(dag=dag, n_sim=100)
head(data)
#>         age    sex custom_1 custom_2
#>       <num> <lgcl>    <num>    <num>
#> 1: 48.49105   TRUE 17.33651 65.82756
#> 2: 50.39048   TRUE 17.34963 67.74011
#> 3: 56.55498   TRUE 21.36313 77.91811
#> 4: 46.49763  FALSE 18.61867 65.11630
#> 5: 50.48704   TRUE 19.34207 69.82911
#> 6: 55.44852   TRUE 19.98135 75.42988

Time-Dependent Nodes

Requirements

By time-dependent nodes we mean nodes that are created using the node_td() function. In general, this works in essentially the same way as for simple root nodes or child nodes. The requirements are:

1.) Its’ name should start with node_ (if you want to use a string to define it in type).
2.) It should contain an argument called data (contains the already generated data).
3.) If it is a child node, it should contain an argument called parents (contains a vector of the child nodes parents). This is not necessary for nodes that are independently generated.
4.) It should return either a vector of length n_sim or a data.frame() (or similar object) with any number of columns and n_sim rows.

Again, any number of additional arguments is allowed and will be passed through the three-dot syntax. Additionally, there are two build-in arguments that users may specify in custom time-dependent nodes, which are then used internally. First, users may add an argument to this function called sim_time. If included in the function definition, the current time of the simulation will be passed to the function on every call made to it. Secondly, the argument past_states may be added. If done so, a list containing all previous states of the simulation (as saved using the save_states argument of the sim_discrete_time() function) will be passed to it internally, giving the user access to the data generated at previous points in time.

Examples

Time-Dependent Root Nodes

An example for a custom time-dependent root node is given below:

node_custom_root_td <- function(data, n, mean=0, sd=1) {
  return(rnorm(n=n, mean=mean, sd=sd))
}

This function simply draws a new value from a normal distribution at each point in time of the simulation. A DAG using this node type could look like this:

n_sim <- 100

dag <- empty_dag() +
  node_td(name="Something", type=node_custom_root_td, n=n_sim, mean=10, sd=5)

Time-Dependent Child Nodes

Below is an example for a function that can be used to define a custom time-dependent child node:

node_custom_child <- function(data, parents) {
  out <- numeric(nrow(data))
  out[data$other_event] <- rnorm(n=sum(data$other_event), mean=10, sd=3)
  out[!data$other_event] <- rnorm(n=sum(!data$other_event), mean=5, sd=10)
  return(out)
}

dag <- empty_dag() +
  node_td("other", type="time_to_event", prob_fun=0.1) +
  node_td("whatever", type="custom_child", parents="other_event")

This function takes a random draw from a normal distribution with different specifications based on whether a previously updated time-dependent node called other is currently TRUE or FALSE.

Using the `sim_time` Argument

Below we give an example on how the sim_time argument may be used. The following function simply returns the square of the current simulation time as output:

node_square_sim_time <- function(data, sim_time, n_sim) {
  return(rep(sim_time^2, n=n_sim))
}

dag <- empty_dag() +
  node_td("unclear", type=node_square_sim_time, n_sim=100)

Note that we did not (and should not!) actually define the sim_time argument in the node_td() definition of the node, because it will be passed internally, just like data is. As long as sim_time is a named argument of the function the user is passing, it will be handled automatically. In real simulation studies this feature may be used to create time-scale dependent risks or effects for some time-dependent events of interest.

Using the `past_states` Argument

As stated earlier, another special kind of argument is the past_states argument, which allows users direct access to past states of the simulation. Below is an example of how this might be used:

node_prev_state <- function(data, past_states, sim_time) {
  if (sim_time < 3) {
    return(rnorm(n=nrow(data)))
  } else {
    return(past_states[[sim_time-2]]$A + rnorm(n=nrow(data)))
  }
}

dag <- empty_dag() +
  node_td("A", type=node_prev_state, parents="A")

This function simply returns the value used two simulation time steps ago plus a normally distributed random value. To make this happen, we actually use both the sim_time argument and the past_states argument. Note that, again, we do not (and cannot!) define these arguments in the node_td() definition of the node. They are simply used internally.

A crucial thing to make the previous code work in an actual simulation is the save_states argument of the sim_discrete_time() function. This argument controls which states should be saved internally. If users want to use previous states, these need to be saved, so the argument should in almost all cases be set to save_states="all", as shown below:

sim <- sim_discrete_time(dag, n_sim=100, max_t=10, save_states="all")

Using the Formula Interface

Users may also use the enhanced formula interface directly with custom child nodes and custom time-dependent nodes. This is described in detail in the vignette on specifying formulas (see vignette(topic="v_using_formulas", package="simDAG")).

Some General Comments

Using custom functions as node types is an advanced technique to obtain specialized simulated data. It is sadly impossible to cover all user cases here, but we would like to give some general recommendations nonetheless:

When using custom nodes, pass the function to type directly, do not use a string. This might avoid some weird scoping issues, depending on which environment the simulation is performed in.
Keep it simple, if u can. Particularly in time-dependent simulations, the computational complexity of the node function matters a lot.
Consider if node_identity() might be used instead. In many cases, it is a lot easier to just use a node of type identity instead of defining a new function.
The structural equations printed for custom nodes may be uninformative.

Robin Denz

Introduction

Root Nodes

Requirements

Examples

Child Nodes

Requirements

Examples

Time-Dependent Nodes

Requirements

Examples

Time-Dependent Root Nodes

Time-Dependent Child Nodes

Using the sim_time Argument

Using the past_states Argument

Using the Formula Interface

Some General Comments

Using the `sim_time` Argument

Using the `past_states` Argument