Skip to contents

These functions should be used in conjunction with the empty_dag function to create DAG objects, which can then be used to simulate data using the sim_from_dag function or the sim_discrete_time function.

Usage

node(name, type, parents=NULL, formula=NULL, ...)

node_td(name, type, parents=NULL, formula=NULL, ...)

Arguments

name

A character vector with at least one entry specifying the name of the node. If a character vector containing multiple different names is supplied, one separate node will be created for each name. These nodes are completely independent, but have the exact same node definition as supplied by the user. If only a single character string is provided, only one node is generated.

type

A single character string specifying the type of the node. Depending on whether the node is a root node, a child node or a time-dependent node different node types are allowed. See details. Alternatively, a suitable function may be passed directly to this argument.

parents

A character vector of names, specifying the parents of the node or NULL (default). If NULL, the node is treated as a root node. For convenience it is also allowed to set parents="" to indicate that the node is a root node.

formula

An optional formula object to describe how the node should be generated or NULL (default). If supplied it should start with ~, having nothing else on the left hand side. The right hand side should define the entire structural equation, including the betas and intercepts. It may contain any valid formula syntax, such as ~ -2 + A*3 + B*4 or ~ -2 + A*3 + B*4 + I(A^2)*0.3 + A:B*1.1, allowing arbitrary non-linear effects, arbitrary interactions and multiple coefficients for categorical variables. If this argument is defined, there is no need to define the betas and intercept argument. The parents argument should still be specified whenever a categorical variable is used in the formula. This argument is currently only supported for nodes of type "binomial", "gaussian", "poisson", "negative_binomial" and "cox". It is also supported for nodes of type "identity", but slightly different input is expected in that case. See examples and the associated vignette for an in-depth explanation.

...

Further named arguments needed to specify the node. Those can be parameters of distribution functions such as the p argument in the rbernoulli function for root nodes or arbitrary named arguments such as the betas argument of the node_gaussian function.

Details

To generate data using the sim_from_dag function or the sim_discrete_time function, it is required to create a DAG object first. This object needs to contain information about the causal structure of the data (e.g. which variable causes which variable) and the specific structural equations for each variable (information about causal coefficients, type of distribution etc.). In this package, the node and/or node_td function is used in conjunction with the empty_dag function to create this object.

This works by first initializing an empty DAG using the empty_dag function and then adding multiple calls to the node and/or node_td functions to it using a simple +, where each call to node and/or node_td adds information about a single node that should be generated. Multiple examples are given below.

In each call to node or node_td the user needs to indicate what the node should be called (name), which function should be used to generate the node (type), whether the node has any parents and if so which (parents) and any additional arguments needed to actually call the data-generating function of this node later passed to the three-dot syntax (...).

node vs. node_td:

By calling node you are indicating that this node is a time-fixed variable which should only be generated once. By using node_td you are indicating that it is a time-dependent node, which will be updated at each step in time when using a discrete-time simulation.

node_td should only be used if you are planning to perform a discrete-time simulation with the sim_discrete_time function. DAG objects including time-dependent nodes may not be used in the sim_from_dag function.

Implemented Root Node Types:

Any function can be used to generate root nodes. The only requirement is that the function has at least one named argument called n which controls the length of the resulting vector. For example, the user could specify a node of type "rnorm" to create a normally distributed node with no parents. The argument n will be set internally, but any additional arguments can be specified using the ... syntax. In the type="rnorm" example, the user could set the mean and standard deviation using node(name="example", type="rnorm", mean=10, sd=5).

For convenience, this package additionally includes three custom root-node functions:

  • "rbernoulli": Draws randomly from a bernoulli distribution.

  • "rcategorical": Draws randomly from any discrete probability density function.

  • "rconstant": Used to set a variable to a constant value.

Implemented Child Node Types:

Currently, the following node types are implemented directly for convenience:

  • "gaussian": A node based on linear regression.

  • "binomial": A node based on logistic regression.

  • "conditional_prob": A node based on conditional probabilities.

  • "conditional_distr": A node based on conditional draws from different distributions.

  • "multinomial": A node based on multinomial regression.

  • "poisson": A node based on poisson regression.

  • "negative_binomial": A node based on negative binomial regression.

  • "cox": A node based on cox-regression.

  • "identity": A node that is just some R expression of other nodes.

For custom child node types, see below.

Implemented Time-Dependent Node Types:

Currently, the following node types are implemented directly for convenience to use in node_td calls:

  • "time_to_event": A node based on repeatedly checking whether an event occurs at each point in time.

  • "competing_events": A node based on repeatedly checking whether one of multiple mutually exclusive events occurs at each point in time.

However, the user may also use any of the child node types in a node_td call directly. For custom time-dependent node types, see below.

Custom Node Types

It is very simple to write a new custom node_function to be used instead, allowing the user to use any type of data-generation mechanism for any type of node (root / child / time-dependent). All that is required of this function is, that it has the named arguments data (the sample as generated so far) and, if it's a child node, parents (a character vector specifying the parents) and outputs either a vector containing n_sim entries, or a data.frame with n_sim rows and an arbitrary amount of columns. More information about this can be found on the node_custom documentation page.

Using child nodes as parents for other nodes:

If the data generated by a child node is categorical (such as when using node_multinomial) they can still be used as parents of other nodes for most standard node types without issues. All the user has to do is to use formula argument to supply an enhanced formula, instead of defining the parents and betas argument directly. This works well for all node types that directly support formula input. For other node types, users may need to write custom functions to make this work. See the associated vignette: vignette(topic="v_using_formulas", package="simDAG") for more information on how to correctly use formulas.

Cyclic causal structures:

The name DAG (directed acyclic graph) implies that cycles are not allowed. This means that if you start from any node and only follow the arrows in the direction they are pointing, there should be no way to get back to your original node. This is necessary both theoretically and for practical reasons if we are dealing with static DAGs created using the node function. If the user attempts to generate data from a static cyclic graph using the sim_from_dag function, an error will be produced.

However, in the realm of discrete-time simulations, cyclic causal structures are perfectly reasonable. A variable \(A\) at \(t = 1\) may influence a variable \(B\) at \(t = 2\), which in turn may influence variable \(A\) at \(t = 3\) again. Therefore, when using the node_td function to simulate time-dependent data using the sim_discrete_time function, cyclic structures are allowed to be present and no error will be produced.

Note

Contrary to the R standard, this function does NOT support partial matching of argument names. This means that supplying nam="age" will not be recognized as name="age" and instead will be added as additional node argument used in the respective data-generating function call when using sim_from_dag.

Value

Returns a DAG.node object which can be added to a DAG object directly.

Author

Robin Denz

Examples

library(simDAG)

# creating a DAG with a single root node
dag <- empty_dag() +
  node("age", type="rnorm", mean=30, sd=4)

# creating a DAG with multiple root nodes
# (passing the functions directly to 'type' works too)
dag <- empty_dag() +
  node("sex", type=rbernoulli, p=0.5) +
  node("income", type=rnorm, mean=2700, sd=500)

# creating a DAG with multiple root nodes + multiple names in one node
dag <- empty_dag() +
  node("sex", type="rbernoulli", p=0.5) +
  node(c("income_1", "income_2"), type="rnorm", mean=2700, sd=500)

# also using child nodes
dag <- empty_dag() +
  node("sex", type="rbernoulli", p=0.5) +
  node("income", type="rnorm", mean=2700, sd=500) +
  node("sickness", type="binomial", parents=c("sex", "income"),
       betas=c(1.2, -0.3), intercept=-15) +
  node("death", type="binomial", parents=c("sex", "income", "sickness"),
       betas=c(0.1, -0.4, 0.8), intercept=-20)

# creating the same DAG as above, but using the enhanced formula interface
dag <- empty_dag() +
  node("sex", type="rbernoulli", p=0.5) +
  node("income", type="rnorm", mean=2700, sd=500) +
  node("sickness", type="binomial",
       formula= ~ -15 + sexTRUE*1.2 + income*-0.3) +
  node("death", type="binomial",
       formula= ~ -20 + sexTRUE*0.1 + income*-0.4 + sickness*0.8)

# using time-dependent nodes
# NOTE: to simulate data from this DAG, the sim_discrete_time() function needs
#       to be used due to "sickness" being a time-dependent node
dag <- empty_dag() +
  node("sex", type="rbernoulli", p=0.5) +
  node("income", type="rnorm", mean=2700, sd=500) +
  node_td("sickness", type="binomial", parents=c("sex", "income"),
          betas=c(0.1, -0.4), intercept=-50)

# we could also use a DAG with only time-varying variables
dag <- empty_dag() +
  node_td("vaccine", type="time_to_event", prob_fun=0.001, event_duration=21) +
  node_td("covid", type="time_to_event", prob_fun=0.01, event_duration=15,
          immunity_duration=100)