Create a node object to grow a DAG step-by-step
node.Rd
These functions should be used in conjunction with the empty_dag
function to create DAG
objects, which can then be used to simulate data using the sim_from_dag
function or the sim_discrete_time
function.
Usage
node(name, type, parents=NULL, formula=NULL, ...)
node_td(name, type, parents=NULL, formula=NULL, ...)
Arguments
- name
A character vector with at least one entry specifying the name of the node. If a character vector containing multiple different names is supplied, one separate node will be created for each name. These nodes are completely independent, but have the exact same node definition as supplied by the user. If only a single character string is provided, only one node is generated.
- type
A single character string specifying the type of the node. Depending on whether the node is a root node, a child node or a time-dependent node different node types are allowed. See details. Alternatively, a suitable function may be passed directly to this argument.
- parents
A character vector of names, specifying the parents of the node or
NULL
(default). IfNULL
, the node is treated as a root node. For convenience it is also allowed to setparents=""
to indicate that the node is a root node.- formula
An optional
formula
object to describe how the node should be generated orNULL
(default). If supplied it should start with~
, having nothing else on the left hand side. The right hand side should define the entire structural equation, including the betas and intercepts. It may contain any valid formula syntax, such as~ -2 + A*3 + B*4
or~ -2 + A*3 + B*4 + I(A^2)*0.3 + A:B*1.1
, allowing arbitrary non-linear effects, arbitrary interactions and multiple coefficients for categorical variables. If this argument is defined, there is no need to define thebetas
andintercept
argument. Theparents
argument should still be specified whenever a categorical variable is used in the formula. This argument is currently only supported for nodes of type"binomial"
,"gaussian"
,"poisson"
,"negative_binomial"
and"cox"
. It is also supported for nodes of type"identity"
, but slightly different input is expected in that case. See examples and the associated vignette for an in-depth explanation.- ...
Further named arguments needed to specify the node. Those can be parameters of distribution functions such as the
p
argument in therbernoulli
function for root nodes or arbitrary named arguments such as thebetas
argument of thenode_gaussian
function.
Details
To generate data using the sim_from_dag
function or the sim_discrete_time
function, it is required to create a DAG
object first. This object needs to contain information about the causal structure of the data (e.g. which variable causes which variable) and the specific structural equations for each variable (information about causal coefficients, type of distribution etc.). In this package, the node
and/or node_td
function is used in conjunction with the empty_dag
function to create this object.
This works by first initializing an empty DAG
using the empty_dag
function and then adding multiple calls to the node
and/or node_td
functions to it using a simple +
, where each call to node
and/or node_td
adds information about a single node that should be generated. Multiple examples are given below.
In each call to node
or node_td
the user needs to indicate what the node should be called (name
), which function should be used to generate the node (type
), whether the node has any parents and if so which (parents
) and any additional arguments needed to actually call the data-generating function of this node later passed to the three-dot syntax (...
).
node
vs. node_td
:
By calling node
you are indicating that this node is a time-fixed variable which should only be generated once. By using node_td
you are indicating that it is a time-dependent node, which will be updated at each step in time when using a discrete-time simulation.
node_td
should only be used if you are planning to perform a discrete-time simulation with the sim_discrete_time
function. DAG
objects including time-dependent nodes may not be used in the sim_from_dag
function.
Implemented Root Node Types:
Any function can be used to generate root nodes. The only requirement is that the function has at least one named argument called n
which controls the length of the resulting vector. For example, the user could specify a node of type "rnorm"
to create a normally distributed node with no parents. The argument n
will be set internally, but any additional arguments can be specified using the ...
syntax. In the type="rnorm"
example, the user could set the mean and standard deviation using node(name="example", type="rnorm", mean=10, sd=5)
.
For convenience, this package additionally includes three custom root-node functions:
"rbernoulli": Draws randomly from a bernoulli distribution.
"rcategorical": Draws randomly from any discrete probability density function.
"rconstant": Used to set a variable to a constant value.
Implemented Child Node Types:
Currently, the following node types are implemented directly for convenience:
"gaussian": A node based on linear regression.
"binomial": A node based on logistic regression.
"conditional_prob": A node based on conditional probabilities.
"conditional_distr": A node based on conditional draws from different distributions.
"multinomial": A node based on multinomial regression.
"poisson": A node based on poisson regression.
"negative_binomial": A node based on negative binomial regression.
"cox": A node based on cox-regression.
"identity": A node that is just some R expression of other nodes.
For custom child node types, see below.
Implemented Time-Dependent Node Types:
Currently, the following node types are implemented directly for convenience to use in node_td
calls:
"time_to_event": A node based on repeatedly checking whether an event occurs at each point in time.
"competing_events": A node based on repeatedly checking whether one of multiple mutually exclusive events occurs at each point in time.
However, the user may also use any of the child node types in a node_td
call directly. For custom time-dependent node types, see below.
Custom Node Types
It is very simple to write a new custom node_function
to be used instead, allowing the user to use any type
of data-generation mechanism for any type of node (root / child / time-dependent). All that is required of this function is, that it has the named arguments data
(the sample as generated so far) and, if it's a child node, parents
(a character vector specifying the parents) and outputs either a vector containing n_sim
entries, or a data.frame
with n_sim
rows and an arbitrary amount of columns. More information about this can be found on the node_custom
documentation page.
Using child nodes as parents for other nodes:
If the data generated by a child node is categorical (such as when using node_multinomial
) they can still be used as parents of other nodes for most standard node types without issues. All the user has to do is to use formula
argument to supply an enhanced formula, instead of defining the parents
and betas
argument directly. This works well for all node types that directly support formula
input. For other node types, users may need to write custom functions to make this work. See the associated vignette: vignette(topic="v_using_formulas", package="simDAG")
for more information on how to correctly use formulas.
Cyclic causal structures:
The name DAG (directed acyclic graph) implies that cycles are not allowed. This means that if you start from any node and only follow the arrows in the direction they are pointing, there should be no way to get back to your original node. This is necessary both theoretically and for practical reasons if we are dealing with static DAGs created using the node
function. If the user attempts to generate data from a static cyclic graph using the sim_from_dag
function, an error will be produced.
However, in the realm of discrete-time simulations, cyclic causal structures are perfectly reasonable. A variable \(A\) at \(t = 1\) may influence a variable \(B\) at \(t = 2\), which in turn may influence variable \(A\) at \(t = 3\) again. Therefore, when using the node_td
function to simulate time-dependent data using the sim_discrete_time
function, cyclic structures are allowed to be present and no error will be produced.
Note
Contrary to the R standard, this function does NOT support partial matching of argument names. This means that supplying nam="age"
will not be recognized as name="age"
and instead will be added as additional node argument used in the respective data-generating function call when using sim_from_dag
.
Examples
library(simDAG)
# creating a DAG with a single root node
dag <- empty_dag() +
node("age", type="rnorm", mean=30, sd=4)
# creating a DAG with multiple root nodes
# (passing the functions directly to 'type' works too)
dag <- empty_dag() +
node("sex", type=rbernoulli, p=0.5) +
node("income", type=rnorm, mean=2700, sd=500)
# creating a DAG with multiple root nodes + multiple names in one node
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node(c("income_1", "income_2"), type="rnorm", mean=2700, sd=500)
# also using child nodes
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node("income", type="rnorm", mean=2700, sd=500) +
node("sickness", type="binomial", parents=c("sex", "income"),
betas=c(1.2, -0.3), intercept=-15) +
node("death", type="binomial", parents=c("sex", "income", "sickness"),
betas=c(0.1, -0.4, 0.8), intercept=-20)
# creating the same DAG as above, but using the enhanced formula interface
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node("income", type="rnorm", mean=2700, sd=500) +
node("sickness", type="binomial",
formula= ~ -15 + sexTRUE*1.2 + income*-0.3) +
node("death", type="binomial",
formula= ~ -20 + sexTRUE*0.1 + income*-0.4 + sickness*0.8)
# using time-dependent nodes
# NOTE: to simulate data from this DAG, the sim_discrete_time() function needs
# to be used due to "sickness" being a time-dependent node
dag <- empty_dag() +
node("sex", type="rbernoulli", p=0.5) +
node("income", type="rnorm", mean=2700, sd=500) +
node_td("sickness", type="binomial", parents=c("sex", "income"),
betas=c(0.1, -0.4), intercept=-50)
# we could also use a DAG with only time-varying variables
dag <- empty_dag() +
node_td("vaccine", type="time_to_event", prob_fun=0.001, event_duration=21) +
node_td("covid", type="time_to_event", prob_fun=0.01, event_duration=15,
immunity_duration=100)