
Simulate a Node Using a Mixture of Node Definitions
node_mixture.Rd
This node type allows users to apply different nodes to different subsets of the already generated data, making it possible to generate data for arbitrary mixture distributions. It is similar to node_conditional_distr
and node_conditional_prob
, with the main difference being that the former only allow univariate distributions conditional on categorical variables, while this function allows any kind of node definition and condition. This makes it, for example, possible to generate data for a variable from different regression models for different subsets of simulated individuals.
Arguments
- data
A
data.table
(or something that can be coerced to adata.table
) containing all columns specified byparents
.- parents
A character vector specifying the names of the parents that this particular child node has. This vector should include all nodes that are used in the conditions and the
node
calls specified indistr
.- name
A single character string specifying the name of the node.
- distr
A unnamed list that specifies both the conditions and the
node
definitions. It should be specified in a similar way as thefcase
function in pairs of conditions (coded as strings) andnode
definitions. This means that a condition comes first, for example"A==0"
, followed by some callnode
and so on. Arbitrary numbers of those pairs are allowed with no restrictions to what can be specified in thenode
calls. Thename
argument has to be specified in allnode
calls, but it does not matter which value is used as it will be ignored in further processing. Currently only supports time-fixed nodes defined using thenode
function, not time-dependent nodes defined using thenode_td
function. See examples.- default
A single value of some kind, used as a default value for those individuals not covered by all the conditions defined in
distr
. Defaults toNA
.
Details
Internally, the data is generated by extracting only the relevant part of the already generated data
as defined by the condition and using node
function to generate the new response-part. This generation is done in the order in which the distr
was specified, meaning that data for the first condition is checked first and so on. There are no safeguards to guarantee that the conditions do not overlap. For example, users are free to set the first condition to something like A > 10
and the next one to A > 11
, in which case the value for every individual with A > 11
is generated twice (first with the first specification, secondly with the next specification). In this case, only the last generated value is retained.
Note that it is also possible to use the mixture node itself inside the conditions or node
calls in distr
, because it is directly added to the data
before the first condition is applied (by setting everyone to the default
value). See examples.
Additionally, because the output of each of the parts of the mixture distributions is forced into one vector, they might be coerced from one class to another, depending on the input to distr
and the order used. This also needs to be taken care of by the user.
Value
Returns a vector of length nrow(data)
. The class of the vector is determined by what is specified in distr
.
Examples
library(simDAG)
set.seed(1234)
## different linear regression models per level of a different covariate
# here, A is the group that is used for the conditioning, B is a predictor
# and Y is the mixture distributed outcome
dag <- empty_dag() +
node("A", type="rbernoulli") +
node("B", type="rnorm") +
node("Y", type="mixture", parents="A",
distr=list(
"A==0", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
"A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
))
data <- sim_from_dag(dag, n_sim=100)
head(data)
#> A B Y
#> <lgcl> <num> <num>
#> 1: FALSE -1.8060313 -5.9893002
#> 2: TRUE -0.5820759 0.8500827
#> 3: TRUE -1.1088896 -0.7019845
#> 4: TRUE -1.0149620 -0.9624472
#> 5: TRUE -0.1623095 2.2211163
#> 6: TRUE 0.5630558 4.7008301
# also works with multiple conditions
dag <- empty_dag() +
node(c("A", "C"), type="rbernoulli") +
node("B", type="rnorm") +
node("Y", type="mixture", parents=c("A", "C"),
distr=list(
"A==0 & C==1", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
"A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
))
data <- sim_from_dag(dag, n_sim=100)
head(data)
#> A C B Y
#> <lgcl> <lgcl> <num> <num>
#> 1: TRUE FALSE -2.3160362 -7.691539
#> 2: TRUE FALSE 0.5624718 6.085333
#> 3: TRUE FALSE -0.7837751 -1.485517
#> 4: FALSE TRUE -0.2260540 -2.089441
#> 5: TRUE FALSE -1.5871030 -4.975722
#> 6: TRUE FALSE 0.5475242 6.285233
# using the mixture node itself in the condition
# see cookbook vignette, section on outliers for more info
dag <- empty_dag() +
node(c("A", "B", "C"), type="rnorm") +
node("Y", type="mixture", parents=c("A", "B", "C"),
distr=list(
"TRUE", node(".", type="gaussian", formula= ~ -2 + A*0.1 + B*1 + C*-2,
error=1),
"Y > 2", node(".", type="rnorm", mean=10000, sd=500)
))
data <- sim_from_dag(dag, n_sim=100)