
Generate Data from a Mixture of Node Definitions
node_mixture.RdThis node type allows users to apply different nodes to different subsets of the already generated data, making it possible to generate data for arbitrary mixture distributions. It is similar to node_conditional_distr and node_conditional_prob, with the main difference being that the former only allow univariate distributions conditional on categorical variables, while this function allows any kind of node definition and condition. This makes it, for example, possible to generate data for a variable from different regression models for different subsets of simulated individuals.
Arguments
- data
A
data.table(or something that can be coerced to adata.table) containing all columns specified byparents.- parents
A character vector specifying the names of the parents that this particular child node has. This vector should include all nodes that are used in the conditions and the
nodecalls specified indistr.- name
A single character string specifying the name of the node.
- distr
A unnamed list that specifies both the conditions and the
nodedefinitions. It should be specified in a similar way as thefcasefunction in pairs of conditions (coded as strings) andnodedefinitions. This means that a condition comes first, for example"A==0", followed by some callnodeand so on. Arbitrary numbers of those pairs are allowed with no restrictions to what can be specified in thenodecalls. Thenameargument has to be specified in allnodecalls, but it does not matter which value is used as it will be ignored in further processing. Currently only supports time-fixed nodes defined using thenodefunction, not time-dependent nodes defined using thenode_tdfunction. See examples.- default
A single value of some kind, used as a default value for those individuals not covered by all the conditions defined in
distr. Defaults toNA.
Details
Internally, the data is generated by extracting only the relevant part of the already generated data as defined by the condition and using node function to generate the new response-part. This generation is done in the order in which the distr was specified, meaning that data for the first condition is checked first and so on. There are no safeguards to guarantee that the conditions do not overlap. For example, users are free to set the first condition to something like A > 10 and the next one to A > 11, in which case the value for every individual with A > 11 is generated twice (first with the first specification, secondly with the next specification). In this case, only the last generated value is retained.
Note that it is also possible to use the mixture node itself inside the conditions or node calls in distr, because it is directly added to the data before the first condition is applied (by setting everyone to the default value). See examples.
Additionally, because the output of each of the parts of the mixture distributions is forced into one vector, they might be coerced from one class to another, depending on the input to distr and the order used. This also needs to be taken care of by the user.
Value
Returns a vector of length nrow(data). The class of the vector is determined by what is specified in distr.
Examples
library(simDAG)
set.seed(1234)
## different linear regression models per level of a different covariate
# here, A is the group that is used for the conditioning, B is a predictor
# and Y is the mixture distributed outcome
dag <- empty_dag() +
node("A", type="rbernoulli") +
node("B", type="rnorm") +
node("Y", type="mixture", parents="A",
distr=list(
"A==0", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
"A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
))
data <- sim_from_dag(dag, n_sim=100)
head(data)
#> A B Y
#> <lgcl> <num> <num>
#> 1: FALSE -1.8060313 -5.9893002
#> 2: TRUE -0.5820759 0.8500827
#> 3: TRUE -1.1088896 -0.7019845
#> 4: TRUE -1.0149620 -0.9624472
#> 5: TRUE -0.1623095 2.2211163
#> 6: TRUE 0.5630558 4.7008301
# also works with multiple conditions
dag <- empty_dag() +
node(c("A", "C"), type="rbernoulli") +
node("B", type="rnorm") +
node("Y", type="mixture", parents=c("A", "C"),
distr=list(
"A==0 & C==1", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
"A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
))
data <- sim_from_dag(dag, n_sim=100)
head(data)
#> A C B Y
#> <lgcl> <lgcl> <num> <num>
#> 1: TRUE FALSE -2.3160362 -7.691539
#> 2: TRUE FALSE 0.5624718 6.085333
#> 3: TRUE FALSE -0.7837751 -1.485517
#> 4: FALSE TRUE -0.2260540 -2.089441
#> 5: TRUE FALSE -1.5871030 -4.975722
#> 6: TRUE FALSE 0.5475242 6.285233
# using the mixture node itself in the condition
# see cookbook vignette, section on outliers for more info
dag <- empty_dag() +
node(c("A", "B", "C"), type="rnorm") +
node("Y", type="mixture", parents=c("A", "B", "C"),
distr=list(
"TRUE", node(".", type="gaussian", formula= ~ -2 + A*0.1 + B*1 + C*-2,
error=1),
"Y > 2", node(".", type="rnorm", mean=10000, sd=500)
))
data <- sim_from_dag(dag, n_sim=100)