Generate Data from a Mixture of Node Definitions

This node type allows users to apply different nodes to different subsets of the already generated data, making it possible to generate data for arbitrary mixture distributions. It is similar to node_conditional_distr and node_conditional_prob, with the main difference being that the former only allow univariate distributions conditional on categorical variables, while this function allows any kind of node definition and condition. This makes it, for example, possible to generate data for a variable from different regression models for different subsets of simulated individuals.

Usage

node_mixture(data, parents, name, distr, default=NA)

Arguments

data: A data.table (or something that can be coerced to a data.table) containing all columns specified by parents.
parents: A character vector specifying the names of the parents that this particular child node has. This vector should include all nodes that are used in the conditions and the node calls specified in distr.
name: A single character string specifying the name of the node.
distr: A unnamed list that specifies both the conditions and the node definitions. It should be specified in a similar way as the fcase function in pairs of conditions (coded as strings) and node definitions. This means that a condition comes first, for example "A==0", followed by some call node and so on. Arbitrary numbers of those pairs are allowed with no restrictions to what can be specified in the node calls. The name argument has to be specified in all node calls, but it does not matter which value is used as it will be ignored in further processing. Currently only supports time-fixed nodes defined using the node function, not time-dependent nodes defined using the node_td function. See examples.
default: A single value of some kind, used as a default value for those individuals not covered by all the conditions defined in distr. Defaults to NA.

Author

Robin Denz

Details

Internally, the data is generated by extracting only the relevant part of the already generated data as defined by the condition and using node function to generate the new response-part. This generation is done in the order in which the distr was specified, meaning that data for the first condition is checked first and so on. There are no safeguards to guarantee that the conditions do not overlap. For example, users are free to set the first condition to something like A > 10 and the next one to A > 11, in which case the value for every individual with A > 11 is generated twice (first with the first specification, secondly with the next specification). In this case, only the last generated value is retained.

Note that it is also possible to use the mixture node itself inside the conditions or node calls in distr, because it is directly added to the data before the first condition is applied (by setting everyone to the default value). See examples.

Additionally, because the output of each of the parts of the mixture distributions is forced into one vector, they might be coerced from one class to another, depending on the input to distr and the order used. This also needs to be taken care of by the user.

Value

Returns a vector of length nrow(data). The class of the vector is determined by what is specified in distr.

Examples

library(simDAG)

set.seed(1234)

## different linear regression models per level of a different covariate
# here, A is the group that is used for the conditioning, B is a predictor
# and Y is the mixture distributed outcome
dag <- empty_dag() +
  node("A", type="rbernoulli") +
  node("B", type="rnorm") +
  node("Y", type="mixture", parents="A",
       distr=list(
         "A==0", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
         "A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
       ))
data <- sim_from_dag(dag, n_sim=100)
head(data)
#>         A          B          Y
#>    <lgcl>      <num>      <num>
#> 1:  FALSE -1.8060313 -5.9893002
#> 2:   TRUE -0.5820759  0.8500827
#> 3:   TRUE -1.1088896 -0.7019845
#> 4:   TRUE -1.0149620 -0.9624472
#> 5:   TRUE -0.1623095  2.2211163
#> 6:   TRUE  0.5630558  4.7008301

# also works with multiple conditions
dag <- empty_dag() +
  node(c("A", "C"), type="rbernoulli") +
  node("B", type="rnorm") +
  node("Y", type="mixture", parents=c("A", "C"),
    distr=list(
      "A==0 & C==1", node(".", type="gaussian", formula= ~ -2 + B*2, error=1),
      "A==1", node(".", type="gaussian", formula= ~ 3 + B*5, error=1)
    ))
data <- sim_from_dag(dag, n_sim=100)
head(data)
#>         A      C          B         Y
#>    <lgcl> <lgcl>      <num>     <num>
#> 1:   TRUE  FALSE -2.3160362 -7.691539
#> 2:   TRUE  FALSE  0.5624718  6.085333
#> 3:   TRUE  FALSE -0.7837751 -1.485517
#> 4:  FALSE   TRUE -0.2260540 -2.089441
#> 5:   TRUE  FALSE -1.5871030 -4.975722
#> 6:   TRUE  FALSE  0.5475242  6.285233

# using the mixture node itself in the condition
# see cookbook vignette, section on outliers for more info
dag <- empty_dag() +
  node(c("A", "B", "C"), type="rnorm") +
  node("Y", type="mixture", parents=c("A", "B", "C"),
       distr=list(
         "TRUE", node(".", type="gaussian", formula= ~ -2 + A*0.1 + B*1 + C*-2,
                      error=1),
         "Y > 2", node(".", type="rnorm", mean=10000, sd=500)
       ))
data <- sim_from_dag(dag, n_sim=100)