Simulate a Node Using Linear Regression
node_gaussian.Rd
Data from the parents is used to generate the node using linear regression by predicting the covariate specific mean and sampling from a normal distribution with that mean and a specified standard deviation.
Arguments
- data
A
data.table
(or something that can be coerced to adata.table
) containing all columns specified byparents
.- parents
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the
formula
argument instead.- formula
An optional
formula
object to describe how the node should be generated orNULL
(default). If supplied it should start with~
, having nothing else on the left hand side. The right hand side may contain any valid formula syntax, such asA + B
orA + B + I(A^2)
, allowing non-linear effects. If this argument is defined, there is no need to define theparents
argument. For example, usingparents=c("A", "B")
is equal to usingformula= ~ A + B
.- betas
A numeric vector with length equal to
parents
, specifying the causal beta coefficients used to generate the node.- intercept
A single number specifying the intercept that should be used when generating the node.
- error
A single number specifying the sigma error that should be used when generating the node.
Details
Using the general linear regression equation, the observation-specific value that would be expected given the model is generated for every observation in the dataset generated thus far. We could stop here, but this would create a perfect fit for the node, which is unrealistic. Instead, we add an error term by taking one sample of a normal distribution for each observation with mean zero and standard deviation error
. This error term is then added to the predicted mean.
Formal Description:
Formally, the data generation can be described as:
$$Y \sim \texttt{intercept} + \texttt{parents}_1 \cdot \texttt{betas}_1 + ... + \texttt{parents}_n \cdot \texttt{betas}_n+ N(0, \texttt{error}),$$
where \(N(0, \texttt{error})\) denotes the normal distribution with mean 0 and a standard deviation of error
and \(n\) is the number of parents (length(parents)
).
For example, given intercept=-15
, parents=c("A", "B")
, betas=c(0.2, 1.3)
and error=2
the data generation process is defined as:
$$Y \sim -15 + A \cdot 0.2 + B \cdot 1.3 + N(0, 2).$$
Examples
library(simDAG)
set.seed(12455432)
# define a DAG
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("bmi", type="gaussian", parents=c("sex", "age"),
betas=c(1.1, 0.4), intercept=12, error=2)
# define the same DAG, but with a pretty formula for the child node
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("bmi", type="gaussian", error=2,
formula= ~ 12 + sexTRUE*1.1 + age*0.4)
sim_dat <- sim_from_dag(dag=dag, n_sim=100)