Simulate a Node Using Logistic Regression
node_binomial.Rd
Data from the parents is used to generate the node using logistic regression by predicting the covariate specific probability of 1 and sampling from a Bernoulli distribution accordingly.
Usage
node_binomial(data, parents, formula=NULL, betas, intercept,
return_prob=FALSE, output="logical", labels=NULL)
Arguments
- data
A
data.table
(or something that can be coerced to adata.table
) containing all columns specified byparents
.- parents
A character vector specifying the names of the parents that this particular child node has. If non-linear combinations or interaction effects should be included, the user may specify the
formula
argument instead.- formula
An optional
formula
object to describe how the node should be generated orNULL
(default). If supplied it should start with~
, having nothing else on the left hand side. The right hand side may contain any valid formula syntax, such asA + B
orA + B + I(A^2)
, allowing non-linear effects. If this argument is defined, there is no need to define theparents
argument. For example, usingparents=c("A", "B")
is equal to usingformula= ~ A + B
.- betas
A numeric vector with length equal to
parents
, specifying the causal beta coefficients used to generate the node.- intercept
A single number specifying the intercept that should be used when generating the node.
- return_prob
Either
TRUE
orFALSE
(default). IfTRUE
, the calculated probability is returned instead of the results of bernoulli trials.- output
A single character string, must be either
"logical"
(default),"numeric"
,"character"
or"factor"
. Ifoutput="character"
oroutput="factor"
, the labels (or levels in case of a factor) can be set using thelabels
argument.- labels
A character vector of length 2 or
NULL
(default). IfNULL
, the resulting vector is returned as is. If a character vector is supplied andoutput="character"
oroutput="factor"
is used, allTRUE
values are replaced by the first entry of this vector and allFALSE
values are replaced by the second argument of this vector. The output will then be a character variable or factor variable, depending on theoutput
argument. This argument is ignored ifoutput
is set to"numeric"
or"logical"
.
Details
Using the normal form a logistic regression model, the observation specific event probability is generated for every observation in the dataset. Using the rbernoulli
function, this probability is then used to take one bernoulli sample for each observation in the dataset. If only the probability should be returned return_prob
should be set to TRUE
.
Formal Description:
Formally, the data generation can be described as:
$$Y \sim Bernoulli(logit(\texttt{intercept} + \texttt{parents}_1 \cdot \texttt{betas}_1 + ... + \texttt{parents}_n \cdot \texttt{betas}_n)),$$
where \(Bernoulli(p)\) denotes one Bernoulli trial with success probability \(p\), \(n\) is the number of parents (length(parents)
) and the \(logit(x)\) function is defined as:
$$logit(x) = ln(\frac{x}{1-x}).$$
For example, given intercept=-15
, parents=c("A", "B")
and betas=c(0.2, 1.3)
the data generation process is defined as:
$$Y \sim Bernoulli(logit(-15 + A \cdot 0.2 + B \cdot 1.3)).$$
Output Format:
By default this function returns a logical vector containing only TRUE
and FALSE
entries, where TRUE
corresponds to an event and FALSE
to no event. This may be changed by using the output
and labels
arguments. The last three arguments of this function are ignored if return_prob
is set to TRUE
.
Examples
library(simDAG)
set.seed(5425)
# define needed DAG
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("smoking", type="binomial", parents=c("age", "sex"),
betas=c(1.1, 0.4), intercept=-2)
# define the same DAG, but using a pretty formula
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("smoking", type="binomial",
formula= ~ -2 + age*1.1 + sexTRUE*0.4)
# simulate data from it
sim_dat <- sim_from_dag(dag=dag, n_sim=100)
# returning only the estimated probability instead
dag <- empty_dag() +
node("age", type="rnorm", mean=50, sd=4) +
node("sex", type="rbernoulli", p=0.5) +
node("smoking", type="binomial", parents=c("age", "sex"),
betas=c(1.1, 0.4), intercept=-2, return_prob=TRUE)
sim_dat <- sim_from_dag(dag=dag, n_sim=100)