Simulation Models

library(fdaoutlier)

The following are simulation models included in the fdaoutlier package. Some of these models were curated from research work related to functional depths and outlier detection for functional data. This documents presents the model equations as well as their corresponding functions and parameters in fdaoutlier. The parameters of the fdaoutlier functions have been set to reasonable default values for ease of use.

Model 1

This is a typical magnitude model in which outliers are shifted from the ‘normal’ non-outlying observations. The main model is of the form:

$X_i(t) = \mu t + e_i(t),$ and the contamination model model is of the form:

$X_i(t) = \mu t + qk_i + e_i(t)$ where:

• $$t\in [0,1]$$,
• $$e_i(t)$$ is a Gaussian process with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$
• $$k_i \in \{-1, 1\}$$ (usually with $$P(k_i = -1) = P(k_i=1) = 0.5$$),
• and $$q$$ is a constant controlling how far the outliers are from the mean function of the data, usually, $$q = 6$$ or $$q = 8$$.

This model can be accessed with the simulation_model1() function in fdaoutlier.

library(fdaoutlier)
dtss <- simulation_model1(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

The returned object is a list containing a matrix of the data and a vector of the indices of the true outliers:

dim(dtss$data) #> [1] 100 50 dtss$true_outliers
#>  [1] 11 14 20 43 53 70 79 81 83 96

The simulated data can be tuned using additional parameters to simulation_model1(). The following parameters modify the data generated by simulation_model1():

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• q: the shift parameter $$q$$ in the contamination model which controls how far the outliers are from the mean function.
• kprob: the probability that $$k_i = 1$$, i.e., $$P(k_i=1)$$ in the contamination model
• cov_alpha: the coefficient $$\alpha$$ in the covariance function.
• cov_beta: the coefficient $$\beta$$ in the covariance function.
• cov_nu: the coefficient $$\nu$$ in the covariance function.

Additional plotting parameters allows for modifying the plot title (plot_title), the font size of the title (title_cex), toggle on/off the display of the legend (show_legend), y-axis label (ylabel) and x-axis label (xlabel).

Model 2

This model generates non-persistent magnitude outliers, i.e., the outliers are magnitude outliers for only a portion of the domain of the functional data. The main model is of the form: $X_i(t) = \mu t + e_i(t),$ with contamination model of the form: $X_i(t) = \mu t + qk_iI_{T_i \le t\le T_i+l } + e_i(t)$ where:

• $$t\in [0,1]$$,
• $$e_i(t)$$ is a Gaussian process with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$
• $$k_i \in \{-1, 1\}$$ with $$P(k_i = -1) = P(k_i=1) = 0.5$$,
• $$q$$ is a constant controlling how far the outliers are from the mass of the data,
• $$I$$ is an indicator function,
• $$T_i$$ is a uniform random variable between an interval $$[a, b] \subset [0,1]$$,
• and $$l$$ is a constant specifying for how much of the domain the outliers are away from the mean function.

A call to simulation_model2() generates data from this model:

dtss <- simulation_model2(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model3() to which arguments can be passed are:

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• q: the shift parameter $$q$$ in the contamination model which controls how far the outliers are from the mean function.
• kprob: the probability that $$k_i = 1$$, i.e., $$P(k_i=1)$$ in the contamination model.
• a, b: values specifying the interval $$[a,b]$$ from which $$T_i$$ is drawn in the contamination model.
• l: the value of $$l$$ in the contamination model.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function.
• cov_beta: the coefficient $$\beta$$ in the covariance function.
• cov_nu: the coefficient $$\nu$$ in the covariance function.

Additional plotting parameters listed for simulation_model1() also applies.

Model 3

This model generates outliers that are magnitude outliers for a part of the domain. The main model is of the form: $X_i(t) = \mu t + e_i(t),$ with contamination model of the form: $X_i(t) = \mu t + qk_iI_{T_i \le t } + e_i(t)$ where:

• $$t\in [0,1]$$,
• $$e_i(t)$$ is a Gaussian process with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$
• $$k_i \in \{-1, 1\}$$ with $$P(k_i = -1) = P(k_i=1) = 0.5$$,
• $$q$$ is a constant controlling how far the outliers are from the mass of the data,
• $$I$$ is an indicator function,
• and $$T_i$$ is a uniform random variable between an interval $$[a, b] \subset [0,1]$$.

A call to simulation_model3() generates data from this model:

dtss <- simulation_model3(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model3() to which arguments can be passed are:

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• q: the shift parameter $$q$$ in the contamination model which controls how far the outliers are from the mean function.
• kprob: the probability that $$k_i = 1$$, i.e., $$P(k_i=1)$$ in the contamination model.
• a, b: values specifying the interval $$[a,b]$$ from which $$T_i$$ is drawn in the contamination model.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function.
• cov_beta: the coefficient $$\beta$$ in the covariance function.
• cov_nu: the coefficient $$\nu$$ in the covariance function.

Additional plotting parameters listed for simulation_model1() also applies.

Model 4

This models generates outliers defined on the reversed interval of the main model. The main model is of the form: $X_i(t) = \mu t(1 - t)^m + e_i(t),$ with contamination model of the form: $X_i(t) = \mu(1 - t)t^m + e_i(t)$ where:

• $$t\in [0,1]$$,
• $$e_i(t)$$ is a Gaussian process with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$
• and $$m$$ is a constant.

A call to simulation_model4() generates data from this model:

dtss <- simulation_model4(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model4() to which arguments can be passed are:

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• m: the constant $$m$$ in the main and contamination models.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function.
• cov_beta: the coefficient $$\beta$$ in the covariance function.
• cov_nu: the coefficient $$\nu$$ in the covariance function.

Additional plotting parameters listed for simulation_model1() also applies.

Model 5

This models generates shape outliers with a different covariance structure from that of the main model. The main model is of the form: $X_i(t) = \mu t + e_i(t),$ with contamination model of the form: $X_i(t) = \mu t + \tilde{e}_i(t),$ where:

• $$t\in [0,1]$$,
• and $$e_i(t)$$ and $$\tilde{e}_i(t)$$ are Gaussian processes with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$

A call to simulation_model5() generates data from this model:

dtss <- simulation_model5(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model5() to which arguments can be passed are:

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function of $$e_i(t)$$.
• cov_beta: the coefficient $$\beta$$ in the covariance function of $$e_i(t)$$.
• cov_nu: the coefficient $$\nu$$ in the covariance function of $$e_i(t)$$.
• cov_alpha2: the coefficient $$\alpha$$ in the covariance function of $$\tilde{e}_i(t)$$.
• cov_beta2: the coefficient $$\beta$$ in the covariance function of $$\tilde{e}_i(t)$$.
• cov_nu2: the coefficient $$\nu$$ in the covariance function of $$\tilde{e}_i(t)$$.

Additional plotting parameters listed for simulation_model1() also applies.

Model 6

This models generates shape outliers that have a different shape for a portion of the domain. The main model is of the form: $X_i(t) = \mu t + e_i(t),$ with contamination model of the form: $X_i(t) = \mu t + (-1)^u\cdot q + (-1)^{(1-u)}\left(\frac{1}{\sqrt{r\pi}}\right)\exp{(-z(t-v)^w)} + e_i(t)$ where:

• $$t\in [0,1]$$,
• $$e_i(t)$$ is a Gaussian process with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$
• $$u$$ follows Bernoulli distribution with probability $$P(u = 1) = 0.5$$,
• $$q$$, $$r$$, $$z$$ and $$w$$ are constants,
• $$v$$ follows a Uniform distribution between an interval $$[a, b]$$.

A call to simulation_model6() generates data from this model:

dtss <- simulation_model6(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model6() to which arguments can be passed are:

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• q: the constant term $$q$$ in the contamination model.
• kprob: the probability $$P(u = 1)$$
• a, b: values specifying the interval of from which $$v$$ in the contamination model is drawn.
• pi_coeff: the constant $$r$$ in the contamination model.
• exp_pow: the constant $$w$$ in the contamination model.
• exp_coeff: the constant $$z$$ in the contamination model.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function.
• cov_beta: the coefficient $$\beta$$ in the covariance function.
• cov_nu: the coefficient $$\nu$$ in the covariance function.

Additional plotting parameters listed for simulation_model1() also applies.

Model 7

This model generates pure shape outliers that are periodic. The main model is of the form: $X_i(t) = \mu t + e_i(t),$ with contamination model of the form: $X_i(t) = \mu t + k\sin(r\pi(t + \theta)) + e_i(t),$ where:

• $$t\in [0,1]$$,
• and $$e_i(t)$$ is a Gaussian processes with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$
• $$\theta$$ is uniformly distributed in an interval $$[a, b]$$.
• $$k$$, $$r$$ are constants

A call to simulation_model7() generates data from this model:

dtss <- simulation_model7(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model7() to which arguments can be passed are:

• mu: the coefficient $$\mu$$ in the main and contamination models controlling the mean function.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function of $$e_i(t)$$.
• cov_beta: the coefficient $$\beta$$ in the covariance function of $$e_i(t)$$.
• cov_nu: the coefficient $$\nu$$ in the covariance function of $$e_i(t)$$.
• sin_coeff: the coefficient $$k$$ in the contamination model.
• pi_coeff: the coefficient $$r$$ in the contamination model.
• a, b: values specifying the interval of from which $$\theta$$ is to be drawn.

Additional plotting parameters listed for simulation_model1() also applies.

Model 8

This model generates pure shape outliers that are periodic. The main model is of the form: $X_i(t) = k\sin(r\pi t) + e_i(t),$ with contamination model of the form: $X_i(t) = k\sin(r\pi t + v) + e_i(t),$ where:

• $$t\in [0,1]$$,
• and $$e_i(t)$$ is a Gaussian processes with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$
• $$k$$, $$r$$, $$v$$ are constants

A call to simulation_model8() generates data from this model:

dtss <- simulation_model8(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model7() to which arguments can be passed are:

• cov_alpha: the coefficient $$\alpha$$ in the covariance function of $$e_i(t)$$.
• cov_beta: the coefficient $$\beta$$ in the covariance function of $$e_i(t)$$.
• cov_nu: the coefficient $$\nu$$ in the covariance function of $$e_i(t)$$.
• sin_coeff: the coefficient $$k$$ in the main and contamination model.
• pi_coeff: the coefficient $$r$$ in the main and contamination model.
• constant: the value of the constant $$v$$ in the contamination model.

Additional plotting parameters listed for simulation_model1() also applies.

Model 9

Periodic functions with outliers of different amplitude. The main model is of the form: $X_i(t) = a_{1i}\sin \pi + a_{2i}\cos\pi + e_i(t),$ with contamination model of the form: $X_i(t) = (b_{1i}\sin\pi + b_{2i}\cos\pi)(1-u_i) + (c_{1i}\sin\pi + c_{2i}\cos\pi)u_i + e_i(t),$ where:

• $$t\in [0,1]$$,
• $$\pi \in [0, 2\pi]$$
• $$a_{1i}$$, $$a_{2i}$$ follows uniform distribution in an interval $$[a_1, a_2]$$
• $$b_{1i}$$, $$b_{2i}$$ follows uniform distribution in an interval $$[b_1, b_2]$$
• $$c_{1i}$$, $$c_{2i}$$ follows uniform distribution in an interval $$[c_1, c_2]$$
• $$u_i$$ follows Bernoulli distribution
• and $$e_i(t)$$ is a Gaussian processes with zero mean and covariance function of the form: $\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$

A call to simulation_model9() generates data from this model:

dtss <- simulation_model9(n = 100, p = 50, outlier_rate = .1,
seed = 50, plot = F)

Additional parameters of simulation_model9() to which arguments can be passed are:

• kprob the probability $$P(u_i = 1)$$
• ai a vector of 2 values containing $$a_{1}$$ and $$a_{2}$$ indicating the interval from which $$a_{1i}$$ and $$a_{2i}$$ are drawn in the main model.
• bi a vector of 2 values containing $$b_{1}$$ and $$b_{2}$$ indicating the interval from which $$a_{1i}$$ and $$a_{2i}$$ are drawn in the main model.
• ci a vector of 2 values containing $$c_{1}$$ and $$c_{2}$$ indicating the interval from which $$c_{1i}$$ and $$c_{2i}$$ are drawn in the main model.
• cov_alpha: the coefficient $$\alpha$$ in the covariance function of $$e_i(t)$$.
• cov_beta: the coefficient $$\beta$$ in the covariance function of $$e_i(t)$$.
• cov_nu: the coefficient $$\nu$$ in the covariance function of $$e_i(t)$$.

Additional plotting parameters listed for simulation_model1() also applies.