Custom functions with bestNormalize

This vignette will go over the steps required to implement a custom user-defined function within the bestNormalize framework.

There are 3 steps.

1. Create transformation function

2. Create predict method for transformation function (that can be applied to new data)

3. Pass through new function and predict method to bestNormalize

S3 methods

Here, we start by defining a new function that we’ll call cuberoot_x, which will take an argument a (as does the sqrt_x function) which will try to add a constant if it sees any negative numbers in x. It will also take the argument standardize which will center and scale the transformed data so that it’s centered at 0 with SD = 1.

## Define user-function
cuberoot_x <- function(x, a = NULL, standardize = TRUE, ...) {
stopifnot(is.numeric(x))

min_a <- max(0, -(min(x, na.rm = TRUE)))
if(!length(a))
a <- min_a
if(a < min_a) {
warning("Setting a <  max(0, -(min(x))) can lead to transformation issues",
"Standardize set to FALSE")
standardize <- FALSE
}

x.t <- (x + a)^(1/3)
mu <- mean(x.t, na.rm = TRUE)
sigma <- sd(x.t, na.rm = TRUE)
if (standardize) x.t <- (x.t - mu) / sigma

# Get in-sample normality statistic results
ptest <- nortest::pearson.test(x.t)

val <- list(
x.t = x.t,
x = x,
mean = mu,
sd = sigma,
a = a,
n = length(x.t) - sum(is.na(x)),
norm_stat = unname(ptest$statistic / ptest$df),
standardize = standardize
)

# Assign class
class(val) <- c('cuberoot_x', class(val))
val
}

Note that we assigned a class to the object this returns of the same name; this is necessary for successful implementation within bestNormalize. We’ll also need an associated predict method that is used to apply the transformation to newly observed data. `

predict.cuberoot_x <- function(object, newdata = NULL, inverse = FALSE, ...) {

# If no data supplied and not inverse
if (is.null(newdata) & !inverse)
newdata <- object$x # If no data supplied and inverse if (is.null(newdata) & inverse) newdata <- object$x.t

# Actually performing transformations

# Perform inverse transformation as estimated
if (inverse) {

# Reverse-standardize
if (object$standardize) newdata <- newdata * object$sd + object$mean # Reverse-cube-root (cube) newdata <- newdata^3 - object$a

# Otherwise, perform transformation as estimated
} else if (!inverse) {
# Take cube root
newdata <- (newdata + object$a)^(1/3) # Standardize to mean 0, sd 1 if (object$standardize)
newdata <- (newdata - object$mean) / object$sd
}

# Return transformed data
unname(newdata)
}

Optional: print method

This will be printed when bestNormalize selects your custom method or when you print an object returned by your new custom function.

print.cuberoot_x <- function(x, ...) {
cat(ifelse(x$standardize, "Standardized", "Non-Standardized"), 'cuberoot(x + a) Transformation with', x$n, 'nonmissing obs.:\n',
'Relevant statistics:\n',
'- a =', x$a, '\n', '- mean (before standardization) =', x$mean, '\n',
'- sd (before standardization) =', x$sd, '\n') } Note: if you can find a similar transformation in the source code, it’s easy to model your code after it. For instance, for cuberoot_x and predict.cuberoot_x, I used sqrt_x.R as a template file. Implementing with bestNormalize # Store custom functions into list custom_transform <- list( cuberoot_x = cuberoot_x, predict.cuberoot_x = predict.cuberoot_x, print.cuberoot_x = print.cuberoot_x ) set.seed(123129) x <- rgamma(100, 1, 1) (b <- bestNormalize(x = x, new_transforms = custom_transform, standardize = FALSE)) ## Best Normalizing transformation with 100 Observations ## Estimated Normality Statistics (Pearson P / df, lower => more normal): ## - arcsinh(x): 1.2347 ## - Box-Cox: 1.0267 ## - cuberoot_x: 0.9787 ## - Exp(x): 4.7947 ## - Log_b(x+a): 1.3547 ## - No transform: 2.0027 ## - orderNorm (ORQ): 1.1627 ## - sqrt(x + a): 1.0907 ## - Yeo-Johnson: 1.0987 ## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats ## ## Based off these, bestNormalize chose: ## Non-Standardized cuberoot(x + a) Transformation with 100 nonmissing obs.: ## Relevant statistics: ## - a = 0 ## - mean (before standardization) = 0.9588261 ## - sd (before standardization) = 0.3298665 Evidently, the cube-rooting was the best normalizing transformation! Sanity check Is this code actually performing the cube-rooting? all.equal(x^(1/3), b$chosen_transform$x.t) ##  TRUE all.equal(x^(1/3), predict(b)) ##  TRUE It does indeed. Using custom normalization statistics The bestNormalize package can estimate any univariate statistic using its CV framework. A user-defined function can be passed in through the norm_stat_fn argument, and this function will then be applied in lieu of the Pearson test statistic divided by its degree of freedom. The user-defined function must take an argument x, which indicates the data on which a user wants to evaluate the statistic. Here is an example using Lilliefors (Kolmogorov-Smirnov) normality test statistic: bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$stat)
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.1958
##  - Box-Cox: 0.1785
##  - Center+scale: 0.2219
##  - Exp(x): 0.3299
##  - Log_b(x+a): 0.1959
##  - orderNorm (ORQ): 0.186
##  - sqrt(x + a): 0.1829
##  - Yeo-Johnson: 0.1872
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193
##  - mean (before standardization) = -0.1263882
##  - sd (before standardization) = 0.9913552

Here is an example using Lillifors (Kolmogorov-Smirnov) normality test’s p-value:

(dont_do_this <- bestNormalize(x, norm_stat_fn = function(x) nortest::lillie.test(x)$p)) ## Best Normalizing transformation with 100 Observations ## Estimated Normality Statistics (using custom normalization statistic) ## - arcsinh(x): 0.4327 ## - Box-Cox: 0.4831 ## - Center+scale: 0.2958 ## - Exp(x): 0.0675 ## - Log_b(x+a): 0.3589 ## - orderNorm (ORQ): 0.4492 ## - sqrt(x + a): 0.4899 ## - Yeo-Johnson: 0.4531 ## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats ## ## Based off these, bestNormalize chose: ## Standardized exp(x) Transformation with 100 nonmissing obs.: ## Relevant statistics: ## - mean (before standardization) = 6.885396 ## - sd (before standardization) = 13.66084 Note: bestNormalize will attempt to minimize this statistic by default, which is definitely not what you want to do when calculating the p-value. This is seen in the example above, as the WORST normalization transformation is chosen. In this case, a user is advised to either manually select the best one: best_transform <- names(which.max(dont_do_this$norm_stats))
(do_this <- dont_do_this$other_transforms[[best_transform]]) ## Standardized sqrt(x + a) Transformation with 100 nonmissing obs.: ## Relevant statistics: ## - a = 0 ## - mean (before standardization) = 0.9811849 ## - sd (before standardization) = 0.4779252 Or, the user can reverse their defined statistic (in this case by subtracting it from 1): (do_this <- bestNormalize(x, norm_stat_fn = function(x) 1-nortest::lillie.test(x)$p))
## Best Normalizing transformation with 100 Observations
##  Estimated Normality Statistics (using custom normalization statistic)
##  - arcsinh(x): 0.5166
##  - Box-Cox: 0.4191
##  - Center+scale: 0.6521
##  - Exp(x): 0.9601
##  - Log_b(x+a): 0.5338
##  - orderNorm (ORQ): 0.4646
##  - sqrt(x + a): 0.4475
##  - Yeo-Johnson: 0.4773
## Estimation method: Out-of-sample via CV with 10 folds and 5 repeats
##
## Based off these, bestNormalize chose:
## Standardized Box Cox Transformation with 100 nonmissing obs.:
##  Estimated statistics:
##  - lambda = 0.3281193
##  - mean (before standardization) = -0.1263882
##  - sd (before standardization) = 0.9913552