A central question in any imputation effort is whether the imputed values you came up with are any good or not.

Though several metrics for evaluating imputations exist, a common one is mean absolute differences (MAD scores) between original and imputed values. This gives a very practical look at how close or far off the imputations were from the original data.

MAD scores are computed at the variable level through the calculation of mean absolute differences between the original distribution of cases and the imputed version of those same cases. Practically, MAD shows aggregate error when imputing over individual variables. Lower percentages mean less error/differences compared to higher percentages, which mean greater overall error/differences across all variables between original and imputed data sets.

For example, suppose you had a nominal variable with three potential
values A, B, and C. The distribution of the variable across each
category was A = 40%, B = 30%, and C = 30% when considering only
complete cases. Then, after imputing this variable, you observed the
distribution A = 43%, B = 28%, and C = 29%. We would calculate the mean
absolute difference as \((|40-43| + |30-28| +
|30-29|) / 3 = 2\), or 2% average difference between the original
and imputed versions of the same variable. The logic is easily scaled up
to accommodate high dimensional data spaces, with identical
interpretation making it a very intuitive and helpful evaluative metric
for imputations tasks. *Note*: the bigger the data space, the
slower the computation.

Let’s see this in action in the following section via the
`mad()`

function from the latest release of
`hdImpute`

.

First, load the library along with the `tidyverse`

library
for some additional helpers in setting up the sample data space.

Next, set up the data and introduce missingness completely at random
(MCAR) via the `prodNA()`

function from the
`missForest`

package. Take a look at the synthetic data with
missingness introduced.

```
d <- data.frame(X1 = c(1:6),
X2 = c(rep("A", 3),
rep("B", 3)),
X3 = c(3:8),
X4 = c(5:10),
X5 = c(rep("A", 3),
rep("B", 3)),
X6 = c(6,3,9,4,4,6))
set.seed(1234)
data <- missForest::prodNA(d, noNA = 0.30) %>%
as_tibble()
data
#> # A tibble: 6 × 6
#> X1 X2 X3 X4 X5 X6
#> <int> <chr> <int> <int> <chr> <dbl>
#> 1 1 <NA> 3 5 A 6
#> 2 NA A 4 6 A 3
#> 3 3 <NA> 5 7 A 9
#> 4 NA B NA NA <NA> 4
#> 5 NA B 7 9 B NA
#> 6 NA B 8 10 B 6
```

*Note*: This is a tiny sample set, but hopefully the usage is
clear enough.

First, impute this simple data set via `hdImpute()`

:

```
imputed = hdImpute(data = data, batch = 2)
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: X1
#> Variables used to impute: X1
#> iter 1: .
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: X3, X2
#> Variables used to impute: X3, X2
#> iter 1: ..
#> iter 2: ..
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: X4, X5
#> Variables used to impute: X4, X5
#> iter 1: ..
#> iter 2: ..
#>
#> Missing value imputation by random forests
#>
#> Variables to impute: X6
#> Variables used to impute: X6
#> iter 1: .
```

Now, we have an imputed versions of the original data space with no more missingness.

```
imputed
#> # A tibble: 6 × 6
#> X1 X2 X3 X4 X5 X6
#> <int> <chr> <int> <int> <chr> <dbl>
#> 1 1 B 3 5 A 6
#> 2 1 A 4 6 A 3
#> 3 3 B 5 7 A 9
#> 4 1 B 5 7 A 4
#> 5 1 B 7 9 B 9
#> 6 3 B 8 10 B 6
```

But how good is this at capturing the original distribution of the
data (pre-imputation)? Let’s find out by computing MAD scores for each
variable via `mad()`

```
mad(original = data,
imputed = imputed,
round = 1)
#> # A tibble: 6 × 2
#> var mad
#> <chr> <dbl>
#> 1 X1 16.7
#> 2 X2 8.3
#> 3 X3 5.3
#> 4 X4 5.3
#> 5 X5 6.7
#> 6 X6 6.7
```

We can see we did best on `X3`

and `X4`

with
scores at 5.3% mean difference for each, and worst on `X1`

with a score of 16.7% mean difference. Importantly, precisely what
defines “best” or “worst” MAD is entirely project-dependent. Users
should interpret results with care.

By default, the function returns a tibble. This can easily be stored in an object for later use:

Now, with our `mad_scores`

as a tidy tibble, we can
continuing working with it to, e.g., visualize the distribution of error
across this full data space with only a few lines of code
(*remember*: lower MAD is better, meaning fewer average
differences in the distribution of imputations compared to the original
data).

First, a histogram:

```
mad_scores %>%
ggplot(aes(x = mad)) +
geom_histogram(fill = "dark green") +
labs(x = "MAD Scores (%)", y = "Count of Variables", title = "Distribution of MAD Scores") +
theme_minimal() +
theme(legend.position = "none")
```

Or a boxplot:

This software is being actively developed, with many more features to come. Wide engagement with it and collaboration is welcomed! Here’s a sampling of how to contribute:

Submit an issue reporting a bug, requesting a feature enhancement, etc.

Suggest changes directly via a pull request

Reach out directly with ideas if you’re uneasy with public interaction

Thanks for using the tool. I hope its useful.