clusterMI
is an R package to perform clustering with missing values. Missing values are addressed by multiple imputation. The package offers various multiple imputation methods dedicated to clustered individuals (Audigier, Niang, and Resche-Rigon (2021)). In addition, it allows pooling results both in terms of partition and instability (Audigier and Niang (2022)). Among applications, these functionalities can be used to choose a number of clusters with missing values.
The wine
data set (Asuncion and Newman (2007)) is the result of a chemical analysis of 177 Italian wine samples from three different cultivars. Each wine is described by 13 continuous variables. This data set will be used to illustrate the clusterMI
package. For achieving this goal, missing values will be added, and the cultivar variable will be omitted.
require(stargazer)
set.seed(123456)
data(wine)
stargazer(wine, type = "text")
#>
#> ============================================
#> Statistic N Mean St. Dev. Min Max
#> --------------------------------------------
#> cult 178 1.900 0.780 1 3
#> alco 178 13.000 0.810 11.000 15.000
#> malic 178 2.300 1.100 0.740 5.800
#> ash 178 2.400 0.270 1.400 3.200
#> alca 178 19.000 3.300 11.000 30.000
#> mg 178 100.000 14.000 70 162
#> phe 178 2.300 0.630 0.980 3.900
#> fla 178 2.000 1.000 0.340 5.100
#> nfla 178 0.360 0.120 0.130 0.660
#> pro 178 1.600 0.570 0.410 3.600
#> col 178 5.100 2.300 1.300 13.000
#> hue 178 0.960 0.230 0.480 1.700
#> ratio 178 2.600 0.710 1.300 4.000
#> prol 178 747.000 315.000 278 1,680
#> --------------------------------------------
Missing values are artificially added according to a missing completely at random mechanism so that each value of the data set is missing with a probability of 1/3 (independently to the values themselves).
ref <- wine$cult # "True" partition
nb.clust <- 3 # Number of clusters
wine.na <- wine
wine.na$cult <- NULL # Remove the reference partition
wine.na <- prodna(wine.na, pct = 1/3)
The clusterMI
package offers various multiple imputation methods dedicated to clustered individuals. They can be divided into two categories: joint modelling (JM) imputation and fully conditional specification (FCS). The first assumes a joint distribution for all variables and imputation is performed using this model. The second proceeds variable per variable in a sequential manner by regression. FCS methods are more time consuming, but also more flexible, allowing a better fit of the imputation model. Multiple imputation is performed with the imputedata
function. We start by presenting JM approaches in Section (2.1), while FCS approaches will be presented in Section (2.2).
The package proposes two JM methods: JM-GL and JM-DP. Both are based on a multivariate gaussian mixture model.
JM-GL
is implemented in the mix
package (Joseph L. Schafer. (2022)). Initially, this method is dedicated to the imputation of mixed data, but it can be used by considering the partition variable as fully incomplete categorical variable. The method assumes constant variance in each cluster (for continuous data).
JM-DP
is a joint modelling method implemented in the R packages DPImputeCont
(Kim (2020)), NPBayesImputeCat
(Wang et al. (2022)), MixedDataImpute
(Murray and Reiter (2015)) for continuous, categorical or mixed data respectively. Such a method has the advantage to automatically determine a number of clusters (but this number needs to by bounded by nb.clust
). Furthermore, it allows various covariance matrices according to the clusters (for continuous data).
JM-GL is the default imputation method used in imputedata
. To perform multiple imputation with this default method, we proceed as follows:
m <- 20 # Number of imputed data sets
res.imp.JM <- imputedata(data.na = wine.na,
nb.clust = nb.clust,
m = m)
and we specify method = "JM-DP"
for imputation using the other JM method:
Both imputation methods consist in a data-augmentation algorithm alternating data imputation and drawing from a posterior distribution. The m
imputed datasets are obtained by keeping one imputed dataset every L
iterations. The first Lstart
iterations consists in a burn-in period, which is required to reach convergence to the posterior distribution (expected from incomplete data). The L
iterations between successive draws guarantee an independance between imputed values.
Lstart
and L
can be checked by graphical investigations. For achieving this goal, we track the between inertia for each variable over successive iterations of the data-augmentation algorithm (available in the res.conv
output from the imputedata
function). In practice, we run imputation for a large number of imputed datasets m
by keeping all intermediate imputed datatsets, i.e. at each iteration (L = 1
) from the first (Lstart = 1
):
res.imp.JM.conv <- imputedata(data.na = wine.na,
method = "JM-GL",
nb.clust = nb.clust,
m = 800,
Lstart = 1, # number of iterations for the burn-in period
L = 1 # number of iterations between each draw
)
and then plot the successive between inertia values for the four first variables (as an example):
res.conv <- res.imp.JM.conv$res.conv
res.conv.ts <- ts(t(res.conv)) # conversion as time-series object
plot(res.conv.ts[, 1:4]) # diagnostic from the 4 first variables
Here, convergence seems to be reached at 400 iterations, meaning Lstart = 400
is a suitable choice.
Next, the number of iterations between successive draws can be checked by visualising the autocorrelograms. Here, an autocorrelogram represents the correlation between the vector of the successive between inertia and its shifted version for several lags. We seek to find a lag L
sufficiently large to avoid correlation between the vectors of between inertia. Such graphics can be obtained as follows:
Lstart <- 400
# extraction of summaries after Lstart iterations for the 4 first variables
res.conv.ts <- res.conv.ts[Lstart:nrow(res.conv.ts), 1:4]
apply(res.conv.ts, 2, acf)
Following such graphics L = 20
(the default value) seems sufficient. Thus, the imputation step can be rerun as follows:
Lstart <- 400
L <- 20
res.imp.JM <- imputedata(data.na = wine.na,
nb.clust = nb.clust,
Lstart = Lstart,
L = L,
m = m)
Note that the imputation also requires a pre-specified number of clusters (nb.clust
). Here it is tuned to 3 (corresponding to the number of varieties). We explain in Section 4.3 how it can be tuned. Furthermore, the number of imputed data sets is tuned to m = 20
which is generally enough (this choice will be discussed in Section 4.2).
Fully conditional specification methods consist in a variable per variable imputation. The two fully conditional imputation methods proposed are FCS-homo
and FCS-hetero
(Audigier, Niang, and Resche-Rigon (2021)). They essentially differ by the assumption about the covariance in each cluster (constant or not respectively).
To perform multiple imputation, we proceed as follows:
maxit <- 20 # Number of iterations for FCS imputation, should be larger in practice
res.imp.FCS <- imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = maxit,
m = m)
With FCS methods, the imputedata
function alternates cluster analysis and imputation given the partition of individuals. When the cluster analysis is performed, the imputedata
function calls the mice
function from the mice R package (van Buuren and Groothuis-Oudshoorn (2011)). The mice
package proposes various methods for imputation. By default, imputedata
uses the default method used in mice
(predictive mean matching for continuous data), but others can be specified by tuning the method.mice
argument. For instance, for imputation under the normal model, use
imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = maxit,
m = m,
method.mice = "norm")
FCS-hetero
allows imputation of continuous variables according to linear mixed models (various methods are available in the micemd
R package Audigier and Resche-Rigon (2023)). Furthermore, contrary to FCS-homo, FCS-hetero updates the partition without assuming constant variance in each cluster.
FCS imputation consists in imputing each variable sequentially several times. Many iterations can be required (maxit
argument). For checking convergence, the within and between inertia of each imputed variable can be plotted at each iteration, as proposed by the choosemaxit
function
Note that by default, only the five first imputed data sets are plotted (corresponding to the number of curves plotted for each variable). The plotm
argument can be tuned to modify which curves should be drawn.
In this case, the number of iterations could be potentially increased. For achieving this goal, the imputation should be rerun by increasing the maxit
argument as follows:
Fully conditional imputation methods are quickly limited when the number of variables is large since imputation models become overfit. To address this issue, we can use penalised regression as proposed in mice
by specifying method.mice = lasso.norm
for instance. Another way consists in specifying conditional imputation models by tuning the predictmat
argument. This argument is a binary matrix where each row indicates which explanatory variables (in column) should be used for imputation.
varselbest
procedureTo tune this matrix in an automatic way, the varselbest
function proposes to perform variable selection following Bar-Hen and Audigier (2022). Briefly, varselbest
performs variable selection on random subsets of variables and then, combines them to recover which explanatory variables are related to the response. More precisely, the outlines of the algorithm are as follows: let consider a random subset of sizeblock
among p variables. Then, any selection variable scheme can be applied (lasso, stepwise and knockoff are proposed by tuning the method.select
argument). By resampling (B
times) a sample of size sizeblock
among the p variables, we may count how many times a variable is considered as significantly related to the response and how many times it is not. We need to define a threshold (r
) to conclude if a given variable is significantly related to the response (by default, r
= 0.3). The main advantage of this function is that it handles both missing values and high-dimensional data.
By default, the varselbest
function performs variable selection by knockoff (Barber and Candès (2015)) based on B
= 200 bootstrap subsets. The threshold r
is tuned at 0.3 allowing to omit only variables very poorly predictive. The choices of B
and r
are discussed in next sections.
Since the method is time consuming, the function allows parallel computing by tuning the nnodes
argument. In the next example, the imputation model for the variable alco
is obtained using the algorithm previously described.
nnodes <- 2
# Number of CPU cores used for parallel computation.
# Use parallel::detectCores() to choose an appropriate number
# variable selection to impute the "alco" variable
B <- 50 # number of bootstrap subsets, should be increased in practice
res.varsel <- varselbest(res.imputedata = res.imp.FCS, B = B, listvar = "alco",
nnodes = nnodes, graph = FALSE)
res.varsel$predictormatrix["alco", ]
#> alco malic ash alca mg phe fla nfla pro col hue ratio prol
#> 0 0 1 1 1 0 1 0 1 1 1 1 1
The function suggests considering the variables ash, alca, mg, fla, pro, col, hue, ratio, prol to impute the alco
variable. Then, imputation can be rerun by specifying the predictmat
argument returned by the varselbest
function
# multiple imputation with the new model
res.imp.select <- imputedata(data.na = wine.na,
method = "FCS-homo",
nb.clust = nb.clust,
maxit = maxit,
m = m,
predictmat = res.varsel$predictormatrix)
Note that for specifying all conditional imputation models you should use
The number of iterations B
should be large so that the proportion of times a variable is selected becomes stable. The chooseB
function plots the proportion according to the number of iterations.
B=50
iterations seems not enough.
To tune the threshold r
, the vector of proportions can be graphically investigated.
# check the variable importance
round(res.varsel$proportion["alco",], 2)
#> alco malic ash alca mg phe fla nfla pro col hue ratio prol
#> 0.00 0.26 0.36 0.45 0.33 0.10 0.33 0.14 0.45 0.95 0.71 0.60 0.36
barplot(sort(res.varsel$proportion["alco",], decreasing=TRUE),
ylab = "proportion",
main = "alco",
ylim = c(0, 1),
las = 2,
cex.names = .5)
r <- 0.2 # a new threshold value (r = 0.3 by default)
abline(h = r, col = 2, lty = 2)
Then, the predictor matrix can be easily updated at hand as follows
predictormatrix <- res.varsel$predictormatrix
predictormatrix[res.varsel$proportion>r] <- 1
predictormatrix[res.varsel$proportion<=r] <- 0
predictormatrix["alco", ]
#> alco malic ash alca mg phe fla nfla pro col hue ratio prol
#> 0 1 1 1 1 0 1 0 1 1 1 1 1
By decreasing the threshold, more variables are used in the imputation model.
A more automatic way to tune r
consists in using the function chooser
. chooser
computes an optimal threshold using K-fold cross-validation. The call to chooser
is highly time consuming, but the optimal value of r
can be follows during the process through graphical outputs. By this way, the user can stop the process early. This can be achieved as follows:
After multiple imputation, cluster analysis and partition pooling can be done through the clusterMI
function (Audigier and Niang (2022)). Next, kmeans clustering is applied on the imputed data sets as an example.
kmeans is the clustering method used by default.
The clusterMI
function returns a consensus partition (part
object) as well as a instability measure (instability
object). The instability
object gathers the instability of each contributory partition (U
), their average (Ubar
), the between instability (B
) and the total instability (T
).
table(part, ref) #compare the partition with the reference partition
#> ref
#> part 1 2 3
#> 1 0 5 48
#> 2 2 61 0
#> 3 57 5 0
res.pool.kmeans$instability # look at instabilitiy measures
#> $U
#> [1] 0.0181 0.0095 0.0225 0.0182 0.0255 0.0191 0.0332 0.0271 0.0182 0.0194
#> [11] 0.0241 0.0170 0.0383 0.0180 0.0179 0.0230 0.0349 0.0310 0.0191 0.0280
#>
#> $Ubar
#> [1] 0.023
#>
#> $B
#> [1] 0.072
#>
#> $Tot
#> [1] 0.095
Among other clustering methods, clusterMI
allows cluster analysis by k-medoids (method.clustering = "pam"
), clustering large applications (method.clustering = "clara"
), hierarchical clustering (method.clustering = "hclust"
), fuzzy c-means (method.clustering = "cmeans"
), or model-based method (method.clustering = "mixture"
)
The user can also use custom clustering methods. For instance, to use reduced k-means, as implemented in the R package clustrd
, analysis and pooling can be achieved as follows:
library(clustrd)
res.ana.rkm <- lapply(res.imp.JM$res.imp,
FUN = cluspca,
nclus = nb.clust,
ndim = 2,
method= "RKM")
# extract the set of partitions (as list)
res.ana.rkm <- lapply(res.ana.rkm, "[[", "cluster")
# pooling by NMF
res.pool.rkm <- fastnmf(res.ana.rkm, nb.clust = nb.clust)$clust
Note that in this case, the instability is not computed.
To check if the imputation model correctly fit the data, a classical way is to perform overimputation (Blackwell, Honaker, and King (2015)), as proposed by the overimpute
function. Overimputation consists in imputing observed values several times (100 or more) and to compare the observed values with their imputed values.
Overimputation is a time-consuming process. To limit the time required for achieving overimputation the user can:
nnodes
argumentplotinds
argumentplotvars
argumentIn the next example, we use parallel computing and perform overimputation on the first variable (alco
) for 20 individuals (at random) only.
# Multiple imputation is rerun with more imputed data sets (m = 100)
res.imp.over <- imputedata(data.na = wine.na,
nb.clust = nb.clust,
m = 100,
Lstart = Lstart,
L = L,
verbose = FALSE)
# selection of 20 complete individuals on variable "alco"
plotinds <- sample(which(!is.na(wine.na[, "alco"])),
size = 20)
res.over <- overimpute(res.imp.over,
nnodes = nnodes,
plotvars = "alco",
plotinds = plotinds)
The graphic represents the observed values (x-axis) versus the 90% prediction interval (y-axis) for these values. Various colours are used according to the proportion of observed value used to build the interval (which depends on the missing data pattern on each individual). If the conditional model fits the data well, then 90% of intervals cutting the first bisector (indicating the observed values are gathered in the interval) are expected.
Here, the imputation model for the alco
variable does not fit the observed data very well since the coverage is 25%. The fit could be improved by investigating FCS imputation methods.
m
)The number of imputed data sets (m
) should be sufficiently large to improve the partition accuracy. The choosem
function can be used to check if this number is suitable. This function computes the consensus partition by considering only the first imputed data sets. By this way, a sequence of m
consensus partitions is obtained. Then, the rand index between successive partitions is computed and reported in a graph. The rand index measures proximity between partitions. If the rand index between the last consensus partitions of the sequence reaches its maximum values (1), then the number of imputed data sets does not modify the consensus partition and this number can be considered as sufficiently large.
Here, the rand index is equal to 1 at m = 20
imputed data sets, meaning the consensus partition remains unchanged between m = 19
and m = 20
, and so probably beyond. Consequently, m = 20
was a suitable choice.
In practice, the number of clusters is generally unknown. A way to tune this number consists in inspecting the instability according to the number of clusters (Fang and Wang (2012)). The more stable partition could be retained. The choosenbclust
function browses a grid of values for the number of clusters and for each one imputes the data and computes the instability.
On the wine data set, the number of clusters suggested is clearly 3.
After building a partition with missing values, a description of each cluster can be performed variable per variable as follows:
require(reshape2)
require(ggplot2)
dat.m = melt(data.frame(wine.na, part = as.factor(part)), id.var=c("part"))
ggplot(dat.m, aes(part, value, col = part)) +
facet_wrap(variable~., scales = "free_y") +
geom_boxplot(width = 0.7)
or by investigating the relationships for each pair of variables
or from a multivariate point of view by using the principal component analysis, as proposed in the R packages FactoMineR
(Lê, Josse, and Husson (2008)) and missMDA
(Josse and Husson (2016))
library(FactoMineR)
library(missMDA)
# merge the partition variable with the incomplete data set
data.pca <- cbind.data.frame(class = factor(part, levels = seq(nb.clust)),
wine.na)
# perform PCA with missing values by specifying where is the partition variable
res.imputepca <- imputePCA(data.pca, quali.sup = 1)
res.pca <- PCA(res.imputepca$completeObs, quali.sup = 1, graph = FALSE)
plot(res.pca, habillage = 1)
Finally, the consensus partition can be analysed by computing external clustering comparison indices as proposed in the clusterCrit
R package (Desgraupes (2023)) as follows:
library(clusterCrit)
res.crit <- extCriteria(part, ref, crit = "all")
round(unlist(res.crit), 2)
#> czekanowski_dice folkes_mallows hubert jaccard
#> 0.87 0.87 0.80 0.77
#> kulczynski mcnemar phi precision
#> 0.87 -2.73 0.00 0.86
#> rand recall rogers_tanimoto russel_rao
#> 0.91 0.88 0.84 0.29
#> sokal_sneath1 sokal_sneath2
#> 0.62 0.95