## Introduction

Association measures can be local or global (Van de Cruys 2011). Local association measures quantify the association between specific values of random variables. In the case of a contingency table, they yield one value for each cell. An example is chi-squared residuals that are computed when constructing a chi-squared test. On the other hand, global association measures yield a single value used to summarize the association for all values taken by random variables. An example is the chi-squared statistic, the sum of squared residuals (Sheskin 2007).

Most often, we are only concerned with the global association and overlook local association. For example, analysis of chi-squared residuals is uncommon practice when compared to the chi-square independence test. Nonetheless, a significant global association can hide a non-significant local association, and a non-significant global association can hide a significant local association (Anselin 1995). Accordingly, analysis of association should not limit itself to the global perspective. Indeed, the association between two variables can depend on their values. For example, in threshold mechanisms, variables are only associated with each other when one takes values above a certain critical level. In this case, local association measures allow pinpointing values for which variables are associated. Moreover, the existence of an association between two variables may depend on the value of a third variable. For example, the effect of a drug will depend on the patient’s sensibility to the drug. The local association between drug intake and recovery will not be the same for patients that are sensitive than for those that are resistant to the drug.

The rest of the vignette is organized as follows. We first give the reader the necessary intuition and mathematical background about global and local associations. This leads to the description of chi-square residuals, Lewontin’s $$D$$ (Lewontin 1964), Ducher’s $$Z$$ (Ducher et al. 1994), and pointwise mutual information (Van de Cruys 2011). We also introduce a multivariate and normalized measure of local association. Subsequently, we illustrate the usage of local association measures using the zebu R package. The vignette ends with a discussion about future development and research.

## Background on Association and Independence

Throughout the vignette, we will suppose that all random variables are discrete and write them in capital letters, such as $$A$$ and $$B$$. Lower letters, such as $$a$$ and $$b$$, will denote possible values taken by these random variables (i.e. events).

One way to think about a statistical association is as events co-occurring. For example, if event $$a$$ always occurs with event $$b$$, then these events are said to be associated. An intuitive measure of association could be the joint probability: $$p(a, b)$$, the long-term frequency of events showing up together. However, this measure fails if $$a$$ or $$b$$ is a rare event. Indeed, joint probabilities are always as small as its individual events are rare: $$p(a, b) \leq \min p(a), p(b)$$. As a consequence, it is necessary to compare observed probabilities $$p(a, b)$$ to expected probabilities in which the variables are considered independent. The expected probability, if events are independent, is the factor of marginalized probabilities of events: $$p(a) \, p(b)$$. Independence is then defined by the following mathematical relation, $$p(a, b) = p(a) \, p(b)$$, and local association measures are defined to be equal to zero.

Independence implies that knowing one or more variables does not give us any information about the others. This is what we are not interested in. It is, however, possible to define two cases where the former equality does not hold: co-occurrence and mutual exclusivity. Co-occurrence is defined as events showing up more often than expected: $$p(a, b) > p(a) \, p(b)$$ and local association measures are positive. Mutual exclusivity is defined as events showing up less often than expected: $$p(a, b) < p(a) \, p(b)$$ and local association measures are negative.

Statistical independence is, however, not the only manner to construct an association measure. Other possibilities are based on the proportion of explained variance such as Pearson’s r. These former measures are parametric and suppose linear or at least monotone relationships between variables. Although intuitive and convenient, this assumption is not always justified. Measures based on statistical independence provide a non-parametric alternative that can detect non-linear relationships.

## Local Association Measures

#### Derivation of Bivariate Forms

For two random variables, $$A$$ and $$B$$, we can estimate the local association for each combination of events $$A = a$$ and $$B = b$$. This is accomplished by comparing the observed from the expected probability of events $$a$$ and $$b$$. If these probabilities are equal, then events $$a$$ and $$b$$ are independent. If not, these events are associated; the sign of the measure indicates the orientation of the relationship, and the absolute value indicates its strength.

There are different measures to compare observed and expected probabilities, for example, by using subtraction and division. Hereunder, we define the difference or Lewontin’s $$D$$ (Lewontin 1964) and the pointwise mutual information $$pmi$$ (Van de Cruys 2011). To simplify notation, and to show similarities between local association measures, we define $$h(a) = - \log p(a)$$ as the self-information of $$a$$.

\begin{aligned} D(a, b) & = p(a, b) - p(a) \, p(b) \\ pmi(a, b) & = \log \frac{p(a, b)} {p(a) p(b)} = - (h(a, b) - h(a) - h(b)) \end{aligned}

The bounds of these two measures depend on the frequency of events, which makes it difficult to compare local association values for different combinations of events. For this reason, it is desirable to express association relative to the frequency of events. One common way to do this is using chi-squared residuals $$r_{\chi}$$ as follows where $$N$$ is the sample size.

$r_{\chi}(a,b) = \sqrt{N} \; \frac{p(a, b) - p(a) \, p(b)}{\sqrt{p(a) \, p(b)}}$

We may wish to normalize local association so that values are between -1 and 1 included. This can be done by using dividing the non-normalized values by their minimal or maximal values. To identify the theoretical minimal and maximal values of $$D$$, we will find the bounds of the observed bivariate probability $$p(a, b)$$ as a function of the marginal probabilities $$p(a)$$ and $$p(b)$$.

Using the inclusion-exclusion principle, we know that:

$p(a, b) = p(a \cap b) = p(a) + p(b) - p(a \cup b)$

The intersection probability $$p(a, b)$$ will be maximized when the union probability $$p(a \cup b)$$ is equal to zero and minimized when the union probability will be equal to one.

$p(a) + p(b) - 1 \le p(a, b) \le p(a) + p(b)$

Given the intersection probability can not be smaller than zero and can not be larger than the smallest marginal probability, we have:

$\max[0, \, p(a) + p(b) - 1] \le p(a, b) \le min[p(a), \, p(b)]$

Using this result, we can divide $$D$$ by its theoretical minimal or maximal value which leads to Lewontin’s $$D'$$ (Lewontin 1964). However, this removes the sign of the association. To preserve the sign, in the case where Lewontin’s $$D$$ is negative, we divide it by the negative theoretical minimal value. We call this measure Ducher’s $$Z$$. It should however be noted that the original definition of Ducher’s $$Z$$ did not probably consider the lower bound of the intersection probability as we do here (Ducher et al. 1994).

$Z(a, b) = \begin{cases} \frac{ p(a, b) - p(a) \, p(b) }{ \min[p(a), \, p(b)] - p(a) \, p(b) } & D(a, b) > 0 \\ \\ \frac{ p(a, b) - p(a) \, p(b) }{p(a) p(b) - \max[0, \, p(a) + p(b) - 1]} & D(a, b) < 0 \\ \\ 0 & D(a, b) = 0 \end{cases}$

Normalization of case of $$pmi$$ is more subtle because $$pmi(a, b)$$ tends to $$\infty$$ when $$p(a, b)$$ tends to 0. Nonetheless, dividing $$pmi(a, b)$$ by $$- h(a, b)$$ solves this problem by making $$npmi(a, b)$$ tend to -1 when $$p(a, b)$$ tends to 0 and equal to 1 when $$p(a, b) = \min[p(a), p(b)]$$ (Bouma 2009).

$npmi(a, b) = \frac{ pmi(a, b) }{- h(a, b) } = \frac{ h(a, b) - h(a) - h(b) }{ h(a, b) }$

The zebu package includes a function called lassie allowing estimation of Lewontin’s $$D$$, Ducher’s $$Z$$, $$pmi$$, $$npmi$$, and $$r_{\chi}$$.

#### Global Association

Global association measures yield a single value used to summarize the association for all values taken by the random variables. For example, mutual information is computed as the sum for all events of their observed probability times their pointwise mutual information. Most global association measures in zebu are defined likewise.

\begin{aligned} GD(A, B) &= \sum_{a, b} p(a, b) D(a, b) \\ GZ(A, B) &= \sum_{a, b} p(a, b) Z(a, b) \\ MI(A, B) &= \sum_{a, b} p(a, b) pmi(a, b) \\ NMI(A, B) &= \sum_{a, b} p(a, b) npmi(a, b) \\ \end{aligned}

The global association measure related to chi-squared residuals is the chi-squared $$\chi^2$$. It is defined as the sum of its squared residuals.

$\chi^2 = \sum_{a, b} r_{\chi}(a,b)^2$

#### Permutation Test

Distinguishing the strength of association from its statistical significance is important. Indeed, a strong association can be non-significant (e.g. some physical law with small sample size) and a weak association can be significant (e.g. epidemiological risk factor with big sample size). Significance can be accessed using p-values estimated using the theoretical null distribution or by resampling techniques (Sheskin 2007). Because the theoretical null distribution of local association measures is unknown, the zebu package resorts to estimating p-values by a permutation test. This can be undertaken using the permtest function of the package.

The null hypothesis $$H_0$$ being tested is that the association measure $$L$$ is equal to 0, that is, there is no association. The observed association is $$L_{obs}$$ and the permuted associations are denoted by the set $$L_{perm}$$. Moreover, we write $$\#(\ldots)$$ as the number of times and $$|\ldots|$$ as the absolute value. The two-sided p-value can then be estimated as follows.

$p = \frac{\#(|L_{obs}| < |L_{perm}|)}{\#(L_{perm})}$

For local association measures, this results in conducting a series of statistical tests. It is thus advised to apply multiple testing corrections, such as the one advocated by Benjamini-Hochberg.

#### Derivation of Multivariate Forms

Multivariate association measures may help identify complex association relationships that cannot be detected only with bivariate association measures. For example, in the XOR gate, the output of the gate is not associated with any of the two inputs individually (Jakulin and Bratko 2003). The association is only revealed when the two inputs and the output are taken together.

To derive multivariate forms of these local association measures, we assume that events are mutually independent. This means that for $$M$$ random variables $$X_1, \ldots, X_M$$, independence is defined by: $$p(x_1, \ldots, x_M) = \prod_{i=1}^{M} p(x_i)$$. We can thus define the following measures.

\begin{aligned} D(x_1, \ldots, x_M) & = p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) \\ pmi(x_1, \ldots, x_M) & = - [h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) ] \end{aligned}

By dividing $$D(x_1, \ldots, x_M)$$ by the expected probability we obtain multivariate chi-squared residuals

$r_{\chi}(x_1, \ldots, x_M) = \sqrt{N} \; \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \sqrt{\prod_{i=1}^{M} p(x_i)} }$

To obtain a multivariate measure of Ducher’s $$Z$$, we need to find the bounds of the observed probability $$p(x_1, \ldots, x_M)$$. We know that the upper bound will be the minimal marginal probability. Additionally, we find a formula to express the lower bound (proof at the end of the section). This leads to the following bounds.

$\max[0, -M - 1 + \sum_{i=1}^M p(x_i)] \le p(x_1, \ldots, x_M) \le \min[x_1, \ldots, x_M].$

We thus propose the following multivariate form of Ducher’s $$Z$$.

$Z(x_1, \ldots, x_M) = \begin{cases} \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{ \min[p(x_1), \ldots, p(x_M)] - \prod_{i=1}^{M} p(x_i) } & D(x_1, \ldots, x_M) > 0 \\ \\ \frac{ p(x_1, \ldots, x_M) - \prod_{i=1}^{M} p(x_i) }{\prod_{i=1}^{M} p(x_i)- \max[0, - M - 1 + \sum_{i=1}^M p(x_i)]} & D(x_1, \ldots, x_M) < 0 \\ \\ 0 & D(x_1, \ldots, x_M) = 0 \end{cases}$

For pointwise mutual information, the normalization technique suggested by Bouma (2009) is not bounded by 1 for more than two variables. To solve this, we suggest the following normalization scheme which we call $$npmi_2$$.

$npmi_2(x_1, \ldots, x_M) = \begin{cases} \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{ \min[h(x_1), \ldots, h(x_M)] - \sum_{i=1}^{M} h(x_i) } & pmi(x_1, \ldots, x_M) > 0 \\ \\ \frac{ h(x_1, \ldots, x_M) - \sum_{i=1}^{M} h(x_i) }{h(x_1, \ldots, x_M)} & pmi(x_1, \ldots, x_M) < 0 \\ \\ 0 & pmi(x_1, \ldots, x_M) = 0 \end{cases}$

#### Proof of lower bound formula

Using induction and the inclusion-exclusion principle we give a formula for the lower bound of the observed intersection probability of $$M$$ events.

$\min[ p(x_1, \ldots, x_M) ] = \max[0, -M - 1 + \sum_{i=1}^M p(x_i)]$

We first show that this is true for the base case $$M=2$$. In this case, $$p(x_1) + p(x_2) -1 \le p(x_1, x_2)$$. We proved this using the inclusion-exclusion principle in the section where we derive a bivariate form of Ducher’s Z.

We now show that the induction step is true: let’s assume that this formula is true for $$M$$ variables. For $$M+1$$ variables, the inclusion-exclusion principle tells us that:

\begin{align} p(x_1, \ldots, x_{M+1}) &= p(\{\cap_{i=1}^M x_i \} \cap x_{M+1}) \\ &= p(\cap_{i=1}^M x_i) + p(x_{M+1}) - p(\{\cap_{i=1}^M x_M \} \cup x_{M+1}) \\ \end{align}

We assumed that the lower bound $$p(\cap_{i=1}^M x_i)$$ is $$-M - 1 + \sum_{i=1}^M p(x_i)$$ and we know that the upper bound of $$p(\{\cap_{i=1}^M x_M \} \cup x_{M+1})$$ is one. Replacing these values in the last line leads to

$\min[ p(x_1, \ldots, x_{M+1}) ] = -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)$

However, given that probabilities can not be lower than zero, we can write the following equation which completes the proof.

$\min[ p(x_1, \ldots, x_{M+1}) ] = \max[0, -(M+1) - 1 + \sum_{i=1}^{M+1} p(x_i)]$

## User’s Guide - An Example with Simulated Data

Once R is installed, the first step is to install the zebu package. You can install the released version from CRAN

install.packages("zebu")

We can then load the zebu R package.

library(zebu) 

### Simulating the Dataset

To illustrate local association measures and the zebu package, we’ll simulate a small culinary dataset.

Each row corresponds to a client of a restaurant. We record the choices made by the client. There are three choices for each of the plates: starters, main dish, and dessert.

The clients take the starter with equal probability. The choice of the following dish depends only on the previous dish. The clients tend to avoid a dish with an ingredient that had in the previous dish. We define these probabilities hereunder.

starter_prob <- c("Tomato Mozzarella Salad" = 1/3, "Rice Tuna Salad" = 1/3, "Lentil Salad" = 1/3)
starter_prob
Tomato Mozzarella Salad         Rice Tuna Salad            Lentil Salad
0.3333333               0.3333333               0.3333333 
main_given_starter_prob <- matrix(c(5/11, 1/11, 5/11,
5/11, 5/11, 1/10,
1/11, 5/11, 5/11),
3, 3, byrow = TRUE)
rownames(main_given_starter_prob) <- names(starter_prob)
colnames(main_given_starter_prob) <- c("Sausage and Lentil Stew", "Pizza Margherita", "Pilaf Rice")
main_given_starter_prob
                        Sausage and Lentil Stew Pizza Margherita Pilaf Rice
Tomato Mozzarella Salad              0.45454545       0.09090909  0.4545455
Rice Tuna Salad                      0.45454545       0.45454545  0.1000000
Lentil Salad                         0.09090909       0.45454545  0.4545455
dessert_given_main <- matrix(c(2/6, 2/6, 2/6,
7/12, 1/12, 2/6,
1/12, 7/12, 2/6),
3, 3, byrow = TRUE)
rownames(dessert_given_main) <- colnames(main_given_starter_prob)
colnames(dessert_given_main) <- c("Rice Pudding", "Apple Pie", "Fruit Salad")
dessert_given_main
                        Rice Pudding  Apple Pie Fruit Salad
Sausage and Lentil Stew   0.33333333 0.33333333   0.3333333
Pizza Margherita          0.58333333 0.08333333   0.3333333
Pilaf Rice                0.08333333 0.58333333   0.3333333

We now simulate a dataset of 1000 clients given these probabilities.

set.seed(0)
sample_size <- 1000
df <- t(sapply(seq_len(sample_size), function(i) {

starter <- sample(names(starter_prob), size = 1, prob = starter_prob)
main <- sample(colnames(main_given_starter_prob), size = 1, prob = main_given_starter_prob[starter, ])
dessert <- sample(colnames(dessert_given_main), size = 1, prob = dessert_given_main[main, ])

c(Starter = starter, Main = main, Dessert = dessert)
}))
df <- as.data.frame(df)
head(df)
                  Starter                    Main      Dessert
2            Lentil Salad              Pilaf Rice    Apple Pie
4            Lentil Salad        Pizza Margherita Rice Pudding
6 Tomato Mozzarella Salad              Pilaf Rice  Fruit Salad

We look at the contingency table.

table(df)
, , Dessert = Apple Pie

Main
Starter                   Pilaf Rice Pizza Margherita Sausage and Lentil Stew
Rice Tuna Salad                 14               14                      44
Tomato Mozzarella Salad         81                2                      47

, , Dessert = Fruit Salad

Main
Starter                   Pilaf Rice Pizza Margherita Sausage and Lentil Stew
Rice Tuna Salad                  7               49                      47
Tomato Mozzarella Salad         39               11                      58

, , Dessert = Rice Pudding

Main
Starter                   Pilaf Rice Pizza Margherita Sausage and Lentil Stew
Rice Tuna Salad                  3              101                      52
Tomato Mozzarella Salad         13               12                      59

### Bivariate Association

The local (and global) association between these variables can be estimated using the lassie function. This function takes at least one argument: a data.frame. Columns are selected using the select arguments (column names or numbers). Variables are assumed to be categorical; continuous variables have to be specified using the continuous argument and the number of discretization bins with the breaks argument (as in the cut function). The local association measure that we use here is Ducher’s Z as specified by setting the measure argument equal to "z". We’ll look at the relationship between the main dish and the dessert.

las <- lassie(df, select = c("Main", "Dessert"), measure = "z")

The permtest function accesses the significance of local (and global) association using a permutation test. The number of iterations is specified by nb and the adjustment method of p-values for multiple comparisons by p_adjust (as in the p.adjust function).

las <- permtest(las,
nb = 5000,
p_adjust = "BH")

The lassie and permtest functions return a lassie S3 object, as well as permtest for permtest. lassie objects can be visualized using the plot and print methods. The plot function returns a heatmap with the local association and p-values displayed between parenthesis.

print(las)
Measure: Ducher's Z
Global: 0.0912667026105455 (p-value: <1/5000)
Dessert
Main                        Apple Pie  Fruit Salad Rice Pudding
Pilaf Rice               0.38531235  0.006639046 -0.749858716
Pizza Margherita        -0.69399394 -0.062255796  0.367744192
Sausage and Lentil Stew -0.04383642  0.027310138 -0.008436162
plot(las) Results can be saved in CSV format using write.lassie. To access the documentation of these functions, please type help("print.lassie"), help("plot.lassie") and help(write.lassie) in the R console.

In this example, we can see that the global association between the two variables is statistically significant. However, we see that there is only a local association between these two variables only for certain combinations. More specifically, knowing that a client took the lentils and sausages does not convey information about what dessert s/he will take, and likewise for the fruit salad. A significant global association can hide a non-significant local association.

### Multivariate Association

The number of variables that can be handled in the zebu package is not limited. We can use the lassie function to access the association between the three simulated variables.

In this case, we obtain a multidimensional local association array. Because of this, results cannot be plotted as a tile plot; the plot method is not available. The print method allows visualizing results by melting the array into a data.frame, by default sorted by decreasing local association.

las2 <- lassie(df, measure = "z")
las2 <- permtest(las2, nb = 5000)
print(las2, what_sort = "local_p", decreasing = FALSE)
Measure: Ducher's Z
Global: -0.00796166078557106 (p-value: 1)
Starter                    Main      Dessert       local
4             Lentil Salad        Pizza Margherita    Apple Pie -0.54570753
5          Rice Tuna Salad        Pizza Margherita    Apple Pie -0.60779228
6  Tomato Mozzarella Salad        Pizza Margherita    Apple Pie -0.94240428
7             Lentil Salad Sausage and Lentil Stew    Apple Pie -0.65556067
19            Lentil Salad              Pilaf Rice Rice Pudding -0.70631293
20         Rice Tuna Salad              Pilaf Rice Rice Pudding -0.91603179
21 Tomato Mozzarella Salad              Pilaf Rice Rice Pudding -0.62596772
24 Tomato Mozzarella Salad        Pizza Margherita Rice Pudding -0.70420369
25            Lentil Salad Sausage and Lentil Stew Rice Pudding -0.65981754
2          Rice Tuna Salad              Pilaf Rice    Apple Pie -0.54220571
23         Rice Tuna Salad        Pizza Margherita Rice Pudding  0.20497105
22            Lentil Salad        Pizza Margherita Rice Pudding  0.16908965
1             Lentil Salad              Pilaf Rice    Apple Pie  0.16835345
3  Tomato Mozzarella Salad              Pilaf Rice    Apple Pie  0.19034384
27 Tomato Mozzarella Salad Sausage and Lentil Stew Rice Pudding  0.06433270
8          Rice Tuna Salad Sausage and Lentil Stew    Apple Pie  0.02929618
9  Tomato Mozzarella Salad Sausage and Lentil Stew    Apple Pie  0.04371432
26         Rice Tuna Salad Sausage and Lentil Stew Rice Pudding  0.03439883
obs        exp              local_p
4  0.017 0.03742083              <1/5000
5  0.014 0.03569537              <1/5000
6  0.002 0.03472480              <1/5000
7  0.013 0.03774250              <1/5000
11 0.007 0.03265977              <1/5000
15 0.011 0.03708474              <1/5000
16 0.017 0.04030752              <1/5000
19 0.011 0.03745483              <1/5000
20 0.003 0.03572781              <1/5000
21 0.013 0.03475636              <1/5000
24 0.012 0.04056846              <1/5000
25 0.015 0.04409398              <1/5000
2  0.014 0.03058142 0.000415384615384615
23 0.101 0.04170236    0.123042857142857
22 0.095 0.04371818               0.1854
1  0.077 0.03205968    0.210282352941176
3  0.081 0.02974990    0.210282352941176
10 0.054 0.03423849    0.436547368421053
18 0.058 0.03740352    0.436547368421053
27 0.059 0.04091718              0.45009
8  0.044 0.03600221               0.4788
9  0.047 0.03502330               0.4788
12 0.039 0.03177174               0.4788
13 0.048 0.03996399               0.4788
14 0.049 0.03812127               0.4788
17 0.047 0.03844896               0.4788
26 0.052 0.04206083               0.4788

In this case, we see that there is no significant global association. However, we see that for certain combinations of variables, there is a significant local association. A non-significant global association can hide a significant local association.

## Future Research and Development

Local association measures are issued from empirical research. Although these have proven their interest in diverse applications, theoretical studies of their mathematical properties are sparse. A more theoretical approach to these measures could be of interest. For example, by determining the theoretical null distribution of these measures. Also, we have assumed mutual exclusivity of events for the multivariate association measures. This assumption may be too stringent for certain variables and usage of other independence models such as conditional independence may prove to be worthwhile.

In zebu, discretization is a necessary step for studying continuous variables. We have restrained ourselves to simple discretization methods: equal-width and user-defined. Other discretization algorithms exist (Dash, Paramguru, and Dash 2011) and may be more adapted for the computation of association measures. Moreover, kernel methods could also be used to handle continuous variables better. Secondly, estimation of probabilities is done from the frequentist maximum-likelihood procedure which requires sufficiently large datasets. Unfortunately, in fields such as health sciences, datasets are sparse. Bayesian estimation methods are more robust to small sample sizes by not relying on asymptomatic assumptions and by allowing integration of prior knowledge (Wilkinson 2007). Such an implementation may also prove to be of interest.