# Introducing discretefit

#### October 20, 2021

The package discretefit implements Monte Carlo simulations for goodness-of-fit (GOF) tests for discrete distributions. This includes tests based on the root-mean-square statistic, the Chi-squared statistic, the log-likelihood-ratio ($$G^2$$) statistic, the Freeman-Tukey (Hellinger-distance) statistic, the Kolmogorov-Smirnov statistic, and the Cramer-von Mises statistic.

Simulations are written in C++ (utilizing Rcpp) and are considerably faster than the simulated Chi-squared GOF test in the R stats package.

## Usage

The GOF tests in discretefit function on a vector of counts, x, and a vector of probabilities, p. In the below example, x represents a vector of counts for five categories, and p represents a vector of probabilities for each corresponding category. The GOF tests provides p-values for the null hypothesis that x is a random sample of the discrete distribution defined by p.

library(discretefit)
library(bench)

x <- c(42, 0, 13, 2, 109)
p <- c(0.2, 0.05, 0.1, 0.05, 0.6)

chisq_gof(x, p)
#>
#>  Simulated Chi-squared goodness-of-fit test
#>
#> data:  x
#> Chi-squared = 17.082, p-value = 0.0017
rms_gof(x, p)
#>
#>  Simulated root-mean-square goodness-of-fit test
#>
#> data:  x
#> RMS = 1.731, p-value = 0.0397
g_gof(x, p)
#>
#>  Simulated log-likelihood-ratio goodness-of-fit test
#>
#> data:  x
#> G2 = 27.362, p-value = 9.999e-05
ft_gof(x, p)
#>
#>  Simulated Freeman-Tukey goodness-of-fit test
#>
#> data:  x
#> FT = 45.599, p-value = 9.999e-05
ks_gof(x, p)
#>
#>  Simulated Kolmogorov-Smirnov goodness-of-fit test
#>
#> data:  x
#> KS = 0.056627, p-value = 0.2385
cvm_gof(x, p)
#>
#>  Simulated Cramer-von Mises goodness-of-fit test
#>
#> data:  x
#> W2 = 0.12578, p-value = 0.1833

## Root-mean-square statistic

In a surprising number of cases, a simulated GOF test based on the root-mean-square statistic outperforms the Chi-squared test and other tests in the Cressie-Read power divergence family. This has been demonstrated by Perkins, Tygert, and Ward (2011). They provide the following toy example.

Take a discrete distribution with 50 bins (or categories). The probability for the first and second bin is 0.25. The probability for each of the remaining 48 bins is 0.5 / 48 (~0.0104).

Now take the observed counts of 15 for the first bin, 5 for the second bin, and zero for each of the remaining 48 bins. It’s obvious that these observations are very unlikely to occur for a random sample from the above distribution. However, the Chi-squared, $$G^2$$, and Freeman-Tukey tests fail to reject the null hypothesis.

x <- c(15, 5, rep(0, 48))
p <- c(0.25, 0.25, rep(1/(2 * 50 -4), 48))

chisq_gof(x, p)
#>
#>  Simulated Chi-squared goodness-of-fit test
#>
#> data:  x
#> Chi-squared = 30, p-value = 0.9707
g_gof(x, p)
#>
#>  Simulated log-likelihood-ratio goodness-of-fit test
#>
#> data:  x
#> G2 = 32.958, p-value = 0.6641
ft_gof(x, p)
#>
#>  Simulated Freeman-Tukey goodness-of-fit test
#>
#> data:  x
#> FT = 50.718, p-value = 0.1334

By contrast, the root-mean-square test convincingly rejects the null hypothesis.

rms_gof(x, p)
#>
#>  Simulated root-mean-square goodness-of-fit test
#>
#> data:  x
#> RMS = 5.1042, p-value = 9.999e-05

For additional examples, also see Perkins, Tygert, and Ward (2011) and Ward and Carroll (2014).

## Speed

The simulated Chi-squared GOF test in discretefit produces identical p-values to the simulated Chi-squared GOF test in the stats package that is part of base R.

set.seed(499)
chisq_gof(x, p, reps = 2000)$p.value #> [1] 0.9685157 set.seed(499) chisq.test(x, p = p, simulate.p.value = TRUE)$p.value
#> [1] 0.9685157

However, because Monte Carlo simulations in discretefit are implemented in C++, chisq_gof is much faster than chisq.test, especially when a large number of simulations are required.

bench::system_time(
chisq_gof(x, p, reps = 20000)
)
#> process    real
#>   203ms   201ms

bench::system_time(
chisq.test(x, p = p, simulate.p.value = TRUE, B = 20000)
)
#> process    real
#>   1.61s   1.61s

Additionally, the simulated GOF tests in base R is vectorized, so for large vectors attempting a large number of simulations may not be possible because of memory constraints. Since the functions in discretefit are not vectorized, memory use is minimized.

## Computation

The cvm_gof function utilizes the formula for the Cramer-von Mises statistic for discrete distributions originally introduced by Choulakian, Lockhart, and Stephens (1994). Let N represent the number of observations, let k represent the number of bins, and let $$p_{i}$$ represent the probability of an observation falling in the $$i^{th}$$ bin. Furthermore, let $$S_{i}$$ correspond to the empirical cumulative distribution function for the observed data and $$T_{i}$$ correspond to the cumulative distribution function for the theoretical distribution. Then where $$Z_{i}$$ = $$S_{i}$$ - $$T_{i}$$, the Cramer-von Mises statistic, represented by $$W^2$$, is defined as follows:

$W^2 = N^{-1} \sum_{i = 1}^{k} Z_{i}^2p_{i}$ For an alternative formula see Lockhart, Spinelli, and Stephens (2007).

All p-values calculated by the functions in discretefit follow the formula for simulated p-values proposed by Dwass (1957). For the below equation, let m represent the number of simulations and let B represent the number of simulations where the test statistic for the simulated data is greater than or equal to the test statistic for the observed data.

${p}_{u} = P(B <= b) = \frac{b + 1}{m + 1}$

This is the equation used to calculate simulated p-values in the chisq.test function in the stats package but some implementations of simulated p-values, for example the dgof package, use the unbiased estimator, $$\frac{B}{m}$$. For an explanation of why the biased estimator yields a test of the correct size and the unbiased estimator does not, see Phipson and Smyth (2011).

## Alternatives

Several other packages implement GOF tests for discrete distributions.

As noted above, the stats package in base R implements a simulated Chi-squared GOF test.

I’m not aware of an R package that implements a simulated $$G^2$$ GOF test but the packages RVAideMemoire and DescTools implement GOF tests that utilize approximations based on the Chi-squared distribution.

The dgof package (Anderson and Emerson 2011) implements simulated Kolmogorov-Smirnov GOF tests and simulated Cramer-von Mises GOF tests . The cvmdisc package also implements a simulated Cramer-von Mises GOF test.

The KSgeneral package (Dimitrova, Kaishev, and Tan, 2020) implements exact Kolmogorov-Smirnov GOF tests and fast, simulated GOF tests utilizing the algorithm introduced by Wood and Altavela (1978) which depends on asymptotic properties.

I’m not aware of another R package that implements a root-mean-square GOF test.

## References

Arnold, Taylor B. and John W. Emerson. “Nonparametric goodness-of-fit tests for discrete null distributions.” R Journal. https://doi.org/10.32614/rj-2011-016

Choulakian, V., Richard. A. Lockhart, and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions.” Canadian Journal of Statistics, 1994. https://doi.org/10.2307/3315828

Dimitrova, Dimitrina S., Vladimir K. Kaishev, and Senren Tan. “Computing the Kolmogorov-Smirnov distribution when the underlying CDF is purely discrete, mixed, or continuous.” Journal of Statistical Software, 2020. https://doi.org/10.18637/jss.v095.i10

Dwass, Meyer. “Modified randomization tests for nonparametric hypotheses.” Annuls of Mathematical Statistics, 1957. https://doi.org/10.1214/aoms/1177707045

Eddelbuettel, Dirk and Romain Francois. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 2011. https://www.jstatsoft.org/article/view/v040i08

Lockhart, Richard A., John J. Spinelli and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions with unknown parameters,” Canadian Journal of Statistics, 2007. https://doi.org/10.1002/cjs.5550350111

Perkins, William, Mark Tygert, and Rachel Ward. “Computing the confidence levels for a root-mean-square test of goodness-of-fit.” Applied Mathematics and Computation, 2011. https://doi.org/10.1016/j.amc.2011.03.124

Phipson, Belinda, and Gordon K. Smyth. “Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 2010. https://doi.org/10.2202/1544-6115.1585

Ward, Rachel and Raymond J. Carroll. “Testing Hardy–Weinberg equilibrium with a simple root-mean-square statistic.” Biostatistics, 2014. https://doi.org/10.1093/biostatistics/kxt028

Wood, Constance L., and Michele M. Altavela. “Large-Sample Results for Kolmogorov–Smirnov Statistics for Discrete Distributions.” Biometrika, 1978. https://doi.org/10.1093/biomet/65.1.235