The package `discretefit`

implements Monte Carlo simulations for goodness-of-fit (GOF) tests for discrete distributions. This includes tests based on the root-mean-square statistic, the Chi-squared statistic, the log-likelihood-ratio (\(G^2\)) statistic, the Freeman-Tukey (Hellinger-distance) statistic, the Kolmogorov-Smirnov statistic, and the Cramer-von Mises statistic.

Simulations are written in C++ (utilizing `Rcpp`

) and are considerably faster than the simulated Chi-squared GOF test in the R `stats`

package.

The GOF tests in `discretefit`

function on a vector of counts, x, and a vector of probabilities, p. In the below example, x represents a vector of counts for five categories, and p represents a vector of probabilities for each corresponding category. The GOF tests provides p-values for the null hypothesis that x is a random sample of the discrete distribution defined by p.

```
library(discretefit)
library(bench)
<- c(42, 0, 13, 2, 109)
x <- c(0.2, 0.05, 0.1, 0.05, 0.6)
p
chisq_gof(x, p)
#>
#> Simulated Chi-squared goodness-of-fit test
#>
#> data: x
#> Chi-squared = 17.082, p-value = 0.0017
rms_gof(x, p)
#>
#> Simulated root-mean-square goodness-of-fit test
#>
#> data: x
#> RMS = 1.731, p-value = 0.0397
g_gof(x, p)
#>
#> Simulated log-likelihood-ratio goodness-of-fit test
#>
#> data: x
#> G2 = 27.362, p-value = 9.999e-05
ft_gof(x, p)
#>
#> Simulated Freeman-Tukey goodness-of-fit test
#>
#> data: x
#> FT = 45.599, p-value = 9.999e-05
ks_gof(x, p)
#>
#> Simulated Kolmogorov-Smirnov goodness-of-fit test
#>
#> data: x
#> KS = 0.056627, p-value = 0.2385
cvm_gof(x, p)
#>
#> Simulated Cramer-von Mises goodness-of-fit test
#>
#> data: x
#> W2 = 0.12578, p-value = 0.1833
```

In a surprising number of cases, a simulated GOF test based on the root-mean-square statistic outperforms the Chi-squared test and other tests in the Cressie-Read power divergence family. This has been demonstrated by Perkins, Tygert, and Ward (2011). They provide the following toy example.

Take a discrete distribution with 50 bins (or categories). The probability for the first and second bin is 0.25. The probability for each of the remaining 48 bins is 0.5 / 48 (~0.0104).

Now take the observed counts of 15 for the first bin, 5 for the second bin, and zero for each of the remaining 48 bins. It’s obvious that these observations are very unlikely to occur for a random sample from the above distribution. However, the Chi-squared, \(G^2\), and Freeman-Tukey tests fail to reject the null hypothesis.

```
<- c(15, 5, rep(0, 48))
x <- c(0.25, 0.25, rep(1/(2 * 50 -4), 48))
p
chisq_gof(x, p)
#>
#> Simulated Chi-squared goodness-of-fit test
#>
#> data: x
#> Chi-squared = 30, p-value = 0.9707
g_gof(x, p)
#>
#> Simulated log-likelihood-ratio goodness-of-fit test
#>
#> data: x
#> G2 = 32.958, p-value = 0.6641
ft_gof(x, p)
#>
#> Simulated Freeman-Tukey goodness-of-fit test
#>
#> data: x
#> FT = 50.718, p-value = 0.1334
```

By contrast, the root-mean-square test convincingly rejects the null hypothesis.

```
rms_gof(x, p)
#>
#> Simulated root-mean-square goodness-of-fit test
#>
#> data: x
#> RMS = 5.1042, p-value = 9.999e-05
```

For additional examples, also see Perkins, Tygert, and Ward (2011) and Ward and Carroll (2014).

The simulated Chi-squared GOF test in `discretefit`

produces identical p-values to the simulated Chi-squared GOF test in the `stats`

package that is part of base R.

```
set.seed(499)
chisq_gof(x, p, reps = 2000)$p.value
#> [1] 0.9685157
set.seed(499)
chisq.test(x, p = p, simulate.p.value = TRUE)$p.value
#> [1] 0.9685157
```

However, because Monte Carlo simulations in `discretefit`

are implemented in C++, `chisq_gof`

is much faster than `chisq.test`

, especially when a large number of simulations are required.

```
::system_time(
benchchisq_gof(x, p, reps = 20000)
)#> process real
#> 203ms 201ms
::system_time(
benchchisq.test(x, p = p, simulate.p.value = TRUE, B = 20000)
)#> process real
#> 1.61s 1.61s
```

Additionally, the simulated GOF tests in base R is vectorized, so for large vectors attempting a large number of simulations may not be possible because of memory constraints. Since the functions in `discretefit`

are not vectorized, memory use is minimized.

The `cvm_gof`

function utilizes the formula for the Cramer-von Mises statistic for discrete distributions originally introduced by Choulakian, Lockhart, and Stephens (1994). Let N represent the number of observations, let k represent the number of bins, and let \(p_{i}\) represent the probability of an observation falling in the \(i^{th}\) bin. Furthermore, let \(S_{i}\) correspond to the empirical cumulative distribution function for the observed data and \(T_{i}\) correspond to the cumulative distribution function for the theoretical distribution. Then where \(Z_{i}\) = \(S_{i}\) - \(T_{i}\), the Cramer-von Mises statistic, represented by \(W^2\), is defined as follows:

\[ W^2 = N^{-1} \sum_{i = 1}^{k} Z_{i}^2p_{i} \] For an alternative formula see Lockhart, Spinelli, and Stephens (2007).

All p-values calculated by the functions in `discretefit`

follow the formula for simulated p-values proposed by Dwass (1957). For the below equation, let m represent the number of simulations and let B represent the number of simulations where the test statistic for the simulated data is greater than or equal to the test statistic for the observed data.

\[{p}_{u} = P(B <= b) = \frac{b + 1}{m + 1}\]

This is the equation used to calculate simulated p-values in the `chisq.test`

function in the `stats`

package but some implementations of simulated p-values, for example the `dgof`

package, use the unbiased estimator, \(\frac{B}{m}\). For an explanation of why the biased estimator yields a test of the correct size and the unbiased estimator does not, see Phipson and Smyth (2011).

Several other packages implement GOF tests for discrete distributions.

As noted above, the `stats`

package in base R implements a simulated Chi-squared GOF test.

I’m not aware of an R package that implements a simulated \(G^2\) GOF test but the packages `RVAideMemoire`

and `DescTools`

implement GOF tests that utilize approximations based on the Chi-squared distribution.

The `dgof`

package (Anderson and Emerson 2011) implements simulated Kolmogorov-Smirnov GOF tests and simulated Cramer-von Mises GOF tests . The `cvmdisc`

package also implements a simulated Cramer-von Mises GOF test.

The `KSgeneral`

package (Dimitrova, Kaishev, and Tan, 2020) implements exact Kolmogorov-Smirnov GOF tests and fast, simulated GOF tests utilizing the algorithm introduced by Wood and Altavela (1978) which depends on asymptotic properties.

I’m not aware of another R package that implements a root-mean-square GOF test.

Arnold, Taylor B. and John W. Emerson. “Nonparametric goodness-of-fit tests for discrete null distributions.” R Journal. https://doi.org/10.32614/rj-2011-016

Choulakian, V., Richard. A. Lockhart, and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions.” Canadian Journal of Statistics, 1994. https://doi.org/10.2307/3315828

Dimitrova, Dimitrina S., Vladimir K. Kaishev, and Senren Tan. “Computing the Kolmogorov-Smirnov distribution when the underlying CDF is purely discrete, mixed, or continuous.” Journal of Statistical Software, 2020. https://doi.org/10.18637/jss.v095.i10

Dwass, Meyer. “Modified randomization tests for nonparametric hypotheses.” Annuls of Mathematical Statistics, 1957. https://doi.org/10.1214/aoms/1177707045

Eddelbuettel, Dirk and Romain Francois. “Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software, 2011. https://www.jstatsoft.org/article/view/v040i08

Lockhart, Richard A., John J. Spinelli and Michael A. Stephens. “Cramer–von Mises statistics for discrete distributions with unknown parameters,” Canadian Journal of Statistics, 2007. https://doi.org/10.1002/cjs.5550350111

Perkins, William, Mark Tygert, and Rachel Ward. “Computing the confidence levels for a root-mean-square test of goodness-of-fit.” Applied Mathematics and Computation, 2011. https://doi.org/10.1016/j.amc.2011.03.124

Phipson, Belinda, and Gordon K. Smyth. “Permutation p-values should never be zero: calculating exact p-values when permutations are randomly drawn.” Statistical Applications in Genetics and Molecular Biology, 2010. https://doi.org/10.2202/1544-6115.1585

Ward, Rachel and Raymond J. Carroll. “Testing Hardy–Weinberg equilibrium with a simple root-mean-square statistic.” Biostatistics, 2014. https://doi.org/10.1093/biostatistics/kxt028

Wood, Constance L., and Michele M. Altavela. “Large-Sample Results for Kolmogorov–Smirnov Statistics for Discrete Distributions.” Biometrika, 1978. https://doi.org/10.1093/biomet/65.1.235