# Inspecting the errorlocate Mixed Integer Program

library(errorlocate)
#> Loading required package: validate

## Intro

Errorlocate uses the linear, categorical and conditional rules from a rules set formulated with R package validate, to create a Mixed Integer Problem.

For most users the details of the translation are not relevant and hidden in locate_errors. Often the number of errors found and the processing time are much more relevant parameters.

In a few cases, you may run into a problems with your error localization problem:

1. The processing time of (some of the records) of locate_errors is high.
2. locate_errors missed an obvious error.
3. locate_errors indicates that it did not find a valid solution (for some records) .

Problem a. can be addressed by using the parallel argument of locate_errors (and replace_errors). Problem b can be due to that error_locate ignores non-linear rules, and therefore is not able to deduce the errors, because it only takes linear, categorical and conditional rules into a account.

There may also be problems with your rules set. Problems set may be mitigated by using the validatetools package that can detect conflicting and redundant rules and has methods to simplify your rule set.

If you want to dive deep into the mixed integer problem that is created by error_locate you can use the inspect_mip function.

## A bit of theory

In the following sections an example is given of how linear, categorical and conditional rules are written as Mixed Integer Problems. First let’s see how these rules in validator can be formally defined.

## Formal description

### Rule $$r_i(\mathbf{x})$$

Each translatable rule $$r_i(\mathbf{x})$$ can be written as a disjunction of atomic clauses $$C_i^j(x)$$: it is a function $$r_i$$ that operates on (some of) the values of record $$\mathbf{x} = (x_1, \ldots, x_n)$$ and is TRUE (valid) or FALSE (not valid)

$r_i(\mathbf{x}) = \bigvee_j C_i^j(\mathbf{x})$

with each atomic clause:

$C_i^j(\mathbf{x}) = \left\{ \begin{array}{l} \mathbf{a}^T\mathbf{x} \leq b \\ \mathbf{a}^T\mathbf{x} = b \\ x_j \in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\ x_j \not\in F_{ij} \textrm{with } F_{ij} \subseteq D_j \\ \end{array} \right.$

Each linear, categorical or conditional rule $$r_i$$ can be written in this form.

### Example 1

rules <- validator(example_1 = if (income > 0) age >= 16)
rules$exprs() #>$example_1
#> !(income > 0) | (age >= 16)
#> attr(,"reference")
#> example_1
#>         1

So the rule if (income > 0) age >= 16 can be written as (income <= 0 OR age >=16)

### Example 2

rules <- validator(example_2 = if (has_house == "yes") income >= 1000)
rules$exprs() #>$example_2
#> !(has_house == "yes") | (income >= 1000)
#> attr(,"reference")
#> example_2
#>         1

So the rule if (has_house == "yes") income >= 1000) can be written as (has_house != "yes" OR age >=1000)

## Rule system:

The rules form a system $$R(\mathbf{x})$$:

$R(\mathbf{x}) = \bigwedge_i r_i$ which means that all rules $$r_i$$ must be valid. If $$R(\mathbf{x})$$ is true for record $$\mathbf{x}$$, then the record is valid, otherwise one (or more) of the rules is violated.

## Mixed Integer Programming to FH

Each rule set $$R(\mathbf{x})$$ can be translated into a mip problem and solved.

$\begin{array}{r} \textrm{Minimize } f(\mathbf{x}) = 0; \\ \textrm{s.t. }\mathbf{Rx} \leq \mathbf{d} \\ \end{array}$ - $$f(\mathbf{x})$$ is the (weighted) number of changed variable: $$\delta_i \in {0,1}$$

$f(\mathbf{x}) = \sum_{i=1}^N w_i \delta_i$

• $$\mathbf{R}$$ contains rules:

• $$\mathbf{R}_H(\mathbf{x}) \leq \mathbf{d}_H$$ that were specified with validate/validator

• $$\mathbf{R}_0(\mathbf{x}, \mathbf{\delta}) \leq \mathbf{d}_0$$ : soft constraints that try fix the current record of $$\mathbf{x}$$ to the observed values.

## Enter inspect_mip:

### Linear rules

Most users will use the function locate_errors to find errors. The function inspect_mip works exactly same, except that it operates on just one record in stead of a whole data.frame. The result of inspect_mip is a mip object, that is not yet executed and can be inspected.

rules <- validator( r1 = age >= 18
, r2 = income >= 0
)
data <- data.frame(age = c(12, 35), income = c(2000, -1000))
data
age income
12 2000
35 -1000

So we detect two errors in the dataset:

summary(confront(data, rules))
name items passes fails nNA error warning expression
r1 2 1 1 0 FALSE FALSE (age - 18) >= -1e-08
r2 2 1 1 0 FALSE FALSE (income - 0) >= -1e-08
locate_errors(data, rules)$errors #> age income #> [1,] TRUE FALSE #> [2,] FALSE TRUE Lets inspect the first record mip <- inspect_mip(data, rules) #> Warning: Taking record 1, ignoring rest of the records... The mip object contains the mip problem before it is executed. We can inspect the lp problem, prior to solving it with lpSolveApi with # inspect the lp problem (prior to solving it with lpsolveAPI) lp <- mip$to_lp()
print(lp)
  Model name: errorlocate
age          income      .delta_age   .delta_income
Minimize                0               0  1.346459353576  1.167139223078
r1                     -1               0               0               0  <=    -18
r2                      0              -1               0               0  <=      0
age_ub                  1               0          -1e+07               0  <=     12
income_ub               0               1               0          -1e+07  <=   2000
age_lb                 -1               0          -1e+07               0  <=    -12
income_lb               0              -1               0          -1e+07  <=  -2000
Kind                  Std             Std             Std             Std
Type                 Real            Real             Int             Int
Upper                 Inf             Inf               1               1
Lower                -Inf            -Inf               0               0

Validator rules r1 and r2 are encoded in two lines of the model. The values of the current record are encoded as soft constraints in age_ub, age_lb, income_lb and income_ub. These constraints try to fix the values of age at 12 and income at 2000, but can be violated, setting .delta_age or .delta_income to 1.

For large problems the lp problem can be written to disk for inspection

mip$write_lp("my_problem.lp") Once we execute the mip project, the lp solver is executed on the problem: res <- mip$execute()

Extra arguments are passed through to lpSolveAPI. The result object contains several properties:

names(res)
#> [1] "s"        "solution" "values"   "lp"       "adapt"    "errors"

res$solution indicates of a solution was found res$solution
#> [1] TRUE

res$s indicates the lpSolveAPI status, what kind of solution was found. res$s
#> [1] 0

res$errors indicates which fields/values are deemed erroneous: res$errors
#>    age income
#>   TRUE  FALSE

res$values contains the values for the valid solution that has been found by the lpsolver: res$values
#>           age        income    .delta_age .delta_income
#>            18          2000             1             0

Note that the solver has found that setting age from 12 to 18 gives a valid solution. .delta_age = 1 indicates that age contained an error.

The result object res also contains an lp object after optimization. This object can be further investigated using lpSolveAPI functions.

# lp problem after solving
res$lp  Model name: errorlocate age income .delta_age .delta_income Minimize 0 0 1.346459353576 1.167139223078 age_ub 1 0 -1e+07 0 <= 12 income_ub 0 1 0 -1e+07 <= 2000 income_lb 0 -1 0 -1e+07 <= -2000 Kind Std Std Std Std Type Real Real Int Int Upper Inf Inf 1 1 Lower 18 0 0 0 Note that the lp problem has been simplified. For example the single variable constraints,the lp problem/object after solving shows that the solver has optimized some of the rules: it has moved rule r1 and r2 into the Lower boundary conditions. It also removed age_lb because that was superfluous with respect to the boundary conditions. ### Categorical rule In categorical rules, each category is coded in a separate column/mip variable: e.g. if we have a working variable, with two categories (“job,” “retired”), the mip problem is encoded as follows: rules <- validator( r1 = working %in% c("job","retired") ) data <- data.frame(working="?") working ? mip <- inspect_mip(data, rules) mip$to_lp()
  Model name: errorlocate
working:?      working:job  working:retired   .delta_working
Minimize                0                0                0    1.47541783657
r1                      0                1                1                0  =  1
working                 1                0                0                1  =  1
Kind                  SOS              SOS              SOS              Std
Type                  Int              Int              Int              Int
Upper                   1                1                1                1
Lower                   0                0                0                0

Row r1 indicates that either working:job or working:retired must be true. The Kind row (SOS) indicates that these variables share the same switch, only one of them can be set.

mip$execute()$values
#>       working:?     working:job working:retired  .delta_working
#>               0               1               0               1

#### Multiple categories:

With categorical variables it is also possible to specify if-then rules. These are encoded as one mip rule:

rules <- validator( r1 = if (voted == TRUE) adult == TRUE)
data <- data.frame(voted = TRUE, adult = FALSE)
TRUE FALSE
mip <- inspect_mip(data, rules)
mip$to_lp()  Model name: errorlocate adult voted .delta_adult .delta_voted Minimize 0 0 1.075791650452 1.446975832222 r1 -1 1 0 0 <= 0 voted 0 1 0 1 = 1 adult -1 0 1 0 = 0 Kind Std Std Std Std Type Int Int Int Int Upper 1 1 1 1 Lower 0 0 0 0 mip$execute()$values #> adult voted .delta_adult .delta_voted #> 1 1 1 0 ### Conditional rule rules <- validator( r1 = if (income > 0) age >= 16) data <- data.frame(age = 12, income = 2000) age income 12 2000 mip <- inspect_mip(data, rules) errorlocate encodes this rule into multiple rules (as noted in the theoretical section above), so rule r1 is chopped into 1 rule + 2 sub rules: r1: if (income > 0) age >= 16: • r1._lin1: if (r1._lin1 == FALSE) income <= 0 • r1._lin2: if (r1._lin2 == FALSE) age >= 16 • r1: r1._lin1 == FALSE | r1._lin2 == FALSE This can be seen with: mip$mip_rules()
#> [[1]]
#> r1: r1._lin1 + r1._lin2 <= 1
#> [[2]]
#> r1._lin1: income - 1e+07*r1._lin1 <= 0
#> [[3]]
#> r1._lin2: -age - 1e+07*r1._lin2 <= -16
#> [[4]]
#> income_ub: income - 1e+07*.delta_income <= 2000
#> [[5]]
#> age_ub: age - 1e+07*.delta_age <= 12
#> [[6]]
#> income_lb: -income - 1e+07*.delta_income <= -2000
#> [[7]]
#> age_lb: -age - 1e+07*.delta_age <= -12

The resulting lp model is:

mip$to_lp()  Model name: errorlocate age income .delta_age .delta_income r1._lin1 r1._lin2 Minimize 0 0 1.037647796562 1.298981007072 0 0 r1 0 0 0 0 1 1 <= 1 r1._lin1 0 1 0 0 -1e+07 0 <= 0 r1._lin2 -1 0 0 0 0 -1e+07 <= -16 income_ub 0 1 0 -1e+07 0 0 <= 2000 age_ub 1 0 -1e+07 0 0 0 <= 12 income_lb 0 -1 0 -1e+07 0 0 <= -2000 age_lb -1 0 -1e+07 0 0 0 <= -12 Kind Std Std Std Std Std Std Type Real Real Int Int Int Int Upper Inf Inf 1 1 1 1 Lower -Inf -Inf 0 0 0 0 mip$execute()$values #> age income .delta_age .delta_income r1._lin1 #> 16 2000 1 0 1 #> r1._lin2 #> 0 ### Alltogether: This works together with categorical, linear and conditional rules. rules <- validator( r1 = working %in% c("no_job", "job","retired") , r2 = if (age < 12) working == "no_job" , r3 = if (working == "retired") age > 50 , r4 = age >=0 ) data <- data.frame(age = 12, working="retired") mip <- inspect_mip(data, rules) mip$execute()$errors #> working age #> FALSE TRUE ### Weights The weights for each variable are normally set to 1, and errorlocate adds some random remainder to the weights: so the solutions are unique and reproducible (using set.seed). set.seed(42) rules <- validator( r1 = if (voted == TRUE) adult == TRUE) data <- data.frame(voted = TRUE, adult = FALSE) mip <- inspect_mip(data, rules, weight = c(voted = 3, adult=1)) $objective contains the generated weights:

mip$objective #> .delta_voted .delta_adult #> 3.457403 1.468538 These are assigned to the delta variables in the objective function of the mip. mip$to_lp()
  Model name: errorlocate
Lower                  0               0               0               0