# Unit Screening

Unit screening is a screening or filtering of units based on data availability rules. Just like with indicators (columns), when a unit (row) has very few data points available, it may make sense to remove it. This avoids drawing conclusions on units with very few data points. It will also increase the percentage data availability of each indicator once the units have been removed.

The COINr function Screen() is a generic function with methods for data frames, coins and purses. It is a building function in that it creates a new data set in $.Data as its output. # Data frames We begin with data frames. Let’s take a subset of the inbuilt example data for demonstration. I cherry-pick some rows and columns which have some missing values. library(COINr) # example data iData <- ASEM_iData[40:51, c("uCode", "Research", "Pat", "CultServ", "CultGood")] iData #> uCode Research Pat CultServ CultGood #> 40 KOR 20437 249.8 1.79800 NA #> 41 LAO 175 NA NA NA #> 42 MYS 8080 64.2 1.15292 7.555 #> 43 MNG 293 0.3 0.00266 0.046 #> 44 MMR 299 NA 0.08905 NA #> 45 NZL 7731 46.5 0.34615 1.213 #> 46 PAK 7122 7.2 0.03553 1.256 #> 47 PHL 1361 11.3 0.29555 3.185 #> 48 RUS 16182 141.5 1.44633 8.379 #> 49 SGP 11411 270.5 0.92780 14.507 #> 50 THA 5317 53.6 0.08969 6.661 #> 51 VNM 3618 NA NA NA The data has four indicators, plus an identifier column “uCode”. Looking at each unit, the data availability is variable. We have 12 units in total. Now let’s use Screen() to screen out some of these units. Specifically, we will remove any units that have less than 75% data availabilty (3 of 4 indicators with non-NA values): l_scr <- Screen(iData, unit_screen = "byNA", dat_thresh = 0.75) The output of Screen() is a list: str(l_scr, max.level = 1) #> List of 3 #>$ ScreenedData:'data.frame':    9 obs. of  5 variables:
#>  $DataSummary :'data.frame': 12 obs. of 10 variables: #>$ RemovedUnits: chr [1:3] "LAO" "MMR" "VNM"

We can see already that the “RemovedUnits” entry tells us that three units were removed based on our specifications. We now have our new screened data set:

l_scr$ScreenedData #> uCode Research Pat CultServ CultGood #> 40 KOR 20437 249.8 1.79800 NA #> 42 MYS 8080 64.2 1.15292 7.555 #> 43 MNG 293 0.3 0.00266 0.046 #> 45 NZL 7731 46.5 0.34615 1.213 #> 46 PAK 7122 7.2 0.03553 1.256 #> 47 PHL 1361 11.3 0.29555 3.185 #> 48 RUS 16182 141.5 1.44633 8.379 #> 49 SGP 11411 270.5 0.92780 14.507 #> 50 THA 5317 53.6 0.08969 6.661 And we have a summary of data availability and some other things: head(l_scr$DataSummary)
#>    uCode N_missing N_zero N_miss_or_zero Dat_Avail Non_Zero LowData LowNonZero
#> 40   KOR         1      0              1      0.75        1   FALSE      FALSE
#> 41   LAO         3      0              3      0.25        1    TRUE      FALSE
#> 42   MYS         0      0              0      1.00        1   FALSE      FALSE
#> 43   MNG         0      0              0      1.00        1   FALSE      FALSE
#> 44   MMR         2      0              2      0.50        1    TRUE      FALSE
#> 45   NZL         0      0              0      1.00        1   FALSE      FALSE
#>    LowDatOrZeroFlag Included
#> 40            FALSE     TRUE
#> 41             TRUE    FALSE
#> 42            FALSE     TRUE
#> 43            FALSE     TRUE
#> 44             TRUE    FALSE
#> 45            FALSE     TRUE

This table is in fact generated by get_data_avail() - some more details can be found in the Analysis vignette.

Other than data availability, units can also be screened based on the presence of zeros, or on both - this is specified by the unit_screen argument. Use the Force1 argument to override the screening rules for specified units if required (either to force inclusion or force exclusion).

# Coins

Screening on coins is very similar to data frames, because the coin method extracts the relevant data set, passes it to the data frame method, and then then puts the output back as a new data set. This means the arguments are almost the same. The only thing different is to specify which data set to screen, the name to give the new data set, and whether to output a coin or a list.

We’ll build the example coin, then screen the raw data set with a threshold of 85% data availability and also name the new data set something different rather than “Screened” (the default):

# build example coin
coin <- build_example_coin(up_to = "new_coin", quietly = TRUE)

# screen units from raw dset
coin <- Screen(coin, dset = "Raw", unit_screen = "byNA", dat_thresh = 0.85, write_to = "Filtered_85pc")
#> Written data set to .$Data$Filtered_85pc

# some details about the coin by calling its print method
coin
#> --------------
#> A coin with...
#> --------------
#> Input:
#>   Units: 51 (AUS, AUT, BEL, ...)
#>   Indicators: 49 (Goods, Services, FDI, ...)
#>   Denominators: 4 (Area, Energy, GDP, ...)
#>   Groups: 4 (GDP_group, GDPpc_group, Pop_group, ...)
#>
#> Structure:
#>   Level 1 Indicator: 49 indicators (FDI, ForPort, Goods, ...)
#>   Level 2 Pillar: 8 groups (ConEcFin, Instit, P2P, ...)
#>   Level 3 Sub-index: 2 groups (Conn, Sust)
#>   Level 4 Index: 1 groups (Index)
#>
#> Data sets:
#>   Raw (51 units)
#>   Filtered_85pc (48 units)

The printed summary shows that the new data set only has 48 units, compared to the raw data set with 51. We can find which units were filtered because this is stored in the coin’s “Analysis” sub-list:

coin$Analysis$Filtered_85pc$RemovedUnits #> [1] "BRN" "LAO" "MMR" The Analysis sub-list also contains the data availability table that is output by Screen(). As with the data frame method, we can also choose to screen units by presence of zeroes, or a combination of zeroes and missing values. # Purses For completion we also demonstrate the purse method. Like most purse methods, this is simply applying the coin method to each coin in the purse, without any special features. Here, we perform the same example as in the coin section, but on a purse of coins: # build example purse purse <- build_example_purse(up_to = "new_coin", quietly = TRUE) # screen units in all coins to 85% data availability purse <- Screen(purse, dset = "Raw", unit_screen = "byNA", dat_thresh = 0.85, write_to = "Filtered_85pc") #> Written data set to .$Data$Filtered_85pc #> Written data set to .$Data$Filtered_85pc #> Written data set to .$Data$Filtered_85pc #> Written data set to .$Data$Filtered_85pc #> Written data set to .$Data\$Filtered_85pc

1. Luke. Sorry.↩︎