Exasol + R UDF: Multiple Imputation with mice

Felix Source

I have performance issues with my multiple imputation on my local machine and I would like to use ExaSols massive parallel execution possibilities to receive the results faster. My goal is to impute the missing values.

I'm struggling with moving my R code into a ExaSol UDF.

What I did: I imported the dataset into a table (only integers fyi) I installed the mice and parallel package via ExaOperations My single core R-code (Data2 is my dataset which is available as table in ExaSol now)

library(mice)
imp.Data2 = complete(mice(Data2, m=5, method=c("polyreg","polyreg","norm","polyreg",rep("norm",471)),maxit=15,seed =10,printFlag=T))

For testing purposes I changed m to 1 and the number of iterations (maxit) to 1

My R UDF is as follows, adapted from the documentation example:

-- A simple SET-EMITS function, computing simple statistics on a column
CREATE R SET SCRIPT r_stats(group_id INT, input_number DOUBLE)
EMITS (group_id INT, mean DOUBLE, stddev DOUBLE) AS
run <- function(ctx) {
# fetch all records from this group into a single vector
ctx$next_row(NA)
ctx$emit(ctx$group_id[1], mean(ctx$input_number), sd(ctx$input_number))
}/


CREATE OR REPLACE R SET SCRIPT dev.imputation(id integer, BPS_KRITERIUM1_T0 integer )
EMITS (id integer, BPS_KRITERIUM1_T0 integer) AS

library(mice)
run <- function(ctx) {
ctx$next_row(NA)

ctx$emit(ctx$id[1], mice(ctx$BPS_KRITERIUM1_T0, m=1, method=c('norm'),maxit=1,seed =10,printFlag=T))
}
/

Using this script with

SELECT dev.IMPUTATION(id, BPS_KRITERIUM1_T0)
FROM dev.TEST_DATA

returns the error message:

[22002] VM error: Error in mice(ctx$BPS_KRITERIUM1_T0, m = 1, method = c("norm"), maxit = 1, : 
Data should be a matrix or data frame
(Session: 1571096526740134920)

After creating a dataframe out of the input to a function like

CREATE OR REPLACE R SET SCRIPT dev.imputation(id integer, BPS_KRITERIUM1_T0 integer )
EMITS (id integer, BPS_KRITERIUM1_T0 integer) AS

library(mice)
run <- function(ctx) {
ctx$next_row(NA)

df <- c()
df <- data.frame (id = ctx$id, BPS_KRITERIUM1_T0 = ctx$BPS_KRITERIUM1_T0)
ctx$emit(ctx$id[1], mice(df, m=1, method=c('norm'),maxit=1,seed =10,printFlag=T))
}

/

The error message is:

[22002] VM error: Error in a(b) : (list) object cannot be coerced to type 'double'
(Session: 1571096526740134920)

And I guess my problem is, that within this function, I can only access the data from the current row, but the algorithm needs the values from all rows in that column to calcualte the missing values. I'm new to R and UDFs in ExaSol, but from my perspective I need to know how: Store all values from the given column in an array and then passing this array as base for the mice function for each given row example data:

ID BPS_KRITERIUM1_T0
1 1
2 1
3 0
4 1
5 0
6 0
7 2
8 1
9 2
10 1
11 1
12 2
13 3
14 1
15 
16 3
17 3
18 0
19 1
20

Do you have any examples for similar approaches? I don't even know if that's the correct way to start :/ Let me know in case you need additional information, any help is much appreciated

Thanks in advance and kind regards

rscriptingudfimputationr-mice

Answers

comments powered by Disqus