caret train binary glm fails on parallel cluster via doParallel

Triamus Source

I have seen there are a lot of questions around this topic already but none seems to give a satisfying answer to my problem. I intend to use caret::train() in combination with library doParallel on a Windows machine. The documentation (The caret package: 9 Parallel Processing) tells me that it will run in parallel by default if it finds a registered cluster (although it uses library doMC). When I attempt setting up a cluster with doParallel and follow the example calculation in its documentation (Getting Started with doParallel and foreach) everything works fine. When I unregister the cluster and run caret::train() everything works fine. But when I create a new cluster and try running caret::train() it produces error Error in serialize(data, node$con) : error writing to connection. I also include the log below. I don't understand how caret::train() works in non-parallel mode but not in parallel mode it doesn't although the cluster seems to be correctly setup.

libraries

library(caret)
library(microbenchmark)
library(doParallel)

session info

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doParallel_1.0.10      iterators_1.0.8        foreach_1.4.3          microbenchmark_1.4-2.1
[5] caret_6.0-76           ggplot2_2.2.1          lattice_0.20-35       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11       compiler_3.4.1     nloptr_1.0.4       plyr_1.8.4         tools_3.4.1       
 [6] lme4_1.1-13        tibble_1.3.3       nlme_3.1-131       gtable_0.2.0       mgcv_1.8-17       
[11] rlang_0.1.1        Matrix_1.2-10      SparseM_1.77       mvtnorm_1.0-6      stringr_1.2.0     
[16] hms_0.3            MatrixModels_0.4-1 stats4_3.4.1       grid_3.4.1         nnet_7.3-12       
[21] R6_2.2.2           survival_2.41-3    multcomp_1.4-6     TH.data_1.0-8      minqa_1.2.4       
[26] readr_1.1.1        reshape2_1.4.2     car_2.1-5          magrittr_1.5       scales_0.4.1      
[31] codetools_0.2-15   ModelMetrics_1.1.0 MASS_7.3-47        splines_3.4.1      pbkrtest_0.4-7    
[36] colorspace_1.3-2   quantreg_5.33      sandwich_2.4-0     stringi_1.1.5      lazyeval_0.2.0    
[41] munsell_0.4.3      zoo_1.8-0

running example from doParallel documentation (no errors)

cores_2_use <- floor(0.8 * detectCores())
cl <- makeCluster(cores_2_use, outfile = "parallel_log1.txt")
registerDoParallel(cl)

x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 100
temp <- microbenchmark(
  r <- foreach(icount(trials), .combine=cbind) %dopar% {
    ind <- sample(100, 100, replace=TRUE)
    result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
    coefficients(result1)}
  )

parallel::stopCluster(cl)
foreach::registerDoSEQ()

mock-up data

x1 = rnorm(100)           # some continuous variables 
x2 = rnorm(100)
z = 1 + 2 * x1 + 3 * x2        # linear combination with a bias
pr = 1 / (1 + exp(-z))         # pass through an inv-logit function
y = rbinom(100, 1, pr)      # bernoulli response variable
df = data.frame(y = as.factor(ifelse(y == 0, "no", "yes")), x1 = x1, x2 = x2)

running caret::train() non-parallel (no error)

# train control function
ctrl <- 
  trainControl(
    method = "repeatedcv", 
    number = 10,
    repeats = 5,
    classProbs = TRUE,
    summaryFunction = twoClassSummary)

# train function
microbenchmark(
  glm_nopar =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

#Unit: milliseconds
 #expr      min       lq     mean   median       uq      max neval
 #glm_nopar 691.9643 805.1762 977.1054 895.9903 1018.112 1474.284     5

running caret::train() parallel (error)

cores_2_use <- floor(0.8 * detectCores())
cl <- makeCluster(cores_2_use, outfile = "parallel_log2.txt")
registerDoParallel(cl)

microbenchmark(
  glm_par =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

#Error in serialize(data, node$con) : error writing to connection

EDIT (trying without parallel::makeCluster() call)

As in Linux setup (see below) also tried without parallel::makeCluster() call, i.e. as shown below but results in same error.

cores_2_use <- floor(0.8 * detectCores())
registerDoParallel(cores_2_use)
...

output parallel_log1.txt

starting worker pid=3880 on localhost:11442 at 16:00:52.764
starting worker pid=3388 on localhost:11442 at 16:00:53.405
starting worker pid=9920 on localhost:11442 at 16:00:53.789
starting worker pid=4248 on localhost:11442 at 16:00:54.229
starting worker pid=3548 on localhost:11442 at 16:00:54.572
starting worker pid=5704 on localhost:11442 at 16:00:54.932
starting worker pid=7740 on localhost:11442 at 16:00:55.291
starting worker pid=2164 on localhost:11442 at 16:00:55.653
starting worker pid=7428 on localhost:11442 at 16:00:56.011
starting worker pid=6116 on localhost:11442 at 16:00:56.372
starting worker pid=1632 on localhost:11442 at 16:00:56.731
starting worker pid=9160 on localhost:11442 at 16:00:57.092
starting worker pid=2956 on localhost:11442 at 16:00:57.435
starting worker pid=7060 on localhost:11442 at 16:00:57.811
starting worker pid=7344 on localhost:11442 at 16:00:58.170
starting worker pid=6688 on localhost:11442 at 16:00:58.561
starting worker pid=9308 on localhost:11442 at 16:00:58.920
starting worker pid=9260 on localhost:11442 at 16:00:59.281
starting worker pid=6212 on localhost:11442 at 16:00:59.641

output parallel_log2.txt

starting worker pid=17640 on localhost:11074 at 15:12:21.118
starting worker pid=7776 on localhost:11074 at 15:12:21.494
starting worker pid=15128 on localhost:11074 at 15:12:21.961
starting worker pid=13724 on localhost:11074 at 15:12:22.345
starting worker pid=17384 on localhost:11074 at 15:12:22.714
starting worker pid=8472 on localhost:11074 at 15:12:23.228
starting worker pid=8392 on localhost:11074 at 15:12:23.597
starting worker pid=17412 on localhost:11074 at 15:12:23.979
starting worker pid=15996 on localhost:11074 at 15:12:24.364
starting worker pid=16772 on localhost:11074 at 15:12:24.743
starting worker pid=18268 on localhost:11074 at 15:12:25.120
starting worker pid=13504 on localhost:11074 at 15:12:25.500
starting worker pid=5156 on localhost:11074 at 15:12:25.899
starting worker pid=13544 on localhost:11074 at 15:12:26.275
starting worker pid=1764 on localhost:11074 at 15:12:26.647
starting worker pid=8076 on localhost:11074 at 15:12:27.028
starting worker pid=13716 on localhost:11074 at 15:12:27.414
starting worker pid=14596 on localhost:11074 at 15:12:27.791
starting worker pid=15664 on localhost:11074 at 15:12:28.170
Loading required package: caret
Loading required package: lattice
Loading required package: ggplot2
loaded caret and set parent environment
starting worker pid=3932 on localhost:11442 at 16:01:44.384
starting worker pid=6848 on localhost:11442 at 16:01:44.731
starting worker pid=5400 on localhost:11442 at 16:01:45.098
starting worker pid=9832 on localhost:11442 at 16:01:45.475
starting worker pid=8448 on localhost:11442 at 16:01:45.928
starting worker pid=1284 on localhost:11442 at 16:01:46.289
starting worker pid=9892 on localhost:11442 at 16:01:46.632
starting worker pid=8312 on localhost:11442 at 16:01:46.991
starting worker pid=3696 on localhost:11442 at 16:01:47.349
starting worker pid=9108 on localhost:11442 at 16:01:47.708
starting worker pid=8548 on localhost:11442 at 16:01:48.083
starting worker pid=7288 on localhost:11442 at 16:01:48.442
starting worker pid=6872 on localhost:11442 at 16:01:48.801
starting worker pid=3760 on localhost:11442 at 16:01:49.145
starting worker pid=3468 on localhost:11442 at 16:01:49.503
starting worker pid=2500 on localhost:11442 at 16:01:49.862
starting worker pid=7200 on localhost:11442 at 16:01:50.205
starting worker pid=7820 on localhost:11442 at 16:01:50.564
starting worker pid=8852 on localhost:11442 at 16:01:50.923
Error in unserialize(node$con) : 
  ReadItem: unknown type 0, perhaps written by later version of R
Calls: <Anonymous> ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize
Execution halted

EDIT (trying on Ubuntu)

libraries

library(caret)
library(microbenchmark)
library(doMC)

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=de_DE.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doMC_1.3.4             iterators_1.0.8        foreach_1.4.3         
[4] microbenchmark_1.4-2.1 caret_6.0-77           ggplot2_2.2.1         
[7] lattice_0.20-35       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11       ddalpha_1.2.1      compiler_3.4.1     DEoptimR_1.0-8    
 [5] gower_0.1.2        plyr_1.8.4         bindr_0.1          class_7.3-14      
 [9] tools_3.4.1        rpart_4.1-11       ipred_0.9-6        lubridate_1.6.0   
[13] tibble_1.3.3       nlme_3.1-131       gtable_0.2.0       pkgconfig_2.0.1   
[17] rlang_0.1.1        Matrix_1.2-11      RcppRoll_0.2.2     prodlim_1.6.1     
[21] bindrcpp_0.2       withr_2.0.0        stringr_1.2.0      dplyr_0.7.1       
[25] recipes_0.1.0      stats4_3.4.1       nnet_7.3-12        CVST_0.2-1        
[29] grid_3.4.1         robustbase_0.92-7  glue_1.1.1         R6_2.2.2          
[33] survival_2.41-3    lava_1.5           purrr_0.2.2.2      reshape2_1.4.2    
[37] kernlab_0.9-25     magrittr_1.5       DRR_0.0.2          splines_3.4.1     
[41] scales_0.4.1       codetools_0.2-15   ModelMetrics_1.1.0 MASS_7.3-47       
[45] assertthat_0.2.0   dimRed_0.1.0       timeDate_3012.100  colorspace_1.3-2  
[49] stringi_1.1.5      lazyeval_0.2.0     munsell_0.4.3  

example from Getting Started with doMC and foreach

Works as expected.

example caret non-parallel

microbenchmark(
  glm_nopar =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

#Unit: seconds
#     expr      min       lq     mean   median       uq      max neval
#glm_nopar 1.093237 1.098342 1.481444 1.102867 2.001443 2.111333     5

caret parallel with setting like Windows (gives error)

cores_2_use <- floor(0.8 * parallel::detectCores())
cl <- parallel::makeCluster(cores_2_use, outfile = "parallel_log2_linux.txt")
registerDoMC(cl)

microbenchmark(
  glm_par =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

# Error in getOper(ctrl$allowParallel && getDoParWorkers() > 1) :(list) object cannot be coerced to type 'double'

parallel_log2_linux.txt

starting worker pid=6343 on localhost:11836 at 16:05:17.781
starting worker pid=6353 on localhost:11836 at 16:05:18.025
starting worker pid=6362 on localhost:11836 at 16:05:18.266

caret parallel without parallel::makeCluster() call (no error)

Unclear how to define log output in this setup.

cores_2_use <- floor(0.8 * parallel::detectCores())
registerDoMC(cores_2_use)

microbenchmark(
  glm_par =
    train(y ~ .,
          data = df,
          method = "glm",
          family = "binomial",
          metric = "ROC",
          trControl = ctrl),
  times = 5)

#Unit: milliseconds
#    expr      min       lq     mean   median       uq      max neval
# glm_par 991.8075 997.4397 1013.686 998.8241 1004.381 1075.978     5
rcaretdoparallel

Answers

answered 10 months ago CPak #1

Looks like because you're on Windows, you're screwed

The doMC package acts as an interface between foreach and the multicore functionality of the parallel package, originally written by Simon Urbanek and incorporated into parallel for R2.14.0. The multicore functionality currently only works with operating systems that support the fork system call (which means that Windows isn't supported)

Caret uses doMC. See caret/parallel-processing.html

library(doMC)
registerDoMC(cores = 5)
model <- train(y ~ ., data = training, method = "rf")

Note OP has edited his original post. OP was running on Windows to begin with.

Edit - Too long for a single comment

doParallel does not rescue caret parallelization. (but I could be wrong...please let me know with more downvotes and comments)

1) Please try this yourself on Windows...It defaulted to sequential when I tried with doParalell. (I would like to know if it works on someone else's Windows machine).

This makes sense that it defaulted to sequential because

2) caret uses doMC. See here,

caret leverages one of the parallel processing frameworks in R to do just this. The foreach package allows R code to be run either sequentially or in parallel using several different technologies, such as the multicore or Rmpi packages (see Schmidberger et al, 2009 for summaries and descriptions of the available options). There are several R packages that work with foreach to implement these techniques, such as doMC (for multicore) or doMPI (for Rmpi).

3) doParallel simply combines doMC and doSNOW. See here.

The doParallel package is a merger of doSNOW and doMC, much as parallel is a merger of snow and multicore.

Note the author of the accepted answer in the link is Steve Weston, one of the authors of the doParallel package.

4) doMC forks processes which is not supported on Windows (Windows only supports SNOW and SOCK processes) See here, again Steve Weston

The multicore functionality currently only works with operating systems that support the fork system call (which means that Windows isn’t supported)

answered 10 months ago Hong Ooi #2

You have to use the foreach backend that corresponds to your cluster type. If you're creating a cluster with parallel::makeCluster, you register it with doParallel::registerDoParallel.

cl <- parallel::makeCluster(cores_2_use, outfile = "parallel_log2_linux.txt")
library(doParallel)
registerDoParallel(cl)

answered 10 months ago Triamus #3

I tried on a different Windows 10 machine with fewer cores but equal code setup. However, I used development version of caret from Github (installed via devtools::install_github('topepo/caret/pkg/caret')) as well as R 3.4.1 and the issue could not be reproduced. The parallel cluster ran without issues with below code. Unfortunately, I don't have access to the original Windows 7 workstation to see if the issue persists with caret dev version and/or newer R version.

library(doParallel)
cores_2_use <- floor(0.8 * detectCores())
cl <- makeCluster(cores_2_use, outfile = "parallel_log.txt")
registerDoParallel(cl)

glm_par <-
  microbenchmark(glm_par =
    train(default ~ .,
            data = benchmark_train_data,
            method = "glm",
            family = "binomial",
            metric = "ROC",
            trControl = ctrl),
    times = 5
    )

glm_par

#Unit: seconds
#    expr      min       lq     mean   median       uq      max neval
# glm_par 13.14082 13.25298 16.77678 13.64924 13.78132 30.05955     5

EDIT (non-parallel benchmark)

This is the same code running on one core (as opposed to parallel above with six cores) - would have expected an even better performance for the parallel setup.

#Unit: seconds
#      expr      min       lq     mean   median       uq      max neval
# glm_nopar 25.44122 25.52031 25.64818 25.53692 25.56496 26.17751     5

comments powered by Disqus