Parallelizing SQL queries in R

Andrew Elliott Source

I have six SQL queries that I script though R that each take a very long time (~30 minutes each). Once each query returns I then manipulate the data for some standard reports.

What I'd like to do is use my multicore machine to run these SQL requests in parallel from R.

I'm on a Windows machine with a Oracle DB. I was following a blog post to use doSNOW and foreach to try and split these requests and this is the best thing I can find on stackoverflow.

I've been able to get the process to work for the non-parallel %do% version of foreach but not the %dopar%. With %dopar% it just returns an empty set. Below is code that sets up tables and runs the queries so you can see what happens. Apologies in advance if there's too much basic code.

I've looked at some of the other R packages but didn't see an obvious solution. Also if you have a better way to manage this kind of process I'd be interested to hear it - just keep in mind I'm an analyst not a computer scientist. Thanks!

#Creating a cluster
library(doSNOW)
cl <- makeCluster(c("localhost","localhost"), type = "SOCK")
registerDoSNOW(cl)

#Connecting to database through RODBC
ch=odbcConnect("",pwd = "xxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)

#Creating database tables for example purposes
qryA1 <- "create table temptable(test int)" 
qryA2 <- "insert into temptable(test) values((1))" 
qryA3 <- "select * from temptable" 
qryA4 <- "drop table temptable" 
qryB1 <- "create table temptable2(test int)" 
qryB2 <- "insert into temptable2(test) values((2))" 
qryB3 <- "select * from temptable2" 
qryB4 <- "drop table temptable2"  

sqlQuery(ch, qryA1) 
sqlQuery(ch, qryA2) 
doesItWork <- sqlQuery(ch, qryA3) 
doesItWork
sqlQuery(ch, qryB1) 
sqlQuery(ch, qryB2) 
doesItWork <- sqlQuery(ch, qryB3) 
doesItWork

result = c()
output = c()
databases <- list('temptable','temptable2')


#Non-parallel version of foreach
system.time(
foreach(i = 1:2)%do%{
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
output[i] = result
}
) 

output

#Parallel version of foreach

outputPar = c()

system.time(
foreach(i = 1:2)%dopar%{
#Connecting to database through RODBC
ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
odbcGetInfo(ch)
result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
outputPar[i] = result
}
) 

outputPar

sqlQuery(ch, qryA4)
sqlQuery(ch, qryB4) 
sqlrparallel-processingmulticore

Answers

answered 6 years ago BenBarnes #1

When you make the assignment outputPar[i] = result inside the serial foreach loop, this is OK (but not really the intended use of foreach). When you make this assignment in the parallel loop, it is not OK. See http://tolstoy.newcastle.edu.au/R/e10/help/10/04/3237.html for a similar question answered by David Smith at Revolution.

As a solution,

system.time(
  outputPar <- foreach(i = 1:2, .packages="RODBC")%dopar%{
#Connecting to database through RODBC
  ch=odbcConnect(dsn ,pwd = "xxxxxx", believeNRows=FALSE)
#Test connection
  odbcGetInfo(ch)
  result<-sqlQuery(ch,paste('SELECT * FROM ',databases[i]))   
  result
}
)

comments powered by Disqus