Dockerized remote R futures + file transfer

The future R package makes it easy to delegate expensive computations to a separate R process, including parallel computing on cloud infrastructure. Best of all (and fundamentally), your local R process doesn’t get stuck waiting for the remote evaluation; you can happily chug along through your R script and deal with the fruits of remote-executed code when they are good and ripe. In this post I’ll briefly outline my current future-enabled workflow, and then detail how I shoehorned in the ability to transfer files produced in the cloud back to my local machine.

Cloud-computing future Workflow

My personal choice for cloud-computing resources has been Google Compute Engine, which I access using the googleComputeEngineR package (in addition to gcloud and the cloud console). I like it because it has simple pricing and super-cheap preemptible images, and also because, well, it’s the one I chose and I’m sticking to it. Plenty of people swear by Digital Ocean, and of course half the internet runs on AWS, .

Because I often work with R packages that rely on external libraries (gdal, netcdf, stan, etc), and getting the installation correct on a new machine is often tedious and time consuming, I opt where possible to set up remote environments as docker containers. In addition to making things “just work” more readily, this setup has a couple added advantages. First, it means you have clearly documented and reproducible (and modular!) setups that you can modify and transfer to whatever hardware you have available. Second, it leverages other people’s dockerfiles and published images–I don’t need to figure out the hard way which libraries, apt repositiories, and compiler flags to use. And lastly, It’s a fine way to learn Docker if you want to see what all the fuss is about (as I did).

My workflow for a future + GCE + docker setup is roughly as follows:

  1. Select a R- and parallel-ready docker image like this one or create your own, possibly by adding on to the one in the like using a FROM statement.
  2. Spin up a container-optimized instance cluster using gce_vm_cluster() with the selected docker image (docker_image argument in gce_vm_cluster()). Even though this is a “cluster”, it need not be multiple instances; I typically use a “cluster” of size 1.
  3. Turn this into a parallel-ready cluster (as.cluster()) and plan() using remote and supplying the cluster as the workers argument.
  4. Go nuts running futures in the cloud!

This is a version of the recipe documented in this vignette from the googleComputeEngineR package. Check out that link for more examples.

The file-transfer conundrum

This workflow works great for most situations, but occasionally I run into a problem for which it is not well equipped. The latest of these has been the following situation. Suppose you have a computationally expensive piece of R code that produces not an R object (which would automatically be sent back to your local R session when ready), but a file written to disk. For example:

f1 <- function(x, filename) {
  # pretend this is computationally expensive R code:
  towrite <- sum(x)
  
  # Pretend this is difficult to return as an R object
  write.csv(towrite, file = filename)
  return(TRUE)
}

Here, the value of the return statement gets back to your local R session via the future protocol, but the file written stays in the docker container (which stops running once the future is resolved). Getting the file back seems like it should be easy to do, but alas, it is not. Remote futures are communicated via ssh, and come with no file transfer capabilities. How to transfer the file back to my local machine? Here are some things I tried:

  1. Open a scp port when creating the docker container and somehow get the container to persist long enough to transfer the file(s) directly to my local machine using scp. This idea failed pretty much immediately and on several fronts. Just to name a few: the docker run call is buried deep in the future call stack; port mapping is not allowed because of host networking; and just how would I get the container to persist anyway?
  2. Push the file to a Box folder using the boxr package, then download it from my local R session. This only got as far as authentication on the remote server, which failed because httr’s oauth apparently requires browser verification. However, I am no expert on this and it’s possible I just didn’t find the workaround.
  3. Similar to 2, but using Google Cloud Storage instead of Box. Attempted using googleCloudStorageR package, but gave up because it requires a json file for authentication and I couldn’t think how to securely get that into the container without an ugly readLines()/writeLines() hack.
  4. Similar to 2 and 3, but using S3 from AWS. For this I used the aws.s3 package, which authenticates using an access key and a secret–both passed as stings. This is easy to get from local to remote, and voila! it worked.

Here, then, is my solution to transfer files from the remote container to my local machine. It assumes you already have the cluster in place as described above, and have installed and setup aws.s3 as described in the package readme.

library(aws.s3)

# authentication credentials
awskey <- Sys.getenv("AWS_ACCESS_KEY_ID")
awssecret <- Sys.getenv("AWS_SECRET_ACCESS_KEY")

# Bucket to use for transferring 
bucketname <- "myfavoritebuckettouse"
putbucket <- get_bucket(bucket = bucketname,
                          key = awskey,
                          secret = awssecret)

# Data in need of computationally expensive processing
data2run <- rnorm(100)
outfile <- "f1output.csv" # file name where results will be written

# Make sure aws.s3 is installed in the remote container.
installed %<-% {install.packages("aws.s3")}

# Run the code remotely using future
f1_output %<-% {
  # generate the file on disk
  f1(data2run, filename = outfile)
  put_object(outfile, bucket = putbucket, key = awskey, secret = awssecret)
}
resolved(futureOf(f1_output)) # Check to see if it's ready to transfer

# Retrieve the object to local storage
save_object(outfile, bucket = bucketname)

That’s it! In a future post I’ll show the results of the project that led me down this rabbit hole: remote-animated (and remote rendered!) .gif/.mp4 visualizations using gganimate.

Avatar
Mark Hagemann
Post-Doctoral Researcher

I use statistics to learn about rivers from space.