Spark: Skip Missing S3 Files

Teliov Source

Is it possible to configure spark (version 2.3.1) to skip missing s3 files. Right now it throws an org.apache.hadoop.mapred.InvalidInputException.

In the latest version of spark there is a configuration option that makes it easy to do this. Wondering how to go about it for older versions that do not have this configuration option yet. This is how I'm reading the inputs where csvFiles is an array of s3 hosted csv files.

var filesRdd = sparkContext.textFile(csvFiles.mkString(","))


comments powered by Disqus