Spark submit (2.3) on kubernetes cluster from Python

amza Source

So now that k8s is integrated directly with spark in 2.3 my spark submit from the console executes correctly on a kuberenetes master without any spark master pods running, spark handles all the k8s details:

spark-submit \
  --deploy-mode cluster \
  --class com.app.myApp \
  --master k8s://https://myCluster.com \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.app.name=myApp \
  --conf spark.executor.instances=10 \
  --conf spark.kubernetes.container.image=myImage \
  local:///myJar.jar

What I am trying to do is do a spark-submit via AWS lambda to my k8s cluster. Previously I used the command via the spark master REST API directly (without kubernetes):

request = requests.Request(
    'POST',
    "http://<master-ip>:6066/v1/submissions/create",
    data=json.dumps(parameters))
prepared = request.prepare()
session = requests.Session()
response = session.send(prepared)

And it worked. Now I want to integrate Kubernetes and do it similarly where I submit an API request to my kubernetes cluster from python and have spark handle all the k8s details, ideally something like:

request = requests.Request(
    'POST',
    "k8s://https://myK8scluster.com:443",
    data=json.dumps(parameters))

Is it possible in the Spark 2.3/Kubernetes integration?

pythonapache-sparkkubernetesaws-lambda

Answers

answered 6 months ago Anton Kostenko #1

I afraid that is impossible for Spark 2.3, if you using native Kubernetes support.

Based on description from deployment instruction, submission process container several steps:

  1. Spark creates a Spark driver running within a Kubernetes pod.
  2. The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
  3. When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.

So, in fact, you have no place to submit a job until you starting a submission process, which will launch a first Spark's pod (driver) for you. And after application completes, everything terminated.

Because of running a fat container on AWS Lambda is not a best solution, and also because if is not way to run any commands in container itself (is is possible, but with hack, here is blueprint about executing Bash inside an AWS Lambda) the simplest way is to write some small custom service, which will work on machine outside of AWS Lambda and provide REST interface between your application and spark-submit utility. I don't see any other ways to make it without a pain.

comments powered by Disqus