Using spark-submit to submit a Spark application in Kerberos environment

Posted on 2019-02-28 In Back-end , BigData Views: Symbols count in article: 2.4k Reading time ≈ 2 mins.

spark

Spark-submit is a shell script that allows you to deploy a spark application for execution, kill or request status of spark applications. So, it is generally a depolyment tool for spark application.

How spark runs an application

Create SparkContext from driver.
Request resources from Cluster Manager.
CM find Worker Node, then response to spark driver.
Spark driver connects to Worker Nodes directly, then communicate each other (such as sending tasks).

So, we can notice that serveral components are ciratical - spark driver, CM, Worker Node - in this submission process. And spark can use different CM - pseudo cluster manager, standalone CM, yarn, mesos and kubernets - to request resources and schedule tasks for spark applications. Spark also provides 2 different types of deployment mode - client and cluster - for running spark driver which can creates a SparkContext.

Cluster Manager types

CM which acquires resources on the cluster is a critical key componet for a distributed system.

Pseudo cluster manager
In this kind of non-distributed single-JVM deployment mode, Spark creates all the execution components - driver, excutor and master - in the same single JVM.
Using this kind of cluster manager to test your spark application is very convient and easy.

Run spark locally with two worker threads:
1
--master local[2]
Run spark locally with as many as logical cores on your machine:
1
--master local[*]
Standalone
It is a simple cluster manager providing by Spark. We can use it as a development tool and a testing tool among the project.
1
--master spark://hostname:port
Hadoop Yarn
The resources manager in Hadoop 2.
I’s a excellent choice in ambari environment.
1
--master yarn
Apache Mesos
A general cluster manager that can run Hadoop MapReduce and service application.
Kubernetes (K8S)
An open-source system for automating deployment, scaling, and management of containerized applications.

Deployment mode types

Client
Spark driver runs where the job is submitted outside of the cluster.
1
--deploy-mode client
Cluster
By contrast with clident deployment mode, spark driver runs inside the cluster.
1
--deploy-mode cluster

Steps for using spark-submit command to launch a spark application in ambari with kerberos authorization

Get TGT
Before we using this tool, we need to get TGT for spark service. Using command kinit to get ticket.
1
kinit -kt /etc/security/keytabs/spark.headless.keytab spark-bigdata@CONNEXT.COM

Submit application

Pseudo cluster manager

Notices:
- The asterisk can be replaced by specific number that showing how many worker nodes to be applied. And the asterisk is just announced that acquiring worker nodes as many as logical cores on your machine.
- Specifying package path and your application class name to CM for informing entry point of your spark application.
1
2
3
4
/usr/hdp/current/spark-client/bin/spark-submit \
--master local[*] \
--class com.johnny.demo.SparkDemo \
/root/spark-demo-1.0.0.jar \

Yarn-cluster

Notices:

Specifying entry point of your spark application which just likes the mode of pseudo cluster manager.
Your application jar files and other files all need to upload to your cluster. Not local file system.

		/usr/hdp/current/spark-client/bin/spark-submit \
--class com.johnny.demo.SparkDemo \
--master yarn \
--deploy-mode cluster \
--files /usr/hdp/current/spark-client/conf/hive-site.xml \
--jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar \
hdfs://bigdata.com:8020/johnny/oozie/workflow/shell-action/demo1/spark-demo-1.0.0.jar