Problem
Databricks spark-submit jobs appear to “hang,” either after the user class’s main method completes (regardless of success or failure, for Java / Scala jobs), or upon a Python script exit (for Python jobs). Clusters do not auto-terminate in this case.
Cause
For Java, Scala, and Python programs, Spark does not automatically call System.exit.
For context, the Java Virtual Machine initiates the shutdown sequence in response to one of three events.
- When the number of live non-daemon threads drops to zero for the first time.
- When the
Runtime.exitorSystem.exitmethod is called for the first time. - When an external event occurs such as an interrupt, or a signal is received from the operating system.
Solution
Embed system.exit code in your application to shutdown the Java virtual machine with exit code 0.
Examples
Python
import sys
sc = SparkSession.builder.getOrCreate().sparkContext # Or otherwise obtain handle to SparkContext
runTheRestOfTheUserCode()
# Fall through to exit with code 0 in case of success, since failure will throw an uncaught exception
# and won't reach the exit(0) and thus will trigger a non-zero exit code that will be handled by
# PythonRunner
sc._gateway.jvm.System.exit(0)Scala
def main(args: Array[String]): Unit = {
try {
runTheRestOfTheUserCode()
} catch {
case t: Throwable =>
try {
// Log the throwable or error here
} finally {
System.exit(1)
}
}
System.exit(0)
}Long-term fix
You can also track Spark’s long-term fix at SPARK-48547.