Spark has a configurable metrics system that supports a number of sinks, including CSV files.
In this article, we are going to show you how to configure a Databricks cluster to use a CSV sink and persist those metrics to a DBFS location.
Create an init script
All of the configuration is done in an init script.
The init script does the following three things:
- Configures the cluster to generate CSV metrics on both the driver and the worker.
- Writes the CSV metrics to a temporary, local folder.
- Uploads the CSV metrics from the temporary, local folder to the chosen DBFS location.
Customize the sample code and then run it in a notebook to create an init script on your cluster.
Sample code to create an init script:
%python
dbutils.fs.put("/<init-path>/metrics.sh","""
#!/bin/bash
mkdir /tmp/csv
sudo bash -c "cat <<EOF >> /databricks/spark/dbconf/log4j/master-worker/metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
master.source.jvm.class org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"
sudo bash -c "cat <<EOF >> /databricks/spark/conf/metrics.properties
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
spark.metrics.staticSources.enabled true
spark.metrics.executorMetricsSource.enabled true
spark.executor.processTreeMetrics.enabled true
spark.sql.streaming.metricsEnabled true
driver.source.jvm.class org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class org.apache.spark.metrics.source.JvmSource
*.sink.csv.period 5
*.sink.csv.unit seconds
*.sink.csv.directory /tmp/csv/
worker.sink.csv.period 5
worker.sink.csv.unit seconds
EOF"
cat <<'EOF' >> /tmp/asynccode.sh
#!/bin/bash
DB_CLUSTER_ID=$(echo $HOSTNAME | awk -F '-' '{print$1"-"$2"-"$3}')
MYIP=$(hostname -I)
if [[ ! -d /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP} ]] ; then
sudo mkdir -p /dbfs/<metrics-path>/${DB_CLUSTER_ID}/metrics-${MYIP}
fi
while true; do
if [ -d "/tmp/csv" ]; then
sudo cp -r /tmp/csv/* /dbfs/<metrics-path>/$DB_CLUSTER_ID/metrics-$MYIP
fi
sleep 5
done
EOF
chmod a+x /tmp/asynccode.sh
/tmp/asynccode.sh & disown
""", True)Replace <init-path> with the DBFS location you want to use to save the init script.
Replace <metrics-path> with the DBFS location you want to use to save the CSV metrics.
Cluster-scoped init script
Once you have created the init script on your cluster, you must configure it as a cluster-scoped init script.
Verify that CSV metrics are correctly written
Restart your cluster and run a sample job.
Check the DBFS location that you configured for CSV metrics and verify that they were correctly written.