How to collect tcp_dumps on a standard (formerly shared) cluster
Important
If you’re working on a dedicated (formerly single-user) cluster, follow the instructions in the Use tcpdump to create pcap files KB article instead.
When working on a standard (formerly shared) access mode cluster, direct access to Databricks File System (DBFS) fails and you can’t copy the files to the DBFS path. This article provides adjusted instructions to accommodate this difference.
Create a volume and then the tcp_dump init script
- Create a volume and provide the path to store the init script.
- Run the following sample init script in a notebook to collect
tcp_dumps
. This script uses curl to make a DBFSPUT
API call to the workspace to upload the pcap files. You need to pass the token and the workspace host in the API path in the script. Your DAPI token should be an existing valid PAT (string) that has permission to use DBFS PUT API.
If you want to filter the tcp_dumps
with host and port, you can uncomment the TCPDUMP_FILTER
line in the script and add the required host and port. Alternatively, you can pass the host and port separately, depending on your requirements.
dbutils.fs.put("/Volumes/<path-to-init-script>/tcp_dumps.sh", """
#!/bin/bash
set -euxo pipefail
MYIP=$(echo $HOSTNAME)
TMP_DIR="/local_disk0/tmp/tcpdump"
[[ ! -d ${TMP_DIR} ]] && mkdir -p ${TMP_DIR}
TCPDUMP_WRITER="-w ${TMP_DIR}/trace_%Y%m%d_%H%M%S_${DB_CLUSTER_ID}_${MYIP}.pcap -W 1000 -G 900 -Z root -U -s256"
TCPDUMP_PARAMS="-nvv -K"
#TCPDUMP_FILTER="host xxxxxxxxx.dfs.core.windows.net and port 443" ## add host/port filter here based on the requirement
sudo tcpdump $(echo "${TCPDUMP_WRITER}") $(echo "${TCPDUMP_PARAMS}") $(echo "${TCPDUMP_FILTER}") &
echo "Started tcpdump $(echo "${TCPDUMP_WRITER}") $(echo "${TCPDUMP_PARAMS}") $(echo "${TCPDUMP_FILTER}")"
cat > /tmp/copy_stats.sh << 'EOF'
#!/bin/bash
TMP_DIR=$1
DB_CLUSTER_ID=$2
COPY_INTERVAL_IN_SEC=45
MYIP=$(echo $HOSTNAME)
echo "Starting copy script at `date`"
DEST_DIR="/Volumes/main/default/jar/"
#mkdir -p ${DEST_DIR}
sleep_duration=45
log_file="/tmp/copy_stats.log"
touch $log_file
declare -gA file_sizes
## logic to copy files by checking previous size. Uses associative array to persist rotated files size.
while true; do
sleep ${COPY_INTERVAL_IN_SEC}
#ls -ltr ${DEST_DIR} > $log_file
for file in $(find "$TMP_DIR" -type f -mmin -3 ); do
current_size=$(stat -c "%s" "$file")
file_name=$(basename "$file")
last_size=${file_sizes["$file_name"]}
if [ "$current_size" != "$last_size" ]; then
echo "Copying $file with current size: $current_size and last size: $last_size at `date`" | tee -a $log_file
DBFS_PATH="dbfs:/FileStore/tcpdumpfolder/${DB_CLUSTER_ID}/trace_$(date +"%Y-%m-%d--%H-%M-%S")_${DB_CLUSTER_ID}_${MYIP}.pcap"
curl -vvv -F contents=@$file -F path="$DBFS_PATH" -H "Authorization: Bearer <your-dapi-token>" https://<your-databricks-workspace-url>/api/2.0/dbfs/put 2>&1 | tee -a $log_file
#cp --verbose "$file" "$DEST_DIR" | tee -a $log_file
echo "done Copying $file with current size: $current_size at `date`" | tee -a $log_file
file_sizes[$file_name]=$current_size
else
echo "Skip Copying $file with current size: $current_size and last size: $last_size at `date`" | tee -a $log_file
fi
done
done
EOF
chmod a+x /tmp/copy_stats.sh
/tmp/copy_stats.sh $TMP_DIR $DB_CLUSTER_ID & disown
""", True)
Note the volume path to the init script. You will need it when configuring your standard access mode cluster.
Add the init script to the allowlist
Follow the instructions to add the init script to the allowlist in the Allowlist libraries and init scripts on compute with standard access mode (formerly shared access mode) (AWS | Azure | GCP) documentation.
Configure the init script
- Follow the instructions to configure a cluster-scoped init script in the Cluster-scoped init scripts (AWS | Azure | GCP) documentation.
- Specify the volume path to the init script. Use the same path that you used in the preceding script. (
/volumes/<path-to-init-script>/tcp_dump.sh
) - After configuring the init script, restart the cluster.
Locate the pcap files
Once the cluster has started, it automatically starts creating pcap files containing the recorded network information. Locate the pcap files in the folder dbfs:/FileStore/tcpdumpfolder/${DB_CLUSTER_ID}
.
Download the pcap files
Download the pcap files from the DBFS path to your local host for analysis. There are multiple ways to download files to your local machine. One option is the Databricks CLI. For more information, review the What is the Databricks CLI? (AWS | Azure | GCP) documentation.