Model lineage not showing source Delta tables in the graph for Databricks Runtime 15.3 or above

Load the data using MLflow load_delta format and log the input.

Written by G Yashwanth Kiran

Last published at: January 28th, 2025

Problem

When training and registering a model using Delta tables in Unity Catalog (UC) you can see the lineage graph, but not the source Delta tables used to create it. 

 

Cause

Support for table-to-model lineage is available from MLflow 2.11.0 and above, which is available as part of Databricks Runtime 15.3 and above. 

 

Solution

If you’re using Databricks Runtime 15.3 or above, to view the Delta tables in UC used to make the lineage graph, first load them using the following code. 

 

train_spark = mlflow.data.load_delta(table_name=<catalog.schema.training-table-name>)
test_spark = mlflow.data.load_delta(table_name=<catalog.schema.test-data-table>)

 

Then, convert the tables to Pandas so the core model can take the Spark DataFrames as inputs. Create X_trainX_testy_train and y_test using the following code. 

 

X_train = train_spark.df.toPandas().drop([“<column-to-be-predicted>”], axis=1)
X_test = test_spark.df.toPandas().drop([“<column-to-be-predicted>”], axis=1)
y_train = train_spark.df.select(“<column-to-be-predicted>”).toPandas()
y_test = test_spark.df.select(“<column-to-be-predicted>”).toPandas()

 

Finally, when starting the MLflow run, log the input. 

 

with mlflow.start_run(run_name='untuned_random_forest'):
…
model.fit(X_train_spark, y_train_spark)
mlflow.log_input(train_spark, "training")
mlflow.log_input(test_spark,"test")
...

 

If you do not want to use Databricks Runtime 15.3 or above, first install MLfLow version 2.11.0 manually, then follow the steps in the previous part of the solution.