How to Effectively Visualize and Communicate Uncertainty with Spaghetti Plots using Altair

Visualizing Uncertainty with Spaghetti Plots in Altair

Dylan | Jan 25, 2021

Post Thumbnail

Data visualization is one of the most powerful communication tools in a Data Scientist's toolkit. When done correctly, complicated mathematical relationships can be represented intuitively in a visual form. Unfortunately, when visualizing predictions a major challenge is finding an effective way to communicate uncertainty. Attempts to convey confidence intervals or other measures of uncertainty pose the risk of being misinterpreted by our audience. Consider the following forecast of the path of an approaching hurricane.

A data visualization forecasting the possible paths that a hurricane may take by shading the entire possible region space blue (Source: NPR)

While the blue region represents any possible predicted position of the hurricane, it's easy to understand how some viewers may misinterpret this encoding of area to represent the hurricane's size. Additionally, instead of correctly interpreting the black line as the most likely path according to the model, viewers may infer that it is in fact the deterministic path of the hurricane. When probabilistic encodings are mistaken for deterministic ones, it is known as a Deterministic Construal Error. Unfortunately, these types of errors are very common but can sometimes be avoided if we keep an eye out for them.

Spaghetti Plots

One powerful graph that can help our audiences correctly interpret the uncertainty in our predictions is the spaghetti plot. The spaghetti plot works by drawing many possible outcomes from our predictive model's possible outcome space and plotting each sample individually as a line. Collectively, all of these samples convey many possible forecasts, resembling… you guessed it, delicious spaghetti! Sticking to the hurricane prediction example above, let's explore how spaghetti plots can help reduce deterministic construal errors.

A data visualization forecasting the possible paths that a hurricane may take using a spaghetti plot (Source: CNN)

This plot eliminates much of the confusion about the significance of the encodings in the visualization. It's clear that the hurricane may take one of several paths and in general, the audience can correctly reason that a single spaghetto passing through the middle of all the spaghetti is a more likely path than one that diverges strongly from the rest of the group.

Now that we understand what spaghetti plots are, let's see how to plot them using the Python visualization library Altair.

Creating Spaghetti Plots with Altair

Let's begin by preparing our data. We'll be using the Boston Housing Dataset available through sklearn.datasets. Let's focus on the relationship between LSTAT (the average proportion of adults without some high school education and the proportion of male workers classified as laborers) and housing price. We'll first visualize the relationship using a simple dot plot to get an idea of the relationship.

import numpy as np
import pandas as pd
import altair as alt
from sklearn.datasets import load_boston
boston = load_boston()

housing_X = pd.DataFrame(boston.data, columns = boston.feature_names)[['LSTAT']]
housing_y = pd.Series(boston.target, name = 'price')
housing_df = pd.concat([housing_y, housing_X], axis=1)

# show LSTAT versus price
alt.Chart(housing_df).mark_point().encode(
x='LSTAT',
y='price'
)

A simple plot to visualize the relationship between LSTAT and price. Appears to follow a negative nonlinear relationship.

Great, the relationship between these two variables appears to be negative (as LSTAT decreases, price increases) and nonlinear (following a curve). Let's make a simple Polynomial model with a degree of 2 using the sklearn library. First, we'll create an even linearly spaced vector (np.linspace) between the minimum LSTAT and the maximum LSTAT values so each predicted line will produce a smooth visualization. Next, we'll use bootstrapping with replacement to draw polynomial prediction values for subsets of our dataset. Each bootstrap sample will result in one spaghetto and for this example, we'll generate 50 of them. Finally, we'll plot the original dotplot overlaid by each of our bootstrapped polynomial model predictions to create our final spaghetti plot.

poly_pred_grid = pd.DataFrame({
"x": np.linspace(housing_X['LSTAT'].min(), housing_X['LSTAT'].max(), num=101)
})
poly_degree = 2

def get_housing_points_chart():
return alt.Chart(housing_df).mark_point().encode(
x='LSTAT',
y='price'
)

def get_one_bootstrap_housing_fit():
resampled_df = housing_df.sample(frac=1.0, replace=True)

X = resampled_df[['LSTAT']]
y = resampled_df['price']

polynomial_features = PolynomialFeatures(degree=poly_degree)
X_poly = polynomial_features.fit_transform(X)
poly_reg = linear_model.LinearRegression()
poly_reg.fit(X_poly, y)

return poly_reg

def get_housing_fit_chart(poly_reg, opacity):
polynomial_features = PolynomialFeatures(degree=poly_degree)
pred_df = pd.DataFrame({
'x': poly_pred_grid['x'],
'y': poly_reg.predict(polynomial_features.fit_transform(poly_pred_grid))
})

return alt.Chart(pred_df).mark_line(
opacity=opacity,
color='red'
).encode(
x='x',
y='y'
)

charts = [get_housing_fit_chart(get_one_bootstrap_housing_fit(), opacity=0.1) for _ in range(50)]
get_housing_points_chart() + alt.layer(*charts)

Our final spaghetti plot showing the most likely optimal polynomial of degree 2 for the given data set.

Again, a viewer can correctly assume that a line down the middle of all of these spaghetti plots is most the likely predicted outcome for a polynomial model with a degree of 2. Notice that because the data at the tails of our dataset are more sparse than the data in the center, the model's predictions at the tails remain much less certain than the thick cluster of overlapping lines passing through the center.