See the solution
# best RF
rf_model = rf_best
# best GB
gb_model = gb_model_finalTraining a model is only half the work. Before drawing any conclusions or deploying to production, you need to rigorously assess its quality: how large are the errors, are they evenly distributed, which features drive the predictions and how does the model compare to alternatives? This section covers the full evaluation process for ensemble models, applied to the Random Forest and Gradian Boosting trained in the previous exercises.
Before evaluating our trained models, you will discover some metrics and plots to measure the quality of the predictions.
Regression metrics assess the quality of a model’s predictions by comparing predicted and “observed” values. Each metric offers a different perspective depending on whether the goal is : - to penalize extreme error ; - to obtain an interpretable measure in the target’s unit ; - to quantify the share of variance explained by the model. A metric depends on what you want to do and your data. From an organisationnal point of view, it means that the “business team”, with in-depth knowledge of the field, is by far the best placed to decide which are the relevant metrics to use. Decision to use these metrics shouldn’t be the sole responsibility of the data science team.
Mean Absolute Error (MAE) : \[\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|\] MAE gives equal weight to all deviations regardless of their magnitude and sign. In the presence of extreme values in the target (as is common in real-estate prices) and in comparison with the following metrics, MAE may be a more representative measure of typical prediction error as they are less affected by extreme errors. Moreover, MAE is easily interpretable. For example, in the prediction of price per square meter in euros, an MAE of 500 means the model’s predictions deviate from actual prices by 500 €/m² on average.
Mean Squared Error (MSE) : \[\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2\] MSE is the most common evaluation metric used. Because the errors are squared, large deviations contribute disproportionately: an error of 200 weighs 4 times more than an error of 100. This makes MSE sensitive to outliers in the test set. The interpretability of this is limited, because of its units, which are the square of the target units.
Root Mean Squared Error (RMSE) : \[\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\] RMSE is simply the square root of the MSE, which restores the original unit of the target variable. RMSE is the most widely reported metric for regression and is directly comparable across models evaluated on the same test set. For example, in the prediction of price per square meter in euros, an RMSE of 500 means the model’s predictions deviate from actual prices by 500 €/m² on average — weighted towards larger errors.
The key trade-off between RMSE and MAE:
Note : Be careful : you can compare different models based on their MAE, MSE and RMSE metrics but only if the metrics are calculated on exactly the same test set.
Mean Absolute Percentage Error (MAPE) : \[\text{MAPE} = \frac{1}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right| \times 100\] MAPE expresses errors as a percentage of the actual value, making it scale-free and immediately interpretable for non-technical stakeholders. However, MAPE is undefined when \(y_i = 0\) and is biased towards underpredictions (a 100% overestimation and a 100% underestimation are not symmetrically penalized). For real-estate prices, where the target is always strictly positive, MAPE is a natural complement to RMSE. For example, a MAPE of 12% means the model is off by 12% of the actual price on average.
Coefficient of Determination (R²) : \[R^2 = 1 - \frac{\text{SS}_\text{res}}{\text{SS}_\text{tot}} = 1 - \frac{\displaystyle\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\displaystyle\sum_{i=1}^{n}(y_i - \bar{y})^2}\] where \(\bar{y}\) is the mean of the actual values.
R² measures the proportion of variance in the target that is explained by the model. It is scale-free:
Unlike RMSE and MAE, R² allows comparison across different datasets and targets. Its main limitation is that it can be inflated by adding features, even irrelevant ones.
Scalar metrics alone are insufficient: two models can share the same RMSE while exhibiting very different error patterns. Diagnostic plots reveal the structure of residuals and provide insights that numbers cannot. Moreover, they add details to see if the model performs well where you want it to perform. For example, if your aim is to have good results on the center of the distribution, you can live with worse results on extreme values (or the opposite).
Residuals distribution : it is a histogram of the residuals \(e_i = y_i - \hat{y}_i\). For a well-behaved regression model, this distribution should be:
For this project, the target was log-transformed before training to reduce right-skewness. The residual distribution lets you verify that this transformation was effective in stabilizing the error pattern.
QQ plot (quantile-quantile) : it compares empirical quantiles of \(y_\text{test}\) and \(\hat{y}\) by plotting them against each other. If the model’s predicted distribution perfectly matches the actual distribution, all points fall on the diagonal.
Deviations from the diagonal reveal:
The QQ plot is complementary to the residuals histogram: it focuses on the overall distributional alignment rather than individual errors.
Target distribution : it is the plot of the sorted values of a series against their percentile rank, producing a cumulative distribution curve. Comparing the curves for \(y_\text{test}\) and \(\hat{y}\) side by side reveals systematic differences in location, spread, or shape:
Permutation feature importance (for RF) : with scikit-learn’s permutation_importance, it measures the contribution of each feature to model performance. The principle: for each feature, the values are randomly shuffled across observations, breaking the relationship between that feature and the target and the drop in model performance (measured by the chosen metric) is recorded. This is repeated 5 times and the results are averaged for stability.
Permutation importance has several advantages over the default impurity-based importance available in tree models:
Features with near-zero or negative permutation importance can safely be removed without hurting predictive performance. A high importance score validates that a feature carries genuine predictive signal.
In this exercise, you will generate predictions from the best RF and GB model and compute all four evaluation metrics on the test set.
Import root_mean_squared_error, mean_absolute_error and r2_score from sklearn.metrics. You can see the documentation of sklearn.metrics here.
You can use the function print-metrics make in last section on GB model.
from sklearn.metrics import root_mean_squared_error, mean_absolute_error, r2_score
def print_metrics(model, split, X=X_train, y=y_train):
"""
Print metrics for trained model
"""
y_pred = model.predict(X)
rmse = root_mean_squared_error(y, y_pred)
mae = mean_absolute_error(y, y_pred)
r2 = r2_score(y, y_pred)
print(f"{split} — RMSE: {rmse:.2f} | MAE: {mae:.2f} | R²: {r2:.4f}")
models = [("RF", rf_model), ("GB", gb_model)]
for name, model in models:
print_metrics(model, name, X_test, y_test)Which model would you choose, and why?
Use mean_absolute_percentage_error from sklearn.metrics. Make sure y_test contains no zero values before dividing. You can see the documentation here.
# Disponible à partir de scikit-learn >= 0.24
from sklearn.metrics import mean_absolute_percentage_error
mape_pct_rf = mean_absolute_percentage_error(y_test, y_pred_RF) * 100
mape_pct_gb = mean_absolute_percentage_error(y_test, y_pred_GB) * 100
print(f"MAPE RF: {mape_pct_rf:.2f} %")
print(f"MAPE GB: {mape_pct_gb:.2f} %")In this exercise, you will produce four diagnostic plots: a residuals distribution, a QQ-plot comparing actual and predicted quantiles, the target distribution and a feature importance plot. For each plot, take a moment to interpret what you observe.
residuals_distribution(residuals, rmse) function to plot the histogram of the residuals. It takes residuals (a pandas Series) and rmse (a float), along with ax, label and color parameters for overlay plotting.Use matplotlib.pyplot module to plot the distribution. You can read the documentation here.
import matplotlib.pyplot as plt
def residuals_distribution(residuals: pd.Series, rmse: float, ax=None, label=None, color=None):
if ax is None:
fig, ax = plt.subplots()
ax.hist(residuals, bins=100, edgecolor="none", alpha=0.5, label=label or f"RMSE = {rmse:.3f}", color=color)
ax.axvline(0, color="red", linestyle="--")
ax.set_xlabel("Residual")
ax.set_ylabel("Frequency")
ax.set_title("Residuals distribution")
ax.legend()
return axPass the residuals series and the RMSE value computed in Exercise 8. The RMSE value is needed in the plot title. Call plt.show() to display the figure.
QQplot(y_test, y_pred) function which computes 1000 quantile points for both series and plots them against each other. The parameters are y_test, y_pred, pandas series of observed and predicted target and the same parameters that question 1. for overlay plotting. The function should return the plot.import numpy as np
def QQplot(y_test: pd.Series, y_pred: pd.Series, ax=None, label=None, color=None):
"""
Actual quantiles vs predicted quantiles
"""
quantiles = np.linspace(0, 100, 1000)
q_real = np.percentile(y_test, quantiles)
q_predict = np.percentile(y_pred, quantiles)
if ax is None:
fig, ax = plt.subplots()
ax.scatter(q_real, q_predict, alpha=0.5, s=5, label=label or "Quantiles", color=color)
ax.plot(
[q_real[0], q_real[-1]],
[q_real[0], q_real[-1]],
"r--", linewidth=1.5
)
ax.set_xlabel("Actual quantiles")
ax.set_ylabel("Predicted quantiles")
ax.set_title("QQ-plot: actual vs predicted")
ax.legend()
return axPoints well above the diagonal in the upper tail indicate that the model underestimates high prices.
How does the gradient boosting model compare visually with the random forest?
target_distribution() function to plot the distribution of the parameter y, a pandas Serie.Call it twice — once for y_test and once for y_pred. Display both figures. Compare the shapes: a compressed predicted distribution indicates regression to the mean, while a systematic vertical shift indicates a constant bias.
fig_actual = target_distribution(y_test)
plt.title("Target distribution — actual values")
plt.show()
fig_pred = target_distribution(y_pred_RF)
plt.title("Target distribution — predicted values with RF model")
plt.show()
fig_pred = target_distribution(y_pred_GB)
plt.title("Target distribution — predicted values with GB model")
plt.show()If you want to plot both plots on a single figure, you can adapt the function :
def plot_combined_distribution(y_test: pd.Series, y_pred: pd.Series, ax=None, label=None, color=None, show_actual=True):
"""
Plots the target distributions of actual and predicted values on the same graph.
"""
if ax is None:
fig, ax = plt.subplots()
if show_actual:
y_sorted_actual = np.sort(y_test)
axe_actual = np.linspace(0, 100, len(y_sorted_actual))
ax.plot(axe_actual, y_sorted_actual, label="Actual Values", color="black")
y_sorted_pred = np.sort(y_pred)
axe_pred = np.linspace(0, 100, len(y_sorted_pred))
ax.plot(axe_pred, y_sorted_pred, label=label or "Predicted Values", color=color)
ax.set_xlabel("Percentile")
ax.set_ylabel("Price")
ax.set_title("Target distribution — actual vs predicted values")
ax.legend()
return ax
fig, ax = plt.subplots()
plot_combined_distribution(y_test, y_pred_RF, ax=ax, label="Random Forest", color="steelblue", show_actual=True)
plot_combined_distribution(y_test, y_pred_GB, ax=ax, label="Gradient Boosting", color="darkorange", show_actual=False)
plt.show()calculate_importance(X_test, y_test, RANDOM_STATE, rf_best, SCORING) function, which returns a sorted Series of importance scores with features name as index. The parameters are the test set (X_test and y_test), the RF model rf_best trained in last section and the scoring chosen for the impurity reduction.Note: Permutation importance is computationally expensive on a large test set. The
calculate_importance()function automatically subsamples up to 100,000 observations before computing importance scores. This subsample does not change the result.
Use permutation_importance function from scikit-learn. See here for the documentation.
from sklearn.inspection import permutation_importance
def calculate_importance(X_test, y_test, RANDOM_STATE, final_rf, SCORING):
X_test_sample = X_test.sample(n=min(100_000, len(X_test)), random_state=RANDOM_STATE)
y_test_sample = y_test.loc[X_test_sample.index]
perm = permutation_importance(
final_rf, X_test_sample, y_test_sample,
n_repeats=5,
scoring=SCORING,
n_jobs=-1,
random_state=RANDOM_STATE
)
importances = (
pd.Series(perm.importances_mean, index=X_test.columns)
.sort_values(ascending=False)
)
return importancespermutation_importance() function to plot the bar chart of features importance.def importance_plot(importances):
"""
Permutation importance plot
"""
fig, ax = plt.subplots(figsize=(8, 6))
importances.head(20).plot.barh(ax=ax)
ax.invert_yaxis()
ax.set_title("Permutation importance (top 20)")
ax.set_xlabel("Mean increase in RMSE")
plt.tight_layout()
plt.savefig("importances.png", dpi=150)
return figCalculate the importance with the train set and the final RF model. The SCORING argument should match the metric used during training (e.g. "r2").
Congrats 🎉! You now have a complete evaluation pipeline: scalar metrics, residual diagnostics, distributional analysis, feature importance and a head-to-head model comparison. These steps form the foundation of any rigorous model validation workflow.