Training a Random Forest model

In the first part of this project, we prepared the data to train the model. We reflected on which target variable to use, how to transform features and how to use scikit-learn tools for pipeline construction. Now, we are going to train a Random Forest model, covering both the theoretical background and its implementation.

Random Forests (RF) is a powerful and widely used ensemble method for classification and regression tasks. It combines the simplicity of decision trees and the sampling of observations and variables with the power of aggregation to improve predictive performance and reduce the risk of overfitting.

An Insee working document on ensemble methods is available here. If you are unfamiliar with ensemble methods, we recommend reading this document before processing.

1 Concepts of Random Forests

RF extend bagging by introducing an additional level of randomness: at each node, the splitting rule is determined using only a randomly selected subset of features. This further reduces correlation between trees, thereby lowering the variance of the aggregated model’s predictions.

So, RF rely on four key elements : building aggregated regression or classification trees, with a randomized subset of features and a bootstrap sample - a random sample drawn with or without replacement from the original dataset, typically of the same size.

You need to select several hyperparameters for training (we cover the most important ones here) :

Number of trees : this hyperparameter must be tuned - too few trees lead to higher prediction error, while too many trees increase computational time;
Number of randomly sampled candidate features : use sqrt or log2 to ensure the RF algorithm works well;
Data sample rate : useful to speed up training on large dataset, otherwise keep it close to 1.0;
Minimum number of observation per leaf (terminal node) : setting a higher value can reduce training time, generally without any performance loss.

For more information, you can read the paragraph on RF hyperparameters here

2 Exercice 4: Train your first Random Forest model

Using the RandomForestRegressor class and its scikit-learn documentation page, define a basic model with the following hyperparameters :
- n_estimators=50
- max_features=“sqrt”
- min_samples_leaf=10
Train the model with the remaining non transformed target (it should take about 5 minutes to train).

As it’s only a first step we chose to simplify the exercise. First, we will drop the trans_date and prop_type features. Secondly, we will work with the non transformed target (i.e.price per square meter) and not use the transformed target (i.e. log of the price per square meter) defined at the end of the pre-processing.

We will come back to this later on and chain it with the preprocessing steps.

Note: All parameters of RandomForestRegressor have a default value - you only need to explicitly pass the ones you wish to override.

TipHint

Remember : in the Preprocessing section, we created the training sets X_train and y_train.

See the solution

from sklearn.ensemble import RandomForestRegressor

# create RandomForestRegressor instance with selected hyperparameters
rf = RandomForestRegressor(
    n_estimators=50,
    max_features="sqrt",
    min_samples_leaf=10,
    oob_score=True # for calculating total oob error for the RF
)

# Defining train and test sets
X = df.drop(columns=["price_sqm", "trans_date", "prop_type"])  
y = df["price_sqm"]  
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE
)

# Train the model
rf.fit(X_train, y_train)

Print the Out-of-Bag error (the concept of OOB error will be covered in the next exercise).

Important

Scikit-learn distinguishes parameters from fitted attributes. Here, to access the attributes sent back by the training phase listed in the documentation, we need to search for the oob_score_ attributes of the random forest.

See the solution

rf.oob_score_

Now, calculate the accuracy of the model on the test set - that is, how far off the prediction are from the actual values.

Note : This is just an illustration of the evaluation process. More details on metrics will be covered in the dedicated section.

TipHint

You can use the mean_squared_error function from the sklearn.metrics module.

Remember to use the test set for the evaluation, not the training set.

See the solution

from sklearn.metrics import mean_squared_error

# Predictions on train set
y_pred = rf.predict(X_test)

# Print the error
print(mean_squared_error(y_test, y_pred))

Congrats 🥳! You have trained your first RF model. Now, try to find the best hyperparameters to minimize the final predicted error of the model.

3 Exercice 5: Tuning a random forest’s hyperparameters

Now that you understand what a random forest is, we will tune its hyperparameters for training. In this exercise, your aim is to train the best model for the prediction of price per square meter.

3.1 Tuning the number of trees

One easy way to determine the optimal number of trees is to use the Out-of-Bag (OOB) error - an error estimate computed on the observations that were not sampled during bootstrap and therefore not used to train each individual tree. By tracking how this error evolves as trees are added, one can plot its convergence as a function of the number of trees.

With the scikit-learn’s default implementation, producing this plot is not straightforward : the training process is optimised for speed, and trees are grown in parallel, meaning intermediate OOB estimates are not natively exposed. We therefore need to write a custom function to compute and plot this convergence curve. You can see the documentation page on OOB error here

We will implement a function to plot the OOB error as a function of the number of trees in a RF model. To do so, you will need to:

subsample a fraction of the dataset to reduce computation time (the dataset is large with more than 1 million of rows);
train multiple RF models with varying numbers of trees;
plot the OOB error for each RF model to visualize the convergence.

Note

For convenience and pedagogical reasons, we are still working on a subset of the features and with the non transformed target. In the real world, we would do this step with the pipeline defined at the end of the pre-processing part.

Subsample a fraction ( 0.1 ) of the df dataset to reduce computation time and define X_sub and y_sub. To stick to the characteristics of the target, stratify the sampling to the target’s distribution thanks to pandas’ qcut function.

See the solution

# Sample the train dataset using Pandas' index
y_train_df = pd.DataFrame(y_train)
y_train_df["quantile"] = pd.qcut(y_train_df["price_sqm"], q=100, labels=False) ## allows to discretly cut along quantiles
y_sub = y_train_df.groupby("quantile").sample(frac=0.1, random_state= RANDOM_STATE)  # sampling by quantile 

y_sub = y_sub["price_sqm"] # converting to pandas.series
X_sub = X_train.filter(items=y_sub.index, axis=0 )  # sampling X_train

Train multiple RF models with varying numbers of trees, using warm_start=True to avoid retraining all trees from scratch at each iteration. Keep the OOB error calculated in a list.

Hint

As a hint, you can go to the documentation page on OOB error here.

If you need, here is the code to calculate and store the OOB score.

oob_scores = []
warnings.filterwarnings("ignore", message="Some inputs do not have OOB scores")
# filterwarnings remove some warnings messages
for n in range(min_estimators, max_estimators, 10):
    rf.set_params(n_estimators=n)
    rf.fit(X_sub, y_sub)
    if metric == "r2":
        oob_scores.append((n, 1 - rf.oob_score_))
    elif metric == "neg_root_mean_squared_error":
        mse = np.mean((y_sub - rf.oob_prediction_) ** 2)
        oob_scores.append((n, np.sqrt(mse)))
    else:
        mae = np.mean(np.abs(y_sub - rf.oob_prediction_))
        oob_scores.append((n, mae))
warnings.resetwarnings()

See the solution

import warnings 
import numpy as np

metric = "r2"
min_estimators=5
max_estimators=150

rf = RandomForestRegressor(
    warm_start=True,
    **rf_params,
)

oob_scores = []
warnings.filterwarnings("ignore", message="Some inputs do not have OOB scores")
# filterwarnings remove some warnings messages
for n in range(min_estimators, max_estimators, 20):
    rf.set_params(n_estimators=n)
    rf.fit(X_sub, y_sub)
    if metric == "r2":
        oob_scores.append((n, 1 - rf.oob_score_))
    elif metric == "neg_root_mean_squared_error":
        mse = np.mean((y_sub - rf.oob_prediction_) ** 2)
        oob_scores.append((n, np.sqrt(mse)))
    else:
        mae = np.mean(np.abs(y_sub - rf.oob_prediction_))
        oob_scores.append((n, mae))
warnings.resetwarnings()

(Optional) Using the Out-of-Bag (OOB) error, try to plot the convergence of the OOB error as a function of the number of trees.
TipHint
You need to :
- subsample the inputs X and y, with a parameter subsample ;
- train the RF model from 15 to 150 trees (with a gap at 5) with RandomForestRegressor with the parameter warm_start=True;
- select which metric you use to calculate OOB error estimator
- plot the convergence of OOB error as a function of number of trees

See the solution

import matplotlib.pyplot as plt

rf_params = {
    "max_depth": 8,
    "max_features": "sqrt",
    "min_samples_split": 5,
    "min_samples_leaf": 10,
    "random_state": RANDOM_STATE,
}

def rf_error_oob_plot(X_train,
                      y_train,
                      subsample=0.1,
                      min_estimators=15,
                      max_estimators=150,
                      metric='r2',
                      **rf_params):
    """
    Plot error OOB convergence by the number of trees

    Args:
        X_train: features
        y_train: target
        subsample: rate of sample for X_train
        min_estimators: number min of trees
        max_estimators: number max of trees
        metric : 'r2',  'rmse' or 'mae'
    """

    # --- Stratified sampling of training set ---
    y_train_df = pd.DataFrame(y_train)
    y_train_df["quantile"] = pd.qcut(y_train_df["price_sqm"], q=100, labels=False) ## allows to discretly cut along quantiles
    y_sub = y_train_df.groupby("quantile").sample(frac=0.1, random_state= RANDOM_STATE)  # sampling by quantile 

    y_sub = y_sub["price_sqm"] # converting to pandas.series
    X_sub = X_train.filter(items=y_sub.index, axis=0 )  # sampling X_train

    # --- Training with warm start ---
    rf = RandomForestRegressor(
        oob_score=True,
        warm_start=True,
        **rf_params,
    )

    oob_scores = []
    warnings.filterwarnings("ignore", message="Some inputs do not have OOB scores")
    for n in range(min_estimators, max_estimators, 5):
        rf.set_params(n_estimators=n)
        rf.fit(X_sub, y_sub)
        if metric == "r2":
            oob_scores.append((n, 1 - rf.oob_score_))
        elif metric == "neg_root_mean_squared_error":
            mse = np.mean((y_sub - rf.oob_prediction_) ** 2)
            oob_scores.append((n, np.sqrt(mse)))
        else:
            mae = np.mean(np.abs(y_sub - rf.oob_prediction_))
            oob_scores.append((n, mae))
    warnings.resetwarnings()

    # Generate the "OOB error rate" vs. "n_estimators" plot.
    xs, ys = zip(*oob_scores)

    fig, ax = plt.subplots()
    ax.plot(xs, ys)
    ax.set_xlim(min_estimators, max_estimators)
    ax.set_xlabel("n_trees")
    ax.set_ylabel(f"OOB error ({metric})")
    plt.close(fig)

    return fig

You can select the number of trees (n_estimators) now, with the OOB error plot. Which number do you choose ?

See the solution

oob_error_ntrees = rf_error_oob_plot(X_train=X_train,
                                     y_train=y_train,
                                     subsample=0.1,
                                     min_estimators=5,
                                     max_estimators=150,
                                     metric="r2")
oob_error_ntrees

3.2 Tuning all others hyperparameters with cross validation

Cross-validation is a key step in model training : it is a technique used to evaluate the model’s ability to generalize to unseendata, by splitting the dataset into multiple folds (typically 5) and iteratively using each fold as a validation set while the remaining folds are used for training. It also allows you to compare different set of hyperparameters and select the configuration that yields the best predictive performance.

Note : In the Preprocessing section, you saw how to build a Pipeline. In this exercise, we will use a Pipeline again. The first exercise didn’t use it to let you practice scikit-learn tools documented here.

For convenience, we dropped the features prop_type, dtemut and trained our model on absolute prices and not on the transformed ones. Get them back and define the new training and testing datasets X_train, X_test, y_train, y_test.

See the solution

# Split features / target
X = df.drop(columns="price_sqm")  
y = df["price_sqm"]  

# Split train / test set
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE
)

Create a dictionary with the 3 hyperparameters to test (n_estimators, max_features and min_samples_leaf) for training.
TipHint
Pay attention to hyperparameter names : when using Pipeline and TransformedTargetRegressor objects, theses names are more complex than plain RandomForestRegressor parameters names. In the previous section, we used two objects for training :
- TransformedTargetRegressor to apply log transformation on the target - the corresponding parameter prefix is regressor
- Pipeline to apply all preprocessing transformations and fit the model : you need to refer to the name of the random forest step (it was 'RF') For example, for the hyperparameter n_estimators, the key in the parameter dictionary must be regressor__RF__n_estimators
More details on scikit learn’s documentation

See the solution

param_grid = {
    "regressor__RF__n_estimators": [80],
    "regressor__RF__max_features": ["sqrt", "log2"],
    "regressor__RF__min_samples_leaf": [5, 10, 50]
}

Using the Grid Search documentation, write the code to set up and run cross-validated hyperparameter tuning for the RF model. Note that training the model should last around 10 minutes.

See the solution

from sklearn.model_selection import GridSearchCV

# Grid search
grid_search = GridSearchCV(
    estimator=model, # it is the TransformedTargetRegressor created in the preprocessing part
    param_grid=param_grid,
    cv=4,  # number of folds
    scoring="r2", # 'r2' or 'neg_root_mean_squared_error' or 'neg_mean_absolute_error'
    n_jobs=-1,
    verbose=1
)

# Train
grid_search.fit(X_train, y_train)

From the fitted grid_search object, retrieve the best hyperparameters found for the model.

See the solution

print(grid_search.best_params_)

Important

Scikit-learn distinguishes original parameters from **fitted attributes with a trailing _**. The _ suffix is a consistent scikit-learn convention: any attribute that only exists after fitting ends with _ (e.g., coef_, n_features_in_, classes_). It signals “this was learned from data.”

grid_search.myfittedparam_

From the fitted grid_search object, retrieve the best model. As it has only been trained on a fraction of the training set, retrain it on the full training set.

Note : Which type is best_estimator_ attribute ?

See the solution

rf_best = grid_search.best_estimator_
print(type(rf_best))

rf_best.fit(X_train, y_train)

Congrats ! You have optimized your RF model. Now, you need to evaluate its accuracy — that is, how close the model’s predictions are to the actual values..

4 Exercice 6: Model evaluation

Now that your model has been optimized, it is time to rigorously evaluate its performance on the test set — data the model has never seen during training. This step is essential to assess how well your model generalizes to new observations.

Several metrics are commonly used to evaluate regression models:

RMSE (Root Mean Squared Error): measures the average magnitude of prediction errors, penalizing large errors more heavily;
MAE (Mean Absolute Error): measures the average absolute difference between predictions and actual values, more robust to outliers;
R² (Coefficient of Determination): measures the proportion of variance in the target explained by the model (1 = perfect fit, 0 = no better than the mean).

Compute predictions on the test set using the best model retrieved from the previous exercise.

See the solution

y_pred = rf_best.predict(X_test)

Calculate the three evaluation metrics on the test set.

Hint

You can use mean_squared_error, mean_absolute_error, and r2_score from sklearn.metrics. To obtain RMSE from MSE, apply np.sqrt().

See the solution

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae  = mean_absolute_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)

print(f"RMSE : {rmse:.2f}")
print(f"MAE  : {mae:.2f}")
print(f"R²   : {r2:.4f}")

(Optional) Plot predicted values against actual values to visually assess model quality. A well-calibrated model should have points closely aligned along the diagonal.

Hint

Use matplotlib.pyplot to create a scatter plot of y_test against y_pred. Add a reference diagonal line (representing perfect predictions) using ax.plot([min, max], [min, max]).

See the solution

fig, ax = plt.subplots(figsize=(7, 7))

ax.scatter(y_test, y_pred, alpha=0.3, s=5, label="Predictions")

lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
ax.plot(lims, lims, "r--", linewidth=1.5, label="Perfect prediction")

ax.set_xlabel("Actual values")
ax.set_ylabel("Predicted values")
ax.set_title("Predicted vs. Actual values on the test set")
ax.legend()
plt.tight_layout()
plt.show()

Congrats 🎉! You have successfully trained, optimized, and evaluated a Random Forest model for price per square meter prediction.