5 Decision Tree Hyperparameters to Enhance your Tree Algorithms
Learn some of the most common hyperparameters you can tweak to boost your tree based algorithms performance
Photo by Alperen Yazgı @unsplash.com
Decision Trees are really cool algorithms that set up the stage for more advanced algorithms such as Random Forests, LightGBM or XGBoost. During your Data Science journey, Decision Trees are probably the first non-linear algorithm that you will learn as they are pretty explainable and straightforward to understand. If you are still struggling with tree-based algorithms, check out my article on classification trees. Hopefully, it might be an interesting read to help you
In the context of studying Data Science and Machine Learning, Decision Trees are probably the first algorithm that you will learn where hyperparameters are essential when it comes to their performance. Hyperparameters are, arguably, more important for tree-based algorithms than with other models, such as regression based ones. At least, the number of hyperparameters that one can tweak in Decision Trees are more vast and heterogeneous than most algorithms.
As with great power comes great responsibility, Decision Trees can find extremely interesting non-linear relationships between features and target but they are also extremely prone to high variance, commonly called overfit.
And how can you control that overfitting tendency? With hyperparameters!
Hyperparameters are parameters not inferred by the model during training and something that the data scientist has to calibrate itself. As we’ve discussed, studying this algorithm will also give you a good grasp on the importance of hyperparameters in the training process and how changing them impacts the performance and stability of your machine learning solution.
In this post, we are going to check some common hyperparameters we can tweak when fitting a Decision Tree and what’s their impact on the performance of your models. Let’s start!
Maximum Depth
The first hyperparameter we will dive into is the “maximum depth” one. This hyperparameter sets the maximum level a tree can “descend” during the training process. For instance, in the sklearn implementation of the Classification Tree, the maximum depth is set to none, by default.
What’s the impact of not having a maximum depth during the training process? Your tree can “theoretically” go down until all nodes are pure — assuming your tree is not controlled by any other hyperpameter. A main problem with this approach is that you may end up making decisions on only a single example from your training table. Think that if you let your tree go down too much, you are splitting your space recursively until you “potentially” isolate each example — this will lead to overfit as your algorithm will be extremely good in the training sample but will fail to generalize to the real world — the ultimate goal of a machine learning algorithm.
By trying different deepness levels of your tree you will strike a balance between generalization and fitting power of your algorithm — Here is a graphical example of what happens when you try different maximum depth parameters:
To recap:
A number of maximum depth that is too high may lead to overfit or a high variance.
A number of maximum depth that is too low may lead to underfit or a high bias.
Minimum Samples Split
As we’ve seen, if we don’t set a maximum depth in our tree, we end up incentivizing our tree to look for pure nodes — nodes that only contain one single class (classification trees) or a single continuous value (regression trees). This will naturally makes our tree base its decisions on fewer examples as pure nodes may be based on a lot of splits and tend to suffer from the curse of dimensionality.
Are there any other hyperparameters that help us to avoid using sparse data to make decisions? Yes there are! One of them is the “minimum samples split” hyperparameter that lets you control how many samples a node must contain to be available for a split.
Setting a high number on this hyperpameter will cause your tree to make generalizations with a higher number of examples as you only split a tree node if it contains more than the number you’ve set. This will avoid creating two potentially “low sample” children that will spawn from the node.
An extreme example (of a high value in this hyperpameter) would be: if you use the training size as “minimum samples split”, your outcome is simply the mean of the target variable because you make the inference on the whole population — of course, this approach would be a bit meaningless.
A low number of minimum samples split will cause your decision tree to overfit as it will make decisions on fewer examples — this has the same effect of choosing a high maximum depth of the decision tree.
To recap:
A number of “minimum samples split” that is too low may lead to overfit or a high variance.
A number of “minimum samples split” that is too high may lead to underfit or a high bias.
Minimum Samples Leaf
Similar to the hyperparameter we saw above, minimum samples leaf is an hyperparemeter that controls the amount of examples a terminal leaf node can have. A leaf node is any terminal node of your tree that will be used to classify new points.
This is pretty similar to the hyperpameter I’ve presented before with the only difference the stage of the sample size you are trying to control. In the hyperparameter above, you control the number of examples in a node before splitting the node. With minimum samples leaf you control the number of examples in a node after the split has “potentially” happened.
By raising the number of minimum samples leaf you are also preventing overfit, trying to avoid the curse of dimensionality.
To recap:
A number of “minimum samples leaf” that is too low may lead to overfit or a high variance.
A number of “minimum samples leaf” that is too high may lead to underfit or a high bias.
Maximum Features
A hyperparemeter that does not act on the sample size but on the features is the “maximum features” one. Every time a Decision Tree performs a search on the candidates to perform the next split, it extensively looks across every feature, by default.
This has, mainly, two disadvantages:
you prevent your tree from dealing with randomness in any way — as an example, if your tree was able to just use a randomnly selected number of features each time it had to perform a split, you would be introducing a random process during the training, something that can benefit your performance.
if you have a high number of dimensions, your tree will take too long to train.
With max features, you spice up a bit of your training process by randomly choosing a set of features for each split. This is exceptionally practical when you have a lot of features.
In the sklearn implementation you will find different ways to choose the number of maximum features you want to be considered for each split — you can define this as an integer or as a fraction of the total number of features (50% of all the features, for instance).
To recap:
A number of “maximum features” that is high approximates the extensive search of a Decision Tree with no parameter set.
A number of maximum depth that is low selects fewer random features to consider when choosing the next split on the tree.
Minimum Impurity Decrease
Another way to tweak your tree and be more strict with which splits are “acceptable” for your tree is to control the amount of impurity (the value you want to minimize in each split).
By default, in each split you accept any reduction in impurity in your overall tree. With this hyperparameter set, you only apply a split if that split decreases the impurity by an x amount.
This hyperparameter acts directly upon the cost function of the decision tree — even if a split reduces your impurity and “theoretically” improve your classification or regression, it may not be used to go forward as it does not match the threshold you have defined. In the sklearn implementation, this value is set to 0, by default — meaning any split is valid as long as it reduces the impurity, even if it reduces that value by 0.00001.
To recap:
A low number of “minimum impurity decrease” will make your tree split your data at “theoretically” valid splits but that may be meaningless;
A high number “minimum impurity decrease” will make your training process too strict on choosing the next split, leading to high bias.
Thank you for reading! These are 5 hyperparameters that I normally tweak when I develop decision trees. Learning decision trees was essential in my studies on DS and ML — it was the algorithm that helped me to grasp the huge impact that hyperparameters can have in your algo’s performance and how they can be key for the failure or success of a project.
I’ve set up a course on learning Data Science from Scratch on Udemy — the course is structured for beginners, contains more than 100 exercises and I would love to have you around!