data:image/s3,"s3://crabby-images/d97e4/d97e4f01b2487edfd089321f367236117ce37da9" alt=""
R has many machine learning libraries one can use to train models. From caret to stand-alone libraries such as randomForest, rpart or glm, R provides a wide array of options when you want to perform some data science tasks.
A curious library that you may have never heard of is h2o. An in-memory platform for distributed and scalable machine learning, h2o can run on powerful clusters when you need boosted computing power. The other interesting bit about it is that it’s a library not exclusive to R — for instance, you can also use the same API using Python.
Additionally, h2o is a very interesting and diverse library. It contains so many diverse features (ranging from training models to automl capabilities) that it’s easy to get a bit lost when using it, particularly due to the high number of methods and functions one can use with the package. That diversity of features is the main reason why I’ve wrote this blog post— I want to help you navigate the h2o interface!
In this blog post, we’ll cover some examples (with code) of h2o, namely:
train a couple of machine learning models;
do some hyperparameter tuning;
perform an automl routine;
take a glimpse at the explainability module;
Let’s start!
Loading the Data
For this blog post, we’ll use the London Bike Sharing Dataset — this dataset contains information about bike demand for the London Bike Sharing Programme. If you want to know more about it, visit the Kaggle link to check a description of the columns and how this data was generated.
Mostly, this is a supervised learning problem, where we’ll want to predict the count of new bike rides per hour, based on a couple of features regarding the day and weather during a specific hour.
We can load the dataset using R’s read.csv
:
london_bike <- read.csv(‘./london_merged.csv’)
After loading the data, let’s just do a small check on our data types. If we call the str
command on london_bike
, we’ll see that:
'data.frame': 17414 obs. of 10 variables:
$ timestamp : chr "2015-01-04 00:00:00" "2015-01-04 01:00:00" "2015-01-04 02:00:00" "2015-01-04 03:00:00" ...
$ cnt : int 182 138 134 72 47 46 51 75 131 301 ...
$ t1 : num 3 3 2.5 2 2 2 1 1 1.5 2 ...
$ t2 : num 2 2.5 2.5 2 0 2 -1 -1 -1 -0.5 ...
$ hum : num 93 93 96.5 100 93 93 100 100 96.5 100 ...
$ wind_speed : num 6 5 0 0 6.5 4 7 7 8 9 ...
$ weather_code: num 3 1 1 1 1 1 4 4 4 3 ...
$ is_holiday : num 0 0 0 0 0 0 0 0 0 0 ...
$ is_weekend : num 1 1 1 1 1 1 1 1 1 1 ...
$ season : num 3 3 3 3 3 3 3 3 3 3 ...
Our dataframe only contains numeric columns (apart from timestamp that we will not use) — as we want the algorithms to treat weather_code
and season
as categorical variables (at least on the first experiment), let’s convert them into factors:
london_bike$weather_code <- as.factor(london_bike$weather_code)
london_bike$season <- as.factor(london_bike$season)
Having the dataset loaded into R, let’s split our data into two by creating a train and test frame.
Train-Test Split
h2o has a convenient function to perform train test splits. To get h2o
running, we’ll need to load and initialize our library:
library(h2o)
h2o.init()
When we type h2o.init()
, we are setting up a local h2o cluster. By default, h2o will spun all available CPU’s but you can specify a specific number of CPU’s to initialize using nthread
.
This is one of the major differences regarding other machine learning libraries in R — to use h2o
, we always need to start an h2o cluster. The advantage is that if you have an h2o instance running on a server, you can connect to that machine and use those computing resources without changing the code too much (you only need to point yourinit
to another machine).
Using h2o.splitFrame
, we can conveniently create a random training-test split of our data but before that, we need to convert our dataframe into a special object that h2o can recognize:
london_bike.h2o <- as.h2o(london_bike)
First learning: h2o can´t deal with normal R dataframes but only with a special type of object h2OFrame
, so the step of converting dataframes using as.h2o
is mandatory.
We are now ready to do our train test split with our new london_bike.h2o
object:
london_bike_split <- h2o.splitFrame(data = london_bike.h2o, ratios = 0.8, seed = 1234)training_data <- london_bike_split[[1]]
test_data <- london_bike_split[[2]]
Using h2o.splitFrame
, we can immediately divide our dataset into two different h2o
frames — ratios
define the percentage we want to allocate to our training data and in the function above, we are using 80% of the dataset for training purposes, leaving 20% as an holdout set.
With our data split between test and train in h2o format, we’re ready to train our first h2o model!
Training a Model
Due to its simplicity, the first model we are going to train will be a linear regression. This model can be trained with theh2o.glm
function and first, we need to define the target and feature variables:
predictors <- c("t1", "t2", "hum", "wind_speed", "weather_code",
"is_holiday","is_weekend", "season")response <- "cnt"
As the column cnt
contains the number of bikes used per each hour, that’s the column we will for our response/target.
To train the model, we can do the following:
london_bike_model <- h2o.glm(x = predictors,
y = response,
training_frame = training_data)
And.. that’s it! With just three arguments, we can train our model:
x
defines the names of the columns we will be using as features.y
defines the target columnin the
training_frame
argument, we pass the training dataset.
We have our model ready! Let’s compare our predictions against the real value on the test set— we can convinently use the h2o.predict
function to get the predictions from our model:
test_predict <- h2o.predict(object = london_bike_model,
newdata = test_data)
object
receives the model we want to apply to our data.newdata
receives the data where we will apply the model.
And then, we can cbind
our predictions with the cnt
from the test set:
predictions_x_real <- cbind(
as.data.frame(test_data$cnt),
as.data.frame(test_predict)
)
We can compare our predictions against the target quickly:
Clearly, our model is overshooting our target a bit — let’s apply some regularization inside the h2o.train
using the alpha
parameter:
london_bike_model_regularized <- h2o.glm(x = predictors,
y = response,
training_frame = training_data,
alpha = 1)
In h2o.glm
,alpha=1
represents Lasso Regression. It doesn’t seem that our model improved that much, and we probably need to do some more feature engineering or try other arguments with the linear regression (although it’s unlikely that this will improve our model by a lot).
In the library’s documentation, you’ll find a ton of parameters to tweak in this generalized linear model function. An important takeaway from this first training process is that h2o’s training implementation contain a lot of tweakable parameters we can experiment and test.
When evaluating out model, we’ve only done a visual “testing” by plotting our predictions against the real target. Of course, if we want to do a more scientific evaluation, we have all the famous regression and classification metrics available in the h2o framework!
Let’s see that, next.
Evaluating our Models
One of the cool things we can do with h20 is passing a test set directly into our model and use it to extract validation metrics. For instance, if we want to retrieve metrics from a test set, we can plug it directly into the validation_frame
argument in any h2o
trained model:
london_bike_model <- h2o.glm(x = predictors,
y = response,
training_frame = training_data,
validation_frame = test_data)
Passing this new validation_frame
argument will give us the ability to extract metrics for both data frames really quickly — for instance, let’s get mean squared error for our model for both training and test (here called valid
):
h2o.rmse(london_bike_model, train=TRUE, valid=TRUE)train valid
936.5602 927.4826
Given that the mean of our target variable is 1143 bikes for each hour, our model is not performing that well. We also see that there seems to be low evidence for overfitting, as both training and test sets seem similar.
How can we change the metric we want to take a loot at? We just tweak the h2o
function! For instance, if we want to look at r-squared, we just use h2o.r2
:
h2o.r2(london_bike_model, train=TRUE, valid=TRUE)train valid
0.2606183 0.2456557
Super simple! The upside is that you don’t have to worry about all the details on how to implement these metrics by yourself.
In this example, I am following a regression problem but, of course, you also have classification models and metrics available. For all metrics available in h2o
, check the following link.
It’s a bit expected that our linear regression isn’t performing that well — we haven’t performed any feature engineering and we are probably violating too many linear regression assumptions. But, if we can train simple linear regressions, we can probably train other types of models in h2o
, right? That’s right ! Let’s see that in the next section.
More Model Examples
As you may have guessed, if we change the h2o
function associated with the training process, we will fit other types of models. Let’s train a random forest by calling h2o.randomForest
:
london_bike_rf <- h2o.randomForest(x = predictors,
y = response,
ntrees = 25,
max_depth = 5,
training_frame = training_data,
validation_frame = test_data)
I’m setting two hyperparameters for my Random Forest on the function call:
ntrees
that sets the number of trees in the forest.maxdepth
that sets the maximum deepness of each tree.
If you need, you can find all the tweakable parameters by calling help(h2o.randomForest)
on the R console.
The cool thing is that we can now use what we’ve learned about model metrics on this new model, just by switching the model we feed into the metric function — for instance, to get the rmse
of this model, we switch the first argument to london_bike_rf
:
h2o.rmse(london_bike_rf, train=TRUE, valid=TRUE)train valid
909.1772 900.5366
And to obtain the r2
:
h2o.r2(london_bike_rf, train=TRUE, valid=TRUE)
train valid
0.3032222 0.2888506
Notice that our code practically didn’t change. The only thing that was modified was the model we fed into the first argument. This makes these metric functions highly adaptable to new models, as long as they are trained inside the h2o framework.
If you follow this link, you will find other models you can train with the library. From that list, let’s fit a Neural Network, using h2o.deeplearning
:
nn_model <- h2o.deeplearning(x = predictors,
y = response,
hidden = c(6,6,4,7),
epochs = 1000,
train_samples_per_iteration = -1,
reproducible = TRUE,
activation = "Rectifier",
seed = 23123,
training_frame = training_data,
validation_frame = test_data)
hidden
is a very important argument in the h2o.deeplearning
function. It takes a vector that will represent the number of hidden layers and neurons we will use on our neural network. In our case, we are using c(6,6,4,7)
, 4 hidden layers with 6, 6, 4 and 7 nodes each.
Will our h2o.r2
function work just the same with neural networks? Let’s test:
h2o.r2(nn_model, train=TRUE, valid=TRUE)
train valid
0.3453560 0.3206021
It works!
Bottom line is: most functions we use in the h2o framework are tweakable for other models. This is a very good feature as it makes it easy to switch between models with less code, avoiding overly complex or an error-prone development.
Another cool feature of h2o is that we can do hyperparameter tuning really smoothly — let’s see how, next.
HyperParameter Tuning
Performing hyperparameter search is also super simple in h2o
— you only need to know:
The model where you want to perform your search.
The name of parameters available for each model.
The values you want to test for which parameters.
Remember that, in the random forest we’ve trained above, we’ve set ntrees
and maxdepth
manually.
In the grid example, we’ll do a search on both parameters plus min_rows
. We can do that by using the h2o.grid
function:
# Grid Search
rf_params <- list(ntrees = c(2, 5, 10, 15),
max_depth = c(3, 5, 9),
min_rows = c(5, 10, 100))# Train and validate a grid of randomForests
rf_grid <- h2o.grid("randomForest",
x = predictors,
y = response,
grid_id = "rf_grid",
training_frame = training_data,
validation_frame = test_data,
seed = 1,
hyper_params = rf_params)
We start by declaring rf_params
that contain the list of values we will use on our grid search and then, we pass that grid into hyper_params
argument in the h2o.grid
. What h2o
will do is train and evaluate every single combination of hyperparameters available.
Seeing the results of our search is super easy:
h2o.getGrid(grid_id = "rf_grid",
sort_by = "r2",
decreasing = TRUE)
The h2o.getGrid
function gives us a summary of the best hyperparameters according to a specific metric. In this case, we chose r2
, but other metrics such as the RMSE
or MSE
also work. Let’s look at the top 5 results of our grid search:
Hyper-Parameter Search Summary: ordered by decreasing r2
max_depth min_rows ntrees model_ids r2
1 9.00000 5.00000 15.00000 rf_grid_model_30 0.33030
2 9.00000 5.00000 10.00000 rf_grid_model_21 0.32924
3 9.00000 5.00000 5.00000 rf_grid_model_12 0.32573
4 9.00000 10.00000 15.00000 rf_grid_model_33 0.32244
5 9.00000 10.00000 10.00000 rf_grid_model_24 0.31996
Our best model was the one that had a max_depth
of 9, a min_rows
of 5 and 15 ntrees
— this model achieved an r2 of 0.3303.
The cool part? You can expand this grid to any hyperparameter available in ?h2o.randomForest
or to any model available in the documentation of h2o
, opening up an endless amount of possibilities with the same function.
AutoML Features
If you need a quick and raw way to look at the way different models perform on your dataset, h2o
also has an interesting automl
routine:
aml <- h2o.automl(x = predictors,
y = response,
training_frame = training_data,
validation_frame = test_data,
max_models = 15,
seed = 1)
The max_models
argument specify the maximum number of models to be tested on a specific automl
ensemble. Keep in mind that the automl
routine can take a while to run, depending on your resources.
We can access the top models of our routine by checking the aml@leaderboard
:
aml@leaderboard
By the table above, we can see that a Stacked Ensemble model was the winner (at least, in terms of rmse
). We can also get more information on the best model by calling:
h2o.get_best_model(aml)Model Details:
==============H2ORegressionModel: stackedensemble
Model ID: StackedEnsemble_AllModels_1_AutoML_1_20221027_95503
Number of Base Models: 15Base Models (count by algorithm type):deeplearning drf gbm glm
4 2 8 1
The result of h2o.get_best_model(aml)
returns more information about the model that achieved the best score on our automl
routine. By the snippet above, we know that our ensemble aggregates the result of:
4 deep learning models;
2 ranfom forests;
8 gradient boosting models;
1 generalized linear model;
Depending on your use case, the automl
routine can be a quick, dirty way, to understand how ensembles and single models behave on your data, giving you hints on where to go next. For example, helping you on the decision if more complex models are the way to go or if you will probably need to acquire more training data / features.
Explainability
Finally, let’s take a look into some of h2o’s explainability modules. In this example, we’ll use the random forest model we’ve trained above:
london_bike_rf <- h2o.randomForest(x = predictors,
y = response,
ntrees = 25,
max_depth = 5,
training_frame = training_data,
validation_frame = test_data)
Just like most machine learning libraries, we can grab the variable importance plot directly with the h2o
interface:
h2o.varimp_plot(london_bike_rf)
By the importance plot, we notice that humidity(hum
) and temperatures (t1, t2
) are the most important variables for our trained random forest. Calling varimp_plot
immediately shows the importance plot for a specific model, without the need to configure anything else.
Single variable importances are not the only available explainability models in h2o
— we can also check shap
values quickly:
h2o.shap_summary_plot(london_bike_rf, test_data)
Voilà! By the shap_summary_plot
, we understand the direction of the relationship between our features and target. For instance:
higher temperatures explain more bike usage.
lower humidity also explain more bike usage.
In the case above, we’ve asked a general explanation of the shap
for our test_data
but we can also check individual row explainability using h2o
— for instance, let’s see the 4th row of our test set:
Basically, this was a very cold day in January and this factor probably impacted bike usage in London, affecting the prediction of our model — let´s pass this row to a shap explainer:
h2o.shap_explain_row_plot(london_bike_rf, test_data, row_index = 4)
Interesting! Several features are pushing the prediction of bike usage down. For example:
A high value in humidity (93) contributes negatively to the prediction.
The fact that this day was a weekend means less commute and is contributing to our prediction.
Let’s see a summer, week day row:
In this case, we have the opposite. The temperature and humidity variables are contributing positively to a high value of bike usage (prediction is ~2617).
Also, by analyzing the rows, you probably notice that hour will be a very important feature — can you add that feature and train more models inside the h2o
framework? Try for yourself and add some practice of h2o
to your repertoire!
Thank you for taking the time to read this post! I hope this guide helped you to get started with h2o
and that you can now test if this library fits your data science pipelines. Although other libraries, such as caret, are more famous, h2o should not be discarded as a contender to perform machine learning tasks. The main advantages of h2o
are:
interesting
automl
and explainability features.the ability to run
h2o
tasks on remote servers through a local environment.A vast array of different models to use and apply.
Easy plug-and-play functions to perform hyperparameter search or model evaluation.
Feel free to check other guides I’ve wrote on R libraries, such as caret, ggplot2 or dplyr.
If you would like to drop by my R courses, feel free to join here (R Programming for Absolute Beginners) or here (Data Science Bootcamp). My R courses are suitable for beginners/mid-level developers and I would love to have you around!
data:image/s3,"s3://crabby-images/67362/673628374ae6aa85c4d1680b4be37d121892140a" alt=""
The dataset used in this post is under the Open Government License terms and conditions, available at https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset