A Guide to using h20.ai in R

Learn how to work with the h20 library using the R Language

Dec 26, 2023

R has many machine learning libraries one can use to train models. From caret to stand-alone libraries such as randomForest, rpart or glm, R provides a wide array of options when you want to perform some data science tasks.

A curious library that you may have never heard of is h2o. An in-memory platform for distributed and scalable machine learning, h2o can run on powerful clusters when you need boosted computing power. The other interesting bit about it is that it’s a library not exclusive to R — for instance, you can also use the same API using Python.

Additionally, h2o is a very interesting and diverse library. It contains so many diverse features (ranging from training models to automl capabilities) that it’s easy to get a bit lost when using it, particularly due to the high number of methods and functions one can use with the package. That diversity of features is the main reason why I’ve wrote this blog post— I want to help you navigate the h2o interface!

In this blog post, we’ll cover some examples (with code) of h2o, namely:

train a couple of machine learning models;
do some hyperparameter tuning;
perform an automl routine;
take a glimpse at the explainability module;

Let’s start!

Loading the Data

For this blog post, we’ll use the London Bike Sharing Dataset — this dataset contains information about bike demand for the London Bike Sharing Programme. If you want to know more about it, visit the Kaggle link to check a description of the columns and how this data was generated.

Mostly, this is a supervised learning problem, where we’ll want to predict the count of new bike rides per hour, based on a couple of features regarding the day and weather during a specific hour.

We can load the dataset using R’s read.csv :

london_bike <- read.csv(‘./london_merged.csv’)

After loading the data, let’s just do a small check on our data types. If we call the str command on london_bike, we’ll see that:

'data.frame': 17414 obs. of  10 variables:
 $ timestamp   : chr  "2015-01-04 00:00:00" "2015-01-04 01:00:00" "2015-01-04 02:00:00" "2015-01-04 03:00:00" ...
 $ cnt         : int  182 138 134 72 47 46 51 75 131 301 ...
 $ t1          : num  3 3 2.5 2 2 2 1 1 1.5 2 ...
 $ t2          : num  2 2.5 2.5 2 0 2 -1 -1 -1 -0.5 ...
 $ hum         : num  93 93 96.5 100 93 93 100 100 96.5 100 ...
 $ wind_speed  : num  6 5 0 0 6.5 4 7 7 8 9 ...
 $ weather_code: num  3 1 1 1 1 1 4 4 4 3 ...
 $ is_holiday  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ is_weekend  : num  1 1 1 1 1 1 1 1 1 1 ...
 $ season      : num  3 3 3 3 3 3 3 3 3 3 ...

Our dataframe only contains numeric columns (apart from timestamp that we will not use) — as we want the algorithms to treat weather_code and season as categorical variables (at least on the first experiment), let’s convert them into factors:

london_bike$weather_code <- as.factor(london_bike$weather_code)
london_bike$season <- as.factor(london_bike$season)

Having the dataset loaded into R, let’s split our data into two by creating a train and test frame.

Train-Test Split

h2o has a convenient function to perform train test splits. To get h2orunning, we’ll need to load and initialize our library:

library(h2o)
h2o.init()

When we type h2o.init() , we are setting up a local h2o cluster. By default, h2o will spun all available CPU’s but you can specify a specific number of CPU’s to initialize using nthread .

This is one of the major differences regarding other machine learning libraries in R — to use h2o, we always need to start an h2o cluster. The advantage is that if you have an h2o instance running on a server, you can connect to that machine and use those computing resources without changing the code too much (you only need to point yourinitto another machine).

Using h2o.splitFrame , we can conveniently create a random training-test split of our data but before that, we need to convert our dataframe into a special object that h2o can recognize:

london_bike.h2o <- as.h2o(london_bike)

First learning: h2o can´t deal with normal R dataframes but only with a special type of object h2OFrame, so the step of converting dataframes using as.h2ois mandatory.

We are now ready to do our train test split with our new london_bike.h2o object:

london_bike_split <- h2o.splitFrame(data = london_bike.h2o, ratios = 0.8, seed = 1234)training_data <- london_bike_split[[1]]
test_data <- london_bike_split[[2]]

Using h2o.splitFrame , we can immediately divide our dataset into two different h2o frames — ratios define the percentage we want to allocate to our training data and in the function above, we are using 80% of the dataset for training purposes, leaving 20% as an holdout set.

With our data split between test and train in h2o format, we’re ready to train our first h2o model!

Training a Model

Due to its simplicity, the first model we are going to train will be a linear regression. This model can be trained with theh2o.glm function and first, we need to define the target and feature variables:

predictors <- c("t1", "t2", "hum", "wind_speed", "weather_code",      
                "is_holiday","is_weekend", "season")response <- "cnt"

As the column cnt contains the number of bikes used per each hour, that’s the column we will for our response/target.

To train the model, we can do the following:

london_bike_model <- h2o.glm(x = predictors,
                      y = response,
                      training_frame = training_data)

And.. that’s it! With just three arguments, we can train our model:

x defines the names of the columns we will be using as features.
y defines the target column
in the training_frame argument, we pass the training dataset.

We have our model ready! Let’s compare our predictions against the real value on the test set— we can convinently use the h2o.predict function to get the predictions from our model:

test_predict <- h2o.predict(object = london_bike_model, 
                            newdata = test_data)

object receives the model we want to apply to our data.
newdata receives the data where we will apply the model.

And then, we can cbind our predictions with the cnt from the test set:

predictions_x_real <- cbind(
  as.data.frame(test_data$cnt),
  as.data.frame(test_predict)
)

We can compare our predictions against the target quickly:

Linear Regression Model Actual vs. Prediction — Image by Author

Clearly, our model is overshooting our target a bit — let’s apply some regularization inside the h2o.train using the alpha parameter:

london_bike_model_regularized <- h2o.glm(x = predictors,
                             y = response,
                             training_frame = training_data,
                             alpha = 1)

Regularized Model — Prediction vs. Actual — Image by Author

In h2o.glm ,alpha=1 represents Lasso Regression. It doesn’t seem that our model improved that much, and we probably need to do some more feature engineering or try other arguments with the linear regression (although it’s unlikely that this will improve our model by a lot).

In the library’s documentation, you’ll find a ton of parameters to tweak in this generalized linear model function. An important takeaway from this first training process is that h2o’s training implementation contain a lot of tweakable parameters we can experiment and test.

When evaluating out model, we’ve only done a visual “testing” by plotting our predictions against the real target. Of course, if we want to do a more scientific evaluation, we have all the famous regression and classification metrics available in the h2o framework!

Let’s see that, next.

Evaluating our Models

One of the cool things we can do with h20 is passing a test set directly into our model and use it to extract validation metrics. For instance, if we want to retrieve metrics from a test set, we can plug it directly into the validation_frame argument in any h2o trained model:

london_bike_model <- h2o.glm(x = predictors,
                             y = response,
                             training_frame = training_data,
                             validation_frame = test_data)

Passing this new validation_frame argument will give us the ability to extract metrics for both data frames really quickly — for instance, let’s get mean squared error for our model for both training and test (here called valid ):

h2o.rmse(london_bike_model, train=TRUE, valid=TRUE)train    valid 
936.5602 927.4826

Given that the mean of our target variable is 1143 bikes for each hour, our model is not performing that well. We also see that there seems to be low evidence for overfitting, as both training and test sets seem similar.

How can we change the metric we want to take a loot at? We just tweak the h2o function! For instance, if we want to look at r-squared, we just use h2o.r2 :

h2o.r2(london_bike_model, train=TRUE, valid=TRUE)train     valid 
0.2606183 0.2456557

Super simple! The upside is that you don’t have to worry about all the details on how to implement these metrics by yourself.

In this example, I am following a regression problem but, of course, you also have classification models and metrics available. For all metrics available in h2o , check the following link.

It’s a bit expected that our linear regression isn’t performing that well — we haven’t performed any feature engineering and we are probably violating too many linear regression assumptions. But, if we can train simple linear regressions, we can probably train other types of models in h2o , right? That’s right ! Let’s see that in the next section.

More Model Examples

As you may have guessed, if we change the h2o function associated with the training process, we will fit other types of models. Let’s train a random forest by calling h2o.randomForest :

london_bike_rf <- h2o.randomForest(x = predictors,
                             y = response,
                             ntrees = 25,
                             max_depth = 5,
                             training_frame = training_data,
                             validation_frame = test_data)

I’m setting two hyperparameters for my Random Forest on the function call:

ntrees that sets the number of trees in the forest.
maxdepth that sets the maximum deepness of each tree.

If you need, you can find all the tweakable parameters by calling help(h2o.randomForest) on the R console.

The cool thing is that we can now use what we’ve learned about model metrics on this new model, just by switching the model we feed into the metric function — for instance, to get the rmseof this model, we switch the first argument to london_bike_rf:

h2o.rmse(london_bike_rf, train=TRUE, valid=TRUE)train    valid 
909.1772 900.5366

And to obtain the r2 :

h2o.r2(london_bike_rf, train=TRUE, valid=TRUE)
train     valid 
0.3032222 0.2888506

Notice that our code practically didn’t change. The only thing that was modified was the model we fed into the first argument. This makes these metric functions highly adaptable to new models, as long as they are trained inside the h2o framework.

If you follow this link, you will find other models you can train with the library. From that list, let’s fit a Neural Network, using h2o.deeplearning :

nn_model <- h2o.deeplearning(x = predictors,
                       y = response,
                       hidden = c(6,6,4,7),
                       epochs = 1000,
                       train_samples_per_iteration = -1,
                       reproducible = TRUE,
                       activation = "Rectifier",
                       seed = 23123,
                       training_frame = training_data,
                       validation_frame = test_data)

hidden is a very important argument in the h2o.deeplearning function. It takes a vector that will represent the number of hidden layers and neurons we will use on our neural network. In our case, we are using c(6,6,4,7) , 4 hidden layers with 6, 6, 4 and 7 nodes each.

Will our h2o.r2 function work just the same with neural networks? Let’s test:

h2o.r2(nn_model, train=TRUE, valid=TRUE)
    train     valid 
0.3453560 0.3206021

It works!

Bottom line is: most functions we use in the h2o framework are tweakable for other models. This is a very good feature as it makes it easy to switch between models with less code, avoiding overly complex or an error-prone development.

Another cool feature of h2o is that we can do hyperparameter tuning really smoothly — let’s see how, next.

HyperParameter Tuning

Performing hyperparameter search is also super simple in h2o — you only need to know:

The model where you want to perform your search.
The name of parameters available for each model.
The values you want to test for which parameters.

Remember that, in the random forest we’ve trained above, we’ve set ntrees and maxdepth manually.

In the grid example, we’ll do a search on both parameters plus min_rows . We can do that by using the h2o.grid function:

# Grid Search 
rf_params <- list(ntrees = c(2, 5, 10, 15),
                    max_depth = c(3, 5, 9),
                    min_rows = c(5, 10, 100))# Train and validate a grid of randomForests
rf_grid <- h2o.grid("randomForest", 
                      x = predictors, 
                      y = response,
                      grid_id = "rf_grid",
                      training_frame = training_data,
                      validation_frame = test_data,
                      seed = 1,
                      hyper_params = rf_params)

We start by declaring rf_params that contain the list of values we will use on our grid search and then, we pass that grid into hyper_params argument in the h2o.grid . What h2owill do is train and evaluate every single combination of hyperparameters available.

Seeing the results of our search is super easy:

h2o.getGrid(grid_id = "rf_grid",
            sort_by = "r2",
            decreasing = TRUE)

The h2o.getGrid function gives us a summary of the best hyperparameters according to a specific metric. In this case, we chose r2 , but other metrics such as the RMSE or MSE also work. Let’s look at the top 5 results of our grid search:

Hyper-Parameter Search Summary: ordered by decreasing r2
  max_depth min_rows   ntrees        model_ids      r2
1   9.00000  5.00000 15.00000 rf_grid_model_30 0.33030
2   9.00000  5.00000 10.00000 rf_grid_model_21 0.32924
3   9.00000  5.00000  5.00000 rf_grid_model_12 0.32573
4   9.00000 10.00000 15.00000 rf_grid_model_33 0.32244
5   9.00000 10.00000 10.00000 rf_grid_model_24 0.31996

Our best model was the one that had a max_depth of 9, a min_rows of 5 and 15 ntrees — this model achieved an r2 of 0.3303.

The cool part? You can expand this grid to any hyperparameter available in ?h2o.randomForest or to any model available in the documentation of h2o , opening up an endless amount of possibilities with the same function.

AutoML Features

If you need a quick and raw way to look at the way different models perform on your dataset, h2o also has an interesting automl routine:

aml <- h2o.automl(x = predictors, 
                  y = response,
                  training_frame = training_data,
                  validation_frame = test_data,
                  max_models = 15,
                  seed = 1)

The max_models argument specify the maximum number of models to be tested on a specific automl ensemble. Keep in mind that the automl routine can take a while to run, depending on your resources.

We can access the top models of our routine by checking the aml@leaderboard :

aml@leaderboard

By the table above, we can see that a Stacked Ensemble model was the winner (at least, in terms of rmse). We can also get more information on the best model by calling:

h2o.get_best_model(aml)Model Details:
==============H2ORegressionModel: stackedensemble
Model ID:  StackedEnsemble_AllModels_1_AutoML_1_20221027_95503 
Number of Base Models: 15Base Models (count by algorithm type):deeplearning          drf          gbm          glm 
           4            2            8            1

The result of h2o.get_best_model(aml) returns more information about the model that achieved the best score on our automlroutine. By the snippet above, we know that our ensemble aggregates the result of:

4 deep learning models;
2 ranfom forests;
8 gradient boosting models;
1 generalized linear model;

Depending on your use case, the automl routine can be a quick, dirty way, to understand how ensembles and single models behave on your data, giving you hints on where to go next. For example, helping you on the decision if more complex models are the way to go or if you will probably need to acquire more training data / features.

Explainability

Finally, let’s take a look into some of h2o’s explainability modules. In this example, we’ll use the random forest model we’ve trained above:

london_bike_rf <- h2o.randomForest(x = predictors,
                             y = response,
                             ntrees = 25,
                             max_depth = 5,
                             training_frame = training_data,
                             validation_frame = test_data)

Just like most machine learning libraries, we can grab the variable importance plot directly with the h2o interface:

h2o.varimp_plot(london_bike_rf)

Random Forest Variable Importance — Image by Author

By the importance plot, we notice that humidity(hum) and temperatures (t1, t2) are the most important variables for our trained random forest. Calling varimp_plot immediately shows the importance plot for a specific model, without the need to configure anything else.

Single variable importances are not the only available explainability models in h2o — we can also check shapvalues quickly:

h2o.shap_summary_plot(london_bike_rf, test_data)

Voilà! By the shap_summary_plot, we understand the direction of the relationship between our features and target. For instance:

higher temperatures explain more bike usage.
lower humidity also explain more bike usage.

In the case above, we’ve asked a general explanation of the shap for our test_data but we can also check individual row explainability using h2o— for instance, let’s see the 4th row of our test set:

4th Row of the Test Set — Image by Author

Basically, this was a very cold day in January and this factor probably impacted bike usage in London, affecting the prediction of our model — let´s pass this row to a shap explainer:

h2o.shap_explain_row_plot(london_bike_rf, test_data, row_index = 4)

Shap Explainer for 4th January 2015, 5 p.m. — Image by Author

Interesting! Several features are pushing the prediction of bike usage down. For example:

A high value in humidity (93) contributes negatively to the prediction.
The fact that this day was a weekend means less commute and is contributing to our prediction.

Let’s see a summer, week day row:

Shap Explainer for 30the June 2015, 1 p.m. — Image by Author

In this case, we have the opposite. The temperature and humidity variables are contributing positively to a high value of bike usage (prediction is ~2617).

Also, by analyzing the rows, you probably notice that hour will be a very important feature — can you add that feature and train more models inside the h2o framework? Try for yourself and add some practice of h2o to your repertoire!

Thank you for taking the time to read this post! I hope this guide helped you to get started with h2o and that you can now test if this library fits your data science pipelines. Although other libraries, such as caret, are more famous, h2o should not be discarded as a contender to perform machine learning tasks. The main advantages of h2o are:

interesting automl and explainability features.
the ability to run h2o tasks on remote servers through a local environment.
A vast array of different models to use and apply.
Easy plug-and-play functions to perform hyperparameter search or model evaluation.

Feel free to check other guides I’ve wrote on R libraries, such as caret, ggplot2 or dplyr.

If you would like to drop by my R courses, feel free to join here (R Programming for Absolute Beginners) or here (Data Science Bootcamp). My R courses are suitable for beginners/mid-level developers and I would love to have you around!

Data Science Bootcamp Course — Image by Author

The dataset used in this post is under the Open Government License terms and conditions, available at https://www.kaggle.com/hmavrodiev/london-bike-sharing-dataset

The Data Journey by Ivo Bernardo

Discussion about this post