Hello,
This is a follow-up to my first blog post, where I looked at the Bergen bicycle data.
Today, I will create a ML-model to predict the duration of each bike ride. I will use the brand new collection of modelling packages called “tidymodels”, which is currently being developed by Max Kuhn and Davis Vaughan at Rstudio.
The goal is to create a model which can predict the duration of a bike ride from station X to station Y.
Filtering the data
I will remove trips to service stations as this is not representative of a normal bike ride. Moreover, trips over 3600 seconds will be removed - it is not legal to rent the bikes any longer than this, so these observations can certainly be classified as outliers. In addition, I remove rides that start and end up in the same place, as it doesn’t make sense to predict the duration of such rides.
df <- df %>%
filter(!(end_station_name %in% c("workshop", "UIP"))) %>%
filter(duration < 3600) %>%
filter(start_station_id != end_station_id)
Adding variables
Let’s start by adding some variables. Remember, this data set is very limited in terms of features, so we have to be a bit creative here.
First, we use the latitude and longitude to compute the difference between two stations. I will do this using the geosphere package.
df <- df %>%
mutate(distance = round(geosphere::distCosine(
cbind(start_station_longitude, start_station_latitude),
cbind(end_station_longitude, end_station_latitude)
)))
Let’s do a quick check on the density distribution to see if this seems reasonable.
ggplot(df, aes(x = distance)) +
geom_density(fill = "seagreen4", color = "black", alpha = 0.7, size = 0.9) +
geom_vline(xintercept = mean(df$distance, na.rm = T), linetype = "dashed", color = "red", size = 0.9)
We see that no trip is longer than 5 kilometers and the average length is 1 kilometer, which seems plausible.
Let’s also add weather-data to the mix. I will use a data set that simply contains the date and the millimeters of rain.
rain_data <- read_csv("rain.csv")
df <- df %>%
mutate(date = as_date(started_at)) %>%
left_join(rain_data, by = "date")
Again, I always recommend to use visualization to check if the data is reasonable, so if we plot the date and millimeters of rain against each other we would expect to see a peak around October.
ggplot(rain_data, aes(y = rain, x = date)) +
geom_point() +
geom_smooth()
Let’s look at the relationship between the duration and these two new variables we have added, rain and distance.
ggplot(df, aes(x = distance, y = duration)) +
geom_jitter(aes(color = rain), alpha = 0.5) +
geom_smooth()
As expected, we see that the average duration increases with the distance between the start and end station. Moreover, it appears that the days with a lot of rain (light blue) are over represented above the trend line - implying that these extra rainy days, the rides take longer than what one would expect based on the distance.
We also see that there are some longer rides in the bottom here with a duration of essentially 0 - I will remove these, as they appear to be data errors.
df <- df %>%
mutate(speed = distance / duration)
summary(df$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.02182 1.76499 2.35419 2.30129 2.88175 54.62903
df <- df %>%
filter(speed < 10)
Splitting the data and creating the recipe
To split the data in training and test, we can use the package rsample.
split <- initial_split(df)
train <- training(split)
test <- testing(split)
Next, we create a “recipe” using the excellent recipes-package. A recipe is a set of operations which can be trained on the training set and then easily applied to the test-set, avoiding contamination from training to test.
In a recipe you can do operations like pre-processing the data (e.g. impute missing or scaling the variables), but you can also add variables.
# Create the recipe that specifies which operations we want to do.
rec <- recipe(duration ~ started_at + rain + distance + end_station_id + start_station_id, data = df) %>%
step_date(started_at) %>%
step_mutate(hour_started = hour(started_at)) %>%
step_holiday(started_at) %>%
step_num2factor(start_station_id, end_station_id) %>%
step_meanimpute(rain)
# Train the recipe on the training set.
prep_rec <- prep(rec, training = train)
# Bake the data (i.e. apply the recipe and get the final datasets)
mod_train <- bake(prep_rec, new_data = train)
mod_test <- bake(prep_rec, new_data = test)
str(mod_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 73236 obs. of 13 variables:
## $ started_at : POSIXct, format: "2018-06-29 10:45:12" "2018-06-29 11:01:02" ...
## $ duration : int 1197 490 506 518 636 1754 537 1777 258 637 ...
## $ start_station_id : Factor w/ 33 levels "116","117","131",..: 21 27 21 4 8 24 32 1 27 21 ...
## $ end_station_id : Factor w/ 33 levels "116","117","131",..: 32 8 31 24 32 27 19 19 32 27 ...
## $ distance : num 330 1405 655 1228 898 ...
## $ rain : num 8.74 8.74 8.74 8.74 8.74 ...
## $ started_at_dow : Factor w/ 7 levels "Sun","Mon","Tue",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ started_at_month : Factor w/ 12 levels "Jan","Feb","Mar",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ started_at_year : num 2018 2018 2018 2018 2018 ...
## $ hour_started : int 10 11 11 11 12 12 12 12 12 12 ...
## $ started_at_LaborDay : num 0 0 0 0 0 0 0 0 0 0 ...
## $ started_at_NewYearsDay : num 0 0 0 0 0 0 0 0 0 0 ...
## $ started_at_ChristmasDay: num 0 0 0 0 0 0 0 0 0 0 ...
I added a few variables here:
- step_date automatically added day, month and year
- I used step_mutate to get the hour started, as well as an indicator for whether the ride started and ended up at the same place.
- step_holiday adds indicator for holidays (not terribly useful as the Norwegian holidays are seemingly not integrated)
Creating the model
Now we are ready to create a model to predict duration. I will use the parsnip-package with an xgboost-specification for this.
model <- boost_tree(mode = "regression", learn_rate = 0.3, tree_depth = 12, min_n = 1, sample_size = 0.8) %>%
set_engine("xgboost") %>%
fit(duration ~
start_station_id
+ end_station_id
+ hour_started
+ distance
+ rain
+ started_at_dow
+ started_at_month,
data = mod_train)
Now let’s evaluate our model using RMSE and the yardstick-package.
mod_test$predicted_duration <- predict(model, new_data = mod_test, type = "numeric")$.pred
rmse <- mod_test %>%
yardstick::rmse(duration, predicted_duration) %>%
print()
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 398.
Note that in yardstick, it is also possible to specify several metrics at once, to get other metrics such as R-squared (rsq), the concordance correlation coefficient (ccc) or mean average percentage error (mape).
multi_metric <- metric_set(rsq, ccc, mape)
mm <- mod_test %>%
multi_metric(duration, predicted_duration) %>%
print()
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rsq standard 0.331
## 2 ccc standard 0.507
## 3 mape standard 41.8
The model we created has a RMSE of 398, which seems a bit high, and the R-squared is only 0.33. I’m sure this could be improved even further with some carefully tuned hyperparameters, which I might cover in a later blog post.
At this point, the drawbacks of tidymodels is becoming clear:
- There is no well documented method to tune hyperparameters (the “dials”-package appears unfinished)
- There doesn’t seem to be any method to get variable importance.
- A lot of documentation is not finished, so it is quite difficult to figure out simple tasks such as how to change number of iterations for xgboost.
Let’s try a random forest model instead, this time using the native “ranger”-model instead of the parsnip-wrapper. The main advantage of random forest compared to e.g. xgboost is that it requires almost no tuning of hyperparameters and works “out of the box”.
library(ranger)
ranger_model <- ranger(duration ~
start_station_id
+ end_station_id
+ hour_started
+ distance
+ rain
+ started_at_dow
+ started_at_month,
data = mod_train,
mtry = 3,
importance = "impurity",
num.trees = 1000)
mod_test$predicted_duration_rf <- predict(ranger_model, data = mod_test)$prediction
rmse <- mod_test %>%
yardstick::rmse(duration, predicted_duration_rf) %>%
print()
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 378.
Here we see that we reach a lower RMSE with ranger - which is not really surprising, given the lack of tuning.
Let’s look at the relationship between our prediction and the actual duration.
ggplot(mod_test, aes(x = duration, y = predicted_duration_rf)) +
geom_jitter(alpha = 0.7) +
geom_abline(linetype = "dashed", color = "red") +
xlim(c(0, 3000)) +
ylim(c(0, 3000))
Ideally, the points here would be evenly distributed around the red line, which represents a perfect prediction. We see that the model clearly struggles at predicting both the really short rides, as well as the long ones.
However, I’m also quite certain that this a problem that cannot be predicted with a high degree of precision - after all, we have no way of knowing what people are actually doing with the bikes, and we have no variables that might predict whether they will stop at the grocery store mid-ride or just ride as fast as possible.
Either way, let’s look at which variables were the most important in explaining the duration of a bike ride:
importance(ranger_model) %>%
enframe() %>%
ggplot(aes(x = fct_reorder(name, value), y = value)) +
geom_bar(stat = "identity", fill = "seagreen4", color = "black") +
coord_flip() +
labs(x = "Variable")
It is perhaps not surprising to see that the distance is the dominant factor to explain duration. Hour started is also quite far up there (probably due to traffic).
Note that the end and start station are still interesting even after controlling for distance, because there certainly will be different conditions between different stations.
Wrapping up
We have made two simple models using xgboost and ranger, predicting the duration of bike rides with a root mean square error of less than 400 seconds.
Throughout the analysis, we have used a set of different tidymodels packages, namely:
- Rsample for splitting the data
- Recipes for preprocessing the data and creating new variables
- Parsnip for defining and training the model
- Yardstick for evaluating the models
In my opinion, the “tidymodels” collection of packages is extremely promising, but some of the packages appear unfinished (such as parsnip, but no wonder - it is in version 0.0.1 after all!). I expect that this will be the go-to package for modelling in R in a year or so, but for now, it seems a bit immature.
As for predicting the duration of bike rides, there is no doubt that one could achieve lower RMSE than what I accomplished with these quick models. For instance, one could add:
- More detailed weather data, particularly with regards to wind.
- Traffic data (from cars)
- Trend data, e.g. “average duration of a bike ride from x to y last hour”. This specification would be able to quickly pick up changes in riding conditions, for instance due to construction work.
Finally, I would like to thank Markus Mortensen (PwC) for providing inputs to this blog post.