Today I will discuss a few common errors in statistical modelling/Machine learning, and how to avoid them. Admittedly, I have done some of these mistakes myself, and some I have simply observed (more than once).
1. The time-travelling model
If we were actually capable of time-travelling, we wouldn’t have to spend so much time creating predictive models! (and we would have to find another job)
However, I have often observed people creating models that are time-travelling by mistake. By this, I mean that the model is supposed to use information at time T, to predict what happens at time T + 1. However, due to some errors in creating the data set, they end up modelling on data which already includes components that were not yet known at time T.
Now this may sound like an issue that shouldn’t happen too often, but it does. And not only is it common, it is also absolutely catastrophic. This may end up ruining your entire model. Say that your model is built as:
\(Y(T+1) = \beta_1X_1(T) + \beta_2X_2(T)\)
but it turns out \(X_1\) is actually not known at time T, so you feed the model with data based on \(X_1(T+1)\). Now the model will definitely overestimate the effect of \(\beta_1\), and most likely underestimate the effect of \(\beta_2\). The opposite may also be true.
So in conclusion, my advice is to always go through the variables available and consider if they are all actually known at prediction time (and reflect on when “prediction time” actually will be for your model - will it be used to forecast 1 month in advance? 1 day?)
“Time traveling is just too dangerous.”
— Dr.Emmett Brown, Back to the Future
2. Interpreting invalid P-values
Another unfortunate mistake I have observed several times is the practice of attempting to interpret p-values that origin from a LM/GLM-model where the assumptions are severely violated.
At its worst, I have seen a model where every observation was repeated 10 times (for whatever reason). This was then inserted into a logistic regression model. Next, its creator proudly announced that he had created a model where all the variables he tested came out as highly statistical significant. This unlikely scenario caught my attention, and the error was spotted.
Now, let’s have a look at what mistakes he made. In a logistic regression, it is assumed that the observations are drawn by a Bernoulli-trial. This means that the observations are independent. Obviously, repeating the same observation 10 times violates this assumption severely.
Many practitioners do not care about violating these assumptions as it does not really affect a models predictive power. While that may be true, any model with violated assumptions cannot be interpreted in the standard statistical way. For example, the p-values cannot be trusted in this scenario.
Let’s have a look at a simulated example to see what happened. First, we create some simulated data and make a regression. We expect the p-values to be high, as the y-variable is completely random.
set.seed(1337)
# Generate the data
n <- 100000
df <- tibble(
y = rbinom(n, size = 1, prob = 0.5),
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n)
)
# Create a model
model <- glm(y ~ ., data = df, family = binomial)
summary(model)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.196 -1.175 -1.161 1.180 1.201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.006073 0.006325 -0.960 0.337
## x1 -0.006526 0.006310 -1.034 0.301
## x2 -0.007356 0.006310 -1.166 0.244
## x3 -0.006816 0.006318 -1.079 0.281
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138629 on 99999 degrees of freedom
## Residual deviance: 138625 on 99996 degrees of freedom
## AIC: 138633
##
## Number of Fisher Scoring iterations: 3
As we can see, none of the variables turned out to be significant.
Now let’s make the same mistake this data analyst made, and replicate each observation 10 times.
df_rep <- df %>%
slice(rep(row_number(), 10))
# Create a model
model <- glm(y ~ ., data = df_rep, family = binomial)
summary(model)
##
## Call:
## glm(formula = y ~ ., family = binomial, data = df_rep)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.196 -1.175 -1.161 1.180 1.201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.006073 0.002000 -3.036 0.002395 **
## x1 -0.006526 0.001995 -3.271 0.001073 **
## x2 -0.007356 0.001995 -3.686 0.000228 ***
## x3 -0.006816 0.001998 -3.411 0.000646 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1386285 on 999999 degrees of freedom
## Residual deviance: 1386249 on 999996 degrees of freedom
## AIC: 1386257
##
## Number of Fisher Scoring iterations: 3
Ouch! The model now claims that all the variables are highly significant. However, do note that the coefficients are not altered at all in this case, so as long as the p-values are not used for anything, we are fine. In this case however, the data analyst wanted to use the p-values for variable selection and to provide insight to business decisions, which is very unfortunate.
Note: This happens due to underestimating the standard errors of the estimates. Each observation in the data is replicated 10 times, making every small, random finding in the model in either direction appear more robust.
Now this is a bit of an odd case, but there are lots of other issues which might lead to p-values that should not be trusted, for example:
- Multicollinearity
- Endogeneity
- “p-hacking”
- Auto-correlation
In conclusion, be careful around these p-values!
3. Underestimating the power of visualization
Another common mistake, which I certainly have done myself, is to underestimate how important visualization is.
While we all would like to believe that the only thing we need to create good models is to see the data flash by our eyes in binary code, Matrix-style, this is not really how our cognitive abilities function.
Visualizing your data may help you:
- Detect errors in your data sets, e.g. by detecting unnatural spikes in the histogram of a variable
- Transform your data in a manner that is more suitable for modelling, e.g. log-transformation.
- Result in the discovery of a new interaction variable or other hand-crafted features.
Hence, you should always visualize your data before and during the modelling phase. If you wait until it’s time to document your model, you might be in for a surprise. Hadley Wickham’s “tidy workflow” chart is an excellent guideline here, see: https://r4ds.had.co.nz/explore-intro.html.
4. Spending too much time choosing which ML-algorithm to use
I have seen a few cases where a data scientist has tested an excessive amount of algorithms in order to solve a fairly simple problem.
In my experience, and the results from various ML-competitions on Kaggle back this up, you can follow some very simple rules for algorithm selection:
- If your data is structured and reasonably large, use xgboost or similar (possibly an ensemble).
- If you have very little data or need a completely transparent model, use glm/glmnet/lm.
- If your data is unstructured, the choice becomes considerably harder - but use a suitable variation of deep learning (CNN/LSTM/GAN etc)
(disclaimer: this is of course very simplified)
Even though most books on Machine Learning spend a lot of time on older algorithms such as KNN and SVM, these algorithms are really a bit outdated and rarely turn out as the best. Hence, using time testing these algorithms is most often a waste of time, in my opinion. If anyone can prove me wrong, I would love to hear an example, though!
Either way, in most real world problems, proper feature-engineering and data preparation is more important than model selection. It doesn’t really matter which algorithm you choose if you feed it with horrible data.