In the last installment of this blog series, we discussed objectives and accuracy in machine learning. And we described two crucial tests for the utility of a machine learning model: The model must be sufficiently accurate and we must be able to deploy the model so that it can produce actionable outputs from the available data. We then introduced a real-world scenario — predicting train failures up to 36 hours in advance of their occurrence using sensor data — to illustrate the application of those tests.
But how did we decide which of the multitude of machine learning algorithms to use to train our model in the first place? To answer this question, we need to revisit the main classes of machine learning algorithms.
As we explained in the second installment of this blog, machine learning algorithms mainly fall into two categories: supervised learning algorithms and unsupervised learning algorithms (for the purposes of simplicity, we will ignore additional categories like semi-supervised learning and reinforcement learning). There are many algorithms available in each of these categories that can be used for either prediction/classification (in the case of supervised learning) or clustering/segmentation (in case of unsupervised learning).
With supervised learning, labeled data from the past is used to train a model that can then be used to predict future, similar events. If the label is a continuous variable (e.g., the revenue of a certain product or the number of products sold) algorithms like regression, special decision trees, random forests or neural networks can be used. If the label is a categorical variable (e.g., true or false), techniques like logistic regression, naïve Bayes classifiers, decision trees or the k-nearest neighbour algorithm are useful.
Unsupervised learning, on the other hand, operates on unlabeled data. Typically, we use unsupervised methods to identify structures and patterns in data that we didn’t know existed before — a process that is often termed “discovery analytics”. If the input data is numerical, the most common set of techniques is cluster analysis. If we are looking at categorical inputs, algorithms like association or affinity analysis can be used, for example, to discover which products are frequently bought in combination with one another in the course of different shopping missions.
Choose your machine learning algorithm
But how do we decide which of these algorithm is most useful for a given problem? The answer is to start with the problem — or rather with the business question that we want to answer through the application of machine learning. What is it that we want to achieve? And how will we measure the success — or otherwise — of our analysis? Accuracy may be (and often is) one of the important success criteria, but there are many more criteria that are also often important. Are the modeling results stable over time (“robustness”)? How long does it take to build and test the model (“speed”)? Can the model handle growing data volumes (“scalability”)? Is the model using as few parameters as possible (“simplicity”)? How easy are the model results/patterns to understand (“interpretability”)? Etc., etc., etc.
Considering these criteria — and understanding which of them are most important in a particular situation — can go a long way towards helping you select the “best” algorithm with which to go after a particular problem.
For example, if we compare a decision tree with an artificial neural network (ANN) for a particular domain using these criteria, we might identify the trade-offs illustrated schematically below.
The ANN may give us better predictive accuracy compared with the decision tree, but the decision tree scores better for simplicity and interpretability. Depending on the business problem and the context, then, we may decide to use a tree — even though the predictions likely won’t be as accurate — if presenting our findings and helping a business stakeholder to understand the relationships found in the data are key considerations for that particular analysis.
The last two success criteria, simplicity and interpretability, are especially interesting to consider, as they are often in tension with one of the subjects of our previous blog — accuracy. And this finally brings us to the title of this blog: Occam’s razor and machine learning.
William of Ockham, a Franciscan friar who studied logic in the 14th century, gave his name to the principle sometimes called lex parsimoniae, which is Latin for “the law of briefness”. William of Ockham supposedly wrote it in Latin as: “Entia non sunt multiplicanda praeter necessitate”, which roughly translates as “More things should not be used than are necessary”.
Applied in the context of machine learning, this means that if two algorithms have broadly similar performance for the criteria identified as the most important for a particular project — accuracy and stability, say — we should always prefer the “simpler” one.
But what does “simpler” mean in this context? We think that simpler should generally be taken to mean the algorithm that is least complex to deploy (because, for example, it uses fewer variables that have required less feature engineering to create) and that is easiest to interpret.
Let’s revisit our example of using machine learning to predict train failures 36 hours in advance of their occurrence. Recall that accuracy, which we measured by calculating type one and type two errors on the hold-out data set — was an important success criteria for the project. But of equal importance in this particular case was gaining an improved understanding of the root causes of train failures so that component and system design could be improved and so that we could establish if failures were the result of specific operating conditions that could be avoided in future. Had accuracy alone been the most important criteria, we might have used an ANN or a deep learning approach. In fact, to ensure that we could address the second criteria, we actually started with a relatively simple decision tree algorithm. In short, we traded some model accuracy for results that were easy for our client to understand and to interpret.
In fact, Occam’s razor not only guided us in the choice of algorithm for this project, but also in how we trained it, so that we built several trees using different numbers of (transformed) sensor readings and settled on the model that met our accuracy threshold whilst using the fewest variables.
Note that these kinds of trade-offs typically only make sense when the accuracy of the simpler model is at least in the same range as that of the more complex one. In a different scenario — one in which half a percentage point in accuracy might translate into millions of dollars of additional revenue or cost savings, say — you might happily opt for the model that is harder to interpret, or that requires more time to develop and deploy.
We’ve said it before, and we’ll no doubt say it again on this blog: You should always start any machine learning project with a relentless focus on the business question that you want to answer and by formulating the key success criteria for the analysis. Assuming all other key criteria are (roughly) equal, then apply Occam’s razor and chose the model that is simplest to interpret, to explain, to deploy and to maintain.
In other words, prefer the simplest model that is sufficiently accurate, but ensure that you know the problem space well enough to know what “sufficiently accurate” means in practice. Because as Einstein, perhaps Occam’s greatest disciple, once said: “Everything should be made as simple as possible, but not simpler”.