SVMs are some of the most important and widely used algorithms available today. They’re used for supervised classification problems. As a reminder, supervised learning refers to algorithms that learn from training data comprised of both input vectors and expected output data. Classification is used to describe a dataset that the algorithm must separate into predicted categories.

How Do SVMs Work?

To better understand SVMs, let’s take a look at an example. In this example, we want to predict the type of vehicle that a given person owns, based on their income and the distance between their home and place of work. For simplicity, we will include only two vehicle options: a bicycle and a car. Figure 1 shows the distribution of data collected by analyzing individuals, and as is made clear from the graph, higher income individuals who live farther away from their workplace are more likely to have a car than a bike. Like other classification algorithms, such as logistic regression or random forest, SVMs attempt to create a mathematical formula capable of dividing the data into the two categories. Figure 2 shows how this mathematical rule is represented by the dashed line. However, as we see in Figure 3, there are multiple ways to separate the data. To fix this problem, SVMs find the best line that separates the data into categories by maximizing the orthogonal distance between the nearest points of each category and the boundary defined by the dashed line. These distances, called margins (M) are shown in Figure 4. The data points that used to define the margin (for SVMs, these are the closest point to the boundary for each category) are called support vectors.  Once the ideal function is determined, the algorithm is able to predict whether someone will travel by car or by bicycle.

So, why are support vector machine algorithms so important?

In the car/bicycle example above, we could have used other model, such as logistic regression, and found almost the same result. But, this doesn’t mean that SVMs aren’t valuable. There are several machine learning problems for which simple models like logistic regression simply do not work – but support vector machines do.

To dive deeper, let’s make a few changes to the previous example. Suppose we now don’t have the data for distance to the office, but instead we have age information for every data point. Figure 5 shows the new scatter plot that includes age and income as variables. It is clear that people with lower age and lower income tend to use bicycles, but as age increases, they usually use a car regardless of their income. In this case, a linear separation (as in the previous example), no longer makes sense for predicting vehicle types. This is where SVMs come in and outperform algorithms like logistic regression that are unable to work nonlinearly.

The mathematical trick here consists of transforming the original feature space (what is represented in Figure 5), which requires a nonlinear separation, to a feature space that can be separated by a linear function, as in Figure 1. In order to achieve this, SVM algorithms leverage the concept of distance to find the best margins and the new separation in the new feature space as Figure 6 shows. For data that can be linearly separated, we use the common concept of Euclidean distance (linear kernel) to measure margin M:   However, extending the interpretation of distance and using other metrics, we can define nonlinear distance as follows, using the polynomial kernel, with  representing the degree of the kernel:   Another metric used to define nonlinear distance is the Gaussian kernel, which uses the variable  that is associated with the variance of the kernel.   Finally, what happens when we use the new interpretations we just learned to measure the separation between boundary and the support vectors? We again have the power to predict accurately which vehicle a person is going to own, based on this data, as Figure 7 shows: Support vector machines are very powerful algorithms appliable to any classifaction problem. Some common uses are: transactional fraud detection, measuring the risk of a login session, and pattern recognition.

Summary

• Great performance with non-linear separable data against linear models like logistic regression
• Once a kernel is defined, it is relatively easy to tune, due to fewer parameters against complex models like random forest or neural nets
• It scales well with large datasets
• Easy to include regularization to the optimization process