Logistic regression, decision tree and support vector
Classification problem is we are in a wide range of industries in one of the major problems encountered in the commercial business.In this article, we will choose from several technology out of the three main technical discussion, Logistic Regression, Logistic Regression, Decision tree, Decision Trees) and Support Vector Machine (Support Vector Machine, SVM).
Algorithm is used to solve classification problems listed above (SVM and DT is also used in return, but that are not within the scope of our discussion).I see someone ask for many times, for he should choose which kind of problem.Classic is also the most correct answer is "it depends."Answer does not satisfy the questioner.Really let a person is very demanding.Therefore, I decided to talk about what is to look at the situation and decide.
This explanation is based on a very simplified 2 d problem, but enough to take this difficult higher dimensional data to understand the readers.
I'll start with the most important question to discuss: what on earth are we doing in the classification problem?Obviously, we have to do is classified.(this is a serious problem? Really?)Let me to repeat it again.In order to do classification, we are trying to find a decision border or a curve (doesn't have to be a straight line), to distinguish the two categories in the feature space.
Feature space on the word sounds very tall, easy to make a lot of new people make confused.Let me show you an example to explain.I have a sample, which contains three variables: x1, x2 and target.Target has two values 0 s and 1 s, depends on the predicted value of the variable x1 and x2.I will data on the axis.
This is the feature space, observation value distribution in it.Here because we only have two predictor variable/features, all the characteristics of the space is two-dimensional.You will find that the two kinds of samples with different color mark the point.I hope that our algorithm can calculate a line/curve to separate the categories.
Through visual inspection, the ideal decision boundary segment (curve) is a circle.Actual decision boundary shape differences are caused by logistic regression, decision tree and support vector machine (SVM) algorithm.
Say first logistic regression.Many people have misunderstanding on logistic regression decision boundary.This kind of misunderstanding is because most of the time mention logistic regression, people will see the famous "S" curve.
Above the blue curve is not a decision boundary.It is the binary logistic regression model for deformation of a response.Logistic regression decision boundary is always a straight line (or a plane, in the higher dimensions is hyperplane).The best way to convince you that, is to show everyone is familiar with the logic of regression equation.
We do a simple assumption that all F is a linear combination of the predictor variable.
The above equation can also write:
, when you are used to predict the probability values do a score truncation, higher than the probability of cutoff value is 1, otherwise 0.Assuming that cutoff value expressed in c, so the decision making process becomes like this:
Y = 1 if p > c, otherwise 0.Finally the decision boundary is F > constant.
> constant, F is a linear decision boundary.Our sample data using logistic regression results will be like this.
You will find that the effect is not good.Because no matter what you do, the decision boundary is always linear logistic regression approach, and can't get there needs to be circular boundary.Therefore, logistic regression is suitable for processing close to the classification problem of linear separable.(although you can do to variable transform linear separable results, but we don't discuss this kind of situation here.)
Then we come to a decision tree how to deal with this kind of problem.We all know that the decision tree is generated according to the rules of the hierarchy.In our data, for example.
If you think carefully, these decision rules x2 | < / a > | const OR x1 | < / a > | const just using the parallel to the axis of the straight line will feature space segmentation, as shown in the figure below.
We can by increasing the size of the tree to make it grow more complex, with a growing number of partitions to simulate the annular boundary.
Ha ha!Tend to ring, is very good.If you continue to increase the size of the tree, you will notice that the decision boundary will continue to use parallel lines in a circular area.Therefore, if the border is nonlinear, and through continuous will feature space segmentation for rectangular to simulate, then the decision tree is a better choice than logistic regression.
Then we could look at the results of SVM.SVM by putting your feature space is mapped to a kernel space, makes the linear can be divided into various categories.This process simpler explanation is that the SVM for extra adds a dimension of feature space and makes the linear can be divided into categories.After the decision boundary map back to the original feature space is nonlinear decision boundary.Chart to explain more clearly than I.
You can see, once the sample which was somehow added a dimension, we can use a plane to split the data (linear classifier), the plane map back to the original two dimensional feature space, you can get a ring decision boundary.
The effect of SVM in our data set on how good ah,
Note: the decision boundary is not so standard circular, but very close to the (presumably polygons).We can use circles instead of in order to easy to operate,.
Aware of the difference between the right now, but there is a problem.That is to say, when dealing with multidimensional data, when to choose what kind of algorithm?If this problem is very important, because the data dimension is more than three, you will find easy way to present data visually.