Logistic regression, decision tree and support vector
Classification problem is we are in a wide range of industries in one of the major problems encountered in the commercial business.In this article, we will choose from several technology out of the three main technical discussion, Logistic Regression, Logistic Regression, Decision tree, Decision Trees) and Support Vector Machine (Support Vector Machine, SVM).
Algorithm is used to solve classification problems listed above (SVM and DT is also used in return, but that are not within the scope of our discussion).I see someone ask for many times, for he should choose which kind of problem.Classic is also the most correct answer is "it depends."Answer does not satisfy the questioner.Really let a person is very demanding.Therefore, I decided to talk about what is to look at the situation and decide.
This explanation is based on a very simplified 2 d problem, but enough to take this difficult higher dimensional data to understand the readers.
I'll start with the most important question to discuss: what on earth are we doing in the classification problem?Obviously, we have to do is classified.(this is a serious problem? Really?)Let me to repeat it again.In order to do classification, we are trying to find a decision border or a curve (doesn't have to be a straight line), to distinguish the two categories in the feature space.
Feature space on the word sounds very tall, easy to make a lot of new people make confused.Let me show you an example to explain.I have a sample, which contains three variables: x1, x2 and target.Target has two values 0 s and 1 s, depends on the predicted value of the variable x1 and x2.I will data on the axis.
This is the feature space, observation value distribution in it.Here because we only have two predictor variable/features, all the characteristics of the space is two-dimensional.You will find that the two kinds of samples with different color mark the point.I hope that our algorithm can calculate a line/curve to separate the categories.
Through visual inspection, the ideal decision boundary segment (curve) is a circle.Actual decision boundary shape differences are caused by logistic regression, decision tree and support vector machine (SVM) algorithm.
Say first logistic regression.Many people have misunderstanding on logistic regression decision boundary.This kind of misunderstanding is because most of the time mention logistic regression, people will see the famous "S" curve.
Above the blue curve is not a decision boundary.It is the binary logistic regression model for deformation of a response.Logistic regression decision boundary is always a straight line (or a plane, in the higher dimensions is hyperplane).The best way to convince you that, is to show everyone is familiar with the logic of regression equation.
We do a simple assumption that all F is a linear combination of the predictor variable.
The above equation can also write:
, when you are used to predict the probability values do a score truncation, higher than the probability of cutoff value is 1, otherwise 0.Assuming that cutoff value expressed in c, so the decision making process becomes like this:
Y = 1 if p > c, otherwise 0.Finally the decision boundary is F > constant.
> constant, F is a linear decision boundary.Our sample data using logistic regression results will be like this.