[Solved]: I am having trouble understanding (and implementing) logistic regression for classifying into three classes

Problem Detail: (For reference, i am using Kevin P Murphy’s Book “Machine Learning: A Probabilistic Perspective” and implementing with MATLAN – without any toolboxes) I have a dataset with 392 samples (rows), each sample has 8 features (columns), one of which defines the class (i.e. column 1 of features is divided into three equal bins which define the three classes – low, medium, and high). I am having a really hard time understanding how to create a logistic regression model to classify a sample into one of these three classes. I just finished learning and making a linear regression model where I learned about both the Ordinary Least Squares (Closed form) solution for the weight vector, and also Gradient Descent (Open Form) solution. But i never implemented gradient descent because my data was fitted perfectly fine with the OLS solution for weight vector. I am extremely confused how to create a weight vector for Logistic regression, I understand that it requires use of Gradient Descent because there is no closed form solution. I also read about the Newton method for calculating the weights but I don’t understand it at all. And after you use these methods to calculate weights how do you apply the weights to the sample data? In Linear regression it was simply because you simply multiplied the weights by the features (and higher order features for higher order linear regression), but is it the same in logistic regression? Moreover my understanding so far is that this model only works for binary classification, so how would i do it for three classes? Basically my question boils down to this: How exactly do you find the weight vector for logistic regression (using either gradient descent or newtons method, whichever is easier) and how do you apply the weight vector to the sample to get a classification out of it for 3 classes (not just two).

Asked By : user2237160

Answered By : Kyle Jones

How do you apply the weight vector to the sample to get a classification out of it for 3 classes (not just two).

If you have $n$ possible classes then you must train $n$ logistic classifiers on your training data, with each classifier producing as output the probability that a given sample is a member of the class the classifier is trained to recognize. Once training is complete, to classify a new sample you let all the trained classifiers try to recognize it and accept the judgment of the classifier that produces the highest probability value. This is known as one-vs-all or one-vs-rest classification and is a standard way to use binary classifiers to do multiclass classification. The weights are applied the same way in classifying new data as they are when you’re training the classifier. You have a weight vector $theta$ containing coefficients that when combined with a sample’s feature vector $x$ yield an hypothesis value. For linear regression $theta^Tnegthinspace x$ (expanded: $theta_0 + theta_1x_1 + theta_2x_2 +$ …) yields the hypothesis value. For logistic regression $theta^Tnegthinspace x$ is used as the exponent in the logistic function to produce the hypothesis value: begin{equation} h_theta(x) = dfrac{1}{1 + e^{-theta^Tnegthinspace x}} end{equation} $e$ is the usual natural log constant $approx 2.71828$. $h_theta(x)$ is the probability that the sample $x$ is a member of class the classifier recognizes. Compute $h_theta(x)$ for each classifier and accept the verdict of the classifier that produces the largest probability value.

How exactly do you find the weight vector for logistic regression (using either gradient descent or newtons method, whichever is easier)?

Gradient descent works the same way for logistic regression as it does for linear regression. You’re still trying to minimize the cost function by iteratively nudging the weights to better values using partial derivatives of the cost function. The hypothesis function is different for logistic regression, but the way it is used in gradient descent is the same. Once you’ve written code to do gradient descent for linear regression, you should be able to just plug in a different hypothesis function and have it work for logistic regression. Pseudocode for one iteration of gradient descent:

newtheta := theta; learning_rate := 0.01;                                                         for k := 1 to n    sum := 0    for i := 1 to m       sum := sum + (hypothesis(x[i], theta) - y[i]) * x[i][k];                  end    nudge := sum * learning_rate;                                                newtheta[k] := newtheta[k] - nudge;                                       end theta := newtheta;

x is a matrix containing your training data, one sample per row. y is a vector containing the correct classification prediction for each sample, 1 if the sample is in the class, 0 otherwise. m is the number of samples. n is the number of features. The idea is that you would repeat this process until you reach some minimum and acceptable cost (error) over the training set.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/31940 Ask a Question Download Related Notes/Documents