Asked By : user2237160
Answered By : Kyle Jones
How do you apply the weight vector to the sample to get a classification out of it for 3 classes (not just two).
If you have $n$ possible classes then you must train $n$ logistic classifiers on your training data, with each classifier producing as output the probability that a given sample is a member of the class the classifier is trained to recognize. Once training is complete, to classify a new sample you let all the trained classifiers try to recognize it and accept the judgment of the classifier that produces the highest probability value. This is known as one-vs-all or one-vs-rest classification and is a standard way to use binary classifiers to do multiclass classification. The weights are applied the same way in classifying new data as they are when you’re training the classifier. You have a weight vector $theta$ containing coefficients that when combined with a sample’s feature vector $x$ yield an hypothesis value. For linear regression $theta^Tnegthinspace x$ (expanded: $theta_0 + theta_1x_1 + theta_2x_2 +$ …) yields the hypothesis value. For logistic regression $theta^Tnegthinspace x$ is used as the exponent in the logistic function to produce the hypothesis value: begin{equation} h_theta(x) = dfrac{1}{1 + e^{-theta^Tnegthinspace x}} end{equation} $e$ is the usual natural log constant $approx 2.71828$. $h_theta(x)$ is the probability that the sample $x$ is a member of class the classifier recognizes. Compute $h_theta(x)$ for each classifier and accept the verdict of the classifier that produces the largest probability value.
How exactly do you find the weight vector for logistic regression (using either gradient descent or newtons method, whichever is easier)?
Gradient descent works the same way for logistic regression as it does for linear regression. You’re still trying to minimize the cost function by iteratively nudging the weights to better values using partial derivatives of the cost function. The hypothesis function is different for logistic regression, but the way it is used in gradient descent is the same. Once you’ve written code to do gradient descent for linear regression, you should be able to just plug in a different hypothesis function and have it work for logistic regression. Pseudocode for one iteration of gradient descent:
newtheta := theta; learning_rate := 0.01; for k := 1 to n sum := 0 for i := 1 to m sum := sum + (hypothesis(x[i], theta) - y[i]) * x[i][k]; end nudge := sum * learning_rate; newtheta[k] := newtheta[k] - nudge; end theta := newtheta;
x is a matrix containing your training data, one sample per row. y is a vector containing the correct classification prediction for each sample, 1 if the sample is in the class, 0 otherwise. m is the number of samples. n is the number of features. The idea is that you would repeat this process until you reach some minimum and acceptable cost (error) over the training set.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/31940 Ask a Question Download Related Notes/Documents