CMSC 678 Fall 2022 - Homework 2

Due at the start of class on Thursday October 6

Question 1 (30 points)

Starting with a weight vector that is all 0's, show the Perceptron algorithm running on the following dataset:

Instance Number x y

1 (1, 2) 1

2 (1, 1) -1

3 (2, 2) 1

4 (2, 1) -1

Instance Number	x	y
1	(1, 2)	1
2	(1, 1)	-1
3	(2, 2)	1
4	(2, 1)	-1

Run through the instances in order from 1 to 4, and for each update give the following information:

Weight vector before the update
The value of wx
The predicted y value
The weight vector after the update

Note that the weight vector after the update becomes the weight vector before the update for the next instance. Be sure to include a weight w0, or bias term. You can pretend that each instance has a 1 prepended to it if that makes it easier conceptually.

Question 2 (40 points)

Your goal is to implement a multi-class Perceeptron and apply it to the MNIST digits dataset. You should get the data and labels. Each row in the data file is 784 integers that represent the grayscale values of a 28x28 handwritten digit. Each row in the labels file is a single digit in the range 0-9 and is in a 1-to-1 correspondence with rows in the data file.

Randomly split the data into half for training and half for testing. Then implement a multi-class Perceptron as follows:

There is a weight vector per class (so 10 of them)
The predicted class is the one for which wx is largest using the class-specific vectors
When a prediction is wrong, the weights for the correct class are increased by x and the weights for the incorrectly predicted class

Turn in:

Your code
The accuracy of the classifier on the training set
The accuracy of the classifier on the test set
A brief writeup describing your implementation and the convergence criterion you chose.

Question 3 (10 points)

Prove that for a logistic regression model the log odds of class 1 is a linear function. That is, prove that log[(p1(x;w)/p0(x;w)] = wx. Your proof should be mathematical, not in prose.

Question 4 (20 points)

Show that for a linearly separable dataset, the maximum likelihood solution for the logisitic regression model is obtained by finding a weight vector w whose decision boundary wx = 0 separates the classes and then taking the magnitude of w to infinity. Make your argument as mathematical as possible.

Ridge regression is a method in which the standard loss for logistic regression is modified by adding a term that is the L2 norm (i.e., the sum of the squares of the individual weights) of the weight vector. Derive an expression for the overall gradient for ridge regression that is analogous to the gradient for normal logistic regression as shown on slide number 52 in the lectures notes. Explain how ridge regression helps overcome the problem illustrated in the first part of this question.