CMSC 478/678 Spring 2015 - Homework 1

Due at the start of class on February 17

Part 1: In this part of the assignment you will gain familiarity with WEKA, the Waikato Environment for Knowledge Analysis. WEKA is widely used in the machine learning and data mining communities because, among other things, it provides both a nice user interface to a number of standard algorithms and a Java API.

First, you must download WEKA from the following URL: http://www.cs.waikato.ac.nz/ml/weka/. The "Getting Started" section of that page has links for information on system requirements, how to download the software, and documentation. WEKA is written in Java and should run on any platform with Java 1.5 or higher.

Read about the Adult Census Income dataset, and get it in the form of an ARFF file. Then do the following:

Part 2: In this part of the homework you will implement k-means clustering and experiment with different ways of initializing the cluster centroids.

The MNIST dataset is a well-studied collection of handwritten digits. It is often used to test multi-class classification algorithms, where there is one class for each of the 10 digits (0 - 9). In this homework, you will use it for unsupervised clustering.

I've made two files available for you:

Implement the k-means clustering algorithm. You will only use your algorithm for this dataset, so you can hard-wire in the number of instances and the size of each instance. The goal is not to write a generic version of the algorithm (though you can if you wish). The goal is to understand how it works on real data. You will need to try different values of k so that must be a parameter.

After completing the implementation (and testing for correctness, of course), do the following: