CMSC 478/678 Spring 2015 - Homework 1
Due at the start of class on February 17
Part 1:
In this part of the assignment you will gain familiarity with WEKA, the
Waikato Environment for Knowledge Analysis. WEKA is widely used in the
machine learning and data mining communities because, among other
things, it provides both a nice user interface to a number of standard
algorithms and a Java API.
First, you must download WEKA from the following URL:
http://www.cs.waikato.ac.nz/ml/weka/. The "Getting Started"
section of that page has links for information on system requirements,
how to download the software, and documentation. WEKA is written in
Java and should run on any platform with Java 1.5 or higher.
Read about
the Adult
Census Income dataset, and get it in the form of
an ARFF file. Then do the following:
- Build a decision tree (J48 classifier) with the default
parameters and report the (stratified cross-validation) accuracy.
- Now turn off pruning and report the accuracy. Inspect the output
of the algorithm. Has it overfit? How can you tell?
- Build a decision stump (a decision tree with a single split; you
can find it in the tree section of algorithms in Weka) and
report the accuracy. Inspect the output of the algorithm. Has it
underfit? How can you tell?
Part 2: In this part of the homework you will implement k-means clustering and experiment
with different ways of initializing the cluster centroids.
The MNIST dataset is a
well-studied collection of handwritten digits. It is often used to
test multi-class classification algorithms, where there is one class
for each of the 10 digits (0 - 9). In this homework, you will use it for
unsupervised clustering.
I've made two files available for you:
- The raw MNIST data, which is a text file
containing 10,000 rows. Each row contains 28 * 28 = 784 integers in
the range 0 to 255.
Each integer is the pixel value from a 28 x 28 image of a
handwritten digit. Every row corresponds to a vector in the dataset
that is to be clustered.
- The labels for the raw data are in a file with
10,000 rows. The first row contains the correct digit label for the
first row in the raw data. The second row is the label for the
second instance, and so on.
Implement the k-means clustering algorithm. You will only use your
algorithm for this dataset, so you can hard-wire in the number of
instances and the size of each instance. The goal is not to write a
generic version of the algorithm (though you can if you wish). The
goal is to understand how it works on real data. You will need to try
different values of k so that must be a parameter.
After completing the implementation (and testing for correctness, of
course), do the following:
- Randomly sample k = 10 instances, use them as the initial cluster
centroids, and run the algorithm to covergence. For each cluster,
find the most common digit in that cluster and count the number of
instances in the cluster that are different from the most common
one. Sum that count over all of the clusters.
- Repeat the above step 10 times in total and report the average
number of iterations to convergence and the average number of
instances that are in the wrong cluster.
- Run the algorithm with k = 5. Look at the clusters and see if
there are digits that tend to get grouped together. What are they
and explain why you think they are grouped into the same cluster.
- Finally, run the algorithm 10 times again with k = 10 and report
the same information as above (iterations to convergence and number
of wrongly clustered instances). But this time do not choose random
instances for the cluster centroids. Randomly choose an instance
that represents each of the digits and use them as the centroids.
That is, one of the centroids will be a randomly chose 0, another
will be a randomly chose 1, and so on.
Do you observe any difference in the performance statistics? Why or
why not?
- Turn in hard copy of your code.