Homework

Homework 6

Assigned: December 4th (Tuesday)
Due: December 16th (Sunday) by 11:59pm

In this final homework you'll write a couple of simple Spark programs to compute statistics on a graph. The first thing to do is install Spark. For Linux systems, do the following:

Go to the Spark downloads page and get the latest version. I got Spark 2.4.0 pre-built for Hadoop 2.7. For what we're doing, you don't actually need to have Hadoop installed, so don't worry about that.
gunzip and untar the file that you downloaded.
For me, that resulted in a directory named spark-2.4.0-bin-hadoop2.7, which I moved to ~/spark.
You'll also need to have Java installed on your machine.

To run a python program, you just give the name of the file and any arguments to the submit-spark script. For example, suppose you've got the script below in a file named wordcount.py in your home directory.

from pyspark import SparkContext
import sys

sc = SparkContext("local", "app")

text_file = sc.textFile(sys.argv[1])
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile(sys.argv[2])

To count the words in a file named book.txt and put the results in a directory named output, execute the following command:

~/spark/bin/submit-spark ~/wordcount.py book.txt output

Here's what you need to do:

Get graph.tsv, which contains a file in which each line is a triple of the form: SOURCE_NODE <TAB> DESTINATION_NODE <TAB> WEIGHT. Note that source and destination nodes are integers, as are weights.
Write one python program per item in the list below that uses Spark to compute the desired information. Each program should accept two command line parameters: a path to the graph.tsv file and a path to an output directory. Use Spark's saveAsTextFile() to save the final RDD to the specified output directory. Note that, by default, if the target output directory already exists when you attempt to save to it, you'll get an error. One solution is to remove the target directory between runs.
- For each node, compute the outdegree (number of outgoing edges) and output the (node, count) pairs in sorted order by node. The code should be in a single file named outdegree.py.
- For each node, compute the sum of weights of incoming edges and output the (node, weight_sum) pairs in order sorted by node. The code should be in a single file named weight.py.
- For each node X, find a list of all other nodes Y such that there is an (X, Y) edge in the graph and a (Y, X) edge in the graph, and output the (X, [Y1, Y2, ..., Yn]) pairs in order sorted by X. Hint: I solved this by building two RDDs, one in which edge source nodes are keys and destination nodes are values, and one in which edge destination nodes are keys and source nodes are values. The code should be in a single file named pairs.py.

Put all 3 programs in a tarball and submit it via slack to the TA. A goal in all cases is to load the data in and keep it within Spark for as long as possible. That is, resist the temptation to, for example, do a .collect() and do the rest of the work in pure python. My code for the 3rd program stayed in Spark RDDs the entire time.

Homework 5

Assigned: November 20th (Tuesday)
Due: December 3rd (Monday) by 11:59pm

Do the following problems from chapter 10 of the Data Mining book. You can find a link to a PDF of that chapter in the syllabus.

10.2 (k-means)
10.10 (BIRCH). Note that you'll have to read about the OPTICS algorithm in the text.
10.12 (density-based clustering)
10.18 (constraints and clustering) For this problem I'm just looking for you to propose some ideas. In addition to what's asked for in the book, say something about how your proposed modifications impact the computational complexity of the clustering algorithm.

Write up your answers and submit them to the TA via slack.

Homework 4

Assigned: November 6th (Tuesday)
Due: November 20th (Tuesday) by 11:59pm

In this homework you will get experience writing and running MapReduce jobs on Hadoop. Here are your tasks:

Install Hadoop. I suggest using the most recent version and installing binaries. You can do either a local install or a psuedo-distributed install. A local install will be easier to do and maintain.
Read and work through the MapReduce tutorial for a working example of the word count problem. I would use that as a template and morph it into the code that you need to write.
Write java code to perform k-means clustering using MapReduce. The program must take 4 inputs in this order: k, n, full path to input directory, full path to output directory. k is the number of clusters and n is the dimensionality of the data points.
- When your program is run it should process all of the data files in the input directory which will contain one line per input point with each value delimited by a single space and terminated by a newline. There will be no blank lines in the input files. For example, if the input points are 4-dimensional, a file may look like this:
```
0.1 12.2 3.4 1.1
0.78 -14.2 800.9 12.0
```
  An excellent test file is the MNIST data (without class labels) that you used for the previous homework. In fact, that will be one of our test cases.
- The initial centroids should be
```
 <i, ..., i> 
```
  for values of i from 1 to k. For example, if the data points are 4-dimensional and k = 3, the initial centroids would be:
```
<1, 1, 1, 1>
<2, 2, 2, 2>
<3, 3, 3, 3>
```
  This ensures that everyone (assuming a correct implemetation) will get the same output for the same input.
- The output of the reducers should be the cluster centroids.
- It is perfectly OK for the mappers and reduces to open and read from files to load side information.

Note that you'll implement one round of k-means in MapReduce and will thus need to wrap it in a little logic that iterates until cluster centroids don't change on two successive iterations.

Turn in your code as a tar file. The main routine must be contained in a file named KMeans.java and all files must extract to the current working directory when you untar them.

Homework 3

Assigned: October 24th (Wednesday)
Due: November 6th (Tuesday) by 11:59pm to the TA via slack.

In this homework you will implement logistic regression, and then add a term to the objective function to prevent overfitting and evaluate it's impact.

The MNIST dataset is a well-studied collection of handwritten digits. It is often used to test multi-class classification algorithms, where there is one class for each of the 10 digits (0 - 9).

I've made two files available for you:

The raw MNIST data, which is a text file containing 10,000 rows. Each row contains 28 * 28 = 784 integers in the range 0 to 255. Each integer is the pixel value from a 28 x 28 image of a handwritten digit. Every row corresponds to a vector in the dataset that is to be clustered.
The labels for the raw data are in a file with 10,000 rows. The first row contains the correct digit label for the first row in the raw data. The second row is the label for the second instance, and so on.

Implement 2-class logistic regression using gradient descent as outlined in the lecture notes. You can do either batch or stochastic versions of the algorithm. You will only use your algorithm for this dataset, so you can hard-wire in the number of instances and the size of each instance. The goal is not to write a generic version of the algorithm (though you can if you wish). The goal is to understand how it works on real data.

Use your algorithm to learn a classifier that determines whether an input image is an 8 or one of the other digits and record the classification accuracy on the training set (the full dataset I provided). Note that you'll have to come up with some stopping criterion, which could be to simply run for a fixed number of iterations and then quit.

After training is complete, create a 28x28 image of the learned weights. The largest weight (most positive) should map to black, the smallest weight (most negative) to white, and the other weights should linearly interpolate between those extremes.

Recall from class that one form of regularization (a way to prevent overfitting) is to encourage the weight vector to be sparse. That is commonly done in logistic regression by adding a term to the objective function that is the sum of the squares of the individual weights. We'd like to minimize that while maximizing the likelihood of the data.

Modify your code from above to include this term. Note that you'll have to consider the derivative of this term when computing gradients, and the formulation in class is maximizing likelihood whereas we want to minimize the sume of squared weights. Introduce a parameter, lambda, that is a constant used to multiply the sum of squares of weights. If lambda = 0, then you are running logistic regression without regularization. If you increase lambda, then you are putting more emphasis on small weights.

Generate a plot of training set accuracy as a function of lambda for several values of lambda. Briefly explain what is happening.

Submit the following in a Jupyter notebook:

Your code
The training set accuracy on the 8-vs-others classification problem
The weight image along with a brief paragraph of what the image tells you about what was learned.
The curve of training set accuracy as a function of lambda as described above, with an explanation.

Homework 2

Assigned: October 2nd (Tuesday)
Due: October 16th (Tuesday) by 11:59pm

In this homework you'll gain experience installing and using SQL databases. I want you to actually install the software yourself, even if you have access to a machine on which it already runs. To document what you did while installing, use the Unix script command. To do that, when you start the installatiom process in a shell type script logfile, and when you're done type exit. Then you'll have a file named logfile that shows what you did while installing and testing the installation. If you're doing this on some other OS, like Windows, you can supply a brief writeup of how you did the install (i.e., what steps were involved).

Here are your tasks.

Install MySQL: Click here to go to the "Installing and Upgrading MySQL" page. Then click on the link for your operating system. I installed in an Ubuntu VM and it went like this:

I clicked on "2.5 Installing MySQL on Linux"
I clicked on "2.5.1 Installing MySQL on Linux Using the MySQL Yum Repository", but it didn't look like there was a Yum repo for Ubuntu, so I went back.
So I clicked on "2.5.3 Installing MySQL on Linux Using the MySQL APT Repository" and it listed Ubuntu.
I was directed to "A Quick Guide to Using the MySQL APT Repository", so I went there.
"Steps for a Fresh Installation of MySQL" looked promising, so I clicked on that
I wound up on a page where I was able to download the APT repo, and it looked like I needed an Oracle Web account. But if you scroll down you'll see a link that says "No thanks, just start my download", so I did that.
The result was a .deb file, and the instructions gave me a dpkg command to run, which I did.
That popped up a window that asked what I wanted to install, so I just went with the defaults and chose "Ok".
I kept entering commands as suggested by the installation process, which downloaded and installed a bunch of stuff.
Then I was asked for a root password for MySQL.
To check the install, run sudo service mysql status. You should see something about MySQL being active and/or running.

Create a MySQL user for yourself:

Run mysql --user root --password at the command line, enter the root password for MySQL that you created, and run the following queries
CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON *.* TO 'username'@'localhost' WITH GRANT OPTION;
QUIT;
Then run mysql -p and enter the password you just specified.
Try SHOW DATABASES; and you should see a few system DBs listed.

Install the Retailer sample database: Go here for instructions on getting the sample database. The result is a .zip file that, when extracted, gives you a .sql file named mysqlsampledatabase.sql, which is nothing more than a series of SQL commands/queries. To run it, do this: mysql -u username < mysqlsampledatabase.sql. As usual, you'll have to enter your password

To check that everything worked, do SHOW DATABASES in mysql and if you see one named classicmodels, then all is well. The web page that contains the link for the sample database has information on the tables and their fields.

Write each of the following queries: For each query, turn in the query and the result of running it on the Retail database that you just created.

Count the number of employees whose last name or first name starts with the letter 'P'.
For how many letters of the alphabet are there more than one employee whose last name starts with that letter? Hint: The substr function will be useful here in a GROUP BY clause.
How many orders have not yet shipped?
How many orders where shipped less than 2 days before they were required?
For each distinct product line, what is the total dollar value of orders placed?
For the first three customers in alphabetal order by name, what is the name of every product they have ordered?

Install the python connector for MySQL: In this part of the homework you'll get experience running queries from python code and write a simple program to extract the structure of a MySQL database.

Go here to download the python connector.
I chose "4.2 Installing Connector/Python from a Binary Distribution" and then looked under "Installing Connector/Python on Linux Using a Debian Package" since I'm on Ubuntu.
I wound up on the Download Connector/Python page where I chose my operating system and downloaded the .deb file, then I resumed following the directions, which involved running a dpkg command on the file I just downloaded.
Then you can test the installation by running python and trying import mysql.connector. If that does not produce an error, you're good to go.

Your task is to write a python program that takes a single command line argument, which is the name of a database, and prints the names of all of the tables in that database along with the number of rows in each table. Read through the documentation on the python connector here to see how to create a connection, issue a query, and walk over the results. For this exercise you'll submit your python code, which should all be in one file, along with the output of running your program on the sample database you installed earlier. Hint: The SHOW TABLES query will be useful here.

Put all elements of the homework into a single file and submit it via Slack to the TA by the due date/time.

Homework 1

Assigned: September 16th (Sunday)
Due: September 28th (Friday) by 11:59pm

In this homework you'll gain experience with Open Baltimore data, Jupyter notebooks, and pandas. Jupyter auto-saves notebooks with some regularity, but I also tend to "Save and Checkpoint" periodically on the File menu because you can always revert to a checkpoint.

You will submit your homework as a notebook by uploading it as a file into the Slack channel for the TA (Prathusha Naidu) by 11:59pm the day the assignment is due. To do that, click on the + icon beside "Direct Messages" and start typing her name. As some point you'll see her name in a list of users below where you are typing. Click on her name. Once you're in a chat with Prathusha, click the big + next to the space where you enter a message, click on "Upload File" and then choose the notebook you want to submit. The system will then allow you to add a message, which you should make "Homework 1 submission for NAME".

To add comments in your notebook, which you're asked to do to explain your thinking in a few places, you'll use markdown syntax in the cell. Look at the Basics tab on the main markdown page and it will tell you everything you need to know. Type your comments in a notebook cell and then either do "Cell" - "Cell Type" - "Markdown", or type CTRL-M M in the cell.

Choose any dataset from the Open Baltimore collection except for variations of the Victim Based Crime Data that I explored in my DataExploration notebook. Choose a dataset that allows you to perform the following tasks:

Load the data into a Jupyter notebook. Explain briefly (using markdown) what the Open Baltimore website says about the dataset. Do a head(50) and tail(50) on the data frame after loading the data. Explain any observations you can make about the dataset and its quality from just that output.
Explore the data to understand what's in each of the columns. If the dataset has a very large number of columns (more than 10) you can choose a smaller subset of columns with which to work, but justify why you selected those columns. For each of the columns, but no more than 5 total columns:
- Describe what the column contains (e.g., the time at which a crime was committed, or the last sale price of a house) in prose
- Determine whether the column contains missing data, make a decision about how to handle them, and implement that decision
- Do the same for outliers or other unusual values. Determine if they exist and, if so, implement an approach to dealing with them
- Explain anything else interesting or unusual about the data in the column that you observed
Note that your code for both discovering missing values and outliers, as well as dealing with them, must be included in the notebook. Add explanatory markdown as needed.
Create scatter plots of pairs of variables that you think might be related, and for two such plots do the following:
- Explain why you think the two variables might be related
- Show the scatter plot
- Explain what the plot says, if anything, about the relationship between the variables. The explanation should be semantic. That is, don't say "x gets bigger when y gets bigger", say, for example, "it looks like crime increases later in the week, presumably because people are out later in the week and on the weekends".
Note that you do not need to plot all pairs of variables. In fact, you can do just two pairs as long as they show interesting relationships.
Pick one variable to be a dependent variable, and two others to be predictor variables. These choices should be based on your exploration above. Generate a 2-D or 3-D plot that shows whether the predictor variables actually convey information about the value of the dependent variable. Explain clearly why you think they do or do not by referring to the plot.

CMSC 691 — Introduction to Data Science — Fall 2018

Homework