Exams


This is the page that I created for this course the last time I taught it. Rather than remove it until we get closer to the exam, I thought I'd leave it here just so you can get an idea early on about what might be on the exam. Note that this can change as we move through the semester, and if it does I'll update this page. Again, this old version is here soley to help you think about the exam early in the semester.
We've covered a number of topics this semester, and the materials for them have come from a number of sources. Below I've gathered them all in one location, with a summary of the things that I think are most important in each topic.

In preparing this review, it became clear that we covered a lot of ground (because data science is such a big field), which means that we went deep on a few things but hit lots of topics at a fairly high level. Therefore, my goal with the exam is to ensure that you know concepts as opposed to details, so it will be a combination of short answer, true/false, and multiple choice questions. There will be no single, say, 20 point question, and thus no single significant point of failure.

Introduction to Pandas and dataframes, CSV, json, and minimal visualization capabilities

More visualization. Data loading, cleaning, summarization, and outlier detection

SQL, NoSQL, key/value stores, connecting to a database from Python

Building models, trees for classification, scikit-learn

Logistic regression, support vector machines

Evaluation, cross-validation, overfitting, practical concerns

Clustering, dimensionality reduction, practical concerns

Cloud computing, scaling up, Amazon EC2

MapReduce and Hadoop

Spark