Exams

This is the page that I created for this course the last time I taught it. Rather than remove it until we get closer to the exam, I thought I'd leave it here just so you can get an idea early on about what might be on the exam. Note that this can change as we move through the semester, and if it does I'll update this page. Again, this old version is here soley to help you think about the exam early in the semester.

We've covered a number of topics this semester, and the materials for them have come from a number of sources. Below I've gathered them all in one location, with a summary of the things that I think are most important in each topic.

In preparing this review, it became clear that we covered a lot of ground (because data science is such a big field), which means that we went deep on a few things but hit lots of topics at a fairly high level. Therefore, my goal with the exam is to ensure that you know concepts as opposed to details, so it will be a combination of short answer, true/false, and multiple choice questions. There will be no single, say, 20 point question, and thus no single significant point of failure.

Introduction to Pandas and dataframes, CSV, json, and minimal visualization capabilities

The key topics here are of dataframes related to indexing and selecting data.
We spent the most time on Different Choices for Indexing, Attribute Access, Slicing Ranges, Selection By Label, Selection By Position, and Boolean indexing among the topics in the dataframe documentation.
Know the basics of CSV and JSON representations (e.g., be able to convert from one to the other)

More visualization. Data loading, cleaning, summarization, and outlier detection

The primary source here was this book chapter
We covered 3.2.1 Missing Values, 3.2.2 Noisy Data, 3.4.4 Attribute Subset Selection, 3.4.6 Histograms, 3.5.2 Data Transformation by Normalization
We discussed some of the other materials, but we spent the most time on this in class

SQL, NoSQL, key/value stores, connecting to a database from Python

For SQL we used slides and made it through slide 45.
The important concepts here are tables, simple queries, LIKE, DISTINCT, (inner) joins, aggregation, grouping, nested queries
I could see asking you to write a query or explain what a query does/returns
We spent much less time on NoSQL, but did get through slide 34 of this presentation.
The important parts are sharding (what is it?), how SQL and NoSQL differ, and the benefits and drawbacks of NoSQL. Also know that MongoDB is a NoSQL database.

Building models, trees for classification, scikit-learn

We had a general discussion of building models, and dove into the contents of this book chapter.
The key concepts are the core algorithm for learning trees, and the use of entropy and information gain to choose attributes for splitting. That's contained in sections 3.1 - 3.4 of the book chapter.

Logistic regression, support vector machines

We used these slides for logistic regression. The LR content starts on slide 44.
You should be able to write down the logistic function, say how/why it is used in LR, and know that LR learns a linear separator.
The earlier material in that presentation about weight vectors, gradient ascent/descent, and batch/stochastic algorithms is relevant.
You don't need to know mathematical details (other than the form of the logistic function), and you certainly don't need to know how to derive things.
The SVM lecture was all on the board. You should know that SVMs maximize the margin, be able to explain what that means and why it is a good thing, that SVMs can be made non-linear by the kernel trick, (in general terms) what the kernel trick is (it projects into high-dimensional spaces implicitly), and how it relates to polynomial regression (which projects explicitly)

Evaluation, cross-validation, overfitting, practical concerns

Know what overfitting is, why it's bad, what cross-validation is, and how it combats overfitting.
We covered through slide 30 of this presentation on evaluation.
The key concept here is bootstrapping, or using a computer program to compute confidence intervals on some statistic of a data sample.
I won't ask questions on distributional theory.

Clustering, dimensionality reduction, practical concerns

We spent the most time on clustering
Key concepts are: understanding what clustering is, difference between partioning and hierarchical methods, k-means, agglomerative clustering, distance measures, BIRCH (basics), and DBSCAN (basics)

Cloud computing, scaling up, Amazon EC2

The content for this section was the EC2 Master Class SlideShare.
You can see the author of that slide deck present it here.
I'd like you to be able to explain what EC2 is and why it is useful, understand the concept of instances, understand the difference between instance storage and EBS, be able to say what a Virtual Private Cloud is, and say how auto-scaling works, its connection to monitoring and metrics, and why it is useful

MapReduce and Hadoop

We covered through section 3.2 of Jimmy Lin's MapReduce book.
You should know what mappers and reducers do, how key/value pairs flow through the system, in-Mapper combining, and why MapReduce is needed and powerful.
I won't ask about partitioners or combiners or the distributed file system.
We talked at length about the word counting example and different ways of implementing it, along with their advantages and disadvantages
I might ask you to sketch MapReduce code for a simple problem, or at least say how you'd approach it with MapReduce.

Spark

We covered three sets of slides: installing, history, and essentials. I'll focus on the essentials.
Key concepts are RDDs, transformation, actions, and lazy execution
Know what the transformations (slides 14 and 15) and actions (slides 21 and 22) do, and how to read/write a simple spark program like the word count example.

CMSC 691 — Introduction to Data Science — Fall 2018

Exams