Projects
The project is meant to give you experience with real data in the context of an unstructured exploration. There are only two hard requirements:
- You must integrate information across two "tables" or "datasets". For example, loading a dataset that has information about each crime that was committed in Baltimore and then creating descriptive statistics or building a model that predicts one column from the others is not enough. You need to leverage data from another dataset, such as one that describes demographics by region that allows you to understand the connection between crime and, say, income level.
- You must apply multiple ideas that we've learned this semester as you explore the data. Again, just closing your eyes and building a logistic regression model to predict Y from X is not enough. You should demonstrate understanding of the data, clean it, make conjectures about what the relationships might be, generate descriptive statistics and visualizations, build models, etc. You don't have to do all of those things. I'm looking for a certain level of complexity. For example, you could just build a model but do it on a very, very large dataset. Or you could use a small dataset but demonstrate some really interesting insight that was not at all obvious going into it by data exploration and visualization.
If you need help finding a dataset, talk to me. But there are many floating around on the internet these days. See the note on the main class page about using Kaggle or other competition datasets. One way around that issue is to combine data from a couple of competitions. In years past I've required students to use Open Baltimore data. That's not a requirement this year but it is one source to consider.
You can do the project in teams of 1, 2, or 3. I expect more work from teams of 2, and even more from teams of 3, so consider that as you form teams and come up with ideas.
Note that there is an item on the syllabus for a 1-page project description. That is an informal writeup of who is doing the project, what data you will be using, and what you intend to do. It will not be graded but I will use it to provide feedback as to whether your project is reasonable. If you don't turn it in I will harrass you mercilessly until you do.
The final report should be single spaced, 12 point font, with 1 inch margins. Other than that, any style is fine. My expectation is that the report will contain at least the following:
- Title, list of team participants (1 point)
- Brief overview of the datasets used and what you accomplished (sort of like an abstract) (4 points)
- Detail on the datasets used (5 points)
- Discussion of what you learned about the datasets by exploring them, with evidence in the form of visualizations or descriptive statistics (25 points)
- Description of what you attempted to do with the datasets to draw connections between them. Also, say why you thought what you did was a good thing to do. (20 points)
- Results, and it is important to include negative results so that I can know what you tried. (20 points)
- If you build models, describe the type of model used, how it was trained, and how it was evaluated, along with performance (accuracy) information
- Perhaps the most important part is the insights gained from the data. Why is what you were trying to do valuable, and what did you learn? Why would someone else be interested in knowing what you learned? How might they use that knowledge? (20 points)
- A brief discussion of what you would do next if you had the time and inclination to keep working in the same direction. (5 points)
In terms of length, given that there will be visualizations, tables, histograms, etc. in the document, a paper shorter than 10 pages will be met with some skepticism (see the discussion above about complexity). A paper longer than 30 pages may test the limits of the reader's attention span. Note that teams with more than one person should have tried more things. It is not the case that the length of the paper should be twice as long for teams of two as compared to an individual project. Rather, I'd expect the larger teams to explore more of the data, try more things, and present more insights.
In the end, the best projects will be ones that learn something interesting from the data. That is, if you tell me that the data says that most crime occurs after midnight on the weekends, I won't be surprised or find that particularly insightful. But if you tell me that crime patterns by weapon seem to move geographically with a weekly cycle (for example), I'd think that was pretty interesting.
The preferred format for the final project is a single Jupyter notebook. But a word document or PDF are also fine.