Projects
The project is meant to give you experience with real data in the context of an unstructured exploration. There are only two hard requirements:
- You must integrate information across two "tables" or "datasets". For example, loading a dataset that has information about each crime that was committed in Baltimore and then creating descriptive statistics or building a model that predicts one column from the others is not enough. You need to leverage data from another dataset, such as one that describes demographics by region that allows you to understand the connection between crime and, say, income level.
- You must apply multiple ideas that we've learned this semester as you explore the data. Again, just closing your eyes and building a logistic regression model to predict Y from X is not enough. You should demonstrate understanding of the data, clean it, make conjectures about what the relationships might be, generate descriptive statistics and visualizations, build models, etc. You don't have to do all of those things. I'm looking for a certain level of complexity. For example, you could just build a model but do it on a very, very large dataset. Or you could use a small dataset but demonstrate some really interesting insight that was not at all obvious going into it by data exploration and visualization.
If you cannot find anything compelling to do in the Open Baltimore data, you can find other datasets, but you must get written approval (in an email) from the instructor to use them in your project. In particular, I'm not overly fond of Kaggle datasets as they tend to be overused with too much analysis and code lying around on the internet.
The final report should be single spaced, 12 point font, with 1 inch margins. Other than that, any style is fine. My expectation is that the report will contain at least the following:
- Title, list of team participants (1 point)
- Brief overview of the datasets used and what you accomplished (sort of like an abstract) (4 points)
- Detail on the datasets used (5 points)
- Discussion of what you learned about the datasets by exploring them, with evidence in the form of visualizations or descriptive statistics (25 points)
- Description of what you attempted to do with the datasets to draw connections between them. Also, say why you thought what you did was a good thing to do. (20 points)
- Results, and it is important to include negative results so that I can know what you tried. (20 points)
- If you build models, describe the type of model used, how it was trained, and how it was evaluated, along with performance (accuracy) information
- Perhaps the most important part is the insights gained from the data. Why is what you were trying to do valuable, and what did you learn? Why would someone else be interested in knowing what you learned? How might they use that knowledge? (20 points)
- A brief discussion of what you would do next if you had the time and inclination to keep working in the same direction. (5 points)
In terms of length, given that there will be visualizations, tables, histograms, etc. in the document, a paper shorter than 10 pages will be met with some skepticism (see the discussion above about complexity). A paper longer than 30 pages may test the limits of the reader's attention span. Note that teams with more than one person should have tried more things. It is not the case that the length of the paper should be twice as long for teams of two as compared to an individual project. Rather, I'd expect the larger teams to explore more of the data, try more things, and present more insights.
In the end, the best projects will be ones that learn something interesting from the data. That is, if you tell me that the data says that most crime occurs after midnight on the weekends, I won't be surprised or find that particularly insightful. But if you tell me that crime patterns by weapon seem to move geographically with a weekly cycle (for example), I'd think that was pretty interesting.