Course Homepage
Recent Announcements
- Dec 1 - Due to the length of homework 6 we will only have 7 homeworks. Everyone will get full credit for the missing 8th homework so the points allotted to each homework will remain unchanged.
- Dec 1 - Homework 7 is out.
- Nov 10 - Homework 6 is out.
- Oct 29 - Homework 5 is out.
- Oct 20 - Homework 4 is out. Note that the due date is on a Thursday rather than a Tuesday.
- Sep 29 - The mongo notebook is here, and the decision tree notebook is here. The decision tree slides are posted below in the syllabus.
- Sep 17 - The mysql connector notebook is here.
- Sep 14 - Homework 2 is posted on the Homework tab.
- Sep 8 - I posted a link to the data preprocessing slides in the schedule below. Here are links for the basic data frames, deeper into data frames, JSON, and simple data visualization notebooks.
- Aug 31 - Homework 1 is out
- Aug 27 - Here are links to the data exploration notebook and the dataset that it loads.
- Aug 26 - Join the class Slack workspace ASAP. The link expires soon.
Course Description
Data science is a field that involves data manipulation, analysis, and presentation, all at scale. It's typical for an organization to have a few terabytes of data maintained for different purposes by different business units stored in different formats, and for someone to have an idea about how the data might bring significant additional value. Data scientists are the bridge between the idea and the data and help extract latent value, often uncovering novel insights and novel beneficial ways to use the data in the process.
The goal of this class is to give students hands on experience with all phases of the data science process using real data and modern tools. Topics that will be covered include data formats, loading, and cleaning; data storage in relational and non-relational stores; data analysis using supervised and unsupervised learning, and sound evaluation methods; data visualization; and scaling up with cloud computing, MapReduce, Hadoop, and Spark.
Tools
The core concepts of data science are programming language indepdendent, but Python has a powerful set of open source tools for doing data science at scale that we will leverage, as do many organizations, both large and small. Specically, we'll use Anaconda, which bundles "over 100 of the most popular Python, R and Scala packages for data science" and provides easy access to hundreds more through the conda package manager.
The elements of Anaconda that are most relevant to the tripartite structure of this course are (1) pandas (the Python Data Analysis Library), which provides ways to load data into a dataframe for easy manipulation and analysis, (2) scikit-learn, which is a set of "simple and efficient tools for data mining and data analysis", and (3) matplotlib, which is "a python 2D plotting library [that] produces publication quality figures in a variety of hardcopy formats and interactive environments".
Two other tools that will figure prominently are Spark, an extremely powerful framework for data manipulation in cluster computing environments, and Jupyter Notebook, a "web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text". We'll use the former to explore the power of cloud computing with Amazon's EC2, and the latter to interactively explore data and present results.
Please note that you will get your hands dirty in this class!
You will be required to install software, read and use online
documentation, solve problems by googling for answers, read posts
on Stack Overflow, and so on.
Data science is a broad and rapidly changing field, so one of the most
valuable skills you can cultivate is the ability to dive in and solve
problems, either your own or the client's. You will by no means be on
your own, with support from me, the TA, and your classmates. But the
first thing I will ask when you come to me with a question is "what
have you already tried?", and the list of things you've tried must
have length ≥ k where k is at least 2.
Grading
This will be an unusual semester given that we're doing the class entirely online. To ensure that everyone stays engaged and gets frequent feedback on their mastery of the content, we're going to do a number of short homeworks rather than a project (as I've done in the past, with fewer homeworks). Grades will based on the following:
- 8 homeworks, 9 points each
- Midterm exam, 14 points
- Final exam, 14 points
Late policy: Homeworks are due at the start of class on the day they are due. A penalty of 10% will be imposed on anything turned in after the start of class but turned in within 24 hours. A penalty of an additional 10% will be imposed every 24 hours after that. So the penalty is 10% for one day late, 20% for two, 30% for three, and so on. Due to the complexity of online classes, you will have 5 late days that you can use for homeworks, and only homeworks. You can use those days any time and do not need to ask me if you can use them. I will assume that any assignment turned in late (by one or more days) will use one or more late days. It is up to you to track how many late days you have used.
Academic Integrity
By enrolling in this course, each student assumes the responsibilities of an active participant in UMBC’s scholarly community in which everyone’s academic work and behavior are held to the highest standards of honesty. Cheating, fabrication, plagiarism, and helping others to commit these acts are all forms of academic dishonesty, and they are wrong. Academic misconduct could result in disciplinary action that may include, but is not limited to, suspension or dismissal. To read the full Student Academic Conduct Policy, consult UMBC policies, or the Faculty Handbook (Section 14.3). For graduate courses, see the Graduate School website.Important University Information
Again, given the complexity of the situation around virtual teaching and Covid-19, the university has a number of addenda that can be found here. In general, reach out to me sooner rather than later if you need help or have questions about anything related to this course, online learning, impacts of being sick, etc.Schedule
Week | Topics | Notes |
---|---|---|
01 | Course overview, introduction to data science, setting up your environment (Anaconda and Jupyter Notebook) | Class Thursday only (start of semester) |
02 | Introduction to Pandas and dataframes, CSV, json, and minimal visualization capabilities | Tuesday, Homework 1 out |
03 | More visualization. Data loading, cleaning, summarization, and outlier detection | Reading, Slides |
04 | SQL, NoSQL, key/value stores, connecting to a database from Python | NoSQL reading NoSQL slides SQL slides Tuesday, Homework 1 due Homework 2 out |
05 | Building models, trees for classification, scikit-learn | Decision tree reading and slides |
06 | Trees for regression, linear regression | Regression slides Tuesday, Homework 2 due Homework 3 out |
07 | Logistic regression, support vector machines |
Logistic regression
slides Logistic regression reading SVM slides SVM readings |
08 | Mid-term exam on Thursday, October 15th (covers weeks 1 - 7) Evaluation, cross-validation, overfitting, practical concerns |
Slides on statistical
tests Tuesday, Homework 3 due Homework 4 out |
09 | Clustering, dimensionality reduction, practical concerns | Clustering reading |
10 | Data visualization | Slides Tuesday, Homework 4 due Homework 5 out |
11 | Cloud computing, scaling up, Amazon EC2 | |
12 | MapReduce and Hadoop | Map-Reduce reading
Tuesday, Homework 5 due Homework 6 out |
13 | Spark (the MapReduce killer) | Slides |
14 | Spark - part 2 | Class Tuesday only (Thanksgiving) Tuesday, Homework 6 due Homework 7 out |
15 | Topics that spilled over from prior weeks (e.g., EC2, a little Spark) | Bias/variance slides |
16 | Exam review | Tuesday, Homework 7 due Homework 8 out |
Dec 10 | Final exam 10:30am - 12:30pm |