UMBC CS 201, Fall 08

CMSC 201
Programming Project Three

Web Data Mining

Out: Thursday 10/23/08
Due: Before 11:59 p.m., Sunday 11/9/08

The design document for this project, design3.txt ,
is due: Before 11:59 p.m., Sunday 11/2/08

The Objective

The objective of this assignment is to give you practice with project and function design. It will also give you an opportunity to work with reading information from a file, using structures, an array of structures, passing by reference, string comparison, sorting, and some formatted printing.

The Background

There are many companies that will pay people to find out which websites are popular so that they may better select where to place their adverstisements. We at CMSC201 Inc. are interested in where our students spend most of their time while using the Internet so that we may make our website more appealing to the student population.

I have installed a data gathering program on a set of computers which have been distributed throughout the campus. This program monitors and logs every website any student visits while logged into one of these computers. Yes, all the places you have visited. The logs generated have been gathered and formed into one extremely large data file we will be working with for this project named History.dat.

Data mining is a field in computer science where researchers look at creating new ways of analyzing large amounts of data. Their goal is to summarize any relationships that exist, and/or find relationships that are not as easily visible. This is one field that is currently growing in the UMBC Computer Science Department.

The Task

Design and code a project that will allow you to read in information from a data file, store it in arrays of structures so that the file can be closed, and then use the information in the arrays to answer some statistical questions. We are interested in knowing:

and will need to produce a chart of data mined from this file summarizing our areas of interest and calculations. The charts should present the data nicely formatted with columns of values aligned by their decimal points. Your program will also need to print out lists of websites sorted by their total number of hits. See the sample output.

The user will need to enter the following information:

The data file contains the following information per line:

An example web address would be "www.umbc.edu" where : You can view the History.dat file to examine its contents and see its format.
Do NOT copy and paste the file from this webpage. Doing so will corrupt the file.
You should copy this file into your account by using the following command: cp /afs/umbc.edu/users/b/o/bogar/pub/History.dat .

The Specifications

EXTRA CREDIT

Sample Run

The sample run when using History.dat can be seen here : Sample Output

Submitting the Program

You are to use separate compilation for this project, so you will be submitting five files.
Your C source code file that contains main() MUST be called proj3.c. You should also have files called dataMine.c, dataMine.h, util.c, and util.h.

To submit your project, type the following at the Unix prompt. Note that the project name starts with uppercase 'P'. submit cs201 Proj3 proj3.c dataMine.c dataMine.h util.c util.h To verify that your project was submitted, you can execute the following command at the Unix prompt. It will show all files that you submitted in a format similar to the Unix 'ls' command.

submitls cs201 Proj3