UMBC CS 201, Fall 08
| CMSC 201Programming Project Three
 
Web Data Mining
 Out: Thursday 10/23/08Due: Before 11:59 p.m., Sunday 11/9/08
 The design document for this project,
 design3.txt , is due: Before 11:59 p.m., Sunday 11/2/08
 | 
The Objective 
           
The objective of this assignment is to give you practice with project and
function design.  It will also give you an opportunity to work with reading
information from a file, using structures, an array of structures, passing by 
reference, string comparison, sorting, and some formatted printing.
The Background 
There are many companies that will pay people to find out which
websites are popular so that they may better select where to place their 
adverstisements.  We at CMSC201 Inc. are interested in where our 
students spend most of their time while using the Internet so that we may 
make our website more appealing to the student population.
I have installed a data gathering program on a set of computers which have been
distributed throughout the campus.  This program monitors and logs every
website any student visits while logged into one of these computers.  Yes,
all the places you have visited.  The logs generated have been gathered and 
formed into one extremely large data file we will be working with for this 
project named History.dat.
Data mining is a field in computer science where researchers look at creating
new ways of analyzing large amounts of data. Their goal is to summarize any
relationships that exist, and/or find relationships that are not as easily 
visible.  This is one field that is currently growing in the UMBC Computer 
Science Department.
The Task
Design and code a project that will allow you to read in information from a
data file, store it in arrays of structures so that the file can be closed,
and then use the information in the arrays to answer some statistical 
questions.
We are interested in knowing: 
-  the total number of unique websites found in the file
-  total number of each of the domain name endings (.net, .gov... etc).
 We have chosen to limit this to: .net, .gov, .edu, and .com
-  total number of websites that were visited more than once
-  total time logged
-  the statistical breakdown by type of websites visited
-  the best content to add to our website ( based on number of hits )
and will need to produce a chart of data mined from this file summarizing our 
areas of interest and calculations.  The charts should present the data nicely
formatted with columns of values aligned by their decimal points.  Your program
will also need to print out lists of websites sorted by their total number of 
hits.  See the sample output.
The user will need to enter the following information:
-  the name of the data file to be used (a string of size 30 or less)
The data file contains the following information per line:
-  web address
-  site type (EDUCATION, FUN, NEWS, and SEARCH are the only type of sites)
-  logged time in seconds
An example web address would be "www.umbc.edu" where :
-  www - is the access code
-  umbc - is the domain name
-  edu - is the domain name ending
You can view the History.dat file to examine its
contents and see its format.
  
Do NOT copy and paste the file from this webpage.  Doing so will corrupt the file.
You should copy this file into your account by using the following command:
      cp /afs/umbc.edu/users/b/o/bogar/pub/History.dat .
The Specifications
   - You must use the following structure definitions and typedefs:
 typedef struct address
         {
            char  accessCode[MAX_ACCESS_CODE_SIZE];
            char  domainName[MAX_DOMAIN_SIZE];
            char  domainEnding[MAX_DOMAIN_ENDING_SIZE];
         }ADDRESS;
   
   
         typedef struct website
         {
            ADDRESS webAddress;
            char  siteType[MAX_SITE_TYPE_SIZE];
            int   hits;
            int   seconds;
         }WEBSITE;
   
where MAX_ACCESS_CODE_SIZE is defined to be 15, MAX_DOMAIN_SIZE is 30, 
MAX_DOMAIN_ENDING_SIZE is defined to be 4, and MAX_SITE_TYPE_SIZE is 10.
 ADDRESS & these 4 #defines HAVE ALREADY BEEN WRITTEN AND EXIST IN proj3helper.h
 You'll need to define WEBSITE in your dataMine.h file.
 
-  You will store the information that you get from the file into arrays,
where each element is a website. Using 4 arrays will make your task much 
simplier.  These arrays will contain websites and should be capable of 
maintaining up to 100 websites each.  You MUST have arrays named: netSites,
govSites, eduSites, and comSites that contain websites of corresponding
domain name endings.
 
-  You must ask the user to enter the name of the data file to be used and
you must allow that filename to be as much as 30 characters long.  Any data
file we use for testing will have exactly the same format as History.dat.
 
-  You must have a function named ComputeTotals() and it must have the
following prototype:
     void ComputeTotals(WEBSITE array[], int *pTotalUniqueSites,
                        int *pTotalNonUniqueSites, int *pTotalTimeLogged,
	                int *pTotalFunSites, int *pTotalSearchSites,
	                int *pTotalEducationSites, int *pTotalNewsSites); 
where :
 pTotalUniqueSites, pTotalTimeLogged are pointers to variables 
totalUniqueSites, totalLoggedTime.
 pTotalNonUniqueSites is a pointer to a variable which counts the total number 
of websites with more than one hit.
 The pointers pTotalFunSites, pTotalNewsSites, pTotalSearchSites, and 
pTotalEducationSites all point to variables which hold the number of sites 
based on the content types of websites.
 
-  You must have a function named ArrayLength() and must have
the following prototype:
     int ArrayLength(WEBSITE array[]);
which takes an array of websites, which might not be full, and returns the
total number of websites in the array.
 HINT: Filling the arrays initially with a constant DUMMY website 
as a place holder is necessary to determine how many websites have been
entered into the array.
 A DUMMY wesbite might have the form:
 accessCode = "WWW"
 domainName = "DUMMY"
 domainEnding = "COM"
 hits = -1
 seconds = -1
 type = ""
 
 
-  You must have a function named IsPresent() and it must have the
following prototype:
     int IsPresent(WEBSITE array[], WEBSITE newSite);
which takes an array of websites and an individual website, checks for a 
duplicate, and returns one of two possible values.  One possible value is 
the index to the duplicate.  If no duplicate is found, the value is the 
index of the first available open location in the array.
 HINT: In order for a website to be considered a duplicate, it must
share the same accessCode, domainName, and domainEnding.  You may even want
to make this a function.
 In the event there this is a duplicate website, the following must be done:
 
	    -  increase the hits of the first copy by 1
	    
-  add the time spent on the duplicate to the original's
		 logged time.
	    
 
 
-  Converting a string into an ADDRESS is a difficult task, so we have
provided some code which can be attained by executing this command:
     cp /afs/umbc.edu/users/r/b/rberge1/pub/proj3/proj3helper.* .
In this file there is a function:
     ADDRESS CreateAddress(char webaddress[]);
This function breaks a web address down and fills the members of the ADDRESS
structure based on a '.' token.
 
-  YOU MUST Sort each array. Sorting the arrays must result in the 
arrays being in descending order based on the number of hits.
 HINT: You may use code that has been provided for you in the lecture 
notes or that you have written in lab.  However, you must modify it to handle 
an array of websites, and that it sorts the array in descending order.  Also, 
please make sure you account for the array not being entirely full when it is 
to be sorted.
 
-  Printing Statistics: will involve calculating the percentage of hits 
for each site type (FUN, EDUCATION, SEARCH & NEWS).  You can find this value 
by dividing the total number of hits for each type by the total number of 
unique websites.
 
-  Forming the Suggestion: will involve comparing the total number of hits 
for each site type.  Print out the suggested content type to be added based 
on these comparisons.
 
-  If you would like to have a function called OpenFile(), then it should 
return a FILE* to the calling function. You are NOT required to have a 
function called OpenFile(). You may open the file in main() and then pass the 
FILE* to the function that reads from the file if you prefer. Don't forget to 
close the file as soon as you have finished reading from it.
 
-  Do NOT use dynamic memory allocation for the arrays of websites, just 
declare them (in main). If there are more websites in the file than can be 
stored in any array, which are of size 100, then you should print an error 
message to the user and exit the program.
EXTRA CREDIT
-  The file has the time spent on each website in seconds.  For extra credit
convert this value to HH:MM:SS any time it is printed to the screen.  This will
be worth 5 points.
Sample Run
The sample run when using History.dat 
can be seen here : Sample Output
Submitting the Program
You are to use separate compilation for this project, so you will be 
submitting five files.
Your C source code file that contains main() MUST be called proj3.c. 
You should also have files called dataMine.c, dataMine.h, util.c,
and util.h.
To submit your project, type the following at the Unix prompt. Note that the 
project name starts with uppercase 'P'.
submit cs201 Proj3 proj3.c dataMine.c dataMine.h util.c util.h
To verify that your project was submitted, you can execute the following 
command at the Unix prompt. It will show all files that you submitted in a 
format similar to the Unix 'ls' command.
submitls cs201 Proj3