UMBC CS 201, Fall 08
CMSC 201
Programming Project Three
Web Data Mining
Out: Thursday 10/23/08
Due: Before 11:59 p.m., Sunday 11/9/08
The design document for this project,
design3.txt ,
is due: Before 11:59 p.m., Sunday 11/2/08
|
The Objective
The objective of this assignment is to give you practice with project and
function design. It will also give you an opportunity to work with reading
information from a file, using structures, an array of structures, passing by
reference, string comparison, sorting, and some formatted printing.
The Background
There are many companies that will pay people to find out which
websites are popular so that they may better select where to place their
adverstisements. We at CMSC201 Inc. are interested in where our
students spend most of their time while using the Internet so that we may
make our website more appealing to the student population.
I have installed a data gathering program on a set of computers which have been
distributed throughout the campus. This program monitors and logs every
website any student visits while logged into one of these computers. Yes,
all the places you have visited. The logs generated have been gathered and
formed into one extremely large data file we will be working with for this
project named History.dat.
Data mining is a field in computer science where researchers look at creating
new ways of analyzing large amounts of data. Their goal is to summarize any
relationships that exist, and/or find relationships that are not as easily
visible. This is one field that is currently growing in the UMBC Computer
Science Department.
The Task
Design and code a project that will allow you to read in information from a
data file, store it in arrays of structures so that the file can be closed,
and then use the information in the arrays to answer some statistical
questions.
We are interested in knowing:
- the total number of unique websites found in the file
- total number of each of the domain name endings (.net, .gov... etc).
We have chosen to limit this to: .net, .gov, .edu, and .com
- total number of websites that were visited more than once
- total time logged
- the statistical breakdown by type of websites visited
- the best content to add to our website ( based on number of hits )
and will need to produce a chart of data mined from this file summarizing our
areas of interest and calculations. The charts should present the data nicely
formatted with columns of values aligned by their decimal points. Your program
will also need to print out lists of websites sorted by their total number of
hits. See the sample output.
The user will need to enter the following information:
- the name of the data file to be used (a string of size 30 or less)
The data file contains the following information per line:
- web address
- site type (EDUCATION, FUN, NEWS, and SEARCH are the only type of sites)
- logged time in seconds
An example web address would be "www.umbc.edu" where :
- www - is the access code
- umbc - is the domain name
- edu - is the domain name ending
You can view the History.dat file to examine its
contents and see its format.
Do NOT copy and paste the file from this webpage. Doing so will corrupt the file.
You should copy this file into your account by using the following command:
cp /afs/umbc.edu/users/b/o/bogar/pub/History.dat .
The Specifications
- You must use the following structure definitions and typedefs:
typedef struct address
{
char accessCode[MAX_ACCESS_CODE_SIZE];
char domainName[MAX_DOMAIN_SIZE];
char domainEnding[MAX_DOMAIN_ENDING_SIZE];
}ADDRESS;
typedef struct website
{
ADDRESS webAddress;
char siteType[MAX_SITE_TYPE_SIZE];
int hits;
int seconds;
}WEBSITE;
where MAX_ACCESS_CODE_SIZE is defined to be 15, MAX_DOMAIN_SIZE is 30,
MAX_DOMAIN_ENDING_SIZE is defined to be 4, and MAX_SITE_TYPE_SIZE is 10.
ADDRESS & these 4 #defines HAVE ALREADY BEEN WRITTEN AND EXIST IN proj3helper.h
You'll need to define WEBSITE in your dataMine.h file.
- You will store the information that you get from the file into arrays,
where each element is a website. Using 4 arrays will make your task much
simplier. These arrays will contain websites and should be capable of
maintaining up to 100 websites each. You MUST have arrays named: netSites,
govSites, eduSites, and comSites that contain websites of corresponding
domain name endings.
- You must ask the user to enter the name of the data file to be used and
you must allow that filename to be as much as 30 characters long. Any data
file we use for testing will have exactly the same format as History.dat.
- You must have a function named ComputeTotals() and it must have the
following prototype:
void ComputeTotals(WEBSITE array[], int *pTotalUniqueSites,
int *pTotalNonUniqueSites, int *pTotalTimeLogged,
int *pTotalFunSites, int *pTotalSearchSites,
int *pTotalEducationSites, int *pTotalNewsSites);
where :
pTotalUniqueSites, pTotalTimeLogged are pointers to variables
totalUniqueSites, totalLoggedTime.
pTotalNonUniqueSites is a pointer to a variable which counts the total number
of websites with more than one hit.
The pointers pTotalFunSites, pTotalNewsSites, pTotalSearchSites, and
pTotalEducationSites all point to variables which hold the number of sites
based on the content types of websites.
- You must have a function named ArrayLength() and must have
the following prototype:
int ArrayLength(WEBSITE array[]);
which takes an array of websites, which might not be full, and returns the
total number of websites in the array.
HINT: Filling the arrays initially with a constant DUMMY website
as a place holder is necessary to determine how many websites have been
entered into the array.
A DUMMY wesbite might have the form:
accessCode = "WWW"
domainName = "DUMMY"
domainEnding = "COM"
hits = -1
seconds = -1
type = ""
- You must have a function named IsPresent() and it must have the
following prototype:
int IsPresent(WEBSITE array[], WEBSITE newSite);
which takes an array of websites and an individual website, checks for a
duplicate, and returns one of two possible values. One possible value is
the index to the duplicate. If no duplicate is found, the value is the
index of the first available open location in the array.
HINT: In order for a website to be considered a duplicate, it must
share the same accessCode, domainName, and domainEnding. You may even want
to make this a function.
In the event there this is a duplicate website, the following must be done:
- increase the hits of the first copy by 1
- add the time spent on the duplicate to the original's
logged time.
- Converting a string into an ADDRESS is a difficult task, so we have
provided some code which can be attained by executing this command:
cp /afs/umbc.edu/users/r/b/rberge1/pub/proj3/proj3helper.* .
In this file there is a function:
ADDRESS CreateAddress(char webaddress[]);
This function breaks a web address down and fills the members of the ADDRESS
structure based on a '.' token.
- YOU MUST Sort each array. Sorting the arrays must result in the
arrays being in descending order based on the number of hits.
HINT: You may use code that has been provided for you in the lecture
notes or that you have written in lab. However, you must modify it to handle
an array of websites, and that it sorts the array in descending order. Also,
please make sure you account for the array not being entirely full when it is
to be sorted.
- Printing Statistics: will involve calculating the percentage of hits
for each site type (FUN, EDUCATION, SEARCH & NEWS). You can find this value
by dividing the total number of hits for each type by the total number of
unique websites.
- Forming the Suggestion: will involve comparing the total number of hits
for each site type. Print out the suggested content type to be added based
on these comparisons.
- If you would like to have a function called OpenFile(), then it should
return a FILE* to the calling function. You are NOT required to have a
function called OpenFile(). You may open the file in main() and then pass the
FILE* to the function that reads from the file if you prefer. Don't forget to
close the file as soon as you have finished reading from it.
- Do NOT use dynamic memory allocation for the arrays of websites, just
declare them (in main). If there are more websites in the file than can be
stored in any array, which are of size 100, then you should print an error
message to the user and exit the program.
EXTRA CREDIT
- The file has the time spent on each website in seconds. For extra credit
convert this value to HH:MM:SS any time it is printed to the screen. This will
be worth 5 points.
Sample Run
The sample run when using History.dat
can be seen here : Sample Output
Submitting the Program
You are to use separate compilation for this project, so you will be
submitting five files.
Your C source code file that contains main() MUST be called proj3.c.
You should also have files called dataMine.c, dataMine.h, util.c,
and util.h.
To submit your project, type the following at the Unix prompt. Note that the
project name starts with uppercase 'P'.
submit cs201 Proj3 proj3.c dataMine.c dataMine.h util.c util.h
To verify that your project was submitted, you can execute the following
command at the Unix prompt. It will show all files that you submitted in a
format similar to the Unix 'ls' command.
submitls cs201 Proj3