CMSC-341 Spring 2006
Project 5
Assigned |
Wednesday May 3, 2006 |
Due |
Tuesday, May 16th, 11:59pm |
Updates |
Monday, May 8th
So that we all get the same results, the use of Horner's rule
for hashing strings (as found in the text and in the class notes) is
required.
|
Background
Despite the fact that Red-Black trees offer guaranteed O(lgN) performance
per operation, there are many applications where the O(1) performance of
hash tables is essential. One such application is query processing by web
search engines like Google and Yahoo. In this project you will use hash
tables to implement a simple, yet efficient, query engine.
Description
Web search engines have response times of approximately one second for
arbitrary queries because they index the entire web. They do this by
maintaining a list of the web pages for every word. Then,
to figure out what pages contain all of the words in a multiword query you
intersect the lists. For example, if the query "project four" is
issued, the index is consulted to find out that "project" occurs in, say,
pages P1, P4, P5, and P9; the index is consulted again to find out that
"four" occurs in, say, pages P4, P7, P9, and P12; and finally the
intersection of these two lists of pages is computed yielding P4 and P9 as
the only ones that contain both "project" and "four".
You will implement a system like the one described above using a hash
table with quadratic probing to store the index.
Here are your tasks:
- Write Proj5.cpp. It will validate command line arguments, read commands
from a command file, print the results of executing commands, and print
statistics about your hash table performance. Your
executable must be named Proj5.
- Copy QuadProbing.h and HashAux.cpp from Mr. Frey's public directory.
Make any necessary changes to these files for this project.
These classes must remain generic templates.
- Design and implement any necessary classes required, using good
Object-Oriented Design/Programming principles.
- Create and submit a makefile for this project. The command
make Proj5 should compile and link all necessary files.
- Answer the questions posed in 341-Spring06-proj5_questions.txt
found in Mr. Frey's public directory for this project.
The Command Line
Project 5 will be invoked with a command line that consists of two
arguments in this order
- The initial hash table size (an integer). If this integer is not
prime, your program should round the table size up to the nearest prime.
- The name of a command file
Your program should validate the command line, ensure that the specified file can be opened, then read and
process each of the commands in the file.
The Command File
The command file format follows.
Note that there can be multiple LOAD and QUERY commands in a single command
file in any order.
- LOAD filename -- Open the named file and insert all of the words it
contains into the index. If the file cannot be opened, output an error
message and continue processing commands.
- QUERY N word1 word2 ... wordN -- Following the word QUERY will be a
positive integer, N, that specifies the number of query terms. This integer is
provided to make it easy to extract the query terms. Following N is a list
of words (the query terms) separated by white space. Read these words and
print a list of the files that contain all of them. If there are no
files that contain all the words, print NONE.
Project Notes, Restrictions, and Miscellaneous Requirements
- Mr. Frey's public directory for this project is
(/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj5).
- As stated above, your project must use quadratic probing.
In addition, when an attempted insertion of a word fails and when
your project terminates
- Print the following statistics
- Hash Table Size
- Number of unique words inserted
- The load factor, lambda (2 decimal places)
- Number of clusters
- Max cluster size
- Average cluster size (2 decimal places)
- Max probes required to successfully insert any word
(output the word and file in which it was contained; not counting the
word that failed).
- Average probes per successful insert (not counting the word that failed; 2 decimal places)
- resize the hash table (twice as big rounded up to the next prime),
rehash all previous words, and then reinsert the word which failed.
Note that probe and cluster data should be cleared after printing,
before rehashing, and be re-accumulated during rehashing.
- For this project, we define an insertion "failure" as a second attempt
to insert a word into the initial probe location. By definition, the initial
probe is always attempted at index h(K, 0).
If a later probe ( h(K, i) with i > 0 ) attempts to
insert at the initial index, then the insertion has failed.
- A cluster is defined as 2 or more consecutive cells which contain
a word. If a cluster "wraps around" the bottom of the table
to the top of the table, count these as two clusters.
- No STL containers may be used for this project. Only those provided
by the author may be used.
Project Output
As usual, this project output is provide for formatting purposes
only. It is not the result of running any executable and so
may be inconsistent with itself.
Your project should echo each command from the command file
and print the result as shown here.
LOAD bob.txt
LOADED bob.txt
LOAD myfile.dat
Error - unable to open myfile.dat
QUERY 3 project five mary
project, five, and mary all appear in files
bob.txt
readme.txt
QUERY 3 bob mary sue
bob, mary, and sue all appear in files
NONE
When an insert fails, your project should print a message followed
by the required statistics.
Inserting "napkin" from file "manners.dat" failed.
Hash Table Stats
----------------
HashTable Size: 12345
Unique Words Inserted: 4567
Load Factor: 6.66
Max Cluster Size: 42
Number of Clusters: 42
Average Cluster Size: 42.42
Max Probes: 22 for "lawn" in file "YardWorkMadeEasy.doc"
Average Probes: 3.66
Rehashing.....
When your project terminates, print the same "Hash Table Stats" as above.
Files To Be Submitted
You should submit all files needed to build your program and the file
containing your answers to the questions.
Submit Tools
There are a number of tools available to you to check on your submittal.
It is your responsibility to ensure that the submittal is correct and will
result in a successful
compilation of your project. Do not wait till the last minute to submit
your files. Give yourself enough time to check that your submittal is correct.
If you don't submit a project correctly, you will not get credit
for it. Why throw away all that hard work you did to write the project?
Check your
submittals. Make sure they work. Do this before the due date.
Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/.
One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.
Additionally, there are two programs for use only by CMSC-341 students
(not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/o/a/oates/pub/CMSC341/ and are
named submitmake and submitrun. You can
use these programs to
make or run your submitted projects.
The syntax is similar to that for submit:
submitmake <class> <project>
Example: submitmake cs341 Proj5
This makes the project, and shows you the report from the make utility.
It cleans up the directory after making the project (removes .o and ii_files),
but leaves the
executable in place.
submitrun <class> <project> [command-line args]
Example: submitrun cs341 Proj5 checkers checkfile.dat
This runs the project, assuming there is an executable (i.e. submitmake
was run successfully).
Grading and Academic Integrity
Your project will be tested using a variety of command lines, some of which
will be invalid.
Your project will also be tested using a variety of command files which
will test various conditions which your code should handle.
Project grading is described in the Project
Policy handout.
Your answers to 341-Spring06-proj5_questions.txt
are worth 10% of your project grade.
Cheating in any form will not be tolerated. Please re-read the Project
Policy handout for further details on honesty in doing projects for
this course.
Remember, the due date is firm. Submittals made after midnight of the
due date will not be accepted. Do not submit any files after that time.