CMSC-341 Fall 2003
Project 4
Assigned |
5 Nov 2003 |
Due |
24 Nov 2003 |
Background
Despite the fact that Red-Black trees offer guaranteed O(lgN) performance
per operation, there are many applications where the O(1) performance of
hash tables is essential. One such application is query processing by web
search engines like Google and Yahoo. In this project you will use hash
tables to implement a simple, yet efficient, query engine.
Description
Web search engines have response times on the order of a second for
arbitrary queries because they index the entire web. They do this by
maintaining for each word a list of the pages that contain the word. Then,
to figure out what pages contain all of the words in a multiword query you
just intersect the lists. For example, if the query "project four" is
issued, the index is consulted to find out that "project" occurs in, say,
pages P1, P4, P5, and P9; the index is consulted again to find out that
"four" occurs in, say, pages P4, P7, P9, and P12; and finally the
intersection of these two lists of pages is computed yielding P4 and P9 as
the only ones that contain both "project" and "four".
You will implement a system just like the one described above using a hash
table to store the index.
Here are your tasks:
- Obtain these files:
- List.C
- List.H
- ListItr.H
- ListNode.H
- SeparateChaining.C
- SeparateChaining.H
- dsexceptions.H
from this location: /afs/umbc.edu/users/o/a/oates/pub/CMSC341/Proj4/. The
file SeparateChaining.C contains Weiss' implementation of a hash table with
separate chaining collision resolution.
- Create a class named Occurrence that stores information about a word
and the pages in which it occurs.
- Write Proj4.C. It will validate command line arguments, read commands
from a command file, and print the results of executing commands.
-
Answer the questions posed in 341-Fall03-proj4_questions.txt. Copy the
file
/afs/umbc.edu/users/o/a/oates/pub/CMSC341/Proj4/341-Fall03-p4_questions.txt
to your own directory and edit it to provide your answers to the questions.
Don't forget to submit the edited file; it's 10% of this project's
grade.
Below is a detailed discussion of how to implement this project.
The Command Line
Project 4 will be invoked with a command line that consists of one
argument - the name of a command file. Your program should validate the
command line, ensure that the specified file can be opened, then read and
process each of the commands in the file.
The Command File
The command file format is as follows:
- LOAD filename -- Open the named file and insert all of the words it
contains into the index.
- QUERY N word1 word2 ... wordN -- Following the word QUERY will be an
integer, N, that specifies the number of query terms. This integer is
provided to make it easy to extract the query terms. Following N is a list
of words (the query terms) delimited by white space. Read these words and
print a list of the files that contain all of them.
There can be multiple LOAD and QUERY commands in a single command file in
any order.
Implementation Details
Your index will be implemented using a hash table. For each word that
occurs in a file that's loaded with a LOAD command, you must record the
fact that the word occurred in the file. You will do that by storing
objects of type Occurrence in the hash table. That is, your program will
declare a hash table that contains Occurrences:
HashTable < Occurrence > hash(ITEM_NOT_FOUND);
Objects of type Occurrence have two private data members:
- string word - The word that occurred.
- List < string > files - A list of the names of the files in which
the word occurred.
You will need to write the following member methods for this class:
- Occurrence() - A zero argument constructor.
- Occurrence(string _word) - A constructor that specifies the word.
- string getWord() const - A reader for the word data member.
- List < string > getFiles() const - A reader for the files data
member.
- void addFile(const string file) - Adds a name to the file list if the
name does not already occur in the list.
- int operator==(const Occurrence & rhs) const - An equality operator
that returns true if the words are the same.
- int operator!=(const Occurrence & rhs) const - An inequality operator
that returns true if the words are different.
You will also need to write the following non-member method:
- int hash(Occurrence o, int tableSize) - This method takes an
Occurrence and a table size and computes a hash value. This method will be
called by Weiss' hash table code, and should simply call his method
hash(string k, int tableSize) passing the Occurrence's word as the key.
That is, we will hash Occurrences on words.
To process a LOAD command you simply open the file and for each word
contained in the file either (1) insert a new Occurrence if the word is not
in the hash table or (2) add the file to an existing Occurrence if the word
is in the hash table. Note that method addFile must modify an Occurrence,
so you might need to modify the hash table method find to allow this.
To process a QUERY command, you read each of the query terms, look for
them in the hash table, and get the lists of files in which they occur.
The intersection of these lists is a list of files in which all of the
query terms occur. You will probably want to write the following List
method:
- List < Object > intersect(List < Object > const & rhs) const
- Given lists L1 and L2, L1.intersect(L2) returns a list containing only
those elements that are in both lists.
Files To Be Submitted
You should submit all files needed to build your program and the file
containing your answers to the questions.
Submit Tools
There are a number of tools available to you to check on your submittal.
It is your responsibility to ensure that the submittal is correct and will
result in a successful
compilation of your project. Do not wait till the last minute to submit
your files. Give yourself enough time to check that your submittal is correct.
If you don't submit a project correctly, you will not get credit
for it. Why throw away all that hard work you did to write the project?
Check your
submittals. Make sure they work. Do this before the due date.
Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/.
One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.
Additionally, there are two programs for use only by CMSC-341 students
(not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/o/a/oates/pub/CMSC341/ and are
named submitmake and submitrun. You can
use these programs to
make or run your submitted projects.
The syntax is similar to that for submit:
submitmake <class> <project>
Example: submitmake cs341 Proj4
This makes the project, and shows you the report from the make utility.
It cleans up the directory after making the project (removes .o and ii_files),
but leaves the
executable in place.
submitrun <class> <project> [command-line args]
Example: submitrun cs341Proj4 checkers checkfile.dat
This runs the project, assuming there is an executable (i.e. submitmake
was run successfully).
Grading and Academic Integrity
Your project will be tested using a variety of command lines, some of which
will be invalid.
Your project will also be tested using a variety of command files which
will test various conditions which your code should handle.
Project grading is described in the Project
Policy handout.
Your answers to 341-Fall03-proj4_questions.txt
are worth 10% of your project grade.
Cheating in any form will not be tolerated. Please re-read the Project
Policy handout for further details on honesty in doing projects for
this course.
Remember, the due date is firm. Submittals made after midnight of the
due date will not be accepted. Do not submit any files after that time.