CMSC 341 - Fall 2001

Project 3

Assigned	10 October 2001
Due	24 October 2001
Modifications	see notes and clarifications page

Background

Millions of people use search engines like the ones available at www.google.com and www.yahoo.com to find web pages containing useful information. In this project you will implement the rudiments of a search engine using binary search trees and priority queues.

Take some time for design before you start writing code. This project is conceptually straightforward and can be divided into several independent pieces. Identifying those pieces, determining their interfaces, and figuring out how to connect them all together in a single main routine will save lots of time in the long run. Make sure you understand the implementations of binary search trees and priority queues (i.e. binary heaps) given in the text before proceeding.

As with the other projects, there is a file containing questions that you must answer.

Description

Given a file containing text, such as a web page, it is often useful to be able to determine efficiently (1) whether the file contains each element of a set of query terms and (2) which words are most frequent in the file. The former operation lies at the heart of virtually all search engines, and the latter operation often yields useful information concerning what the document is about. Given a text file, you will use a binary search tree to store all of the unique words contained in the file so that determining whether it also contains a query term is an efficient operation. A word is defined to be any string of characters delimited by white space. In addition to storing unique words, you must store an integer with each word indicating the number of times that it occurred in the document. You should convert all strings to either lower or upper case so that, for example, "Cat" and "cat" both count as occurrences of the same word. However, DO NOT remove punctuation. Consider the following text:

"I can't believe I wrote so much code!"

There are seven unique words - "I", "CAN'T", "BELIEVE", "WROTE", "SO", "MUCH", "CODE!". The word "I" occurred twice and all of the others occurred once. Note that the exclamation point at the end of the sentence is counted as part of the final word, just as the apostrophe is counted as part of the second word.

Once a file has been loaded into a binary tree, your program must be able to answer queries about the file. A query is just a set of words, and the response to a query is the sum over all of the words in the query of the number of times they occur. For example, given the above text, the correct response to the query "HELLO" is 0 because the word "HELLO" occurs 0 times in the text. The correct response to the query "HELLO I" is 2 because "I" occurs twice, and the correct response to the query "I SO CODE" is 4 because "I" occurs twice and both "SO" and "CODE" occur once.

Finally, given a binary tree containing unique words and their occurrence counts, you must be able to determine efficiently the M most frequent words in the file from which the tree was constructed. This is to be accomplished with a priority queue. Although the binary search tree was keyed on strings representing words, the priority queue will be keyed on integers representing occurrence counts.

The Command Line

Your program will take a single command line argument which is the name of a command file. You must check to ensure that the command line is valid both in terms of the number of arguments and that the argument specifies the name of an existing file. The main loop of your program will read and process commands from this file.

The Command File

The command file will contain commands of the following form, each on a separate line:

LOAD filename
QUERY word [word ...]
FREQUENT number

Please echo the commands as they are read from the file. The command file may contain blank lines. You can assume that each line that is not blank will contain a single syntactically valid command.

The LOAD command specifies the name of a text file. After ensuring that the file can be opened, read its contents one word at a time and construct a binary tree as described above. That is, if the word is not in the tree it should be added with an occurrence count of one. If it is in the tree, it's occurrence count should be incremented. Note that it is possible to have multiple LOAD commands in a single command file. All other commands, i.e. QUERY and FREQUENT, apply to the most recently LOADed file. For all LOAD commands after the first be sure to call the destructor for the binary search tree constructed for the previous LOAD. That is, don't leak memory.

Following the word QUERY will be a list of one or more strings delimited by white space. Your program must count the number of times each of these strings occurs in the file that was last LOADed and output the sum of these counts.

The FREQUENT command asks you to print a list (in any order) of the MOST frequent words, along with their occurrence counts, in the file. The number of words to print is specified as an argument of the command. Again, this must be accomplished with a priority queue. Given a document containing N unique words, your algorithm for finding the M most frequent words must run in O(N + M log(M)) time in the average case. Although the binary search tree was keyed on strings representing unique words, the priority queue will be keyed on integers representing occurrence counts. Given the example text file above, the command "FREQUENT 3" might result in the following output:

WROTE (1)
I (2)
CODE (1)

Note that the output is not sorted in any special order, though it can be if you like. Also, only 3 word/count pairs were printed despite the fact that there are additional words with a frequency of one. You should stop printing words when the limit specified for the FREQUENT command is reached, even if there are additional words that have the same frequency as the last word printed. (This is meant to make your life easier.)

To find the M most frequent words in a file containing N unique words in O(N + M log(M)) time you will use a BinarySearchTreeItr iterator (described below) to walk over the nodes in the binary search tree built from the file. While walking over the tree add nodes to a min-heap in such a way that you never have more than M nodes in the heap. You can assume that M <= N.

Classes

You can use the author's string class to store words read from files. This class has an operator>> defined so it can be used to read words from files as well. You will need string.C and string.H which can be found in Mr. Frey's public directory:

/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/

You must use the author's BinarySearchTree and BinaryHeap classes. You'll need to get BinarySearchTree.C and BinarySearchTree.H for the former, and BinaryHeap.C and BinaryHeap.H for the latter. Again, these files can be found in Mr. Frey's public directory. You will also need the file dsexceptions.h which contains a few exception classes for the BinaryHeap class.

A few modifications were made to the author's BinarySearchTree class. First, the member functions find, findMin and findMax were modified to return references to the found object rather than a constant reference. This allows you to find a node keyed on a word and, if it is found, update the occurrence count. Second, an iterator class named BinarySearchTreeItr was added. It operates just like iterators that you have used with linked lists. In particular, it supports the following operations:

first() - used to create an iterator for an existing tree that points to the first node in the tree (in some order)
isPastEnd() - returns true when the iterator has stepped through all of the nodes in the tree
advance() - steps the iterator to the next node
retrieve() - returns a reference (not constant) to the data stored in the node to which the iterator currently points

The code for this iterator is in BinarySearchTree.H and BinarySearchTree.C.

Finally, you will need to write two classes, both of which hold a string and an integer. One of these classes must have comparison operators defined that compare the string member. For this class the string member MUST BE CONST. This class will be used with the binary search tree. The second class must have comparison operators defined that compare the integer member. For this class the integer member MUST BE CONST. This class will be used with the priority queue.

All classes must support the "big four":

default constructor
destructor
copy constructor
assignment operator

If the BinarySearchTree and BinaryHeap classes do not provide these operations, you must write them. You must also provide them for any and all classes that you design and implement, even if the default behavior provided by the compiler would be acceptable.

Tasks

Here are your tasks:

Read and understand the author's implementation of binary search trees and binary heaps, as well as the iterator defined for binary search trees.
Design and implement a class to hold string/integer pairs to be stored in a binary search tree.
Design and implement a class to hold string/integer pairs to be stored in a binary heap.
Write code to read the words in a file and store them in a binary search tree keyed on the word, updating occurrence counts for those words that occur more than once in the file.
Write code that takes a binary tree containing word/frequency pairs and uses a priority queue to identify the M most frequent words in the file from which the tree was built.
Use the data structures and routines mentioned above to write a main loop that processes commands found in the command file.
Answer the questions posed in 341-Fall01-proj3_questions.txt. Copy the file
```
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj3/341-Fall01-p3_questions.txt
```
to your own directory and edit it to provide your answers to the questions. Don't forget to submit the edited file; it is 10% of this project's grade.

Project Submission

DO submit these files

Since you are submitting your own makefile, you are free to name your files whatever you like, with the following restrictions:

All class definitions must appear in .H files. Header files may not contain any code unless that code was put there by the author.
All class implementations appear in .C files.
The name of your executable MUST BE Proj3 (upper-case 'P'). The submitrun program and the scripts that we use to grade your files assume that you follow this convention for naming your executable.

DO NOT submit these files

Do not submit dsexceptions.h, string.C, string.H or any of the binary search tree or binary heap files. These files should be included in your project through your makefile. You should not need to modify them at all.

Submit Tools

There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/. One of the tools provided by the submit program is submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example: submitmake cs341 Proj3

This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o files), but leaves the executable in place.

submitrun <class> <project> [command-line args]

Example: submitrun cs341 Proj3 test.dat

This runs the project, assuming there is an executable (i.e. submitmake was run successfully).

Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.

Your answers to 341-Fall01-proj3_questions.txt are worth 10% of your project grade.

Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.

Last modified on Wednesday October 10, 2001 by Tim Oates
email: oates@cs.umbc.edu