Assigned | 10 October 2001 | |
Due | 24 October 2001 | |
Modifications | see notes and clarifications page |
Take some time for design before you start writing code. This project is conceptually straightforward and can be divided into several independent pieces. Identifying those pieces, determining their interfaces, and figuring out how to connect them all together in a single main routine will save lots of time in the long run. Make sure you understand the implementations of binary search trees and priority queues (i.e. binary heaps) given in the text before proceeding.
As with the other projects, there is a file containing questions that you must answer.
"I can't believe I wrote so much code!"
There are seven unique words - "I", "CAN'T", "BELIEVE", "WROTE", "SO", "MUCH", "CODE!". The word "I" occurred twice and all of the others occurred once. Note that the exclamation point at the end of the sentence is counted as part of the final word, just as the apostrophe is counted as part of the second word.
Once a file has been loaded into a binary tree, your program must be able to answer queries about the file. A query is just a set of words, and the response to a query is the sum over all of the words in the query of the number of times they occur. For example, given the above text, the correct response to the query "HELLO" is 0 because the word "HELLO" occurs 0 times in the text. The correct response to the query "HELLO I" is 2 because "I" occurs twice, and the correct response to the query "I SO CODE" is 4 because "I" occurs twice and both "SO" and "CODE" occur once.
Finally, given a binary tree containing unique words and their occurrence counts, you must be able to determine efficiently the M most frequent words in the file from which the tree was constructed. This is to be accomplished with a priority queue. Although the binary search tree was keyed on strings representing words, the priority queue will be keyed on integers representing occurrence counts.
The LOAD command specifies the name of a text file. After ensuring that the file can be opened, read its contents one word at a time and construct a binary tree as described above. That is, if the word is not in the tree it should be added with an occurrence count of one. If it is in the tree, it's occurrence count should be incremented. Note that it is possible to have multiple LOAD commands in a single command file. All other commands, i.e. QUERY and FREQUENT, apply to the most recently LOADed file. For all LOAD commands after the first be sure to call the destructor for the binary search tree constructed for the previous LOAD. That is, don't leak memory.
Following the word QUERY will be a list of one or more strings delimited by white space. Your program must count the number of times each of these strings occurs in the file that was last LOADed and output the sum of these counts.
The FREQUENT command asks you to print a list (in any order) of the MOST frequent words, along with their occurrence counts, in the file. The number of words to print is specified as an argument of the command. Again, this must be accomplished with a priority queue. Given a document containing N unique words, your algorithm for finding the M most frequent words must run in O(N + M log(M)) time in the average case. Although the binary search tree was keyed on strings representing unique words, the priority queue will be keyed on integers representing occurrence counts. Given the example text file above, the command "FREQUENT 3" might result in the following output:
To find the M most frequent words in a file containing N unique words in O(N + M log(M)) time you will use a BinarySearchTreeItr iterator (described below) to walk over the nodes in the binary search tree built from the file. While walking over the tree add nodes to a min-heap in such a way that you never have more than M nodes in the heap. You can assume that M <= N.
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/
You must use the author's BinarySearchTree and BinaryHeap classes. You'll need to get BinarySearchTree.C and BinarySearchTree.H for the former, and BinaryHeap.C and BinaryHeap.H for the latter. Again, these files can be found in Mr. Frey's public directory. You will also need the file dsexceptions.h which contains a few exception classes for the BinaryHeap class.
A few modifications were made to the author's BinarySearchTree class. First, the member functions find, findMin and findMax were modified to return references to the found object rather than a constant reference. This allows you to find a node keyed on a word and, if it is found, update the occurrence count. Second, an iterator class named BinarySearchTreeItr was added. It operates just like iterators that you have used with linked lists. In particular, it supports the following operations:
Finally, you will need to write two classes, both of which hold a string and an integer. One of these classes must have comparison operators defined that compare the string member. For this class the string member MUST BE CONST. This class will be used with the binary search tree. The second class must have comparison operators defined that compare the integer member. For this class the integer member MUST BE CONST. This class will be used with the priority queue.
All classes must support the "big four":
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj3/341-Fall01-p3_questions.txtto your own directory and edit it to provide your answers to the questions. Don't forget to submit the edited file; it is 10% of this project's grade.
If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.
Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/. One of the tools provided by the submit program is submitls. It lists the names of the files you have submitted.
Additionally, there are two programs for use only by CMSC-341 students
(not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/ and are
named submitmake and submitrun. You can
use these programs to make or run your submitted projects.
The syntax is similar to that for submit:
submitmake <class> <project>
Example: submitmake cs341 Proj3
This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o files), but leaves the executable in place.
submitrun <class> <project> [command-line args]
Example: submitrun cs341 Proj3 test.dat
This runs the project, assuming there is an executable (i.e. submitmake
was run successfully).
Project grading is described in the Project Policy handout.
Your answers to 341-Fall01-proj3_questions.txt are worth 10% of your project grade.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.
Remember, the due date is firm. Submittals made after midnight of the
due date will not be accepted. Do not submit any files after that time.