CMSC-341 Fall 2003

Project 3

Assigned	15 Oct 2003
Due	28 Oct 2003

Background

This project will give you experience working with trees in the context of a spelling checker application. Rather than using a standard binary tree, you will implement a tree class using the "First Child Next Sibling" (or FCNS) representation. Your program will populate an FCNS tree with the words in a dictionary, and use the tree to check the spelling of the words found in a text file.

Description

In a standard binary tree, nodes contain a data element, a pointer to the node's left child, and a pointer to the node's right child. In an FCNS tree, each node contains a data element, a pointer to the node's first child, and a pointer to the node's next sibling. This makes it possible for nodes to have an arbitrary number of children. You will use an FCNS tree to store a dictionary of words. Rather than storing one complete word per node, the FCNS tree will store one character per node so that any path from the root to a node in the tree spells out a word or the prefix of a word. This data structure is called a Trie (for reTRIEval). Here are your tasks:

If you want to use the author's BST code as a starting point, you can. If you want to write the tree code from scratch, that's OK as well. In the former case, obtain the files BinarySearchTree.H, BinarySearchTree.C, and dsexceptions.H from the following location:

/afs/umbc.edu/users/o/a/oates/pub/CMSC341/Proj3/

Implement an FCNS tree.

Implement a Trie using the FCNS tree.

Write Proj3.C. It will validate command line arguments, read a dictionary from a file, and check the spelling of words in a second file.

Answer the questions posed in 341-Fall03-proj3_questions.txt. Copy the file

/afs/umbc.edu/users/o/a/oates/pub/CMSC341/Proj3/341-Fall03-p3_questions.txt

to your own directory and edit it to provide your answers to the questions. Don't forget to submit the edited file; it's 10% of this project's grade.

Definition of the ADT

You will implement an FCNS tree class that can hold any kind of object. You will then implement a Trie class that has an FCNS tree as a private data member. These two classes cannot be friends. As will become clear in a moment, the Trie must store two things at each tree node, a value and a flag that tells whether the node is at the end of a word. You will therefore need to define a class used by the Trie that contains the object stored in the trie node and the flag. It is these objects that will be inserted into the FCNS tree by the Trie methods.

There are two concepts that must be grasped before this project can be completed. The first is the nature of the FCNS representation of a tree. The second is the structure of a Trie. In an FCNS tree, each node contains a pointer to its first child and a pointer to its next sibling. Given the binary search tree below using the standard left/right child pointers:

It would be represented as an FCNS tree as follows:

Note the a node can have any number of children. For example, in the tree below, node 10 has children 5, 20, 30, and 40:

             10
            /  
           /    
          5-----20-----30-----40
         / 
        /   
       3-----7

A Trie is typically used to store strings, which are sequences of characters. Your implementation will be templated to store sequences of OBJECTs. Tries differ from BSTs in the way the strings are stored. A BST stores one string per node, whereas a Trie stores one character per node. An example might clarify things. Suppose you want to store the strings "bag", "ban", "bat", "bagel", "an", and "ant" in a Trie. It would look like this using an FCNS tree:

                           ROOT
                          /    
                         /      
                        A-----B
                       /     /
                      /     /
                    *N     A
                    /     /
                   /     /
                 *T    *G-----*N-----*T
                       /
                      / 
                     E
                    /
                   /
                 *L

Every path from the root of this tree to a node corresponds to a word or a prefix of a word. For example

A->N->T

corresponds to the end of a word, but

B->A->G->E

does not. It corresponds to a prefix of a word, i.e. "bagel". Nodes that correspond to ends of words are marked with a *. Note that the root does not contain any data, and it may have as many children as there are letters in the alphabet. Likewise, every other node may have as many children as there are letters in the alphabet. Your Trie must use an FCNS representation and it must support (at least) the following two public methods:

insert(OBJECT *item, int n) - This method takes an array of OBJECTs and the number of OBJECTs in the array, and inserts them into the Trie. The following code fragment demonstrates the use of this method:
```
  Trie < char > trie;
  char *word = "test";
  trie.insert(word, 4);
```
In the code above, the Trie contains characters in its nodes, and after the insert method returns the Trie will contain 5 nodes, one for the root and one for each character in the word "test". A Trie that contains sequences of integers rather than characters might be used like this:
```
  Trie < int > trie;
  int nums[] = {1, 2, 3, 4};
  trie.insert(nums, 4);
```
find(OBJECT *item, int n) - This method takes the same arguments as insert, but should return some indication (such as a boolean) as to whether the sequence was found in the Trie. For example, given the code fragment above that inserted "test" into the Trie, find("test",4) should return TRUE, whereas find("foo", 3) and find("tes", 3) should both return FALSE. In the latter case, "tes" exists in the Trie but that sequence is not the end of a word.

In the code above, if you store objects of type char in the FCNS tree maintained by the Trie, there is no way to know if any given path from the root ends a word. Therefore, you will need to define a class that stores both an OBJECT (i.e. the type specified in the Trie template parameter) and an end of word marker. Note that instances of this class must be comparable based on the OBJECTs they store. For example, you might have a class that looks partly like this:

 
class Wrapper < OBJECT > {
private: 
  OBJECT x; 
  boolean end_of_word; 
}

When a sequence is inserted into the Trie, it will declare a sequence of Wrapper objects for which the end_of_word marker will be false for all but the last one.

The return types and other aspects of these methods, such as whether they and/or their arguments should be const, is left up to you. As usual, you can create any private data members or methods you need. You can write any public methods for which there is a good justification for why they are not private. When trying to determine if a method should be public, ask yourself whether it is reasonable for a programmer who is using your code as a package to call that method. Does it reveal or make accessible too much of the internals of the implementation? Fell free to submit a README file with your project to justify any public methods that you think need it. Any public methods in the author's BST code that you don't need do not need to be implemented.

The Command Line

Project 3 will be invoked with a command line that consists of two arguments. The first argument specifies the name of a dictionary file. That file will contain words separated by white space. Each word should be inserted into the Trie. There may be zero, one, or more words per line. The second argument will be the name of a file that contains some text that you are supposed to spell check. It will have the same format as the dictionary file - zero, one, or more words per line with multiple words separated by white space. Read each word from the file and check to see if it exists in the Trie. If it does, move on to the next word. If not, print the word. That is, the output of your program will be a list, one per line, of words in the file that are not in the trie (i.e. that are mis-spelled).

Search in the Trie should be case insensitive. That is, if the text file contains "What" and "what", and the word "what" is in the dictionary, then both forms should be recognized as being spelled correctly. There will be no characters in the files other than upper and lower case letters and white space.

Note that you must check command line arguments to ensure that they are valid, e.g. that the command file can be opened, and print an appropriate message if an error is found.

The Command File

There is no command file for this project. Also, there is no sample output. It should be clear what is required from the description above.

Files To Be Submitted

You should submit all files needed to build your program and the file containing your answers to the questions.

Submit Tools

There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful
compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your
submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/. One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/o/a/oates/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to
make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example: submitmake cs341 Proj3

This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o and ii_files), but leaves the
executable in place.

submitrun <class> <project> [command-line args]

Example: submitrun cs341Proj3 checkers checkfile.dat

This runs the project, assuming there is an executable (i.e. submitmake was run successfully).

Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.

Your answers to 341-Fall03-proj3_questions.txt are worth 10% of your project grade.

Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.