CMSC 341 Fall 2008
Project 3

Assigned	Wednesday, Oct 15, 2008
Due	Thursday, Oct 30, 2008 at 11:59 PM
Updates

Background

There are many computational tasks where it is important to know how frequently sequences of words or characters occur in text. For example, in speech recognition (automatically turning spoken utterances into text) the sentence "sodium nitrate is a preservative" can be difficult because "nitrate" sounds a lot like "night rate". But if you have a large corpus of text and know that "night" almost never follows "sodium", you can rule out "night rate" and go with "nitrate". This example may seem silly - who would ever confuse "nitrate" with "night rate"? Sadly, this is a very real problem with all modern speech recognition systems.

Another application near and dear to the hearts of Computer Science teachers and students is figuring out whether students have copied code from each other. For example, if you see two project submissions that both have this code, you can be pretty sure they are cheating:

  for (int abc123 = 0; abc123 < 100 - 10; abc123 = abc123 + 2 - 1) 
    System.out.println(abc123);

Why? Because those two are probably the only projects that contain the strings "abc123" and "< 100 - 10" and "+ 2 - 1".

In this project, you will implement a program that checks for cheating using a data structure called a trie.

Tries

A trie (pronounced "try") is a tree useful for representing sets of strings. Each node in a trie represents a string whose length is equal to the depth of the node. You will build depth-limited tries for text files, and compare the tries to compute the similarity of the text files. Here's how you will do that.

Consider the following input text:

  'an ant sat and ate'

Suppose the depth bound is 3. To build the trie you'll take a window of width 3 and pass it over the input text file, yielding the following substrings:

  'an '
  'n a'
  ' an'
  'ant'
  'nt '
  't s'
  ' sa'
  'sat'
  'at '
  't a'
  ' an'
  'and'
  'nd '
  'd a'
  ' at'
  'ate'

You then build a tree such that there is a path from the root for each of these substrings, and keep counts for each node of the number of substrings that have passed though it. The final trie looks like this:

Each trie node contains a character and a count. There are 16 substrings of length 3, so the root was visited 16 times. The root node is the only node for which the character data member is ignored. Of the 16 substrings, 5 started with the letter 'a', so the (leftmost) child of the root labeled 'a' has a count of 5. The 5 substrings that started with 'a' are:

  'an '
  'ant'
  'at '
  'and'
  'ate'

The initial 'a' is followed by either an 'n' (in 'an ', 'ant', and 'and') or a 't' (in 'at ' and 'ate'), so that node has two children labeled 'n' and 't'. Note that substrings that share a common prefix (as 'and' and 'ant' share the two-character prefix 'an') share tree structure.

Input and Output

Your program will be run with four command line arguments. The first will be a positive integer, D, which is the maximum depth of the trie. The second will be a positive integer, N, which is the number of substrings to print. The third and fourth will be names of files containing text. Build one trie for each file. Do not modify the characters you read in any way. Do not remove white space or convert characters to upper or lower case. Then, for the N most frequent strings in the first trie at depth D, print the string, the number of times it occurs in the first text file, and the number of times it occurs in the second text file. This latter number must be obtained by searching the second trie.

Suppose your program is run with this command line:

  java Proj3 3 2 file1.txt file2.txt

Suppose the 2 most frequent 3-letter sequences in file1.txt are "and" and "the", and that they occurred 57 and 45 times, respectively. Forthermore, suppose those sequences occurred in file2.txt 100 and 94 times, respectively. Your output might look like this:

Sequence     # in file1.txt     # in file2.txt
--------     --------------     --------------
and          57                 100
the          45                 94

The list of strings should be sorted in descending order according to their frequency in the first text file. If there are strings that have the same frequency as the Nth string in the list, print all of them along with their frequencies in both files. In the example above, if the string "was" occurred 45 times in file1.txt, but the next most frequent string occurred 39 times, the output should look like this (assuming "was" occurred 52 times in file2.txt:

Sequence     # in file1.txt     # in file2.txt
--------     --------------     --------------
and          57                 100
the          45                 94
was          45                 52

Implementation

You have lots of lattitude in how you implement this project. A few things are required, which are listed below. After that are some hints and suggestions that you are free to use or ignore.

Requirements

Your Trie class must be generic. That is, rather than storing data items of type char in each node, it must store data items of any type where the type is specified as a compile-time parameter. For example:
```
  public class Trie< T > { ... }
```
Note that it must be possible to test objects of type T for equality (e.g., via the equals method).
The Trie class must implement the Iterable interface, which means that class must implement the iterator() method which returns an iterator over the trie.
You must implement a TrieNode class that is a private inner class of the Trie class. Note that TrieNode will need to be generic as well to store data items of the type specified for the enclosing trie.

Because your Trie class is generic and can store information about the frequency of sequences of arbitrary objects, you will need Trie methods that take vectors or arrays or lists (or whatever seems appropriate to you) of such objects. For example:

  Trie< Character > trie = new Trie< Character >();
  String text = new String("an ant sat and ate");
  Vector< Character >> sequence = new Vector< Character >>();

  for (int i = 0; i < 4; i++)
    sequence.add(text.charAt(i));

  trie.addSequence(sequence);

Hints

You can store counts with TrieNode objects, along with the data item, and increment the count when a new sequence is added to the trie that passes through the node.
You will need some data structure inside the TrieNode, such as a vector or list, to store the children of that node.
You do not have to implement the remove method of the Iterator interface. Rather, your code can look like this:
```
  public void remove() {
    throw new UnsupportedOperationException();
  }
```
Your iterator will probably need some sort of internal data structure (e.g., a stack or queue or list) like the example tree iterators in the lecture notes.
Your iterator can be a public inner class of the Trie class so that it has access to the internals of both tries and trie nodes. It just needs to implement the Iterator interface. It can implement methods other than those in the interface. For example, I found it useful to have a count() method that returns the count associated with the node to which the iterator is pointing.
You might find it useful to implement a Trie method such as public int getCount(Vector< T > sequence) that returns the count associated with a particular sequence.
In general, implement whatever methods you see fit as long as there is a good need for them and they adhere to good OOP principles.

Questions

Write brief (one or two paragraphs should suffice) answers to the following questions and submit them through CVS in a file named questions.txt. Your answers will be worth 10 points (out of 100) on your project grade.

Given the N most frequent sequences in one trie, you have to find the counts of those sequences in another trie. What is the complexity of that operation (in big-O terms) for your implementation as a function of D (the length of the sequences), N, and S, the number of distinct objects of the type stored in the trie (e.g., letters in the alphabet), and any other relevant features of the problem? Support your answer.
For project 2 this year, every submission probably contained the 4-character sequence "Maze" many times. But that does not mean the students cheated. What you really want to find are strings shared by a pair of submissions that are rare in all of the other submissions. Explain how could you use tries to process a set of text files (rather than just a pair) and a find such strings? There is no one right answer here.

Submission

You must use CVS to check out your module in the Proj3 repository and to check in your code. That must include an ant build.xml file and javadoc. The grading scripts will issue commands like the following, so be sure that your build.xml supports them:

  ant
  ant doc
  ant -Dargs="3 2 file1.txt file2.txt" run

See the projects page for more information on all of these topics.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.