Project 3
Assigned 8 March 1999
Due 2 April 1999
Background
Section 7.6 of the Heileman text describes an application that produces a rough
index for a text document. In this project, you will build that application
using the list classes from the last project. In addition, you will make
some enhancements to the indexing application:
- Notice that in the sample output shown on pp. 214-215 of the text, some
of the page listings contains duplicate entries. A page number should
only appear once in the page list for a given word.
- In the implementation shown in the text, the only words that will appear in
LeaveOut
are those that occur too often.
Modify the Index()
function so that it accepts an input file
containing a list of words. The words in this file should be added to
LeaveOut
prior to processing the input document. The name of
this file will be passed as a command line argument.
- Modify the
Index()
function so that it writes Master
to a disk file rather than to the screen. The name of the output file will
be passed as a command line argument.
- For simplicity, the implementation in the text considers only the first five
characters in a word. You are to remove this limitation, so that all unique
words are indexed.
- The text imposes a 40-character limit on the length of words. You are to
remove this restriction, so that words of any length can be indexed.
Description
Start with the classes Bnode
,
BinaryTree
, and BST
from the textbook web page and make the following modifications:
- The classes should conform to the course requirements on style, naming conventions,
and documentation.
- The
Bnode
class should be in files Bnode.H
and Bnode.C
.
The BinaryTree
class should be in files
BinaryTree.H
and BinaryTree.C
. The
BST
class should be in files BST.H
and
BST.C
.
- Eliminate the use of
typedef
's.
- Only accessors and mutators should be
inline
.
- No implementation in the header file, with the exception of
inline
functions.
- There are two minor problems in the
Predecessor()
function:
- There is a variable called
found
. It should be changed
to success
.
- There is an extraneous semicolon at the end of the second line of
the function body.
- Remove the restriction that the elements manipulated by BST objects are objects
with integer key values. This eliminates the need for a
Key()
function
and requires changing some of the function declarations:
T& Search (const T& elm, bool& success);
bool Delete (const T& elm);
T& Predecessor (const T& elm, bool& success);
T& Successor (const T& elm, bool& success);
These are the only functions whose declarations will change, but
other functions may also require changes to their implementation.
In general, be sure that all comparisons are done on the tree elements
directly rather than through an integer key.
- Currently, the code for printing a binary tree displays the elements
inside of a set of parentheses, with commas separating the elements.
Change the appropriate functions so that the elements are printed
one per line, no commas and no parentheses. For example, a tree
with three integer elements would display as:
17
23
36
instead of
( 17, 23, 36, )
- Test all changes to these classes before proceeding.
Create files WordRecord.H
and WordRecord.C
for
the WordRecord
class shown on page 211. Adapt the
WordRecord
class as follows:
- Use
ArrayList
(from the last project) rather than
DynList
.
- Eliminate the fixed-size character array used to store the word,
and replace it with a
char
pointer (to dynamically
allocated memory). This means you should also add a destructor,
copy constructor, and overloaded assignment operator.
- Eliminate the
Key()
and MakeKey()
functions and the associated data member.
- Add overloadings for the operators
==
, !=
,
<
, and >
. All comparisons should be
case insensitive.
- Modify the
UpdatePages()
function to eliminate duplicate
entries from the page list for a word.
- Data members can be added to the
private
section of the
class if needed.
In a file called Proj3_aux.C
, write the Index()
and ReadWord()
functions.
Modify the Index()
function so that it has four parameters as follows:
istream& input_file
- input stream associated with the original text
document
istream& skip_file
- input stream associated with the file
containing list of words to ignore
ostream& output_file
- output stream associated with the
output file
int threshold
- this parameter replaces the value that is
currently a hard-coded constant
When writing the functions in Proj3_aux.C
, you may use the code
on pages 212-213 as a guideline, but you do not have to duplicate the textbook's
approach. Make improvements and corrections as needed.
In a file called Proj3.C
write a short main()
that gets
three file names and an integer from the command line, specified in the following
order:
- The input document (plain ASCII text file with page breaks specified by
the word PgBk; see p. 214 for an example)
- The file containing a list of words to ignore (assume one word per line)
- The output file
- The threshold value
The command line arguments should be processed with appropriate error-handling,
including informative error messages if the wrong number and/or type of values.
Next, attempt to open the three files, again with appropriate error-handling.
If successful, call the Index()
function.
The format of the output
file should be one word per line, with each word followed by its list of
page numbers on the same line. Words and page numbers should be separated
by one or more spaces. See Sample Output
below.
Summary of Tasks
Here are your tasks:
- Make sure your
ArrayList
class works and that it has an
<<
operator defined.
- Revise and test the
Bnode
, BinaryTree
, and
BST
classes. Don't forget to add documentation.
- Adapt the
WordRecord
class shown on page 211.
- Write the
Index()
and other auxilliary functions in
Proj3_aux.C
.
- In
Proj3.C
, write a main()
that processes the
command line arguments and calls Index()
.
- Test, test, test! Be sure that you have successfully implemented all of the
enhancements described in the Background
section above.
- The name of your executable should be
Proj3
.
Project grading is described on the
Project Policy page.
Please re-read the
Project Policy page for details on honesty in doing
projects for this course.
Here is a very simple example to show the expected format of the output.
You are responsible for developing your own test cases and input files.
All projects will be tested with the same data files, but the files
will not be disclosed in advance.
Command line: Proj3 input.txt skip.txt output.txt 5
Contents of input.txt file:
The concept of abstract data types
pervades much of the theory of data
structures, and also forms the central
concept of the class in object-oriented
programming.
PgBk
The binary search tree data structure
is a binary tree in which the vertices
must obey a specific order.
PgBk
The hash table is a data structure that
is used to implement the Dictionary ADT.
Contents of skip.txt file:
a
and
is
to
Contents of output.txt file:
abstract 1
ADT 3
also 1
binary 2
central 1
class 1
concept 1
data 1 2 3
Dictionary 3
forms 1
hash 3
implement 3
in 1 2
much 1
must 2
obey 2
object-oriented 1
of 1
order 2
pervades 1
programming 1
search 2
specific 2
structure 2 3
structures 1
table 3
that 3
theory 1
tree 2
types 1
used 3
vertices 2
which 2
Notice that the word "the" does not appear in the output file because
its number of occurrences exceeds the threshold value specified on the
command line.