CMSC 201
Programming Project Four
Concordance
Out: Monday 11/13/00
Due: Before Midnight, Sunday 11/26/00
The design document for this project,
design4.txt
,
is due: Before Midnight, Sunday 11/19/00 |
The Objective
The objective of this assignment is to give you practice with project and
function design. It will also give you an opportunity to work with reading
information from a file, sorting an array of structures, passing structures
by reference, manipulating strings, dealing with command line arguments and
some formatted printing.
The Background
Analyzing text is one of the primary uses of computers. Text is analyzed
to make searching faster, and for statistical analysis. This project will
give you the opportunity to analyze some text and report your findings.
A concordance is an alpahbetical list of words from a passage of text
together with the number of times that each word occurs
in the text. Very often a list of line numbers in which each
word appears is also provided, but is not required
for this project.
The Task
Design and code a project that will allow you to read in the information
from a text file, create a concordance
and report on various statistics about the words in the text.
To make your program easier, the text file will be entirely
in lower case and there will be no punctuation marks in the file.
Your program will provide the followng information from your concordance.
See the sample output for a suggested format.
-
The contents of the text file to be analyzed
-
An alphabetical list of the of all words in the text, together with the
number of times each occurs
-
An alphabetical list of the words which occur most frequently and the number
of times they occurs.
-
An alphabetical list of the longest word(s), their length and the number
of times each occurs.
-
An alphabetical list of the shortest word(s), their length and the number
of times each occurs.
-
The average word length reported with one decimal place of precision.
Several test data files are available for you. You can view these
files and examine their content.
You should copy one or more of these files into your account by
using the following commands:
(don't forget that there is a dot (.) at the end of the command)
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/preamble.dat
.
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/declaration.dat
.
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/mary.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/rose.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/test.dat .
cp /afs/umbc.edu/users/s/b/sbogar1/pub/cs201/P4Data/imagine.dat
.
You should of course make your own test files as well.
The Specifications
-
Your program must input the name of the text file as the one and only command
line argument.
-
Your program may make the following assumptions
-
There will be no more than 100 different words in the text.
-
The longest word in the text will be no more than 12 characters
-
Your program must sort the words in the concordance. You may use whatever
sorting technique you want.
The code for selection sort is available from lecture
10, but it must be modified to sort strings instead of integers.
-
Your program must use a structure to hold the the information about each
word in the concordance.
-
Your program must contain at least one function which uses a pointer to
a structure as a parameter.
-
All alphabetic listings must be displayed 5 entries per line
-
All errors must be reported to stderr.
Sample Run
Although your output need not look exactly like the sample output below,
all information detailed in the specification above must be present. Your
program must also print a short greeting. Don't be concerned
if your output scrolls off the top of the screen. It will be
very difficult to keep all output on a single screen.
There are two approaches to this problem...
- Use the unix script command to capture your output in a file
(named typescript) in order to examine it.
For more information on the script command, see the Unix
man pages.
- Redirect the output of your program into a file using Unix
redirection as discussed in class.
irix1[1]% a.out
Usage: a.out <filename>
irix1[2]% a.out ppp.dat
can't open ppp.dat
irix1[3]% a.out test.dat
The original text:
this is the test file for project four
this is the test
this is only the test
this is not real because it is the test
The concordance contains 13 words, listed below alphabetically
because 1
file 1 for
1 four 1
is 5
it
1 not 1
only 1 project 1
real 1
test 4
the 4 this
4
The most frequent word(s) occurred 5 times:
is
The longest word(s) had length 7 :
because 1
project 1
The shortest word(s) had length 2 :
is
5 it
1
Average word length is 3.5 characters
irix1[4]%
Submitting the Program
You are to use seperate compilation for this project, so you will be submitting
a minimum of three files.
Your C source code file that contains main() MUST be called
proj4.c.
I would expect that you would also have files called
concordance.c and concordance.h, but you may choose to
have additional .c and .h files.
To submit your project, type the following at the Unix prompt. Note
that the project name starts with uppercase 'P'.
submit cs201 Proj4 proj4.c concordance.c concordance.h (and possibly
other files, seperated by spaces)
To verify that your project was submitted, you can execute the following
command at the Unix prompt. It will show all files that you submitted in
a format similar to the Unix 'ls' command.
submitls cs201 Proj4
CSEE
|
201
| 201
F'00 | lectures
| news
|
help
Monday, 30-Oct-2000 14:53:34 EST