UMBC CMSC201, Computer Science I, Fall '97

Project 5: Document Statistics

Due date: Friday, December 12, 1997

Statistics about documents are used for many purposes by computer scientists. A tally of specific word occurrences within documents can be useful for determining the similarity of documents. This is helpful for document retrieval and is used by modern search engines for the Internet. Analysis of words used within a document can even determine authorship. The number of occurrences of letters can be useful, and has been well studied in the area cryptology. The percentages of individual letters that occur within a document are language specific. This can help determine the language of an encoded message.

This project will give you practice with strings, file-handling, malloc, arrays, pointers and sorting.

Description of the Program

There will be several text files made available to you for this project. I suggest starting with sample.txt, since I have provided you sample output from my program that is an analysis of sample.txt. Instructions are given below for copying it into your account.

I have written proj5.c for you. You will need to copy it into your account. Instructions are given below. You must use it without modification. You must write charfreq.h and charfreq.c. Here is proj5.c:

/****************************************************************************\ * Filename: proj5.c * * Author: Sue Bogar * * Date written: 11/28/98 * * Description: This program reads in a string, determines its length and * * prints the string, its length, the number of whitespace * * characters, punctuation marks and digits found in the * * string, and the number of words in the string. The number * * of occurrences of each letter is found, using an array of * * counters and a report is generated that shows the letters * * and the number of times they occurred in descending order * * by occurrences. * \****************************************************************************/ #include <malloc.h> #include "charfreq.h" #define SIZE 26 main () { char *string, letters [SIZE]; int i, length, freq [SIZE]; int word = 0, space = 0, punct = 0, digit = 0; /* Get the string from a file */ length = ReadString (&string); /* Process the string for character types and words */ CountCharTypes (string, length, &space, &punct, &digit); word = WordCount (string); PrintCharReport (string, length, space, punct, digit, word); /* Find frequencies of letters in the string */ InitArrays (letters, freq, SIZE); CountLetters (string, length, freq, SIZE); /* Release memory since the string is no longer needed */ free(string); SortByFrequency (letters, freq, SIZE); PrintFreqReport (letters, freq, SIZE); } charfreq.c should contain the 8 functions that are called from main() and may contain other functions as well.

Notice that I am using two arrays. The array called letters should hold the letters a, b, c, etc. The array called freq will hold zeros at first, but will eventually hold the number of times the associated character occurred in the string.

In the function ReadString(), you are to read the entire contents of a file specified by the user into the string whose address is passed to the function, and return the number of characters read.
The function CountCharTypes() is to make use of isdigit() and other macros in ctype.h to count the number of digits, white space characters, and punctuation marks found in the string.
The function WordCount() is to return the number of words in the string.

The function PrintCharReport() should produce output similar to the following example:

The string is :
This is just a little sample file.  I need to find out
whether my code will handle newlines and multiple sentences
properly.  There are 4 sentences and 38 words in this file.
The integers count as words too.

It consists of 208 characters in all.
There are 40 space(s), 4 punctuation mark(s), 3 digit(s), 
and 38 word(s).

InitArrays() should initialize the array, letters, to hold the characters 'a' through 'z' and the array, freq, to all zeros.
After CountLetters() has executed, letters[0] should hold the character 'a', and freq[0] should hold the number of times the character 'a' or 'A' occurred in the string.
Next we want to sort these two arrays, so that if 'e' or 'E' was the most frequently occurring letter, then 'e' should be in letters[0], and the number of times it occurred should be in freq[0]. The alphabetic characters should be in the array letters[] in descending order by their frequency. You should write SortByFrequency() using sort.c found on pages 443-444 of the Roberts' text as an example. sort.c uses selection sort. Using selection sort is required for this project.
PrintFreqReport simply prints out the contents of the two sorted arrays.

The final output for the whole program should look similar to this:

retriever[102] a.out
Enter the name of the text file to be examined: sample.txt

The string is :
This is just a little sample file.  I need to find out
whether my code will handle newlines and multiple sentences
properly.  There are 4 sentences and 38 words in this file.
The integers count as words too.

It consists of 208 characters in all.
There are 40 space(s), 4 punctuation mark(s), 3 digit(s), 
and 38 word(s).

The letters in descending order by their occurrences 
are shown below:

e occurred   26 times
t occurred   16 times
n occurred   14 times
s occurred   14 times
i occurred   13 times
l occurred   12 times
o occurred    9 times
r occurred    8 times
d occurred    8 times
h occurred    7 times
a occurred    7 times
w occurred    5 times
c occurred    4 times
p occurred    4 times
u occurred    4 times
m occurred    3 times
f occurred    3 times
y occurred    2 times
g occurred    1 times
j occurred    1 times
k occurred    0 times
v occurred    0 times
q occurred    0 times
x occurred    0 times
b occurred    0 times
z occurred    0 times

retriever[103]

More details

You are not allowed to use any of the Roberts' libraries for this project.
You will be graded on your design and on the efficiency of the program, as well as the correctness of the program, documentation and style.

Copying the files

The document to use for this project is called sample.txt and also the source file, proj5.c are found in my 201 directory. You should copy these files into your own directory. The executable and the data file need to be in the same directory. Here's how to copy the files:
Change directory until you are in the directory where you will write your code and have the executable, then type the following commands at the unix prompt.

     cp ~sbogar1/201/sample.txt sample.txt
     cp ~sbogar1/201/proj5.c proj5.c

I intend to provide other text files, in several languages, so that you can see the differences in occurrences of letters. After you have your project running properly on the sample.txt file, check out the other test files that I'll provide and compare the occurrences of letters in different languages.

What to Turn In

You must use separate compilation for this project. You must be able to compile the proj5.c file that I have provided for you with a file called charfreq.c that you have written. You must also write charfreq.h. charfreq.c and charfreq.h, contain functions related to the frequency of characters and the prototypes for those functions, respectively. You may, of course, have other .c and .h files, as you see fit.

Submit as follows:

submit cs201 proj5 proj5.c charfreq.c charfreq.h

The order in which the files are listed doesn't matter. However, you must make sure that all files necessary to compile your project are listed.