UMBC CMSC201, Computer Science I, Spring '98
Project 3: Document Statistics
Due date: Tuesday, April 14, 1998
Statistics about documents are used for many purposes by computer scientists.
A tally of specific word occurrences within documents can be useful for
determining the similarity of documents. This is helpful for document retrieval
and is used by modern search engines for the Internet. Analysis of words
used within a document can even determine authorship. The number of
occurrences of letters can be useful, and has been well studied in the area
cryptology. The percentages of individual letters that occur within a
document are language specific. This can help determine the language of
an encoded message.
This project will give you practice with strings, file-handling, malloc,
arrays, pointers and sorting.
Description of the Program
There will be several text files made available to you for this project.
They are named : sample.txt, English.txt, French.txt, Spanish.txt and
German.txt. I suggest starting with sample.txt, since I have provided you
sample output from my program that is an analysis of sample.txt. Instructions
are given below for copying it into your account.
I have written docstat.c for you. You will need to copy it into your account.
Instructions are given below. You must use it without modification.
You must write charfreq.h and charfreq.c. Here is docstat.c:
/****************************************************************************\
* Filename: docstat.c *
* Author: Sue Bogar *
* Date written: 7/24/95 *
* Modified: 3/24/98 For 201S98 Project 3 *
* Description: This program reads in a string, determines its length and *
* prints the string, its length, the number of whitespace *
* characters, punctuation marks and digits found in the *
* string, and the number of words in the string. The number *
* of occurrences of each letter is found, using an array of *
* counters and a report is generated that shows the letters *
* and the number of times they occurred in descending order *
* by occurrences. *
* *
* This program is to be run separately against 4 text files, *
* having the same content, but written in different languages.*
* The user can inspect the output produced from each of the *
* four runs to see differences in character occurrences from *
* each of the four languages. *
\****************************************************************************/
#include
#include "charfreq.h"
#define SIZE 26
main ()
{
char *string, letters [SIZE];
int i, length, freq [SIZE];
int word = 0, space = 0, punct = 0, digit = 0;
string = ReadString (&length);
CountCharTypes (string, length, &space, &punct, &digit);
word = WordCount (string);
PrintCharReport (string, length, space, punct, digit, word);
InitArrays (letters, freq, SIZE);
CountLetters (string, length, freq, SIZE);
free(string);
SortByFrequency (letters, freq, SIZE);
PrintFreqReport (letters, freq, SIZE);
}
charfreq.c should contain the 8 functions that are called from main() and may
contain other functions as well.
Notice that I am using two arrays. The array called letters should hold the
letters a, b, c, etc. The array called freq will hold zeros at first, but will
eventually hold the number of times the associated character occurred in the
string.
- In the function ReadString(), you are to read the entire contents of
a file specified by the user into a string. You must malloc the space
to hold this string and return the address of the string to main().
This function must also modify the variable, length, so that it will
contain the number of characters read (the length of the string).
- The function CountCharTypes() is to make use of isdigit() and other
macros in ctype.h to count the number of digits, white space
characters, and punctuation marks found in the string.
- The function WordCount() is to return the number of words in the
string.
- The function PrintCharReport() should produce output similar to the
following example:
The string is :
This is just a little sample file. I need to find out
whether my code will handle newlines and multiple sentences
properly. There are 4 sentences and 38 words in this file.
The integers count as words too.
It consists of 208 characters in all.
There are 40 space(s), 4 punctuation mark(s), 3 digit(s),
and 38 word(s).
- InitArrays() should initialize the array, letters, to hold the
characters 'a' through 'z' and the array, freq, to all zeros.
- After CountLetters() has executed, letters[0] should hold the
character 'a', and freq[0] should hold the number of times the
character 'a' or 'A' occurred in the string.
- Next we want to sort these two arrays, so that if 'e' or 'E' was
the most frequently occurring letter, then 'e' should be in letters[0],
and the number of times it occurred should be in freq[0]. The
alphabetic characters should be in the array letters[] in descending
order by their frequency. You should write SortByFrequency() by
modifying one of the sorting functions that you wrote for project 2.
Please use the more efficient one of the two sorting algorithms.
- PrintFreqReport simply prints out the contents of the two sorted
arrays.
The final output for the whole program should look similar to this:
retriever[102] a.out
Enter the name of the text file to be examined: sample.txt
The string is :
This is just a little sample file. I need to find out
whether my code will handle newlines and multiple sentences
properly. There are 4 sentences and 38 words in this file.
The integers count as words too.
It consists of 208 characters in all.
There are 40 space(s), 4 punctuation mark(s), 3 digit(s),
and 38 word(s).
The letters in descending order by their occurrences
are shown below:
e occurred 26 times
t occurred 16 times
n occurred 14 times
s occurred 14 times
i occurred 13 times
l occurred 12 times
o occurred 9 times
r occurred 8 times
d occurred 8 times
h occurred 7 times
a occurred 7 times
w occurred 5 times
c occurred 4 times
p occurred 4 times
u occurred 4 times
m occurred 3 times
f occurred 3 times
y occurred 2 times
g occurred 1 times
j occurred 1 times
k occurred 0 times
v occurred 0 times
q occurred 0 times
x occurred 0 times
b occurred 0 times
z occurred 0 times
retriever[103]
More details
- You will be graded on your design and on the efficiency of the program,
as well as the correctness of the program, documentation and style.
Copying the files
The documents to use for this project are called sample.txt, English.txt,
French.txt, Spanish.txt and German.txt. These files along with the source
file, docstat.c, are found in my 201 directory. You should copy these
files into your own directory. The executable and the data files need to be
in the same directory. Here's how to copy the files:
Change directory until you are in the directory where you will write your
code and have the executable, then type the following commands at the unix
prompt.
cp ~sbogar1/201/sample.txt .
cp ~sbogar1/201/docstat.c .
After you have your project running properly on the sample.txt file, check
out the other test files and compare the occurrences of letters in different
languages.
What to Turn In
You must use separate compilation for this project. You must be able
to compile the docstat.c file that I have provided for you with a file
called charfreq.c that you have written. You must also write charfreq.h.
charfreq.c and charfreq.h, contain functions related to the frequency of
characters and the prototypes for those functions, respectively. You may,
of course, have other .c and .h files, as you see fit.
Submit as follows:
submit cs201 proj3 charfreq.c charfreq.h
Please Note : You do not have to submit docstat.c
because your charfreq.c and charfreq.h files must compile with my
docstat.c