Statistics about documents are used for many purposes by computer scientists. A tally of specific word occurrences within documents can be useful for determining the similarity of documents. This is helpful for document retrieval and is used by modern search engines for the Internet. Analysis of words used within a document can even determine authorship. The number of occurrences of letters can be useful, and has been well studied in the area cryptology. The percentages of individual letters that occur within a document are language specific. This can help determine the language of an encoded message.
This project will give you practice with strings, file-handling, malloc, arrays, pointers and sorting.
I have written proj5.c for you. You will need to copy it into your account. Instructions are given below. You must use it without modification. You must write charfreq.h and charfreq.c. Here is proj5.c:
Notice that I am using two arrays. The array called letters should hold the letters a, b, c, etc. The array called freq will hold zeros at first, but will eventually hold the number of times the associated character occurred in the string.
The string is : This is just a little sample file. I need to find out whether my code will handle newlines and multiple sentences properly. There are 4 sentences and 38 words in this file. The integers count as words too. It consists of 208 characters in all. There are 40 space(s), 4 punctuation mark(s), 3 digit(s), and 38 word(s).
retriever[102] a.out Enter the name of the text file to be examined: sample.txt The string is : This is just a little sample file. I need to find out whether my code will handle newlines and multiple sentences properly. There are 4 sentences and 38 words in this file. The integers count as words too. It consists of 208 characters in all. There are 40 space(s), 4 punctuation mark(s), 3 digit(s), and 38 word(s). The letters in descending order by their occurrences are shown below: e occurred 26 times t occurred 16 times n occurred 14 times s occurred 14 times i occurred 13 times l occurred 12 times o occurred 9 times r occurred 8 times d occurred 8 times h occurred 7 times a occurred 7 times w occurred 5 times c occurred 4 times p occurred 4 times u occurred 4 times m occurred 3 times f occurred 3 times y occurred 2 times g occurred 1 times j occurred 1 times k occurred 0 times v occurred 0 times q occurred 0 times x occurred 0 times b occurred 0 times z occurred 0 times retriever[103]
More details
The document to use for this project is called sample.txt and also the
source file, proj5.c are found in my 201 directory. You should copy these
files into your own directory. The executable and the data file need to be
in the same directory. Here's how to copy the files:
Change directory until you are in the directory where you will write your
code and have the executable, then type the following commands at the unix
prompt.
cp ~sbogar1/201/sample.txt sample.txt cp ~sbogar1/201/proj5.c proj5.c
I intend to provide other text files, in several languages, so that you can see the differences in occurrences of letters. After you have your project running properly on the sample.txt file, check out the other test files that I'll provide and compare the occurrences of letters in different languages.
You must use separate compilation for this project. You must be able to compile the proj5.c file that I have provided for you with a file called charfreq.c that you have written. You must also write charfreq.h. charfreq.c and charfreq.h, contain functions related to the frequency of characters and the prototypes for those functions, respectively. You may, of course, have other .c and .h files, as you see fit.
Submit as follows:
submit cs201 proj5 proj5.c charfreq.c charfreq.h
The order in which the files are listed doesn't matter. However, you must
make sure that all files necessary to compile your project are listed.