Assigned |
|
Due |
|
Correction, April 20 |
In section Statistics to be
collected, the modulo operator "%" should be division "/". |
April 20 | Due
date of Project 4 is extended to |
Corrections, April 27 |
1. Formula for computing average
word frequency (# of
distinct words in HT1 / N_words) is incorrect. It should be 1/# of distinct words in HT1 2. Formula for theoretical estimate of average number of probings in project questions is incorrect. It should be (1/lambda)*ln(1/(1 - lambda)) where ln is natural logarithm. |
The purpose of this project is to give you some
exposure to
Hash Tables (HT) through a simple application that determines the
frequencies of words appearing in a collection of text paragraphs. Time
performance of HT will also be explored.
Word frequency analysis has many
applications. For example, it can provide guidance for learning English
(learn the most
frequent words first), and help to determine if two text documents were
written by the same person (compare the word frequency distributions of
the two documents). In this project, you are asked to compute the
frequencies of words
appearing in a collection of text paragraphs using hash table
technique. These paragraphs, read from a text file, referred to as input file in this description, will
first undergo
some pre-process in which punctuations and other special symbols will
be removed. Since
some words, such as articles, prepositions, pronouns, occur frequently
but are not of great
interest in many applications, they will be excluded from the word
frequency calculation
in this project. The list of words that are to be excluded is given in
another text file referred to as exclusion
file. Words, after pre-processing, will first be checked
against exclusion file. Those words that are not excluded be inserted
into
a HT, and if a word is already in the table, its occurrence count will
be
increment by 1. HT is selected as the data structure because
find and insert operations of HT are of O(1) time complexity, as long
as
the load factor is low. When the load factor for the current HT becomes
too high, a new HT of approximately doubled size will be created, and
all words already processed will be re-hashed into the new table
together with their occurrence counts. This is called rehash in author's code. The process
continues until all words in the input file are processed. You will
also collect some
statistics related to time performance.
Below are the specifics.
Pre-processing
A text paragraph, such as a new story from wire services or texts from
web, often contains punctuations (e.g., , . ? ! "), special symbols
(e.g, @ $ % -), and
numerals (0, ..., 9). This pre-process is to remove all such symbols
from the
text read from input file. Specifically, all symbol other than space or
those in {A ,..., Z, a, ..., z} shall be removed. In addition, all
capital cases are converted to lower cases. After the process, a text
paragraph becomes a collection of
sequences of lower case characters
"a" through "z", separated
by
space(s).We call such a sequence of
characters a word.
Exclusion file
An example exclusion file can be found here.
Word frequencyThe frequency of a word in the collection of texts
is computed as the
ratio of the number of occurrences of that word in the collection and
the total word occurrences in the collection.
Here the collection of text should be understood as the input file
after the pre-processing and having all excluded words removed.
Hash tables
Since every word read and pre-processed will be
checked against the list of words in exclusion file before inserting
into
HT, it is more efficient if we store those excluded words in a hash
table as well. Therefore, you will build two hash tables, one, HT0 to hold
the words from the exclusion file, and another, HT1, for the rest of words.
Both HT0 and HT1 are open address tables, using the
division
method for
hashing and quadratic probing for resolving collisions as specified
below:
h'(K) = K mod M
h(K, I) = (h'(K) + I2) mod M
Here
HT0 will have a fixed size M0. Since fewer than 50
words will be included in the exclusion file, we choose M0 to be
the smallest prime greater than 100. HT1 will
start with table size M1 = M0. You need to rehash HT1 to double its
size when
1) the load factor lambda becomes greater or equal
to 0.8; or
2) the quadratic probing cannot
find a
empty cell.
Note: Quadratic probing
guarantees to find an empty slot if 1) the table size is prime and 2)
lambda is less than 0.5. Since in this project lambda is allowed to be
as large as 0.8, it is possible, although not very likely, that
you may not be able to find
an empty slot. This condition can be detected when the
current probing falls into a slot that has been probed earlier while
attempting to insert K (i.e.,h'(K) + I2
= h'(K) + J2
where I !=J).
Statistics to be collected
The following statistics are to be collected and reported:
/afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4/341-Spring05-p4_questions.txt
These questions are worth 10% of the project grade.
Your project will be tested using a variety of
command
lines, some of which will be invalid.
Your project will also be tested using a variety of command files which
will
test various conditions which your code should handle.
Project grading is described in the Project
Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project
Policy handout for further details on honesty in doing projects for
this
course.
Remember, the due date is firm. Submittals made after