.
CMSC 341 Fall 2008 - Project 5
CMSC 341 Fall 2008
Project 5
Assigned
| Monday, Nov 25, 2008
|
Due
| 11:59pm, Tuesday Dec 9, 2008
|
Updates
|
|
Objectives
- To use a priority queue in a real application
- To learn about Huffman Coding
- To implement a non-standard binary tree
Background
In today's world, many files are transmitted across the internet including text
files, image files, video files, and music files.
To send them efficiently and to store them efficiently requires that we
compress those files before we send them and then uncompress them when they
are received. Tools such as Winzip and gzip do this for us. Music players and
video players know how to uncompress their respective files. We are going
to investigate one method used for compressing text files.
Many of the files we deal with on a daily basis are text files
which contain ASCII characters. Each ASCII character is represented
by one byte (8 bits). When we wish to archive many old text files
or to send a text file across the internet, we prefer to "compress"
the text file. Compressing a text file means encoding the ASCII characters
in the file so that not all characters require 8 bits of storage. As a result,
the size of the compressed file is smaller than the original text file.
Of course, if we compress a text file, we must also be able to uncompress the
file to retrieve the original file.
Description
In this project you will be using a priority queue and a binary tree of your
design to implement a file compression/uncompression
algorithm called "Huffman Coding". Huffman coding is named for David Huffman
who developed this algorithm as a PhD student at MIT in 1952.
Before you begin, you should read and understand the
explanation of Huffman Coding.
Your program will read a text file and compress it using your implementation of the Huffman coding algorithm found in the explanation.
The compressed text will be written to a file. That compressed file
will be then be read back by your program and uncompressed.
The uncompressed text will then be written to a third file.
The uncompressed text file should of course match the original text file
character for charcter.
Your project will accept the three command line arguments listed below, in the order listed:
- The name of the text file to be compressed.
- The name of the compressed file (created from the compression of the text file).
If this file already exists, it should be deleted.
- The name of the uncompressed text file (created from uncompressing the compressed file).
If this file already exists, it should be deleted.
Project Notes, Hints, and Requirements
- As the Huffman coding document suggests, one way to select the trees
with the smallest weights is to use a Priority Queue.
For this purpose you can use the
java.util.PriorityQueue class.
- The compressed file is a stream of bits. You can use the
java.io.FileInputStream
and java.io.FileOutputStream
classes to read and write byte streams, but you'll need to add some logic
on top of them to handle reading and writing bits.
- The Huffman coding tree is based on a binary tree.
You may design and implement your own "Huffman Tree" class from scratch,
but the Binary Node and Binary Tree classes available from the textbook's author might make a good starting point.
You may change the author's BinaryNode/BinaryTree to be specific to this project.
A "Huffman Tree" does not have to be generic. Also note that in a Huffman Tree internal nodes and external nodes (leaves)
are different (yet your Huffman Tree can have just one kind of node).
- You may assume that the text file does not contain the "^" (carat - above the '6' on your
keyboard) character, so this character may be
used as the "pseudo-eof" character.
- Note that two passes of the text file are required. The first pass counts the frequency of each character (which are then used to build the Huffman Tree and create encoding table). The second pass is made to do the actual encoding.
- Since we will be compressing and uncompressing the file within the same program, you need
not write a file header in the compressed file.
- The Unix command od can be used to display a file in octal, hex, etc.
Check the man pages for more details.
- The Unix command diff can be used to compare the contents of text
files.
- Your program must provide the following required output. See the
sample output below.
- Echo the names of the files from the command line
- The table of characters and their frequency counts (the number of times they appear
in the text file), sorted alphabetically (i.e. by ASCII character values)
- The total number of characters in the text file.
- The table of characters and their bit encoding sorted alphabetically
(i.e. by ASCII character values)
Sample Output
This sample output is provided only as a format example.
It is not produced by any program and the data in the tables may not be consistent
nor even valid.
linux3[7]%Proj4 file1.txt file2.huff file3.txt
Compressing "file1.txt" into "file2.huff"
Uncompressing "file2.huff" into "file3.txt"
Character Frequencies
---------------------
Space 3
4 1
; 2
A 5
C 10
D 3
a 7
c 1
m 1
n 2
Total Characters: 35
Character Encodings
-------------------
Space 10
4 11
; 0100
A 0101
C 0110
D 0111
a 000
c 001
m 1000
n 1001
Submission
You must use CVS to check out your module in the Proj5 repository and
to check in your code. That must include an ant build.xml file and
javadoc. The grading scripts will issue commands like the following, so be
sure that your build.xml supports them:
ant
ant doc
ant -Dargs="arguments go here" run
See
the projects
page for more information on all of these topics.
If you don't submit a project correctly, you will not get credit for it.
Why throw away all that hard work you did to write the project? Check your
submittals. Make sure they work. Do this before the due date.
Project grading is described in the
Project
Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project
Policy handout for further details on honesty in doing projects for this
course.
Remember, the due date is firm. Submittals made after midnight of the due
date will not be accepted. Do not submit any files after that time.