CMSC 201
Programming Project Five

XHTML Checker

Out: Monday 11/20/06
Due Date: Sunday 12/10/06, before midnight

The design document for this project, design5.txt, is due: Before Midnight, Sunday 12/3/06

The Objective

The purpose of this assignment is to give you practice with linked lists, queues, stacks and guarding header files. It will also act as an introduction to HTML, the markup language of the web. You'll be getting some more experience using command line arguments, and writing error messages to stderr. As always, you should continue to practice using top-down design and good implementation techniques like incremental programming.

The Background

Hypertext Markup Language (HTML) is a notation used to describe how to display the contents of web pages and to focus on how the page looks. Extensible Markup Language (XML) is a notation used to describe data on web pages and to focus on what the data mean. The newest HTML standard is XHTML, which combines both HTML and XML. Web browsers read HTML or XHTML to determine how web pages should be displayed. Other software can be used to read the same webpages and interpret the information contained there, if XML tags that program recognizes have been used in the document. HTML tags are enclosed in angle brackets (<>). In XHTML, tags generally appear in start-tag, end-tag combinations.

A start tag has the form: <name attributes>. The matching end tag contains that name preceeded by a "/". For example, a paragraph of text might be formatted as follows:

<p>This is a paragraph. </p>

XHTML also allows self-closing tags of the form: <name attributes />.

In a proper XHTML file, the tags will occur in properly nested pairs. Each start tag is matched by a corresponding end tag, and one structure may be embedded inside another, but they cannot overlap. This is like the matching of curly braces in a C program.

For example:
<p>...<ol>...</ol>...</p> is OK, but
<p>...<ol>...</p>...</ol> is not.
A self-closing tag acts as a self-contained start-end pair.

The Task

You should become familiar with both HTML and XHTML first. You can use the w3school.com website for both the HTML and XHTML tutorials, or search for other tutorials that you like better. This would be appropriate to do over the long Thanksgiving break. The lecture will cover stacks and queues when classes begin again on Monday/Tuesday, 11/27,11/28.

Your task is to write a program that checks html files, (web pages) to determine whether the embedded XHTML tags are balanced. Balanced tags are necessary for the file to be a valid xhtml file as explained above.

Your program will read from the file and print an analysis of the file to the screen. The sequence of tags in the file will be echoed to the output; any other text in the file is to be ignored. If there is a tag balancing error or the program reaches the end of the file while in the middle of a tag, the program must quit and print an error message. If the end of file is reached without any errors, print a message to that effect.

Program Requirements

Hint

You must use both a stack and a queue to solve this problem. To divide up the effort of this problem, I suggest that you consider the program as consisting of two phases.
  1. The first phase takes care of reading the file and finding all of the tags from the file. The first phase will print out the tags as it finds them and also save them into a queue of tags. If the input file ends in the middle of a tag, the program will report the error and end.
  2. The second phase of the program takes the queue of tags generated in phase one and analyzes the sequence of tags to make sure that they are properly balanced. This phase will make use of both a queue and a stack.

More Details

This section added 11/24/06

Copying the file

The data file used to create the sample output for this project is called test1.html. You may get a copy of this file from my pub directory.

Sample Run

Please note that the sample output only shows what is expected from your program for the test1.html file. This is not a test of the program at all, but just a sample for your clarification.

Submitting the Program

To submit your program, type the following command at the Unix prompt

submit cs201 Proj5 proj5.c followed by all the .c and .h files necessary for compilation

I would expect this list of files to be long. Probably:
proj5.c xhtml.c xhtml.h llist.c llist.h queue.c queue.h stack.c stack.h

To verify that your project was submitted, you can execute the following command at the Unix prompt. It will show all files that you submitted in a format similar to the Unix 'ls' command.

submitls cs201 Proj5