A Pearl of a Perl Programming Problem

<!--#set var="PAGE_TITLE" value="Homework 8" -->
<!--#include virtual="../../header.shtml"-->

<h1><!--#echo var="PAGE_TITLE" --></h1>
 
<p>
Out: 12/6, due by 11:59pm 12/12 (Tues)
</p>

<!--
<table>
  <tr>
    <td>
      <img src="lisplogo.png" width="256" height="223" align="middle" />
    </td>
  </tr>
</table>
-->


<h1>A Pearl of a Perl Programming Problem </h1>

<p>
The solution to the following problem should be put into a file
called <code>hw8.pl</code>.
</p>

<p>
One of Perl's major advantages is that it combines a powerful
procedural language with equally powerful regular expression
functions.  This allows us to parse complex file formats like
XML documents.
</p>
<p>
For the homework, We will be trying to parse a very stripped-down version of XML
that we are calling "Baby-XML".
The general format for a Baby-XML document is as follows:
</p>
<p>
The first line is an XML header line of the following form:
</p>
<p>
<pre>
    &lt;?xml version="1.0"?&gt;
</pre>
This is followed by a series of sequential and/or nested open-tag/
close-tag pairs.  An open tag has the form:
<pre>
    &lt;some_tag_name&gt;
</pre>
It consists of an open angle bracket ('&lt;'), followed by a string
of alphanumeric characters and underscores ('_'), finishing with a
close angle bracket ('&gt;').  In our Baby-XML, no whitespaces are
allowed inside the tag.
</p>
<p>
A matching close tag is identical to the open tag, except
that the tag name is prefixed with a forward slash character ('/');
for example, the following is an open tag and its matching close tag:
<pre>
    &lt;some_tag_name&gt;
    &lt;/some_tag_name&gt;
</pre>
</p>
<p>
In a properly formed XML file, open-/close-tag pairs can be sequential:
<pre>
    &lt;tag1&gt;
    &lt;/tag1&gt;
    &lt;tag2&gt;
    &lt;/tag2&gt;
</pre>
... or nested:
<pre>
    &lt;tag1&gt;
      &lt;tag2&gt;
      &lt;/tag2&gt;
    &lt;tag1&gt;
</pre>

However, the following is not allowed, because the two different
tag types are neither sequential, nor properly nested:
<pre>
    &lt;tag1&gt;
    &lt;tag2&gt;
    &lt;/tag1&gt;
    &lt;/tag2&gt;
</pre>

Now, XML is not about just tags: the purpose of tags is to help identify
the content, which is the stuff between the open-tag and close-tag.
However, for this homework, we will be completely ignoring that stuff,
just focusing on parsing and interpreting the tags correctly.
</p>
<p>
Another important detail about XML is that other than inside a tag itself
(i.e., between the '&lt;' and '&gt;'), the user is free to insert content, spaces
and newlines wherever they please.  The following has several different
styles intermixed, and it is perfectly legal:
<pre>
    &lt;tag1&gt;
      &lt;tag2&gt;
        &lt;tag3a&gt;foobar&lt;/tag3a&gt;
                      &lt;tag3b&gt;
    More random content&lt;/tag3b&gt; Content that would be part of tag2
    &lt;/tag2&gt;&lt;/tag1&gt;
</pre>
However, this would be consider very sloppy XML formatting! Also, we
are not going to be using this fully-general format for the homework.
In our Baby-XML, we will insist on a line having just a single open tag,
or a single close tag, or one open-tag/close-tag pair.  So, the last
line (containing the two close tags <code>&lt;/tag2&gt;&lt;/tag1&gt;</code>)
would never occur in our Baby-XML. Other than that, though, the above <i>would</i>
be legal in both full XML and Baby-XML: tags
can be intermingled with actual content text anywhere. Some lines will
not have any tags.
</p>
<p>
<h2>Requirements and Specifications</h2>
This is what your program should do:
<ol>
<li>It should read all of its input from STDIN; i.e., you do not
need to open any files.  If you want to have your program read from
a file stored on disk, just use Unix I/O redirection.
</li>
<li>It should read the first line, and check that it matches the XML
header format described earlier.  If it does not, your program should
output the error message "Not a Baby-XML file" and exit
(by calling the "die" function, described in the lecture notes).
</li>
<li>It should then go into a loop reading lines from STDIN.
</li>
<li>For each line, it should use the power or Perl's regular expression-matching
to scan for an open tag, a close tag,
or an open-tag/close-tag pair together on a line.
(Hint: the order in which you do these checks will matter.)
If it finds any, it should do the following:
<ul>
<li>If the next match is an open tag, it should add that tag to a stack
of currently open tags.  It should print out the current line number,
followed by ": OPEN ", followed by the tag name (without the angle brackets).
This line should be indented, with the amount of indentation based upon
the level of nesting; each extra level of nesting should be indented by
an additional two spaces--like formatting nested blocks of code in C++ or Python.
(See the sample output below for an example.)
</li>
<li>If the match is a close tag, it should pop the topmost element off
the stack of open tags, and compare that to the close tag, to make sure it
matches (don't forget to exclude the close tag's '/' prefix when doing
the comparison).  If they match, print out a line similar to the line
output for the open tag, but with "CLOSE" instead of "OPEN" in the message.
<p>
If the close tag doesn't match the curent top open tag on the stack,
use "die" to print an error message that includes
the line number, the expected closing tag, and the actual non-matching
closing tag, and exit.
<p>
</li>
<li>If the match is an open-tag/close-tag pair, it should handle
it just like it saw an open followed by a close, as above, except you
obviously don't have to push-then-pop it. You should print out both an OPEN
and a CLOSE message as above, though. Again, if the close tag
doesn't match the open tag, handle the error as above.
</li>
</ul>
<li>At the end of the input, check to make sure the open tags stack is empty,
i.e., that every open tag has been closed.  If not, again, use "die" to
print out the list of still-open tags, then exit.
</ol>
It should be obvious that you will also need to keep track of what line
of the file you are on, to output this with the messages;
don't forget to include the XML header line in the count.
</p>
<p>
That's all there is to it.  It should be relatively straightforward--
give it a shot!
</p>
<p>
<h2>Additional Perl Skills You Will Need</h2>
You will need a couple of slightly more advanced regular expression skills to
do this assignment.
When you do regular expression matching, the pattern can contain
parenthesized subexpressions.  The corresponding matching parts can be
accessed afterwards as $1, $2, $3, etc..  So, if you did the following
(recall that '\w' matches all alphanumeric characters):
<pre>
    $a = "Here @123@ I am";
    if ($a =~ m/@(#?)(\w+)@/) {
      print "Found a match: prefix '" . $1 . "', string '" . $2 . "'\n";

    $a = "Here @#abc@ I am";
    if ($a =~ m/@(#?)(\w+)@/) {
      print "Found a match: prefix " . $1 . ", string " . $2 . "\n";
      print "Found a match: prefix '" . $1 . "', string '" . $2 . "'\n";
    }
</pre>
it would print out:
<pre>
    Found a match: prefix '', string '123'
    Found a match: prefix '#', string 'abc'
</pre>
The first parenthesized part--"<code>(#?)</code>"--matches an optional '#',
and the second parenthesized part--"<code>(\w+)</code>"--matches the
alphanumeric part.  Since the bracketing '@'s are part of the
regular expression pattern, but not inside the parentheses, they are
<strong>not</strong> part of the selections $1 or $2.  Also, note that
the first line of output has an empty string for the prefix ($1),
because the optional '#' was not seen.
</p>
<p>


<h2>Sample Run</h2>
With the following input (available as a link here:
<a href="hw8data.xml">hw8data.xml</a>):
<blockquote>
<pre>
&lt;?xml version="1.0"?&gt;
&lt;student&gt;
  &lt;name&gt;John&lt;/name&gt;
  &lt;age&gt;18&lt;/age&gt;
  &lt;address&gt;123 Elm St.&lt;/address&gt;
  &lt;mother&gt;
    &lt;name&gt;Jane&lt;/name&gt;
    &lt;age&gt;48&lt;/age&gt;
  &lt;/mother&gt;
&lt;/student&gt;
</pre>
</blockquote>
Your program should output:
<blockquote>
<pre>
2: OPEN student
  3: OPEN name
  3: CLOSE name
  4: OPEN age
  4: CLOSE age
  5: OPEN address
  5: CLOSE address
  6: OPEN mother
    7: OPEN name
    7: CLOSE name
    8: OPEN age
    8: CLOSE age
  9: CLOSE mother
10: CLOSE student
</pre>
</blockquote>
Whereas, the input:
<blockquote>
<pre>
&lt;?xml version="1.0"?&gt;
&lt;student&gt;
  &lt;name&gt;John
&lt;/student&gt;
</pre>
</blockquote>
would produce the output:
<blockquote>
<pre>
2: OPEN student
  3: OPEN name
4: close tag "student" doesn't match current open tag "name".
</pre>
</blockquote>
And the input:
<blockquote>
<pre>
&lt;?xml version="1.0"?&gt;
&lt;student&gt;
  &lt;name&gt;John
</pre>
</blockquote>
will output:
<blockquote>
<pre>
2: OPEN student
  3: OPEN name
End of file with still-open tags: student, name.
</pre>
</blockquote>

<h2>Hints</h2>
<h4>Hint 1:</h4>
Don't forget that metacharacters (like '.', '+', '*', and '?', among others)
have special meaning inside regular expressions, and must be quoted with
a preceeding '\' to indicate you want the literal character.
For example, the regular expression command "<code>m/.*?/</code>" would search
for an optional occurrence (the '?') of 0 or more repeats (the '*') of
any character (the '.').  If you wanted to match the exact 3-character
literal substring "<code>.*?</code>", you would have to write your
regular expression  as: "<code>m/\.\*\?/</code>".
You will need this trick when you are testing for the XML header line,
which contains some of these metacharacters.
</p>
<p>
Also note that if you want to search for a '/' inside a regular
expression, it will cause problems.  The '/' is not a metacharacter,
but because that character is usually used as the begin and end
indicator for your regular expression, it will cause problems.
You can again use the quote character ('\') to solve this problem, too;
e.g.: "<code>m/mydir\/myfile/</code>" will match the string
"<code>mydir/myfile</code>".
<h4>Hint 2:</h4>
Beware of using regular expressions that might greedily match multiple tags
on the same line as one giant tag.  See the last slide in the regular expressions
section in the Perl lecture notes for strategies.
</p>


<h4>Hint 3:</h4>
Perl already has stack operation functions, as mentioned in the lecture
notes, so take advantage of those.
</p>


<h2>What to hand in</h2>
Submit the file hw8.pl to the "hw8" submit directory on GL.
</p>


<!--#include virtual="../../footer.shtml"-->