Homework 8 (Optional)

Extra Perl

out:5/14, due:5/19 (Sun.) at 11:59pm

(Note that this assignment is for people who have a previous homework that they received a low score for, and wish to replace. It is not an extra credit assignment that will add to your grade--it's grade will replace part of your lowest-scoring assignment, up to a maximum of ~4%, 50% of the average value of the regular assignments.)


One of Perl's major advantages is that it combines a powerful procedural language with equally powerful regular expression functions. This allows us to parse complex file formats like XML documents.

We will be using a simplified XML format. It is a form that is fully XML-compliant, but is restricted to make your parsing job easier. (In other words, all the XML files your program will be asked to handle will be legal XML, but we will not expect you to be able to handle all real XML). We will call the restricted version for the homework "Baby-XML".

The general format for a Baby-XML document is as follows:
The first line is an XML header line of the following form:

    <?xml version="1.0"?>
This is followed by a series of sequential and/or nested open-tag/ close-tag pairs. An open tag has the form:
    <some_tag_name>
It consists of an open angle bracket ('<'), followed by a string of alphanumeric characters and underscores ('_'), finishing with a close angle bracket ('>'). In our Baby-XML, no whitespaces are allowed inside the tag.

A matching close tag is almost identical to the open tag, except that the tag name is prefixed with a forward slash character ('/'); for example, the following is an open tag its matching close tag:

    <some_tag_name>
    </some_tag_name>

In a properly formed XML file, open-/close-tag pairs can be sequential:

    <tag1>
    </tag1>
    <tag2>
    </tag2>
... or nested:
    <tag1>
      <tag2>
      </tag2>
    <tag1>
However, the following is not allowed, because the two different tag types are neither sequential, nor properly nested:
    <tag1>
    <tag2>
    </tag1>
    </tag2>
Now, XML is not about just tags: the purpose of tags is to help identify the content, which is the stuff between the open-tag and close-tag. However, for this homework, we will be completely ignoring that stuff, just focusing on parsing and interpreting the tags correctly.

Another important detail about XML is that other than inside a tag itself (i.e., between the '<' and '>'), the user is free to insert spaces and newlines wherever they please. The following has several different styles intermixed, and it is perfectly legal:

    <tag1>
      <tag2>
        <tag3a>foobar</tag3a>
                      <tag3b>
    More random content</tag3b>
    </tag2></tag1>
However, this would be consider very sloppy XML formatting!

Requirements and Specifications

This is what your program should do:
  1. It should read all of its input from STDIN; i.e., you do not need to open any files. If you want to have your program read from a file stored on disk, just use Unix I/O redirection.
  2. It should read the first line, and check that it fits the XML header format described earlier. If it does not, your program should output the error message "Not a Baby-XML file" and exit (by calling the "die" function, described in the lecture notes).
  3. It should then go into a loop reading lines from STDIN.
  4. For each line, it should scan for all open or close tags (there might be several per line), and do the following for each:
    • If the next match is an open tag, it should add that tag to a stack of currently open tags. It should print out the current line number, followed by ": OPEN ", followed by the tag name (without the angle brackets). This line should be indented, with the amount of indentation based upon the level of nesting; each extra level of nesting should be indented by an additional two spaces--like formatting nested blocks of code in Java. (See the sample output below for an example.)
    • If the next match is a close tag, it should pop the topmost element off the stack of open tags, and compare that to the close tag, to make sure it matches (don't forget to exclude the close tag's '/' prefix when doing the comparison). If they match, print out a line similar to the line output for the open tag, but with "CLOSE" instead of "OPEN" in the message.

      If the close tag didn't match the top tag from the stack, use "die" to print an error message that includes the line number, the expected closing tag, and the actual non-matching closing tag, and exit.

  5. At the end of the input, check to make sure the open tags stack is empty, i.e., that every open tag has been closed. If not, again, use "die" to print out the list of still-open tags, then exit.
It should be obvious that you will also need to keep track of what line of the file you are on, to output this with the messages; don't forget to include the XML header line in the count.

That's all there is to it. It should be relatively straightforward-- give it a shot!

Additional Perl Skills You Will Need

You will need a couple of slightly more advanced regular expression skills to do this assignment. Remember that XML is free-form with respect to line breaks, so a single line can contain multiple open and close tags. So, you will need to know how to sequentially scan a line for multiple appearances of a given pattern. You do this by using the 'g' (for "global") modifier to a regular expression command. The 'g' says to find every instance of a pattern in a string. Perl does this by maintaining the index of where the most recent match was found, and continuing on from there. So, if you tried the following:
    $a = "foo fee 123 hello there f fum";
    while ($a =~ m/f[a-z]+/g) {
      print "Found another match";
    }
it would print out the "Found another match" message 3 times.

This is pretty powerful, but of limited use because Perl is only telling you that it found a matching substring, and not what that matching substring was. For that, you need the following trick. When you do regular expression matching, the pattern can contain parenthesized subexpressions. The corresponding matching parts can be accessed afterwards as $1, $2, $3, etc.. So, if you did the following (recall that '\w' matches all alphanumeric characters):

    $a = "Here @123@ I @#abc@ am @@";
    while ($a =~ m/@(#?)(\w+)@/g) {
      print "Found a match: prefix " . $1 . ", string " . $1 . "\n";
    }
it would print out:
    Found a match: prefix , string 123
    Found a match: prefix #, string abc
The first parenthesized part--"(#?)"--matches an optional '#', and the second parenthesized part--"(\w+)"--matches the alphanumeric part. Since the bracketing '@'s are part of the regular expression pattern, but not inside the parentheses, they are not part of the selections $1 or $2. Also, note that the first line of output has an empty string for the prefix ($1), because the optional '#' was not seen.

Sample Run

With the following input (available as a link here: hw8data.xml):
<?xml version="1.0"?>
<student>
  <name>John</name>
  <age>18</age>
  <address>123 Elm St.</address>
  <mother>
    <name>Jane</name>
    <age>48</age>
  </mother>
</student>
Your program should output:
2: OPEN student
  3: OPEN name
  3: CLOSE name
  4: OPEN age
  4: CLOSE age
  5: OPEN address
  5: CLOSE address
  6: OPEN mother
    7: OPEN name
    7: CLOSE name
    8: OPEN age
    8: CLOSE age
  9: CLOSE mother
10: CLOSE student
Whereas, the input:
<?xml version="1.0"?>
<student>
  <name>John
</student>
would produce the output:
2: OPEN student
  3: OPEN name
4: close tag "student" doesn't match current open tag "name".
And the input:
<?xml version="1.0"?>
<student>
  <name>John
will output:
2: OPEN student
  3: OPEN name
End of file with still-open tags: student, name.

Hints

Hint 1:

It would probably be easiest if you handled searching for tags by creating a single regular expression that matches both open and close tags, which uses a selection pattern (the parentheses explained earlier) to pull out just the part inside the angle brackets, but not including the brackets themselves. Then, in nested code, you can further test to distinguish between open and close tags. by using another regular expression match to test whether the selected substring starts with a '/' or not.

Hint 2:

Don't forget that metacharacters (like '.', '+', '*', and '?', among others) have special meaning inside regular expressions, and must be quoted with a preceeding '\' to indicate you want the literal character. For example, the regular expression command "m/.*?/" would search for an optional occurrence (the '?') of 0 or more repeats (the '*') of any character (the '.'). If you wanted to match the exact 3-character literal substring ".*?", you would have to write your regular expression as: "m/\.\*\?/". You will need this trick when you are testing for the XML header line, which contains some of these metacharacters.

Also note that if you want to search for a '/' inside a regular expression, it will cause problems. The '/' is not a metacharacter, but because that character is usually used as the begin and end indicator for your regular expression, it will cause problems. You can again use the quote character ('\') to solve this problem, too; e.g.: "m/mydir\/myfile/" will match the string "mydir/myfile".

Hint 3:

You will need a single regular expression that matches both opening and and closing tags, but that does not greedily match multiple tags on the same line all at once. See the last slide in the regular expressions section in the Perl lecture notes for strategies.

Hint 4:

Perl already has stack operation functions, as mentioned in the lecture notes, so take advantage of those.

What to submit

Name your file "hw8.pl", and submit it to the "hw8" project.