Out: 12/6, due by 11:59pm 12/12 (Tues)
The solution to the following problem should be put into a file
called hw8.pl
.
One of Perl's major advantages is that it combines a powerful procedural language with equally powerful regular expression functions. This allows us to parse complex file formats like XML documents.
For the homework, We will be trying to parse a very stripped-down version of XML that we are calling "Baby-XML". The general format for a Baby-XML document is as follows:
The first line is an XML header line of the following form:
<?xml version="1.0"?>This is followed by a series of sequential and/or nested open-tag/ close-tag pairs. An open tag has the form:
<some_tag_name>It consists of an open angle bracket ('<'), followed by a string of alphanumeric characters and underscores ('_'), finishing with a close angle bracket ('>'). In our Baby-XML, no whitespaces are allowed inside the tag.
A matching close tag is identical to the open tag, except that the tag name is prefixed with a forward slash character ('/'); for example, the following is an open tag and its matching close tag:
<some_tag_name> </some_tag_name>
In a properly formed XML file, open-/close-tag pairs can be sequential:
<tag1> </tag1> <tag2> </tag2>... or nested:
<tag1> <tag2> </tag2> <tag1>However, the following is not allowed, because the two different tag types are neither sequential, nor properly nested:
<tag1> <tag2> </tag1> </tag2>Now, XML is not about just tags: the purpose of tags is to help identify the content, which is the stuff between the open-tag and close-tag. However, for this homework, we will be completely ignoring that stuff, just focusing on parsing and interpreting the tags correctly.
Another important detail about XML is that other than inside a tag itself (i.e., between the '<' and '>'), the user is free to insert content, spaces and newlines wherever they please. The following has several different styles intermixed, and it is perfectly legal:
<tag1> <tag2> <tag3a>foobar</tag3a> <tag3b> More random content</tag3b> Content that would be part of tag2 </tag2></tag1>However, this would be consider very sloppy XML formatting! Also, we are not going to be using this fully-general format for the homework. In our Baby-XML, we will insist on a line having just a single open tag, or a single close tag, or one open-tag/close-tag pair. So, the last line (containing the two close tags
</tag2></tag1>
)
would never occur in our Baby-XML. Other than that, though, the above would
be legal in both full XML and Baby-XML: tags
can be intermingled with actual content text anywhere. Some lines will
not have any tags.
If the close tag doesn't match the curent top open tag on the stack, use "die" to print an error message that includes the line number, the expected closing tag, and the actual non-matching closing tag, and exit.
That's all there is to it. It should be relatively straightforward-- give it a shot!
$a = "Here @123@ I am"; if ($a =~ m/@(#?)(\w+)@/) { print "Found a match: prefix '" . $1 . "', string '" . $2 . "'\n"; $a = "Here @#abc@ I am"; if ($a =~ m/@(#?)(\w+)@/) { print "Found a match: prefix " . $1 . ", string " . $2 . "\n"; print "Found a match: prefix '" . $1 . "', string '" . $2 . "'\n"; }it would print out:
Found a match: prefix '', string '123' Found a match: prefix '#', string 'abc'The first parenthesized part--"
(#?)
"--matches an optional '#',
and the second parenthesized part--"(\w+)
"--matches the
alphanumeric part. Since the bracketing '@'s are part of the
regular expression pattern, but not inside the parentheses, they are
not part of the selections $1 or $2. Also, note that
the first line of output has an empty string for the prefix ($1),
because the optional '#' was not seen.
Your program should output:<?xml version="1.0"?> <student> <name>John</name> <age>18</age> <address>123 Elm St.</address> <mother> <name>Jane</name> <age>48</age> </mother> </student>
Whereas, the input:2: OPEN student 3: OPEN name 3: CLOSE name 4: OPEN age 4: CLOSE age 5: OPEN address 5: CLOSE address 6: OPEN mother 7: OPEN name 7: CLOSE name 8: OPEN age 8: CLOSE age 9: CLOSE mother 10: CLOSE student
would produce the output:<?xml version="1.0"?> <student> <name>John </student>
And the input:2: OPEN student 3: OPEN name 4: close tag "student" doesn't match current open tag "name".
will output:<?xml version="1.0"?> <student> <name>John
2: OPEN student 3: OPEN name End of file with still-open tags: student, name.
m/.*?/
" would search
for an optional occurrence (the '?') of 0 or more repeats (the '*') of
any character (the '.'). If you wanted to match the exact 3-character
literal substring ".*?
", you would have to write your
regular expression as: "m/\.\*\?/
".
You will need this trick when you are testing for the XML header line,
which contains some of these metacharacters.
Also note that if you want to search for a '/' inside a regular
expression, it will cause problems. The '/' is not a metacharacter,
but because that character is usually used as the begin and end
indicator for your regular expression, it will cause problems.
You can again use the quote character ('\') to solve this problem, too;
e.g.: "m/mydir\/myfile/
" will match the string
"mydir/myfile
".