CMSC 201
Programming Project Four
Name That Language
Out: Sunday 4/14/02
Due Date: Sunday 4/28/02, before midnight
The design document for this project,
design4.txt,
is due: Before Midnight, Sunday 4/21/02
|
The Objective
The purpose of this assignment is to give you practice with strings and chars,
string and char library functions, sorting, using arrays, allocating memory
dynamically, and reading files and using command line arguments.
The Background
Statistics about documents are used for many purposes by computer scientists.
A tally of specific word occurrences within documents can be useful for
determining the similarity of documents. This is helpful for document retrieval
and is used by modern search engines for the Internet. Analysis of words
used within a document can even determine authorship. The number of
occurrences of letters can be useful, and has been well studied in the area of
cryptology. The percentages of individual letters that occur within a
document are language specific. This fact can even help determine the
language of an encoded message.
There is a famous island monastery/fortress located on the English Channel
in Normandy, France known as Mont Saint Michel. As a popular tourist site,
it attracts people from all over the world who of course speak many different
languages. To handle this diverse population of visitors, pamphlets detailing
the history of Mont Saint Michel are published in English, French, Spanish,
Italian and German. The English version of the pamplet is provided as the
sample run.
The Task
The good news is that Ms Bogar was fortunate enough to have the chance to
visit Mont St. Michel and brought back several pamphlets, each in a different
language. The bad news is that Ms Bogar can't tell which language is which.
Your job is to help Ms Bogar identify which pamphlet is in which language.
You will be given four files which contain the text from the pamphlet
available at Mont Saint Michel. Your task is to write a program that takes
a command-line argument which is the name of the file to be read, reads the
text, analyzes it and determines the language in which the text was written --
Italian, German, Spanish or French. Each file will be a different language
and each file is entirely in that same language. You will determine which
language the file is written in based on the frequency that certain letters
appear in the text. The following table represents the 5 most frequently used
letters in the document in Italian, Spanish, French and German, in order.
| French | German | Italian | Spanish |
1st | e | e | a | e |
2nd | a | n | e | a |
3rd | s | i | i | o |
4th | i | r | l | i |
5th | r | t | o | l |
Program Requirements
- Your program must take the name of the text file to analyze as a
command-line argument. The four files that contain the pamphlet text are
located on the GL system in the directory
/afs/umbc.edu/users/s/b/sbogar1/pub
The files are named pamphlet1.txt, pamphlet2.txt, pamphlet3.txt
and pamphlet4.txt
- Your program must output the following information. See the sample run.
- the total number of characters in the file
- the number of whitespace characters in the file
- the number of digits in the file
- the number of punctuation characters in the file
- the number of words in the file
- the frequency of each letter, showing the letter and the
number of times it occurred in descending order by number of
occurrences
- your program must include a function with prototype
char* ReadString (char* filename, int* lengthPtr);
that reads the entire text file into dynamically allocated memory,
returns a pointer to the text as its return value and returns the
length via the parameter lengthPtr by reference.
- You must use separate compilation for this project.
Notes:
- special characters used in some languages (e.g. the umlaut in German,
or the accent in French) are not present in the files.
- a "word" is a sequence of characters surrounded by whitespace
- "whitespace" is any character for which the function isspace()
returns true. Note that this is not limited to spaces, but also includes
tabs, newlines and other nonprintable characters.
- a "punctuation mark" is any character for which the function ispunct()
returns true.
- DO NOT distinguish between uppercase and lowercase letters.
An uppercase 'A' should be counted the same as a lowercase 'a'.
- You MUST use the ctype library functions such as isspace(),
isdigit(), and others to classify characters. See the Unix man pages
for more information on these functions.
Program Hints
- You will be reading the entire contents of the file into a single string.
You must dynamically allocate enough space to hold that string, but you don't
know how big to make it. - You could open the file and read one character at
a time until you hit EOF keeping a count of the characters as you go. You can
get back to the beginning of the file by using rewind().
- Create a small text file of your own. Carefully count
the frequency of all types of characters on which you are supposed to
report. Test your program on the small file first. Your program
output should match the numbers you counted.
- Use 2 arrays to count and sort the frequency of each letter in the file.
One array for the counts and an array for the letters themselves. When sorting
the array of counts, be sure to rearrange the array of letters so that the
counts and their corresponding letters still have the same index. Feel free
to use the sorting code from the lecture notes on the web (with required
modification to handle both arrays).
- Develop your program incrementally.
Sample Run
This sample run used the English pamphlet as the input data.
linux1[20] a.out english.txt
The string is :
Mont Saint Michel
History- The Wonder of the West
Mont Saint Michel is one of the medieval West's major legacies of its sacred
history. Dedicated to Saint Michael in 708 following some miraculous
visitations, in 966 it was entrusted by the Duke of Normandy to the Benedictine
monks who made the island one of the most important places of pilgrimage in the
Christian world, by building on the legend of the founding bishop, Aubert. The
monks set about a superhuman construction program with work continuing without
interruption from the year 1000 to the beginning of the 16th century.
Thus the visitor will gain a comprehensive picture of medieval architecture as
he explores its many buildings, squeezed onto the tip of the rock. Mont Saint
Michel was also an impregnable fortress. Its heroic resistance to the English
during the Hundred Years War earned it a symbolic place in the national psyche.
The ramparts enclosing the village and the abbey fortifications bear witness to
this powerful role. After the conversion of the abbey into a prison, which
remained from the revolution until 1863, the monastery, designated an historic
monument in 1874, underwent major restoration work. These works enable
visitors to enjoy once again the splendor of a building that men in the Middle
Ages saw as the image of Holy Jerusalem on earth.
-----
Follow the Guide
After entering the Guard Room, the fortified entrance to the abbey, the visitor
climbs the Ceremonial staircase, which is the formal entrance to the abbey
church. The path then passes between the church, on the right, and the abbey
lodgings, on the left, linked by hanging passages. These rooms, constructed
from the late 14th to the early 16th centuries, were the official residence of
the abbots.
The west terrace is formed from the primitive square in front of the abbey
church and the first bays of the nave destroyed in the 18th century after a
fire. The neoclassical facade was rebuilt in 1780. From here there is a
panoramic view over the bay, from le Grouin point to Champeaux point. To the
west is Mont Dol and to the north, the small island of Tombelaine. The terrace
also offers an excellent view of the neo-Gothic spire of the bell tower
constructed in 1897 and the embossed copper and gold leaf statue of the
archangel. The abbey church is built on the tip of the rock, on a platform
consisting of four crypts which surround it and support the four arms of the
cross. The elevation of the nave, typical of the Norman Romanesque style, is
on three levels- arcades, triforium galleries and clerestory. The framework
of the nave was clad in paneled barrel vaulting, as were most Romanesque
churches in Normandy. The Romanesque chancel, which fell down in 1421, was
rebuilt between 1446 and 1521 in the flamboyant Gothic style.
The visit continues to the north of the church, with the Gothic monastery
known as the "Merveille", the Wonder, because of the outstanding nature of the
building. It was constructed after the fire of 1204, which devastated the
abbey. The cloister looks out over the sea, to the north, and gives access to
the refectory, the kitchen, church, dormitory, chartulary and various
staircases leading to the lower levels. Around the garden, restored in 1965,
the design of the cloister colonnades, where the height of the small columns
is that of the human body, created an intimate setting in which the monks could
meditate.
The decorations on the quoins, sculpted in Caen stone, which can be carved into
more elaborate designs than the granite of the buildings, was originally
painted. Today only traces of plant material can be distinguished.
In the vast refectory the monks took their meals in silence, while the reader
read to them from the pulpit in the south wall. Narrow windows are set into
the side walls of this room, invisible from the entrance but which allow a
stream of light to pass through.
Access to the lower floor is via a staircase where the monks work room is to be
found, later known as the "Salle des Chevaliers", the Knight's Hall, together
with the Guest Hall where distinguished guests were received. On the ground
floor poor pilgrims were fed and lodged in the almshouse, a vast hall divided
into two naves by a row of columns. The nearby cellarium, an immense cool and
in shadow-filled storeroom, is divided into sections by two rows of square
pillars to ensure that provisions were stored in a logical fashion. A large
model dated 1701, a copy of an original made in 1690 and conserved in the
Relief-Map museum in Paris, is on display in the cellarium. It shows Mont
Saint Michel as it was before the revolution. There is also a life size
maquette of Saint Michel de Fremiet who crowns the church spire. The way out
is through the gardens on the north side whose peaceful paths meander, facing
the immensity of the bay, beneath the steep walls of the "Merveille".
Clerestory- a series of windows in the upper part of a church building, but
clear of the roof, admitting light to the central area of a built space.
Chartulary- Where the abbey records were kept.
Quoin- The roughly triangular shaped area between the tops of two arches.
The three levels of the monastery reflect, from top to bottom, the structure of
medieval society- clergy, nobility, and third estate, and the hierarchy of
nourishment- spiritual, intellectual and material.
-----
Architecture- Medieval architecture
The abbey of Mont Saint Michel offers a complete overview of medieval
architecture. Pre-Romanesque architecture is represented by the church of
Notre-Dame-sous-Terre, 10th century, where traditional Romanesque features can
still be seen - very thick walls constructed of small rubblestone and Norman
arches clad in flat brick. The 11th century offers Romanesque volumes at their
fullest in the crypts of the transept and the south side of the nave of the
church. The masonry facings are meticulously laid out in a regular pattern
with fine jointing.
The 12th century sought a lighter style of construction and used the pointed
arch in the lower north side of the nave. In the Ambulatory, the architects
conceived vaults rising over a skeleton of diagonal arches. This innovation
was to lead to the birth of the Gothic style. The new process permitted the
massive and thick Romanesque vaultings to be replaced by a delicate vaulted
structure supported by arches. Since the weight was thus distributed over the
pillars, larger and larger openings could be made in the walls. The first
floor of the "Merveille", the Wonder, dating from the 13th century,
demonstrates the mastery of this system of construction.
The Flamboyant style 15th century chancel expresses the culmination of Gothic
architecture. Since the vaulting rests on fine pillars, supported on the
outside by majestic flying buttresses, the sanctuary could be transformed into
a space bathed in light.
Norman arch- A semi-circular arch, a revival of the Roman style (hence
Romanesque).
Ambulatory- In French "Promenoir" - where the monks and laymen could walk.
Flamboyant- Refers to the late Gothic period (in France from the late 14th
century) which favored decorative curves and reverse curves resembling flames.
-----
Saint Michael
Saint Michel- Saint Michel of the summits
The mount, dedicated to Saint Michael in 708, was, with Mount Gargan in
Southern Italy, one of the principal places of worship consecrated to the
archangel in the West. Devotion to Saint Michael had a very special
significance in medieval religious life. The archangel Michael had three
tasks - he weighed souls in order to separate them into the elect and the
damned, he lead them to heaven protecting them against lurking demons and
lastly, he guarded the gates of Paradise. Thus peaks close to heaven, such
as Saint-Michel-de-l'Aiguilhe in Puy and Saint-Michel-de-Cuxa in the Pyrenees,
were often consecrated to him, and high chapels above the entrances to a number
of important churches were dedicated to him, like Tournus, Vezelay, and
Saint-Benoit-sur-Loire.
In the 15th century, worship of the archangel acquired a new importance with
the creation of the Order of Saint Michael. The 19th century rediscovered the
Middle Ages, as the Fremiet statue, erected on the top of the spire in 1897,
bears witness.
It consists of 8399 characters in all.
There are 1526 space(s), 214 punctuation mark(s), 85 digit(s), and 1375 word(s)
The letters in descending order by their occurrences are shown below:
e occurred 871 times
t occurred 663 times
a occurred 480 times
o occurred 473 times
i occurred 457 times
r occurred 435 times
n occurred 432 times
h occurred 418 times
s occurred 405 times
l occurred 285 times
c occurred 261 times
d occurred 213 times
u occurred 190 times
m occurred 180 times
f occurred 161 times
g occurred 112 times
w occurred 108 times
p occurred 101 times
y occurred 97 times
b occurred 96 times
v occurred 76 times
k occurred 31 times
q occurred 15 times
j occurred 6 times
x occurred 5 times
z occurred 3 times
The file, english.txt, is written in English.
Submitting the Program
To submit your program, type the following command at the Unix prompt
submit cs201 Proj4 followed by the .c and .h files necessary for compilation
To verify that your project was submitted, you can execute the
following command at the Unix prompt. It will show all files that
you submitted in a format similar to the Unix 'ls' command.
submitls cs201 Proj4