Computing with Strings

Sue Evans & Travis Mayberry

Adapted from the CS1 Course at Swarthmore College by Lisa Meeden

Hit the space bar for next slide

Learning Outcomes

To understand the string data type and how strings are represented in memory
To become familiar with the way strings can be manipulated, by using built-in functions and the string library
To understand how indexing works on Python lists and strings
To understand how characters are stored
To be aware of the ASCII chart and how to get the ordinal of a character and vice versa.
To be aware that formatted output is possible to produce nice-looking reports and tables
To be able to write programs that process textual information

String Type

Just like range(x) returns a sequence of numbers, strings are sequences of characters

>>> for ch in 'hello':
...     print ch
...
h
e
l
l
o
>>>

Unlike other languages, strings in Python are denoted using either single or double quotes

>>> myString = "Goodbye"
>>> myString
'Goodbye'

This allows us to easily have quoted text within a string.

>>> fact = '"The Cat In the Hat" was written by Dr. Suess.'
>>> print fact
"The Cat In the Hat" was written by Dr. Suess.
>>>

String Representation

Slicing

String Operators

Operation	Python Operator
Concatenation	+
Repetition	*
Indexing	[ ]
Slicing	[ : ]

Examples using these operators:

>>> 'snow' + 'ball'
'snowball'  
>>> 'hello' * 3
'hellohellohello'

Oops. Let's see if we can do better ...

>>> 'hello ' * 3 + '!'
'hello hello hello !'

Notice in the next example that indexing begins with 0

>>> 'hello'[4]
'o'

Also notice, when slicing, that the character at the first index is included in the slice and all characters from that position up to one less than the second index shown.

>>> 'hello'[1:4]
'ell'

Python also allows use of negative indexes.
An index of -1 is equivalent to the largest index of the string.

>>> 'sample'[-1]
'e'
>>> len('sample')
6
>>> 'sample'[5]
'e'
>>> 'sample'[-3]
'p'

String Library

In order to use the string library, you have to import it :
import string

Function	Purpose
string.capitalize(s)	Returns s with only the first letter capitalized
string.capwords(s)	Returns s with the first letter of each word capitalized
string.count(s, sub)	Returns the number of times sub occurs in s
string.find(s, sub)	Finds the first occurrence of sub in s and returns its position or -1 if it is not found
string.join(list)	Concatenates a list of strings together to make one string
string.upper(s)	Returns a copy of s in all capital letters
string.lower(s)	Returns a copy of s in all lower case letters
string.split(s, c)	Returns a list of strings made by splitting s on each occurence of c
string.strip(s)	Removes all whitespace from the beginning and end of s

String Library Examples

lower makes the string all lower case

upper makes the string all upper case

>>> import string
>>> string.lower("Hello World")
'hello world'
>>> string.upper("Hello World")
'HELLO WORLD'

count counts the number of times sub occurs in s
```
>>> string.count("isn't this it", "is")
2
```

strip removes all whitespace from the beginning and end of a string.

>>> string.strip("  Hello World \n")
'Hello World'
>>> string.strip("   \n \n \n      \t   \n  ")
''

split makes a string into a list of strings. Each element was separated by whitespace in the original. For example:
```
>>> string.split("Hello World from U M B C")
['Hello', 'World', 'from', 'U', 'M', 'B', 'C']
```

The following syntax also works :

>>> "Hello World from U M B C".split()
['Hello', 'World', 'from', 'U', 'M', 'B', 'C']
>>>

How are characters stored?

As binary, of course. Everything is stored in binary!

The technique used for characters is that each character is assigned a number (an integer), and that number is stored in binary.

Recall that an integer is 4 bytes big, with 8 bits/byte, that means an integer needs 32 bits of memory. That allows us to count up to 2147483647. But do we really need all that space to represent a character ?

If you think about the English language and everything that's necessary to communicate, there are very few individual characters :

Type of character	Number
Upper-case characters	26
Lower-case characters	26
Digits	10
Arithmetic Operators	9
Punctuation Marks	24
Total	95

This means that in order to assign a number to each character, we only need to count to 95.
How many bits does that take ?

value	bits
1	1
2	2
4	3
8	4
16	5
32	6
64	7
128	8

So by storing characters in just one byte (8 bits), the amount of memory needed for each character is only 1/4 of the amount of space needed by an int. This is a tremendous amount of saved space for storing text.
So all characters are only one byte big.

Trivia Question: How big is a nibble ?

Half of a byte 4 bits

ASCII

American Standard Code for Information Interchange or ASCII was derived from telegraphic codes. Work began on ASCII in 1960 with the first version published in 1963. There was a major revision in 1967. The current version became available in 1986.

There are 128 characters defined in ASCII. 33 are nonprinting control characters that control communication devices and printers, like line feed or form feed. 32 (0 - 31) control characters are at the beginning of the chart and one at the end, delete (127). Many of the control characters are obsolete. The values 32 - 126 are the definitions of the printable characters made up of numbers, letters, punctuation, whitespace and symbols.

Just as integers can be cast into floats, characters can be cast into their ASCII values, and from ASCII values back into their characters. Numbers have ASCII values from 48-57, while uppercase letters are 65-90 and lowercase letters are 97-122. Here is a full ASCII table.

>>> chr(104)
'h'
>>> ord('h')
104

Let's print out an ASCII chart for just the printable characters.
How would we do that ?

>>> for num in range(32, 127):
...     print num, '=', chr(num)
...
32 =
33 = !
34 = "
35 = #
36 = $
37 = %
38 = &
39 = '
40 = (
41 = )
42 = *
43 = +
44 = ,
45 = -
46 = .
47 = /
48 = 0
49 = 1
50 = 2
51 = 3
52 = 4
53 = 5
54 = 6
55 = 7
56 = 8
57 = 9
58 = :
59 = ;
60 = <
61 = =
62 = >
63 = ?
64 = @
65 = A
66 = B
67 = C
68 = D
69 = E
70 = F
71 = G
72 = H
73 = I
74 = J
75 = K
76 = L
77 = M
78 = N
79 = O
80 = P
81 = Q
82 = R
83 = S
84 = T
85 = U
86 = V
87 = W
88 = X
89 = Y
90 = Z
91 = [
92 = \
93 = ]
94 = ^
95 = _
96 = `
97 = a
98 = b
99 = c
100 = d
101 = e
102 = f
103 = g
104 = h
105 = i
106 = j
107 = k
108 = l
109 = m
110 = n
111 = o
112 = p
113 = q
114 = r
115 = s
116 = t
117 = u
118 = v
119 = w
120 = x
121 = y
122 = z
123 = {
124 = |
125 = }
126 = ~
>>>

As you've seen, if we print just one character per line, our chart is extremely long and narrow, so let's print 6 characters across instead. Since we haven't covered if yet, you can't use it in your solution.
How can we do that ?

>>> for num in range(32, 127, 6):
...     print num, '=', chr(num),
...     print num + 1, '=', chr(num + 1),
...     print num + 2, '=', chr(num + 2),
...     print num + 3, '=', chr(num + 3),
...     print num + 4, '=', chr(num + 4),
...     print num + 5, '=', chr(num + 5)
...
32 =   33 = ! 34 = " 35 = # 36 = $ 37 = %
38 = & 39 = ' 40 = ( 41 = ) 42 = * 43 = +
44 = , 45 = - 46 = . 47 = / 48 = 0 49 = 1
50 = 2 51 = 3 52 = 4 53 = 5 54 = 6 55 = 7
56 = 8 57 = 9 58 = : 59 = ; 60 = < 61 = =
62 = > 63 = ? 64 = @ 65 = A 66 = B 67 = C
68 = D 69 = E 70 = F 71 = G 72 = H 73 = I
74 = J 75 = K 76 = L 77 = M 78 = N 79 = O
80 = P 81 = Q 82 = R 83 = S 84 = T 85 = U
86 = V 87 = W 88 = X 89 = Y 90 = Z 91 = [
92 = \ 93 = ] 94 = ^ 95 = _ 96 = ` 97 = a
98 = b 99 = c 100 = d 101 = e 102 = f 103 = g
104 = h 105 = i 106 = j 107 = k 108 = l 109 = m
110 = n 111 = o 112 = p 113 = q 114 = r 115 = s
116 = t 117 = u 118 = v 119 = w 120 = x 121 = y
122 = z 123 = { 124 = | 125 = } 126 = ~ 127 =
>>>

Wow, that's one ugly table! It doesn't even have columns.
Let's look at some print formatting :

>>> for num in range(32, 127, 6):
...     print "%5d = %c" % (num, chr(num)),
...     print "%5d = %c" % (num + 1, chr(num + 1)),
...     print "%5d = %c" % (num + 2, chr(num + 2)),
...     print "%5d = %c" % (num + 3, chr(num + 3)),
...     print "%5d = %c" % (num + 4, chr(num + 4)),
...     print "%5d = %c" % (num + 5, chr(num + 5))
...
   32 =      33 = !    34 = "    35 = #    36 = $    37 = %
   38 = &    39 = '    40 = (    41 = )    42 = *    43 = +
   44 = ,    45 = -    46 = .    47 = /    48 = 0    49 = 1
   50 = 2    51 = 3    52 = 4    53 = 5    54 = 6    55 = 7
   56 = 8    57 = 9    58 = :    59 = ;    60 = <    61 = =
   62 = >    63 = ?    64 = @    65 = A    66 = B    67 = C
   68 = D    69 = E    70 = F    71 = G    72 = H    73 = I
   74 = J    75 = K    76 = L    77 = M    78 = N    79 = O
   80 = P    81 = Q    82 = R    83 = S    84 = T    85 = U
   86 = V    87 = W    88 = X    89 = Y    90 = Z    91 = [
   92 = \    93 = ]    94 = ^    95 = _    96 = `    97 = a
   98 = b    99 = c   100 = d   101 = e   102 = f   103 = g
  104 = h   105 = i   106 = j   107 = k   108 = l   109 = m
  110 = n   111 = o   112 = p   113 = q   114 = r   115 = s
  116 = t   117 = u   118 = v   119 = w   120 = x   121 = y
  122 = z   123 = {   124 = |   125 = }   126 = ~   127 =
>>>

Console Input

Recall that the function input() can be used to get values from the user.
The downside to input() is that it executes directly whatever is typed in.

What would you expect the following to do?

x = input() 
#User types: hello world
print x

Output:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1
    hello world
             ^
SyntaxError: unexpected EOF while parsing

This is because the string is placed directly where input() was in the code. The user could surround his string with quotes in order to make this work, but you should never leave technical issues for the user to handle. There is a better way to write this code.

Raw Input

The function raw_input() can be used to read in a string value from the console.
```
x = raw_input() 
#User types: hello world
print x
```
Output:
```
'hello world'
```
The function input() interprets whether the user has entered an integer or a float, but doesn't handle strings.
The function raw_input() reads in everything the user enters as a string, so using raw_input() will solve input errors, but if what is entered is to be used in a calculation, you have to change it into the desired type.

>>> value = raw_input("Enter a positive integer: ")
Enter a positive integer: 5
>>> value * 2
'55'
>>> value
'5'
>>> value = int(value)
>>> value
5
>>> value * 2
10
>>>