UMBC CMSC 461 Spring '99

Lecture 24
Indexing and Hashing

Chapter 11, Database Systems Concepts,

by Silberschatz, et al, 1997

Portions reproduced with permission

Basic Concepts

An index for a file is like a catalog for a book in the library. Cards in the catalog are stored in order with portions of the catalog order by author's name, book title, or subject. Items in the database are catalogued with indices based on keys. When a table is defined, it has a primary key; however, it can have additional keys defined.

Typical databases are too large to search sequentially looking for specific records and more sophisticated indexing techniques are employed. The two basic kinds of indices are:

Ordered indices
Hash indices

Different techniques are evaluated on the basis of several opposing factors:

Access types: The access could be for a specific value or a specified range.

Access time: The amount of time it takes to find a particular data item or set of items.

Insertion time: The time it takes to insert a new data item. This value includes the time it takes to find the correct place to insert the new data items, as well as update the index structure. Obviously, the more indices that must be updated, the longer this will take. (Note: When optimizing performance of a DBMS, it is useful to review all indices to make sure that the primary key is the key used most often and then any other indices are being used often enough to justify the overhead of maintaining the index.)

Deletion time: The time it takes to delete a data item. This includes the time to locate the item and then delete it, as well as updating the index structure. (Note: When optimizing performance of a DBMS, it is useful to review how deletions can cascade and cause deletions in other tables. This is part of the overhead of maintaining the index structure.)

Space overhead: The addition space occupied by an index structure. Provided that the amount of additional space is moderate, it is usually worthwhile to sacrifice the space to achieve improved performance. (With the rapidly dropping prices and increased capacity, the developer must still be limited by the constraint that there is only so much space available and that it still requires more time to maintain the structure. This overhead can still be significant!)

With indexing, there is now a new meaning for the over-used word "key":

Primary key
Candidate key
Superkey
Search key

Records in an indexed file are stored in some type of order (including unordered.). If the file containing the records is order sequentially, the index whose search key specifies the sequential order of the file is the primary key. Primary indices are also called clustering indices. Indices whose search key specifies an order different from the sequential order are called secondary indices, or nonclustering indices.

Primary Index

If we assume that all files are ordered sequentially on some search key (primary index), then such files are called index-sequential files. (The Index-Sequential Access Method or ISAM is one of the oldest index schemes used in database systems.) They are used for applications that require both sequential processing of the entire file and random access to the individual records. An example for a back account would be:

RECORD NR (not in table)	LOCATION (primary key)	ACCOUNT NR	BALANCE	NEXT RECORD
1	Brighton	A-217	750	2
2	Downtown	A-101	500	3
3	Downtown	A-110	600	4
4	Mianus	A-215	700	5
5	Perryridge	A-102	900	6
6	Perryridge	A-201	700	7
7	Perryridge	A-218	700	8
8	Redwood	A-222	700	9
9	Round Hill	A-305	350	NULL

It could be even more complex if the primary key was the combination of the location and the account number. The advantage of this method is that when processing is being done sequentially, this is the fastest access method. However it is expensive in terms of maintaining. If a record is to be added to the table, the appropriate slot must be located, a new slot must be added to the table, and all records must be moved down one slot. Then the new slot is inserted. Then all of the index columns must be updated. The file stored on the disk can be stored as contiguous or non-contiguous data. In terms of the DBMS performance, being able to maintain the file contiguously offers better DBMS performance, however the is probably in conflict with the operating system's allocation method. (Some DBMS systems will allocate the maximum space that a file can have when the file is created to reduce the overhead caused when records are added to the file.) Some systems require space that the operating system can not manage, so that some of these problems are reduced.

Deletions are preformed in the reverse sequence. The appropriate slot must be located and all records below that slot must be moved up. Again, all of the index columns must be update.

Updates on these kind of files are normally batched together to minimize the system overhead, because the cost of updating one record is almost exactly the same as updating one hundred records. This is not particularly suited to interactive updates.

Dense and Sparse Indices

This can be improved by adding another data structure (especially for random access).

Dense Index. An index record (or index entry) appears for every search-key value, containing the value and the location for the first data record with that value. The dense index for the previous table would look like:

Value	Start
Brighton	1
Downtown	2
Mianus	4
Perryridge	5
Redwood	8
Round Hill	9

Sparse Index. An index record is created only for some of the values, which is the only difference between the two versions in terms of the data and its structure. With the sparse index, the system has to locate the largest value in the index that does not exceed the search key. From that point the records must be checked sequentially until a match is found.

Obviously, it is generally faster to locate a record if we have a dense index rather than a spare index, but the maintenance overhead is greater for a dense index in addition to requiring more disk space. This is the tradeoff that the system designer must make. The major advantage to the sparse index is that it will required bring in less blocks, which can be expensive in terms of time, because searching the block once is memory is SO much faster! The obvious answer might not be the best answer.

Multilevel Indices

The above system appears to work well, but it does not scale up well. If the index is small enough to keep in memory this is a really good method, but when the index becomes too large, the problems of searching the index become just like searching the file: It takes too long!

Suppose we have a file with 100,000 records with 10 records stored per block. If we have one index record for each block, the index has 10,000 entries. Since the index entries are smaller, there might be 100 entries per block in 100 blocks that are stored sequentially on the disk. Doing a sequential search would involve up to 100 reads (average would be n/2, or 50 reads).

Binary searches could be used. The cost for b blocks would be log₂(b) blocks to be read. In this example that would be seven reads. If each read is 30 milliseconds access time, we have 210 milliseconds, which by today's standards is long! (That example ignores a number of other time-consuming activities.) A binary search is not possible if the system was built using overflow blocks.

What if we use a sparse index on the index file, which is small enough to keep in memory! That new index is an outer index and the original index is an inner index. The system performs a binary search on the outer index and now only has to make one read for the correct block of the inner index. There can be any number of inner levels, creating a multilevel index system.

Multilevel indices are closely related to tree structures, such as the binary trees used for in-memory indexing.

Index Update

Algorithms for updating single-level indices:

Deletion. To delete a record, look up the record to delete. If the deleted record was the only record for that particular search-key, then the search-key value is also deleted from a dense index. If the deleted record was the first record for a particular search-key value, then the index was be updated to point the next record. Likewise if a sparse index was used, it too may have to be similarly updated.

Insertion. First we perform a lookup using the search-key value of the record to be inserted. The index is dense and the search-key does not appear in the index, the search-key value must be inserted into the index. If the index is sparse, it may not have to be modified.

Insertion and deletion algorithms for multilevel indices are a simple extension of this scheme.

Secondary Indices

A secondary index on a candidate key looks just like a dense primary index, except that the record pointers are not stored in the record itself and must be stored somewhere else. Additionally, the secondary index might not be based on a candidate key. Therefore, it is not sufficient to point to the first occurrence. There must be pointers to every record. The "dense" index points to a structure that has an entry for each occurrence of that value. Obviously, the maintenance of secondary indices is more costly.

B⁺-Tree Index Files

The main disadvantage of the index-sequential file organization is that performance degrades as the file grows, both for index lookups and for sequential scans through the data. A secondary disadvantage is to overhead of maintaining the file in a sequential order when inserting large amounts of data.

The B⁺-tree index structure is widely used to maintain efficiency despite insertions and deletions of data. It is a multilevel index, where each node contains some pointers and search-key values. It is assumed that you have already developed a familiarity with trees in your other courses.

Static Hashing

One disadvantage of sequential file organization is that we must access an index structure to locate data, or must use binary searches. This results in more I/O operations that are expensive in terms of time. File organizations based on the technique of hashing allows us to avoid accessing an index structure, yet indexing the file.

Hash File Organization

In a hash file organization, we obtain the address of the disk block (actually a bucket that can contain one or more records) containing the desired record by computing a function on the search-key value of the record.

If K is the set of all search-key values, B is the set of all bucket addresses, h is the hash function, we can computer the address of the bucket to insert a record with the search-key K_i using h(K_i). Assuming there is space in the bucket, we can simply insert the record. We locate the record with the search-key K_i using h(K_i). Deletion is done the same way.

However if it turns out the two records have the same hash value, h(K₅) = h(K₇), then we do a sequence search on the bucket for the record that is desired.

Hash Functions

The worst possible hash function maps all search-key values to the same bucket. The ideal hash functions will spread the records uniformly across all the buckets, so that all buckets have the same number of records.

At design time, we do not necessarily know what search-key values will be stored in the file. The typical hash function performs computation on the internal binary representation of the characters in the search key. It is critical that the hash function be as close to the ideal as possible, because it is the most efficient one for locating the data on a timely basis.

Handling of Bucket Overflow

When trying to insert a record into a full bucket causes an overflow and can be caused by either insufficient buckets or skew.

The number of buckets required is based on the maximum number of records stored divided by the number of records that will fit into a bucket. Rarely do we get that exact fit, the required number of buckets is increased by some fudge factor, typically 20% This wastes twenty percent of the space, but reduces that probability of having the overflow problem.

The skew can be caused when either multiple records have the same key or the chosen hash function results in nonuniform distribution of search keys.

Overflow can be handled by adding addition buckets to hold the records from the single bucket that overflowed (called closed hashing) or simply using some other bucket (open hashing).

An important drawback on hashing is that the hash function must be chosen when we implement the system and can not be easily changed afterwards,

Hash Indices

In addition to using hashing for file organization, the indices can be hashed.

Comparison of Ordered Indexing and Hashing

Each scheme has advantages in certain situations. And the DBMS implementor could leave the decision to the database designer and provide several methods. Normally, the implementor only provides a very limited number of schemes.

Typically, ordered indexing is used unless it is known in advance that range queries will be infrequent, in which case hashing is used. Hash organizations are particularly useful for temporary files created during query processing, if lookups on a key value are required and no ranges queries will be performed.

Index Definition in SQL

Creating an index:

CREATE INDEX <index-name> ON <relation-name> (<attribute-list>)

Example:

CREATE INDEX branch-index ON branch (branch-name)

Deleting an index:

DROP INDEX <index-name>

Lecture 24 Indexing and Hashing