UMBC CMSC 461 Spring '99 |
An index for a file is like a catalog for a book in the library. Cards in the catalog are stored in order with portions of the catalog order by author's name, book title, or subject. Items in the database are catalogued with indices based on keys. When a table is defined, it has a primary key; however, it can have additional keys defined.
Typical databases are too large to search sequentially looking for specific records and more sophisticated indexing techniques are employed. The two basic kinds of indices are:
Different techniques are evaluated on the basis of several opposing factors:
With indexing, there is now a new meaning for the over-used word "key":
Records in an indexed file are stored in some type of order (including unordered.). If the file containing the records is order sequentially, the index whose search key specifies the sequential order of the file is the primary key. Primary indices are also called clustering indices. Indices whose search key specifies an order different from the sequential order are called secondary indices, or nonclustering indices.
If we assume that all files are ordered sequentially on some search key (primary index), then such files are called index-sequential files. (The Index-Sequential Access Method or ISAM is one of the oldest index schemes used in database systems.) They are used for applications that require both sequential processing of the entire file and random access to the individual records. An example for a back account would be:
RECORD NR (not in table) |
LOCATION (primary key) |
ACCOUNT NR |
BALANCE |
NEXT RECORD |
1 |
Brighton |
A-217 |
750 |
2 |
2 |
Downtown |
A-101 |
500 |
3 |
3 |
Downtown |
A-110 |
600 |
4 |
4 |
Mianus |
A-215 |
700 |
5 |
5 |
Perryridge |
A-102 |
900 |
6 |
6 |
Perryridge |
A-201 |
700 |
7 |
7 |
Perryridge |
A-218 |
700 |
8 |
8 |
Redwood |
A-222 |
700 |
9 |
9 |
Round Hill |
A-305 |
350 |
NULL |
It could be even more complex if the primary key was the combination of the location and the account number. The advantage of this method is that when processing is being done sequentially, this is the fastest access method. However it is expensive in terms of maintaining. If a record is to be added to the table, the appropriate slot must be located, a new slot must be added to the table, and all records must be moved down one slot. Then the new slot is inserted. Then all of the index columns must be updated. The file stored on the disk can be stored as contiguous or non-contiguous data. In terms of the DBMS performance, being able to maintain the file contiguously offers better DBMS performance, however the is probably in conflict with the operating system's allocation method. (Some DBMS systems will allocate the maximum space that a file can have when the file is created to reduce the overhead caused when records are added to the file.) Some systems require space that the operating system can not manage, so that some of these problems are reduced.
Deletions are preformed in the reverse sequence. The appropriate slot must be located and all records below that slot must be moved up. Again, all of the index columns must be update.
Updates on these kind of files are normally batched together to minimize the system overhead, because the cost of updating one record is almost exactly the same as updating one hundred records. This is not particularly suited to interactive updates.
This can be improved by adding another data structure (especially for random access).
Value |
Start |
Brighton |
1 |
Downtown |
2 |
Mianus |
4 |
Perryridge |
5 |
Redwood |
8 |
Round Hill |
9 |
Obviously, it is generally faster to locate a record if we have a dense index rather than a spare index, but the maintenance overhead is greater for a dense index in addition to requiring more disk space. This is the tradeoff that the system designer must make. The major advantage to the sparse index is that it will required bring in less blocks, which can be expensive in terms of time, because searching the block once is memory is SO much faster! The obvious answer might not be the best answer.
The above system appears to work well, but it does not scale up well. If the index is small enough to keep in memory this is a really good method, but when the index becomes too large, the problems of searching the index become just like searching the file: It takes too long!
Suppose we have a file with 100,000 records with 10 records stored per block. If we have one index record for each block, the index has 10,000 entries. Since the index entries are smaller, there might be 100 entries per block in 100 blocks that are stored sequentially on the disk. Doing a sequential search would involve up to 100 reads (average would be n/2, or 50 reads).
Binary searches could be used. The cost for b blocks would be log2(b) blocks to be read. In this example that would be seven reads. If each read is 30 milliseconds access time, we have 210 milliseconds, which by today's standards is long! (That example ignores a number of other time-consuming activities.) A binary search is not possible if the system was built using overflow blocks.
What if we use a sparse index on the index file, which is small enough to keep in memory! That new index is an outer index and the original index is an inner index. The system performs a binary search on the outer index and now only has to make one read for the correct block of the inner index. There can be any number of inner levels, creating a multilevel index system.
Multilevel indices are closely related to tree structures, such as the binary trees used for in-memory indexing.
Algorithms for updating single-level indices:
Insertion and deletion algorithms for multilevel indices are a simple extension of this scheme.
A secondary index on a candidate key looks just like a dense primary index, except that the record pointers are not stored in the record itself and must be stored somewhere else. Additionally, the secondary index might not be based on a candidate key. Therefore, it is not sufficient to point to the first occurrence. There must be pointers to every record. The "dense" index points to a structure that has an entry for each occurrence of that value. Obviously, the maintenance of secondary indices is more costly.
The main disadvantage of the index-sequential file organization is that performance degrades as the file grows, both for index lookups and for sequential scans through the data. A secondary disadvantage is to overhead of maintaining the file in a sequential order when inserting large amounts of data.
The B+-tree index structure is widely used to maintain efficiency despite insertions and deletions of data. It is a multilevel index, where each node contains some pointers and search-key values. It is assumed that you have already developed a familiarity with trees in your other courses.
One disadvantage of sequential file organization is that we must access an index structure to locate data, or must use binary searches. This results in more I/O operations that are expensive in terms of time. File organizations based on the technique of hashing allows us to avoid accessing an index structure, yet indexing the file.
In a hash file organization, we obtain the address of the disk block (actually a bucket that can contain one or more records) containing the desired record by computing a function on the search-key value of the record.
If K is the set of all search-key values, B is the set of all bucket addresses, h is the hash function, we can computer the address of the bucket to insert a record with the search-key Ki using h(Ki). Assuming there is space in the bucket, we can simply insert the record. We locate the record with the search-key Ki using h(Ki). Deletion is done the same way.
However if it turns out the two records have the same hash value, h(K5) = h(K7), then we do a sequence search on the bucket for the record that is desired.
The worst possible hash function maps all search-key values to the same bucket. The ideal hash functions will spread the records uniformly across all the buckets, so that all buckets have the same number of records.
At design time, we do not necessarily know what search-key values will be stored in the file. The typical hash function performs computation on the internal binary representation of the characters in the search key. It is critical that the hash function be as close to the ideal as possible, because it is the most efficient one for locating the data on a timely basis.
When trying to insert a record into a full bucket causes an overflow and can be caused by either insufficient buckets or skew.
The number of buckets required is based on the maximum number of records stored divided by the number of records that will fit into a bucket. Rarely do we get that exact fit, the required number of buckets is increased by some fudge factor, typically 20% This wastes twenty percent of the space, but reduces that probability of having the overflow problem.
The skew can be caused when either multiple records have the same key or the chosen hash function results in nonuniform distribution of search keys.
Overflow can be handled by adding addition buckets to hold the records from the single bucket that overflowed (called closed hashing) or simply using some other bucket (open hashing).
An important drawback on hashing is that the hash function must be chosen when we implement the system and can not be easily changed afterwards,
In addition to using hashing for file organization, the indices can be hashed.
Each scheme has advantages in certain situations. And the DBMS implementor could leave the decision to the database designer and provide several methods. Normally, the implementor only provides a very limited number of schemes.
Typically, ordered indexing is used unless it is known in advance that range queries will be infrequent, in which case hashing is used. Hash organizations are particularly useful for temporary files created during query processing, if lookups on a key value are required and no ranges queries will be performed.
Creating an index:
CREATE INDEX <index-name> ON <relation-name> (<attribute-list>)
Example:
CREATE INDEX branch-index ON branch (branch-name)
Deleting an index:
DROP INDEX <index-name>
CSEE | 461 | 461 S'99 | lectures | news | help