Database Management Systems - Chapter 3: Indexing Structures for Files - Võ Thị Ngọc Châu
3.1. Types of Single-level Ordered Indexes
3.2. Multilevel Indexes
3.3. Dynamic Multilevel Indexes Using BTrees and B+-Trees
3.4. Indexes on Multiple Keys
3.5. Other File Indexes
3.6. Indexes in Today‘s DBMSs
3.2. Multilevel Indexes
3.3. Dynamic Multilevel Indexes Using BTrees and B+-Trees
3.4. Indexes on Multiple Keys
3.5. Other File Indexes
3.6. Indexes in Today‘s DBMSs
Bạn đang xem 20 trang mẫu của tài liệu "Database Management Systems - Chapter 3: Indexing Structures for Files - Võ Thị Ngọc Châu", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
File đính kèm:
- database_management_systems_chapter_3_indexing_structures_fo.pdf
Nội dung text: Database Management Systems - Chapter 3: Indexing Structures for Files - Võ Thị Ngọc Châu
- 3.1. Types of Single-level Ordered Indexes A single-level ordered index is an access structure defined on a field of a file (or multiple fields of a file). This index is a file including many entries. Each entry is . Field values: able to be ordered. Pointer(s): record pointers or block pointers to the data file. The field is called an indexing field. The index file is ordered with the field values. Binary search is applied on the index file with the conditions =, >, <, ≥, ≤, between on the indexing field. 8
- 3.1. Types of Single-level Ordered Indexes The index file usually occupies considerably less disk blocks than the data file because the number of the entries in the index file is much smaller. A binary search on the index file yields a pointer to the file record. Indexes are characterized as dense or sparse. A dense index has an index entry for every field value (and hence every record) in the data file. A sparse (or non-dense ) index has index entries for only some field values. The previous example is a non-dense index. 10
- 3.1. Types of Single-level Ordered Indexes Primary indexes An ordered file whose records are of fixed length with two fields: The first field is of the same data type as the ordering key field—called the primary key—of the data file. The second field is a pointer to a disk block. There is one index entry (or index record) in the index file for each block in the data file. Each index entry has the value of the primary key field for the first record in a block and a pointer to that block as its two field values: . The first record in each block of the data file is called the anchor record of the block, or simply the block anchor. 12
- 3.1. Types of Single-level Ordered Indexes Primary indexes A primary index is a nondense (sparse) index. Why? The index file for a primary index occupies a much smaller space than does the data file. Why? Given the value K of its primary key field, a binary search is used on the index file to find the appropriate index entry i, and then retrieve the data file block whose address is P(i). A major problem with a primary index is insertion and deletion of records. Why and how to solve it? 14
- 3.1. Types of Single-level Ordered Indexes Clustering indexes Also defined on an ordered data file with an ordering non-key field. Each index entry for each distinct value of the field The index entry points to the first data block that contains records with that field value. It is another example of non-dense index. Record insertion and deletion cause problems. Why? To alleviate the problem of insertion, it is common to reserve a whole block (or a cluster of contiguous blocks) for each value of the clustering field. All records with that value are placed in the block (or block cluster). 16
- Clustering indexes - Non-dense - Block anchor A clustering index on the Dept_number ordering nonkey field of an EMPLOYEE file: a separate block cluster for each group of records that share the same value for the clustering field Figure 17.3, pp. 608, [1] 18
- 3.1. Types of Single-level Ordered Indexes Secondary indexes A secondary index provides a secondary means of accessing a file for which some primary access already exists. a secondary index may be defined on a field which is a candidate key and has a unique value in every record, or a nonkey with duplicate values. The index is an ordered file with two fields. The first field is of the same data type as some nonordering field of the data file that is an indexing field. The second field is either a block pointer or a record pointer. A secondary index is a dense index. Why? 20
- Secondary indexes Example 3: given the following data file with the non-ordering key field SSN EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ) record size R=150 bytes block size B=512 bytes number of records r=30,000 records blocking factor bfr= B div R= B/R = 512 div 150= 3 records/block number of file blocks b = r/bfr = 30,000/3 = 10,000 blocks For a secondary index on the SSN field, assume the field size VSSN=9 bytes, assume the block pointer size P=6 bytes, the record pointer size PR=7 bytes. index entry size Ri=VSSN+ P=9+6=15 bytes index blocking factor bfri= B div Ri= 512 div 15= 34 entries/block number of index entries ri = number of file records r = 30,000 entries number of index blocks bi= ri/ bfri = 30,000/34 = 883 blocks binary search on the index needs log2bi = log2883 = 10 block accesses one extra block access to retrieve the record from the data file The total search cost via the index is: 10 + 1 = 11 block accesses Because the data file is not ordered according to the values of SSN, an average linear search is done directly on the data file with the cost: b/2 = 10,000/2 = 5,000 block accesses 22
- Index file Field value Record pointer Secondary 1 1 indexes 1 2 A secondary 2 3 index (with 3 record pointers) 3 3 on a nonkey field 3 implemented 4 4 using option (1) 5 5 5 5 6 6 6 6 6 8 8 8 24
- Secondary indexes A secondary index (with record pointers) on a nonkey field implemented using one level of indirection in option (3) so that index entries are of fixed length and have unique field values. Figure 17.5, pp. 612, [1] 26
- Secondary indexes using implementation with option (3) Example 4: given the file with the non-ordering non-key field DEPT_NUMBER EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, , DEPT_NUMBER ) record size R=150 bytes; block size B=512 bytes number of records r=30,000 records It is assumed that there are 125 distinct values of the DEPT_NUMBER field and even distribution across DEPT_NUMBER values. To retrieve the first record from the data file, given a DEPT_NUMBER value: binary search on the index needs log2bi = log23 = 2 block accesses one extra block access to have access to the indirection level one extra block access to retrieve the first record from the data file The total search cost via the index is: 2 + 1 + 1 = 4 block accesses To retrieve all the records with the indirection level and even distribution, given a DEPT_NUMBER value: binary search on the index needs log2bi = log23 = 2 block accesses number of block accesses to the blocks in the indirection level: 4 block accesses, number of block accesses to the data file: 30,000/125 = 240 block accesses The total cost to retrieve all the records = 2 + 4 + 240 = 246 block accesses 28
- 3.1. Types of Single-level Ordered Indexes Table 17.2, pp. 613, [1] Properties of Index Types 30
- 3.2. Multilevel Indexes A binary search is applied to the single- level ordered index file to locate pointers to a disk block or to a record (or records) in the file having a specific index field value. A binary search requires approximately (log2bi) block accesses for an index with bi blocks because each step reduces the part of the index file that we continue to search by a factor of 2. For much faster search, a multilevel index reduces the part of the index that we continue to search by bfri, the blocking factor for the index, called the fan-out fo. 32
- Multilevel Indexes A two-level primary index resembling ISAM (Indexed Sequential Access Method) organization Figure 17.6, pp. 615, [1] 34
- 3.2. Multilevel Indexes SSN 1 5 SSN 2 1 . 2 . . . . SSN . . . . . . . . . . . . SSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Top-level index Second-level index First-level index . (primary) (primary) (primary, clustering, . secondary) . (Multilevel) Index file Data file 36
- 3.2. Multilevel Indexes SSN 1 5 SSN 2 1 . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First-level index . (primary, clustering, . secondary) . Data file 38
- 3.2. Multilevel Indexes SSN 1 5 SSN 2 1 . 2 . . . . SSN . . . . . . . . . . . . SSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Top-level index Second-level index First-level index . (primary) (primary) (primary, clustering, . secondary) . Data file 40
- 3.2. Multilevel Indexes (4) SSN 1 5 Search SSN (3) 2 1 . 2 . . (1) . . (2) SSN . . . . . . . . . . . . SSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Top-level index Second-level index First-level index . (primary) (primary) (primary, clustering, . secondary) . (Multilevel) Index file Data file 42
- A node in a search tree with pointers to subtrees below it. q ≤ p where p is the tree order. Figure 17.8, pp. 618, [1] A search tree of order p = 3 Figure 17.9, pp. 619, [1] 44
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees B-Tree and B+-Tree Each node is stored in a disk block. Each node is kept between 50 and 100 percent full. All the leaf nodes are at the same level. In B-tree, pointers to the data blocks are stored in both internal nodes and leaf nodes of the B-tree structure. B+-tree is a variation of B-tree. Pointers to the data blocks are stored only in leaf nodes. The leaf nodes are usually linked to each other. 46
- (a). A node in a B-tree with q − 1 search values. (b). A B-tree of order p = 3. B-tree structures The values were inserted in the order 8, 5, 1, 7, Figure 17.10, pp. 621, [1] 3, 12, 9, 6. 48
- (a) Internal node of a B+-tree with q−1 search values. (b) Leaf node of a B+-tree with q−1 search values and q−1 data pointers. The nodes of a B+-tree Figure 17.11, pp. 623, [1] 50
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees The structure of the leaf nodes of a B+-tree of order pleaf: 1. Each leaf node is of the form , , , , Pnext> where q ≤ pleaf, each Pri is a data pointer, and Pnext points to the next leaf node. 2. Within each leaf node, K1 ≤ K2 , Kq−1, q ≤ pleaf. 3. Each Pri is a data pointer that points to the record whose search field value is Ki or to a file block containing the record (or to a block of record pointers that point to records whose search field value is Ki if the search field is not a key). 4. Each leaf node has at least ⎡pleaf/2⎤ values. 5. All leaf nodes are at the same level. 52
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees In the following examples, the capacity index of B-tree and that of B+-tree are examined with the same number of levels. Given the parameters: Disk block size B = 512 bytes Search field size V = 9 bytes Block pointer P = 6 bytes Record (data) pointer Pr = 7 bytes Each node in both B-tree and B+-tree is 69% full. 54
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees Consider a B-tree: order p and 69% full nodes On the average, each node has: p*0.69 = 23*0.69 = 15.87 ≈ 16 16 pointers and 15 search values The average fan-out fo = 16. The capacity of a three-level B-tree is: Root: 1 node 15 entries 16 pointers Level 1: 16 nodes 240 entries 256 pointers Level 2: 256 nodes 3,840 entries 4,096 pointers Level 3: 4,096 nodes 61,440 entries The total number of index entries = 15 + 240 + 3,840 + 61,440 = 65,535 entries 56
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees Consider a B+-tree: orders p, pleaf, and also 69% full nodes Each internal node has: 34*0.69 ≈ 23 pointers. Each leaf node has: 31*0.69 ≈ 21 data pointers. Fan-out fo = 23 and 21 for internal and leaf nodes The capacity of a three-level B+-tree: Root: 1 node 22 entries 23 pointers Level 1: 23 nodes 506 entries 529 pointers Level 2: 529 nodes 11,638 entries 12,167 pointers Leaf level: 12,167 nodes 255,507 data pointers The total number of index entries = 255,507 entries 58
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees Search for a Record with Search Key Field Value K, using B+-tree (Algorithm 17.2, pp. 625-626, [1]) 60
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees Search using B-trees and B+-trees Search conditions on indexing attributes =, , ≤, ≥, between, MINIMUM value, MAXIMUM value Search results Zero, one, or many data records Search cost B-trees . From 1 to (1 + the number of tree levels) + data accesses B+-trees . 1 (root level) + the number of tree levels + data accesses Logically ordering for a data file 62
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees Insert a Record with Search Key Field Value K in a B+-Tree of Orders p, pleaf Search a leaf node n in which (K, Pr) is inserted If the leaf node n is not full, then insert (K, Pr). Otherwise, the node overflows and must be split. The first j = ⎡(pleaf + 1)/2⎤ entries in the original node are kept there, and the remaining entries are moved to a new leaf node. The jth search value is replicated in the parent internal node, and an extra pointer to the new node is created in the parent. These must be inserted in the parent node in their correct sequence. If the parent internal node is full, the new value will cause it to overflow also, so it must be split. 64
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees A B+-tree with orders: p = 4 and pleaf = 3, half full For simplicity, data pointers are not included in the leaf nodes. Insertion: 47, 58 66
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees K=58 37|58 j = ⎣(p + 1)/2⎦ 48 60|71 48|58|60|71 46|47|48 56|58|59|60 56|58 59|60 j = ⎡(pleaf + 1)/2⎤ A B+-tree with orders: p = 4 and pleaf = 3, half full For simplicity, data pointers are not included in the leaf nodes. Insertion: 47, 58 68
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees Delete a Record with Search Key Field Value K in a B+-Tree of Orders p, pleaf Search a leaf node n from which (K, Pr) is removed If found, it is always removed from the leaf level. If it happens to occur in an internal node, it must also be removed from there. The value to its left in the leaf node must replace it in the internal node because that value is now the rightmost entry in the subtree. Deletion may cause underflow by reducing the number of entries in the leaf node to below the minimum required. 70
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees 37 58 15 21 48 60 71 10 15 16 18 21 23 37 46 47 48 56 58 59 60 65 71 74 78 92 A B+-tree with orders: p = 4 and pleaf = 3, half full For simplicity, data pointers are not included in the leaf nodes. Underflow: number of tree pointers < ⎡p/2⎤ = 2 number of data pointers < ⎡pleaf/2⎤=2 Deletion: 48, 58 72
- 3.3. Dynamic Multilevel Indexes Using B-Trees and B+-Trees 37 58 Delete: 58 15 21 47 60 71 10 15 16 18 21 23 37 46 47 56 58 59 60 65 71 74 78 92 K=58 56 15 21 37 60 71 10 15 16 18 21 23 37 46 47 56 59 60 65 71 74 78 92 74
- An example of deletion in a B+-tree with p = 3 and pleaf = 2. Original deletion: 5, 12, 9 Update this B+-tree with more deletions: 6, 1 76
- 3.5. Other File Indexes Hash indexes The hash index is a secondary structure to access the file by using hashing on a search key other than the one used for the primary data file organization. Bitmap indexes A bitmap index is built on one particular value of a field (the column in a table) with respect to all the rows (records) and is an array of bits. Function-based indexes In Oracle, an index such that the value that results from applying a function (expression) on a field or some fields becomes the key to the index 78
- Hash-based indexing a hashing function: the sum of the digits of Emp_id modulo 10 Figure 17.15, pp. 634, [1] 80
- 3.5. Other File Indexes Bitmap indexes – Adapted examples in Oracle - E25789-01 customers Table cust_id cust_last_name cust_marital_status cust_gender 1 Kessel M 2 Koch F 3 Emmerson M 4 Hardy M 5 Gowen M 6 Charles single F 7 Ingram single F Sample bitmaps Value Row 1 Row 2 Row 3 Row 4 Row 5 Row 6 Row 7 M 1 0 1 1 1 0 0 F 0 1 0 0 0 1 1 single 0 0 0 0 0 1 1 divorced 0 0 0 0 0 0 0 82
- 3.5. Other File Indexes Function-based indexes The use of any function on a column prevents the index defined on that column from being used. Indexes are only used with some specific search conditions on indexed columns. In Oracle, a function-based index is an index such that the value that results from applying some function (expression) on a field or a collection of fields becomes the key to the index. A function-based index can be either a B-tree or a bitmap index. 84
- 3.5. Other File Indexes Common statements for index creation - UNIQUE is used to guarantee that no two rows of a table have duplicate values in the key column or column. - CLUSTER is used when the index to be created should also sort the data file records on the indexing attribute. - Specifying CLUSTER on a key (unique) attribute would create some variation of a primary index, whereas specifying CLUSTER on a nonkey (nonunique) attribute would create some variation of a clustering index. 86
- 3.6. Indexes in Today‘s DBMSs Internal structure of a B-tree index in Oracle 19c 88
- Summary Indexes: additional access structures Created on one or many fields of a data file, called indexing fields Ordering key field => Primary indexes Ordering non-key field => Clustering indexes Non-ordering field => Secondary indexes Also stored in index files on disk Single-level vs. Multilevel indexes Dynamic multilevel index structures: B-tree, B+-tree Support certain search conditions =, >, >=, <, <=, and ―between‖ on indexing fields 90
- Check for understandings 3.1. What are indexes? Give at least three examples. 3.2. What are primary, secondary, and clustering indexes? Give at least one example for each. 3.3. Compare primary, secondary, and clustering indexes with each other. Which are dense and which are not? Explain the characteristics in their corresponding data file that make them dense or sparse. 92
- Check for understandings 3.7. Given a data file of 15 SSN Name Dept employee records in 8 blocks 2 F 2 5 AA 5 with bfr = 2 records/block: 8 U 4 These records are in ascending 10 C 2 order of SSN values. 11 L 2 15 EG 1 SSN is a primary key. 18 B 4 Name is a candidate key. 19 FL 5 Dept is a foreign key. 21 N 5 23 Y 5 Build a B-tree index on each 28 D 1 field: SSN, Name, Dept, 34 O 2 35 NM 3 given order p = 3. What is 36 T 5 each index called? 40 P 7 94
- Check for understandings 3.9. Given a data file of 15 SSN Name Dept employee records in 8 blocks 2 F 2 5 AA 5 with bfr = 2 records/block: 8 U 4 These records are in ascending 10 C 2 order of SSN values. 11 L 2 15 EG 1 SSN is a primary key. 18 B 4 Name is a candidate key. 19 FL 5 Dept is a foreign key. 21 N 5 23 Y 5 How are those trees updated 28 D 1 if departments 2 and 7 are 34 O 2 35 NM 3 merged to be 10? 36 T 5 40 P 7 96
- Check for understandings 3.11. Given a data file of 15 SSN Name Dept employee records in 8 blocks 2 F 2 5 AA 5 with bfr = 2 records/block: 8 U 4 These records are in ascending 10 C 2 order of SSN values. 11 L 2 15 EG 1 SSN is a primary key. 18 B 4 Name is a candidate key. 19 FL 5 Dept is a foreign key. 21 N 5 23 Y 5 How are those trees updated 28 D 1 if the records of employees 34 O 2 35 NM 3 18 and 11 are removed? 36 T 5 40 P 7 98