Database Systems - Lec 10: Data Storage, Indexing Structures for Files - Nguyen Thanh Tung
§Data Storage
•Disk Storage Devices
•Files of Records
•Operations on Files
•Unordered Files
•Ordered Files
•Hashed Files
•RAID Technology
§Indexing Structures for Files
•Types of Single-level Ordered Indexes
•Multilevel Indexes
•Dynamic Multilevel Indexes Using B-Trees and B+-Trees
•Indexes on Multiple Keys
§
Bạn đang xem 20 trang mẫu của tài liệu "Database Systems - Lec 10: Data Storage, Indexing Structures for Files - Nguyen Thanh Tung", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.
File đính kèm:
- database_systems_lec10_data_storage_indexing_structures_for.ppt
Nội dung text: Database Systems - Lec 10: Data Storage, Indexing Structures for Files - Nguyen Thanh Tung
- Data Storage, Indexing Structures for Files 2
- Disk Storage Devices ▪ Preferred secondary storage device for high storage capacity and low cost. ▪ Data stored as magnetized areas on magnetic disk surfaces. ▪ A disk pack contains several magnetic disks connected to a rotating spindle. ▪ Disks are divided into concentric circular tracks on each disk surface. • Track capacities vary typically from 4 to 50 Kbytes or more 4
- Disk Storage Devices (contd.) 6
- Disk Storage Devices (contd.) 8
- Blocking ▪ Blocking: • Refers to storing a number of records in one block on the disk. ▪ Blocking factor (bfr) refers to the number of records per block. ▪ There may be empty space in a block if an integral number of records do not fit in one block. ▪ Spanned Records: • Refers to records that exceed the size of one or more blocks and hence span a number of blocks. 10
- Files of Records (contd.) ▪ File records can be unspanned or spanned • Unspanned: no record can span two blocks • Spanned: a record can be stored in more than one block ▪ The physical disk blocks that are allocated to hold the records of a file can be contiguous, linked, or indexed. ▪ In a file of fixed-length records, all records have the same format. Usually, unspanned blocking is used with such files. ▪ Files of variable-length records require additional information to be stored in each record, such as separator characters and field types. • Usually spanned blocking is used with such files. 12
- Unordered Files ▪ Also called a heap or a pile file. ▪ New records are inserted at the end of the file. ▪ A linear search through the file records is necessary to search for a record. • This requires reading and searching half the file blocks on the average, and is hence quite expensive. ▪ Record insertion is quite efficient. ▪ Reading the records in order of a particular field requires sorting the file records. 14
- Ordered Files (contd.) 16
- Hashed Files ▪ Hashing for disk files is called External Hashing ▪ The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1, , bucketM-1. • Typically, a bucket corresponds to one (or a fixed number of) disk block. ▪ One of the file fields is designated to be the hash key of the file. ▪ The record with hash key value K is stored in bucket i, where i=h(K), and h is the hashing function. ▪ Search is very efficient on the hash key. ▪ Collisions occur when a new record hashes to a bucket that is already full. • An overflow file is kept for storing such records. • Overflow records that hash to each bucket can be linked together. 18
- Hashed Files (contd.) ▪ To reduce overflow records, a hash file is typically kept 70-80% full. ▪ The hash function h should distribute the records uniformly among the buckets • Otherwise, search time will be increased because many overflow records will exist. ▪ Main disadvantages of static external hashing: • Fixed number of buckets M is a problem if the number of records in the file grows or shrinks. • Ordered access on the hash key is quite inefficient (requires sorting the records). 20
- Parallelizing Disk Access using RAID Technology. ▪ Secondary storage technology must take steps to keep up in performance and reliability with processor technology. ▪ A major advance in secondary storage technology is represented by the development of RAID, which originally stood for Redundant Arrays of Inexpensive Disks. ▪ The main goal of RAID is to even out the widely different rates of performance improvement of disks against those in memory and microprocessors. 22
- Use of RAID Technology (contd.) 24
- Storage Area Networks (contd.) ▪ Advantages of SANs are: • Flexible many-to-many connectivity among servers and storage devices using fiber channel hubs and switches. • Up to 10km separation between a server and a storage system using appropriate fiber optic cables. • Better isolation capabilities allowing non-disruptive addition of new peripherals and servers. ▪ SANs face the problem of combining storage options from multiple vendors and dealing with evolving standards of storage management software and hardware. 26
- Indexes as Access Paths ▪ A single-level index is an auxiliary file that makes it more efficient to search for a record in the data file. ▪ The index is usually specified on one field of the file (although it could be specified on several fields) ▪ One form of an index is a file of entries , which is ordered by field value ▪ The index is called an access path on the field. 28
- Indexes as Access Paths (contd.) ▪ Example: Given the following data file: EMPLOYEE(NAME,SSN, ADDRESS,JOB,SAL, ) ▪ Suppose that: • record size R=150 bytes block size B=512 bytes r=30000 records ▪ Then, we get: • blocking factor Bfr= B div R= 512 div 150= 3 records/block • number of file blocks b= (r/Bfr)= (30000/3)= 10000 blocks 30
- Types of Single-Level Indexes ▪ Primary Index • Defined on an ordered data file • The data file is ordered on a key field • Includes one index entry for each block in the data file; the index entry has the key field value for the first record in the block, which is called the block anchor • A similar scheme can use the last record in a block. • A primary index is a nondense (sparse) index, since it includes an entry for each disk block of the data file and the keys of its anchor record rather than for every search value. 32
- Types of Single-Level Indexes ▪ Clustering Index • Defined on an ordered data file • The data file is ordered on a non-key field unlike primary index, which requires that the ordering field of the data file have a distinct value for each record. • Includes one index entry for each distinct value of the field; the index entry points to the first data block that contains records with that field value. • It is another example of nondense index where Insertion and Deletion is relatively straightforward with a clustering index. 34
- Another Clustering Index Example 36
- Example of a Dense Secondary Index 38
- Properties of Index Types 40
- A Two-level Primary Index 42
- A Node in a Search Tree with Pointers to Subtrees below It ▪ FIGURE 14.8 44
- Dynamic Multilevel Indexes Using B-Trees and B+-Trees ▪ Most multi-level indexes use B-tree or B+-tree data structures because of the insertion and deletion problem • This leaves space in each tree node (disk block) to allow for new index entries ▪ These data structures are variations of search trees that allow efficient insertion and deletion of new search values. ▪ In B-Tree and B+-Tree data structures, each node corresponds to a disk block ▪ Each node is kept between half-full and completely full 46
- Difference between B-tree and B+-tree ▪ In a B-tree, pointers to data records exist at all levels of the tree ▪ In a B+-tree, all pointers to data records exists at the leaf-level nodes ▪ A B+-tree can have less levels (or higher capacity of search values) than the corresponding B-tree 48
- The Nodes of a B+-tree ▪ FIGURE 14.11 The nodes of a B+-tree • (a) Internal node of a B+-tree with q –1 search values. • (b) Leaf node of a B+-tree with q – 1 search values and q – 1 data pointers. 50
- Q&A 52