Database Management Systems - Chapter 4: Query Processing and Optimization - Võ Thị Ngọc Châu

4.1. Introduction to Query Processing
 4.2. Translating SQL Queries into Relational Algebra
 4.3. Algorithms for External Sorting
 4.4. Algorithms for SELECT and JOIN Operations
 4.5. Algorithms for PROJECT and SET Operations
 4.6. Implementing Aggregate Operations and Outer
Joins
 4.7. Combining Operations using Pipelining
 4.8. Using Heuristics in Query Optimization
 4.9. Using Selectivity and Cost Estimates in Query
Optimization
 4.10. Overview of Query Optimization in Oracle
 4.11. Semantic Query Optimization 
pdf 140 trang xuanthi 2520
Bạn đang xem 20 trang mẫu của tài liệu "Database Management Systems - Chapter 4: Query Processing and Optimization - Võ Thị Ngọc Châu", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

  • pdfdatabase_management_systems_chapter_4_query_processing_and_o.pdf

Nội dung text: Database Management Systems - Chapter 4: Query Processing and Optimization - Võ Thị Ngọc Châu

  1. 4.5. Algorithms for PROJECT and SET Operations Algorithms for SET operations: Set operations : UNION, INTERSECTION, SET DIFFERENCE and CARTESIAN PRODUCT. CARTESIAN PRODUCT of relations R and S includes all possible combinations of records from R and S. The attributes of the result include all attributes of R and S. Cost analysis of CARTESIAN PRODUCT If R has n records and j attributes and S has m records and k attributes, the result relation will have n*m records and j+k attributes. CARTESIAN PRODUCT operation is very expensive and should be avoided if possible. 60
  2. Union: T  R  S sort the tuples in R and S using the same unique sort attributes; set i  1, j  1; while (i ≤ n) and (j ≤ m) do { if R(i) > S(j) then { output S(j) to T; set j  j+1 } elseif R(i) < S(j) then { output R(i) to T; set i  i+1 } else set j j+1 /* R(i) = S(j), so we skip one of the duplicate tuples */ } if (i ≤ n) then add tuples R(i) to R(n) to T; if (j ≤ m) then add tuples S(j) to S(m) to T; 62
  3. Difference T  R S sort the tuples in R and S using the same unique sort attributes; set i  1, j  1; while (i ≤ n) and (j ≤ m) do { if R(i) > S(j) then set j  j+1 elseif R(i) < S(j) then { output R(i) to T; /* R(i) has no matching S(j), so output R(i) */ set i  i+1 } else set i  i+1, j j+1 } if (i ≤ n) then add tuples R(i) to R(n) to T; 64
  4. 4.6. Implementing Aggregate Operations and Outer Joins Implementing Aggregate Operations: SUM, COUNT and AVG  1. For a dense index (each record has one index entry): apply the associated computation to the values in the index.  2. For a non-dense index: actual number of records associated with each index entry must be accounted for With GROUP BY: the aggregate operator must be applied separately to each group of tuples.  1. Use sorting or hashing on the group attributes to partition the file into the appropriate groups;  2. Compute the aggregate function for the tuples in each group. What if we have Clustering index on the grouping attributes? 66
  5. 4.6. Implementing Aggregate Operations and Outer Joins Implementing Outer Join: Modifying Join Algorithms: Nested Loop or Sort-Merge joins can be modified to implement outer join. e.g., for left outer join, use the left relation as outer relation and construct result from every tuple in the left relation. If there is a match, the concatenated tuple is saved in the result. However, if an outer tuple does not match, then the tuple is still included in the result but is padded with a null value(s). 68
  6. 4.7. Combining Operations using Pipelining Motivation  A query is mapped into a sequence of operations. Each execution of an operation produces a temporary result.  Generating and saving temporary files on disk is time consuming and expensive. Alternative:  Avoid constructing temporary results as much as possible  Pipeline the data through multiple operations - pass the result of a previous operator to the next without waiting to complete the previous operation Pipelining (stream-based processing) 70
  7. Introduction to Query Optimization Query optimization: the process of choosing a suitable execution strategy for processing a query. Goal: to arrive at the most efficient and cost-effective plan using the available information about the schema and the content of relations involved, and to do so in a reasonable amount of time  The chosen execution plan may not always be the most optimal plan possible!!! Heuristic rule-based vs. cost-based optimization 72
  8. 4.8. Using Heuristics in Query Optimization Example: For every project located in ‗Stafford‘, retrieve the project number, the controlling department number and the department manager‘s last name, address and birthdate. SQL query: SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE FROM PROJECT AS P,DEPARTMENT AS D, EMPLOYEE AS E WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND P.PLOCATION='STAFFORD'; Relational algebra: PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((σPLOCATION='STAFFORD'(PROJECT))  DNUM=DNUMBER (DEPARTMENT))  MGRSSN=SSN (EMPLOYEE)) 74
  9. 4.8. Using Heuristics in Query Optimization LESS EXPENSIVE??? NO Cartesian Product!!! An initial query tree Figure 19.1.a, [1], pp. 693 76
  10. 4.8. Using Heuristics in Query Optimization Example: Find the last names of employees born after 1957 who work on a project named ‗Aquarius‘ SELECT LNAME FROM EMPLOYEE, WORKS_ON,PROJECT WHERE PNAME = 'AQUARIUS' AND PNMUBER=PNO AND ESSN=SSN AND BDATE > '1957-12-31'; 78
  11. SELECT LNAME FROM EMPLOYEE, WORKS_ON,PROJECT WHERE PNAME = 'AQUARIUS' AND PNMUBER=PNO AND ESSN=SSN AND BDATE > '1957-12-31'; A previous query tree Apply the more restrictive SELECT operation first swap branches reduce the size of each temporary result 80
  12. SELECT LNAME FROM EMPLOYEE, WORKS_ON,PROJECT WHERE PNAME = 'AQUARIUS' AND PNMUBER=PNO AND ESSN=SSN AND BDATE > '1957-12-31'; A previous query tree Move PROJECT down the query tree Reduce the size of each intermediate result 82
  13. 4.8. Using Heuristics in Query Optimization General Transformation Rules for Relational Algebra Operations: 5. Commutativity of  (and ): The operation is commutative as the operation: R  S = S  R; R S = S R 6. Commuting σ with  (or ): If all the attributes in the selection condition c involve only the attributes of one of the relations being joined - say, R- the two operations can be commuted as follows : σc( R  S ) = (σc(R))  S Alternatively, if the selection condition c can be written as (c1 and c2), where condition c1 involves only the attributes of R and condition c2 involves only the attributes of S, the operations commute as follows: σc( R  S ) = (σc1(R))  (σc2(S)) 84
  14. 4.8. Using Heuristics in Query Optimization General Transformation Rules for Relational Algebra Operations: 8. Commutativity of set operations: The set operations  and ∩ are commutative but – is not. 9. Associativity of  , x,  , and ∩: These four operations are individually associative; that is, if q stands for any one of these four operations (throughout the expression), we have ( R q S ) q T = R q ( S q T ) 10. Commuting s with set operations: The s operation commutes with  , ∩ , and –. If q stands for any one of these three operations, we have sc ( R q S ) = (sc (R)) q (sc (S)) 86
  15. 4.8. Using Heuristics in Query Optimization General Transformation Rules for Relational Algebra Operations: 13. Pushing σ in conjunction with set difference: σc(R − S) = σc(R) – σc(S) However, σ may be applied to only one relation: σc(R – S) = σc(R) – S 14. Pushing σ to only one argument in ∩: If in the condition σc all attributes are from relation R, then: σc(R ∩ S) = σc(R) ∩ S 15. Some trivial transformations: If S is empty, then R ∪ S = R. If the condition c in σ is true for the entire R, then σ (R) = R. c c 88
  16. 4.8. Using Heuristics in Query Optimization A Heuristic Algebraic Optimization Algorithm: 5. Using rules 3, 4, 7, and 11 concerning the cascading of project and the commuting of project with other operations, break down and move lists of projection attributes down the tree as far as possible by creating new project operations as needed. Only those attributes needed in the query result and in subsequent operations in the query tree should be kept after each PROJECT operation. 6. Identify subtrees that represent groups of operations that can be executed by a single algorithm. 90
  17. 4.8. Using Heuristics in Query Optimization Summary of Heuristics for Algebraic Optimization: The main heuristic is to apply first the operations that reduce the size of intermediate results.  performing as early as possible SELECT operations to reduce the number of tuples and PROJECT operations to reduce the number of attributes—by moving SELECT and PROJECT operations as far down the tree as possible The SELECT and JOIN operations that are most restrictive—that is, result in relations with the fewest tuples or with the smallest absolute size—should be executed before other similar operations. Reordering the leaf nodes of the tree among themselves while avoiding Cartesian products, and adjusting the rest of the tree appropriately 92
  18. 4.8. Using Heuristics in Query Optimization How about nested subquery optimization? Case 1: a query block inside an outer query block Case 2: correlated nested queries 94
  19. 4.8. Using Heuristics in Query Optimization How about nested subquery optimization? SELECT Fname, Lname, Salary FROM EMPLOYEE E, DEPARTMENT D WHERE E.Dno = D.Dnumber AND D.Zipcode = 30332; 96
  20. 4.8. Using Heuristics in Query Optimization Query Execution Plans An execution plan for a relational algebra query consists of a combination of the relational algebra query tree and information about the access methods to be used for each relation as well as the methods to be used in computing the relational operators stored in the tree. Materialized evaluation: The result of an operation is stored as a temporary relation. Pipelined evaluation: as the result of an operator is produced, it is forwarded to the next operator in sequence. 98
  21. 4.9. Using Selectivity and Cost Estimates in Query Optimization Cost-based query optimization:  For a given subexpression in the query, there may be multiple equivalence rules that apply.  It is necessary to resort to some quantitative measure for evaluation of alternatives. By using the space and time requirements and reducing them to some common metric called cost, it is possible to devise some methodology for optimization.  Appropriate search strategies can be designed by keeping the cheapest alternatives and pruning the costlier alternatives.  The scope of query optimization is generally a query block. Various table and index access paths, join permutations (orders), join methods, group-by methods, etc. provide the alternatives from which the query optimizer must choose.  In a global query optimization, the scope of optimization is multiple query blocks. 100
  22. 4.9. Using Selectivity and Cost Estimates in Query Optimization Catalog Information Used in Cost Functions Information about the size of a file  number of records (tuples) (r)  record size (R)  number of blocks (b)  blocking factor (bfr) Information about indexes and indexing attributes of a file  Number of levels (x) of each multilevel index  Number of first-level index blocks (bI1)  Number of distinct values (d) of an attribute  Selectivity (sl) of an attribute (the fraction of records satisfying an equality condition on the attribute)  Selection cardinality (s) of an attribute (s = sl * r) 102
  23. 4.9. Using Selectivity and Cost Estimates in Query Optimization Cost Functions for SELECT: S4. Using an ordering index to retrieve multiple records For the comparison condition (>, >=, , =, or <=), CS6b= x + bI1/2 + r/2 (Half the file records are assumed to satisfy the condition.) 104
  24. Examples for SELECT EMPLOYEE: rE = 10,000 , bE = 2000 , bfrE = 5 Access paths:  1. A clustering index on SALARY, with levels xSALARY = 3 and average selection cardinality SSALARY = 20.  2. A secondary index on the key attribute SSN, with xSSN = 4 (SSSN = 1).  3. A secondary index on the nonkey attribute DNO, with xDNO= 2 and first-level index blocks bI1DNO= 4. There are dDNO = 125 distinct values for DNO, so the selection cardinality of DNO is SDNO = rE/dDNO = 80.  4. A secondary index on SEX, with xSEX = 1. There are dSEX = 2 values for the sex attribute, so the average selection cardinality is SSEX = rE/dSEX = 5000. 106
  25. Examples for SELECT OP3: σDNO=5 (EMPLOYEE)  CS1a = 2000  CS6a = xDNO + sDNO + 1 = 2 + 80 + 1 = 83 OP4: σDNO=5 AND SALARY>30000 AND SEX='F' (EMPLOYEE)  CS6a-DNO = 83  CS4-SALARY = xSALARY + b/2 = 3 + 2000/2 = 1003  CS6a-SEX = xSEX + sSEX + 1 = 1 + 5000 + 1 = 5002 => chose DNO=5 first and check the other conditions 108
  26. Examples for SELECT OP5: σESSN='123456789' AND PNO=10(WORKS_ON)  CS1a = bWORKS_ON Using a primary composite index on (ESSN, PNO)  CS8 = CS3a = x + 1 => choose the access path with a smaller cost: linear search or access via a primary composite index on (ESSN, PNO) 110
  27. 4.9. Using Selectivity and Cost Estimates in Query Optimization Cost Functions for JOIN: Join selectivity (js) js = | (R  C S) | / | R x S | = | (R  C S) | / (|R| * |S |) If condition C does not exist, js = 1. If no tuples from the relations satisfy condition C, js = 0. Usually, 0 <= js <= 1 Size of the result file after join operation | (R  C S) | = js * |R| * |S | Note: R  A=B S If A is a key of R, js ≤ 1/|R|. If B is a key of S, js ≤ 1/|S|. 112
  28. 4.9. Using Selectivity and Cost Estimates in Query Optimization Cost Functions for JOIN: J2. Single-loop join For a secondary index, CJ2a = bR + |R| * (xB+ sB + 1) + (js* |R|* |S|)/bfrRS For a clustering index, CJ2b = bR + |R| * (xB+ sB/bfrB ) + (js* |R|* |S|)/bfrRS For a primary index, CJ2c = bR + |R| * (xB+ 1) + (js* |R|* |S|)/bfrRS If a hash key exists for one of the two join attributes — B of S CJ2d = bR + |R| * h + (js* |R|* |S|)/bfrRS h: the average number of block accesses to retrieve a record, given its hash key value, h>=1 114
  29. Examples for JOIN Suppose that we have the EMPLOYEE file described in the previous example The DEPARTMENT file: rD = 125 and bD = 13 , xDNUMBER = 1, secondary index on MGRSSN of DEPARTMENT, sMGRSSN = 1, xMGRSSN = 2, jsOP8 = (1/|DEPARTMENT| ) = 1/125 , bfrED = 4 OP8: EMPLOYEE  DNO=DNUMBER DEPARTMENT OP9: DEPARTMENT  MGRSSN=SSN EMPLOYEE 116
  30. Assume that the DEPARTMENT file of rD = 125 and bD = 13 , xDNUMBER = 1, secondary index on MGRSSN of DEPARTMENT, s = 1, x = 2, Example MGRSSN MGRSSN jsOP9 = (1/IEMPLOYEEI ) = 1/rE = 1/10,000 , bfrED = 4 A secondary index on the key attribute SSN of EMPLOYEE, with xSSN = 4 (SSSN = 1). OP9: DEPARTMENT  MGRSSN=SSNEMPLOYEE  Method J1 (Nested loop) with Employee as outer: CJ1 = bE + bE * bD + (jsOP7 * rE * rD)/bfrED = 2000 + 2000 * 13 + ((1/10,000) * 10,000 * 125)/4 = 28,031.25 = 28,032  Method J1 (Nested loop) with Department as outer: CJ1 = bD + bE * bD + (jsOP7 * rE * rD)/bfrED = 13 + 13 * 2000 + ((1/10,000) * 10,000 * 125)/4 = 26,044.25 = 26,045  Method J2 (Single loop) with EMPLOYEE as outer loop: CJ2c = bE + rE * (xMGRSSN + sMGRSSN + 1) + (jsOP7 * rE * rD)/bfrED = 2000 + 10,000 * (2 + 1 + 1) + ((1/10,000) * 10,000 * 125)/4 = 42,031.25 = 42,032  Method J2 (Single loop) with DEPARTMENT as outer loop: CJ2a = bD + rD * (xSSN+ sSSN + 1) + (jsOP7 * rE * rD)/bfrED = 13 + 125 * (4 + 1 + 1) + ((1/10,000) * 10,000 * 125)/4 = 794.25 = 795 Chosen!!! 118
  31. 4.10. Overview of Query Optimization in DBMSs Displaying the System‘s Query Execution Plan 120
  32. Execution Plan in MS SQL Server 122
  33. 4.10. Overview of Query Optimization in Oracle Query optimizer  determines which execution plan is most efficient by considering several sources of information, including query conditions, available access paths, statistics gathered for the Oracle® Database Concepts system, and hints 11g Release 2 (11.2) Part Number E25789-01, 7 SQL 124
  34. 4.10. Overview of Query Optimization in Oracle Optimizer hints /*+ FIRST_ROWS(25) */  Suppose that an interactive application runs a query that returns 50 rows. User A wants the optimizer to generate a plan that gets the first 25 records as quickly as possible so that the user is not forced to wait. 126
  35. 4.11. Semantic Query Optimization A different approach to query optimization, used in combination with the techniques discussed previously, uses constraints specified on the database schema - such as unique attributes and other more complex constraints—to modify one query into another query that is more efficient to execute. With the inclusion of active rules and additional metadata in database systems, semantic query optimization techniques are being gradually incorporated into DBMSs. 128
  36. Summary Query processing includes several typical steps as follows:  Scanning, parsing, validating  Query optimizing  Query code generating  Runtime database processing Query optimization is a process of choosing a suitable execution plan for processing a query.  A reasonably efficient or the best available plan for executing the query  Heuristic optimizer vs Cost-based optimizer 130
  37. Summary Cost-based optimizer  Estimating cost for different execution plans and choosing the plan that minimizes estimated cost  Different database systems consider different cost components for a cost function.  The scope of query optimization is generally a query block. Various table and index access paths, join permutations (orders), join methods, group-by methods, and so on provide the alternatives from which the query optimizer must choose.  Catalog information used in cost functions Selectivity, cardinality, , other statistical information 132
  38. Your turn for query optimization 3. Suppose that: Supplier has rS = 500 records, bS = 100 blocks, bfrS = 5 records/block, one primary index B+-tree on Supp# with xSupp#=2 and one secondary index B+-tree on City with xCity = 2, dCity = 50 distinct values for City. Project has rP = 1,000 records, bP = 200 blocks, bfrP = 5 records/block, one primary index B+-tree on Proj# with xProj# = 2 and another secondary index B+-tree on Budget with xBudget=2 and first-level index blocks bI1Budget = 5. Order has rO = 20,000 records, bO = 5,000 blocks, bfrO = 4 records/block, a secondary index B+-tree on Supp# with xSupp# = 3 and another secondary index B+-tree on Proj# with xProj# = 3 and first-level index blocks bI1Proj# = 10. Blocking factor for join results bfrPO = 2, bfrSO = 2. What access paths should be for Budget>10000000(Project), City=‗New York City‘(Supplier), and for Project  Project.Proj#= Order.Proj# Order? 134
  39. Check for Understandings 4.1. List and describe typical steps when a query is processed. 4.2. Differentiate a query tree from a query graph. 4.3. Why does a SQL query need to be translated into relational algebra expressions? 4.4. Describe external sorting and calculate its cost. List some applications of sorting in query processing. 136
  40. Check for Understandings 4.11. Given queries as follows, for each query, write its corresponding SQL statement, draw its query tree, and then explain its processing to obtain the result. 4.11.1. Retrieve the last name and salary of each employee who works in department 10 and has a salary higher than 30,000. 4.11.2. Retrieve the last name and department number of each employee who works in the department where the minimum salary of the employees is higher than 30,000. 4.11.3. Retrieve the department name and department number of each department where more than 10 employees work with a salary higher than 30,000. 138
  41. Check for Understandings 4.17. Draw query trees step by step to obtain a final optimized query tree using heuristic optimization for each query in 4.11. 4.18. Using the characteristics of the EMPLOYEE and DEPARTMENT data files as described in the previous slides, describe an optimized execution plan based on a decision of the cost-based optimizer for each query in 4.11. 140