https://alexanderanokhin.com/2012/07/18/dont-forget-about-column-projection/
Note: this post is not about some exact statement, but about importance of column projection which should not be ignored, especially in cases as operations requiring workareas, data access optimization, Exadata offloading and others.
Let’s consider merge of two simple tables.
merge into t1 using t2 on (t1.id = t2.id) when matched then update set n = 1; |
The tables:
SQL> select * from v$version where rownum = 1; BANNER -------------------------------------------------------------------------------- Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production SQL > create table t1 compress 2 as 3 select rownum as id, 4 0 as n, 5 lpad('*', 4000, '*') as pad 6 from dual 7 connect by level <= 1000000; Table created. SQL > create table t2 compress 2 as 3 select 1000000 + rownum as id, 4 1 as n, 5 lpad('*', 4000, '*') as pad 6 from dual 7 connect by level <= 1100000; Table created. SQL > exec dbms_stats.gather_table_stats(user, 'T1', estimate_percent => 100, degree => 4); PL/SQL procedure successfully completed. SQL > exec dbms_stats.gather_table_stats(user, 'T2', estimate_percent => 100, degree => 4); PL/SQL procedure successfully completed. SQL > alter session set statistics_level=all; Session altered. |
And a simple statement:
SQL > merge into t1 2 using t2 on (t1.id = t2.id) 3 when matched 4 then update set n = 1; 0 rows merged. Elapsed: 00:08:41.93 |
Let’s look at the execution plan with rowsource statistics:
SQL > select * from table(dbms_xplan.display_cursor(null, null, 'allstats last')); ------------------------------------- SQL_ID b5cp092vum9nw, child number 0 ------------------------------------- merge into t1 using t2 on (t1.id = t2.id) when matched then update set n = 1 Plan hash value: 3423882595 ---------------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp| ---------------------------------------------------------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 1 | | 0 |00:08:41.66 | 52742 | 608K| 556K| | | | | | 1 | MERGE | T1 | 1 | | 0 |00:08:41.66 | 52742 | 608K| 556K| | | | | | 2 | VIEW | | 1 | | 0 |00:08:41.66 | 52742 | 608K| 556K| | | | | |* 3 | HASH JOIN | | 1 | 1 | 0 |00:08:41.66 | 52742 | 608K| 556K| 2047M| 56M| 30M (1)| 4346K| | 4 | TABLE ACCESS FULL| T1 | 1 | 1000K| 1000K|00:00:05.96 | 22174 | 22163 | 0 | | | | | | 5 | TABLE ACCESS FULL| T2 | 1 | 1100K| 1100K|00:00:05.58 | 30568 | 30556 | 0 | | | | | ---------------------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("T1"."ID"="T2"."ID") |
So, our statement is executing more than 8 minutes. All this time has been consumed on the step 3 – HASH JOIN. Notice amount of used temp space 4346 Mb (this is old oddity that value in Used-Temp should be multiplied by 1000) as a result high amount of I/O operations on temp space (columns Reads, Writes – 556K of physical reads and 556K physical writes via direct path read temp and direct path write temp) due to size of workarea (hash_area_size) was not enough (Used-Mem = 30Mb).
Note: obviously in your case it can be absolutely another time and amount of I/O, in particular it depends on hash area size, in my case it was about 30MbHow can we improve the performance of the statement and especially hash join?
Should we increase hash area size?May be, but before let’s take look at the same execution plan with column projection:
SQL> select * from table(dbms_xplan.display_cursor('b5cp092vum9nw', null, 'allstats last +projection')); ------------------------------------- SQL_ID b5cp092vum9nw, child number 0 ------------------------------------- merge into t1 using t2 on (t1.id = t2.id) when matched then update set n = 1 Plan hash value: 3423882595 ---------------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp| ---------------------------------------------------------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 1 | | 0 |00:08:41.66 | 52742 | 608K| 556K| | | | | | 1 | MERGE | T1 | 1 | | 0 |00:08:41.66 | 52742 | 608K| 556K| | | | | | 2 | VIEW | | 1 | | 0 |00:08:41.66 | 52742 | 608K| 556K| | | | | |* 3 | HASH JOIN | | 1 | 1 | 0 |00:08:41.66 | 52742 | 608K| 556K| 2047M| 56M| 30M (1)| 4346K| | 4 | TABLE ACCESS FULL| T1 | 1 | 1000K| 1000K|00:00:05.96 | 22174 | 22163 | 0 | | | | | | 5 | TABLE ACCESS FULL| T2 | 1 | 1100K| 1100K|00:00:05.58 | 30568 | 30556 | 0 | | | | | ---------------------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("T1"."ID"="T2"."ID") Column Projection Information (identified by operation id): ----------------------------------------------------------- 1 - SYSDEF[4], SYSDEF[32720], SYSDEF[1], SYSDEF[96], SYSDEF[32720] 3 - (#keys=1) "T1"."ID"[NUMBER,22], "T2"."ID"[NUMBER,22], "T1".ROWID[ROWID,10], "T1"."PAD"[VARCHAR2,4000], "T1"."N"[NUMBER,22], "T2"."PAD"[VARCHAR2,4000], "T2"."N"[NUMBER,22] 4 - "T1".ROWID[ROWID,10], "T1"."ID"[NUMBER,22], "T1"."N"[NUMBER,22], "T1"."PAD"[VARCHAR2,4000] 5 - "T2"."ID"[NUMBER,22], "T2"."N"[NUMBER,22], "T2"."PAD"[VARCHAR2,4000] |
Pay attention on column projection of the steps 3, 4, 5. It contains ALL columns from the tables T1 and T2 although the query uses only some of them. The part of the execution plan below of the step 2 is equivalent to
select t1.rowid, t1.id, t1.n, t1.pad, t2.id, t2.n, t2.pad from t1, t2 where t1.id = t2.id |
It looks like a bug or “not implemented yet”. In our case it means that all columns from the table T1 (especially the fat column PAD) will be retrieved and stored in the workarea.
This can dramatically affect the performance:
For operations using workareas it means increased amount of dataset. If required amount of available memory is not enough then it leads to additional I/O in temp space.
In our particular case all columns from the table T1 are retrieved and put into the workarea. If the condition in the statement would be something liket1.id between t2.id and t2.id + 100
then Hash Join would be impossible, because it can be based on “equality” condition. In that case we can expect Merge Join and then both datasests must be sorted.
1234567891011121314151617181920212223242526272829SQL> explain plan for
2 merge into t1
3 using t2
4 on (t1.id between t2.id and t2.id + 100)
5 when matched then
6 update set n = 1;
Explained
SQL> @plan
---------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
---------------------------------------------------------------------------------------
| 0 | MERGE STATEMENT | | 1 | 2 | | 1766K (1)| 05:53:21 |
| 1 | MERGE | T1 | | | | | |
| 2 | VIEW | | | | | | |
| 3 | MERGE JOIN | | 1 | 8018 | | 1766K (1)| 05:53:21 |
| 4 | SORT JOIN | | 1000K| 3822M| 7812M| 840K (1)| 02:48:06 |
| 5 | TABLE ACCESS FULL | T1 | 1000K| 3822M| | 4928 (1)| 00:01:00 |
|* 6 | FILTER | | | | | | |
|* 7 | SORT JOIN | | 1100K| 4206M| 8593M| 926K (1)| 03:05:16 |
| 8 | TABLE ACCESS FULL| T2 | 1100K| 4206M| | 6779 (1)| 00:01:22 |
---------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
6 - filter("T1"."ID"<="T2"."ID"+100)
7 - access(INTERNAL_FUNCTION("T1"."ID")>=INTERNAL_FUNCTION("T2"."ID"))
filter(INTERNAL_FUNCTION("T1"."ID")>=INTERNAL_FUNCTION("T2"."ID"))
It affects data access optimization. If there would be indexes containing all columns using in a query as t1(id, n) or t2(id,n), then Index Fast Full Scan will be impossible here because the indexes do not contain required redundant columns (as PAD).
If there would be a condition doing Index Range Scan appropriate access path then TABLE ACCESS BY INDEX ROWID would exist to get rest of unnecessary columns.If a row is chained then it leads to additional logical (and maybe physical) I/O to read redundant columns
If the query would be run on Exadata and one or both Full Table Scan would be offloading, then there would no Column Projection optimization, returned amount of data from Storage Cells to PGA would be redundant and contains unnecessary columns from a table.
etc
It is what can happen if we ignore Column Projection.
Our particular case can be fixed easy
SQL > merge into (select id, n from t1) t1 2 using (select id, n from t2) t2 on (t1.id = t2.id) 3 when matched 4 then update set n = 1; 0 rows merged. Elapsed: 00:00:20.09 |
upd note: (select id, n from t2) is redundant here. It could be (select id from t2).
The execution plan with rowsurce statistics and column projection:
---------------------------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads | Writes | OMem | 1Mem | Used-Mem | Used-Tmp| ---------------------------------------------------------------------------------------------------------------------------------------------- | 0 | MERGE STATEMENT | | 1 | | 0 |00:00:19.83 | 52742 | 35651 | 2032 | | | | | | 1 | MERGE | T1 | 1 | | 0 |00:00:19.83 | 52742 | 35651 | 2032 | | | | | | 2 | VIEW | | 1 | | 0 |00:00:19.83 | 52742 | 35651 | 2032 | | | | | |* 3 | HASH JOIN | | 1 | 1 | 0 |00:00:19.83 | 52742 | 35651 | 2032 | 36M| 6589K| 34M (1)| 16384 | | 4 | TABLE ACCESS FULL| T2 | 1 | 1100K| 1100K|00:00:11.77 | 30568 | 28841 | 0 | | | | | | 5 | TABLE ACCESS FULL| T1 | 1 | 1000K| 1000K|00:00:04.44 | 22174 | 4778 | 0 | | | | | ---------------------------------------------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("ID"="ID") Column Projection Information (identified by operation id): ----------------------------------------------------------- 1 - SYSDEF[4], SYSDEF[32720], SYSDEF[1], SYSDEF[96], SYSDEF[32720] 3 - (#keys=1) "ID"[NUMBER,22], "ID"[NUMBER,22], "N"[NUMBER,22], "T1".ROWID[ROWID,10], "N"[NUMBER,22] 4 - "ID"[NUMBER,22], "N"[NUMBER,22] 5 - "T1".ROWID[ROWID,10], "ID"[NUMBER,22], "N"[NUMBER,22] |
Now column projection does not contain unnecessary columns, amount of required memory and as a result amount of temp space (Used-Tmp ~ 16Mb) and I/O (2032 physical reads and 2032 physical writes via direct path read temp and direct path write temp) was significantly reduced, and the statement was executed about 20 seconds instead of 8 minutes.
Notice, that in this case Oracle takes the table T2 as build input. It is because RowID from the table T1 is required to perform MERGE.
The part of the execution plan below of the step 2 now looks like select t1.id, t1.n, t1.rowid, t2.id, t2.n from t1, t2 where t1.id = t2.id |
It estimates that amount of resultset
1100K rows * ("ID"[NUMBER,22], "N"[NUMBER,22]) |
less than
1000K rows * ("T1".ROWID[ROWID,10], "ID"[NUMBER,22], "N"[NUMBER,22]) |
and that resultset requires less amount of workarea.
Thus it is reasonable to use T2 as build input here.