Understanding Parallel Execution – Part 2

by Randolf Geist

Maximizing the performance of the Parallel Execution feature in Oracle Database Enterprise Edition

November 2012

In order to get the most out of Parallel Execution, the following prerequisites need to be met:

Efficient Parallel Execution plan
Sufficient resources available at execution time
No significant skew of work distribution across Parallel Slaves

Efficient Parallel Execution plans were discussed in Part 1. In Part 2 we'll look at the remaining two points in detail.

Sufficient Resources Available At Execution Time

There is a lot to say about this, but the following is important to understand:

CPU Usage - I/O capabilities

In general, Parallel Execution makes sense only if the underlying system architecture or the specifics of the SQL statement by itself allow saturating more than a single CPU. This is particularly relevant if the majority of data needs to be read from disk. Often the efficiency of Parallel Execution is severely limited by the fact that the I/O subsystem cannot deliver the data fast enough simply leading to more processes waiting for data from disk when employing Parallel Execution rather than speeding up execution. This can be different if the data actually can be served from the Buffer Cache which is by default not the case with Parallel Execution. Only segments that are smaller than the number of blocks defined by the _small_table_threshold will be read into the Buffer Cache when using Parallel Execution. An exception to that rule is the newly introduced In-Memory Parallel Execution feature that attempts to combine Parallel Execution with the Buffer Cache usage, in particular when using multiple instances in a RAC cluster configuration by caching different parts of the segment on different nodes.

I've come across systems where the disk throughput of a SQL statement execution could be scaled by using Parallel Execution although a serial execution wasn't CPU constrained - at least not according to the available instrumentation - but mainly waiting for I/O. I usually wasn't able to determine the exact reason for the behaviour due to time constraints and existing access restrictions of such systems, but I just wanted to point this out that it makes sense to test this for your system architecture individually. It could simply be related to the fact that Parallel Execution usually employs direct path reads that can make use of asynchronous I/O if available and configured properly.

Furthermore it is important to point out that there are cases where Parallel Execution allows increasing performance even when the disks don't normally provide sufficient throughput. This can be the case if complex expressions, for example based on regular expressions, or user-defined PL/SQL functions are in use that saturate a single CPU even with a moderate data flow. Other cases include the usage of compression, for example inserting data into a segment that is marked as compressed.

Degree Of Parallelism (DOP) Considerations

As outlined in Part 1, the optimizer generates an execution plan based on certain assumptions. In particular the optimizer is pretty optimistic regarding the performance gain by performing an operation in parallel. Although recent releases include an artificial performance reduction factor of 10% in the cost calculation, the optimizer still assumes a pretty linear scaling, for example specifying a parallel degree of 8 for a full table scan ends up with a cost that is close to 12.5% of the serial cost. Except for very high end hardware configurations it is however very unlikely to achieve such an almost linear scaling when using higher degrees of parallelism, therefore care should be taken when determining a sensible DOP.

Default DOP

Note that the default DOP in most cases is way too high, since on most platforms it is based on the assumption to use a DOP of two times the number of available CPUs for a single statement execution, which means that for non-trivial statements employing the Consumer / Producer model you'll end up with four times the number of CPUs Parallel Slaves. If that number of Parallel Slaves really starts consuming CPU (it would be sufficient to only consume 25% CPU on average per slave) then you end up with 100% CPU usage of your whole configured system - according to the CPU_COUNT that Oracle gets told from the operating system (since Oracle Database 11.2 you can set the CPU_COUNT to a lower value for a so called "Instance Caging"). If your system for example reports 32 CPUs to Oracle, then the default DOP is determined by PARALLEL_THREADS_PER_CPU * CPU_COUNT * INSTANCES (for RAC). PARALLEL_THREADS_PER_CPU is 2 on most platforms by default, so the default DOP per instance will be 64 in such a case. Note that you can set PARALLEL_THREADS_PER_CPU, for example to a lower value of 1.

This means that per DFO there could be up to two Parallel Slave Sets active each using 64 Parallel Slaves per RAC instance, resulting in 128 processes per instance for this single SQL execution. It should be verified that such a high DOP really speeds up processing as expected - you might find out that using a much lower DOP actually provides better performance, or at least similar performance with much lighter resource consumption.

But this means that you need to configure your database objects or use statement hints to use a sensible DOP instead of going with the default. Other options include introducing an artificial DOP limit via the Resource Manager and/or relying on the new Auto DOP feature (available since version 11.2).

DOP Downgrade At Execution Time - Workload Management

The other point to consider in this regard is that at execution time there might be various reasons why the statement runs with a lower DOP than assumed by the optimizer.

This is particularly relevant when dealing with multiple, concurrent Parallel Executions.

In the past you had limited choices:

Use the corresponding degree at execution time, which would at a certain level of concurrent Parallel Execution either run out of Parallel Slave processes leading to downgrades of the execution (in worst case downgrade to serial), or overwhelm the system with too many Parallel Slave processes potentially leading to CPU / O/S scheduling starvation. The first situation is unsatisfying for the affected query since the actual execution time might be much longer than expected, the second one can be annoying for all users since the system might become unresponsive for all users.
Use the PARALLEL_ADAPTIVE_MULTI_USER setting to have the execution engine automatically downgrade the DOP in case of concurrent Parallel Execution. This allows avoiding the second scenario above but still leads to the first scenario, which is in particular unsatisfying if at start time of an execution many concurrent executions are ongoing, but soon after the start complete and the system has sufficient resources available for a higher DOP but the execution runs with the downgraded DOP.

Since Oracle Database 11.2 there is the new Parallel Statement Queuing feature that attempts to address this problem by starting to queue concurrent Parallel Executions if insufficient Parallel Slaves are available (this is configurable via PARALLEL_SERVERS_TARGET). As soon as the required number of Parallel Slaves is available the statement is executed with the intended DOP rather than being downgraded. This should lead to a much better usage of the available system resources.

Note that this feature is by default only available together with the new Auto DOP feature that automatically determines a Degree Of Parallelism (Auto DOP).

As already mentioned, there are other possibilities, why an execution might not use the degree determined by the optimizer, for example restrictions imposed by an active Resource Manager plan could lead to such a downgrade. Note that from 11.2 on the optimizer can work together with the Resource Manager to avoid such a situation, but not all scenarios are covered yet.

Other possibilities include concurrent DDL operations that invalidate cursors while the Parallel Query Coordinator acquires Parallel Slaves. Since some of the Parallel Slaves might be forced to re-optimize the invalided cursor it is possible that the so called "signature" of the cursor used by the Parallel Slave doesn't match the signature of the cursor used by the Parallel Query Coordinator, and in such cases it is possible that the Parallel Slaves cannot be acquired. Oracle tries to cope with that situation usually by silently restarting the operation, but sometimes you might either end up with error ORA-12842 or a serial execution of the statement, depending on the version and exact circumstances.

DOP Downgrade

So the upshot of this point is that the runtime engine needs to follow the execution plan generated by the optimizer – and the optimizer based its decision on a particular DOP.

If the DOP at execution time differs significantly from the DOP at optimization time, it is not unlikely that the resulting runtime might be much longer than expected, possibly even longer than a corresponding serial execution.

This can be caused by various factors:

Obviously there is not as much CPU power available as originally assumed when running at a lower DOP
Oracle won't be able to use a similar amount of PGA memory for SQL workareas, because only a particular amount of memory will be allocated per Parallel Slave. This is particularly relevant for the BUFFERED operations mentioned above: When executed with a much lower DOP those BUFFER operations might start spilling to disk, leading to significantly increased runtimes and additional I/O operations
A completely different execution plan could have been favored by the optimizer if the optimization was based on the lower DOP that actually gets used at execution time

It is therefore crucial to determine the actual DOP used at execution time in order to understand if a sudden long-running query might have been caused by such downgrades. Now the actual DOP and number of Parallel Slaves involved is easy to find out while a statement executes even without the Real time SQL Monitoring feature since you can simply query the available V$PX* views, for example (G)V$PX_SESSION contains the QC* columns to identify the query coordinator and the SERVER_GROUP, REQ_DEGREE and DEGREE columns that allow answering that question easily.

But if you need to do a post execution analysis these views won't be of any help. Even with a Tuning Pack license the corresponding Real Time SQL Monitoring report might have already been aged out of the Shared Pool and therefore no longer available. As already mentioned for more complex Parallel Execution plans that involve multiple DFOs Real-Time SQL Monitoring only tells the highest DOP used, but doesn't tell you easily the DOP used for each DFO - by looking at the various details provided ("Executions" column in the "Plan Statistics" tab, or number of Parallel Slaves per "Parallel Group" in the "Parallel" tab) you might be able to tell.

We will therefore look at different approaches how you can do such a post execution analysis based on the Active Session History information available via the Diagnostic Pack license (a Tuning Pack license is not required for that purpose).

No Significant Skew Of Data Distribution Across Parallel Slaves

The last point is also important: Even if the execution at runtime is able to run with the expected DOP a Parallel Execution will only allow to scale as expected if the work is distributed as equally as possible among the Parallel Slaves involved.

However, there are scenarios where this distribution can be skewed leading to unbalanced (or lowered) usage of the involved Parallel Slaves, typically leaving many or even most of the Parallel Slaves idle. Another possible scenario is "temporal" skew, where the data arrives at the Parallel Slaves in such a way that only a few of them are actually active at the same time but at the end they've done a similar amount of work and they all complete at around the same time. I won't cover this scenario here because in my experience it is far less common than the unbalanced data distribution among Parallel Slaves.

The most straightforward way to determine the distribution of data across the Parallel Slaves is the view V$PQ_TQSTAT. There are various problems with that view, however:

It is only populated in the private memory of the coordinator session that executed the Parallel Execution. So you cannot access the relevant contents from outside
It is only available right after the execution
It fails to produce sensible output for more complex Parallel Execution plans. In particular execution plans that use Parallel DML / DDL and / or multiple DFOs will often lead to incomplete results that are not meaningful for determining skew
It is only populated if the statement ran successful to completion. If the statement fails (for example, it runs out of TEMP space) or gets cancelled, no information will be available.

So for simple Parallel Execution plans during development the view can be helpful and allows determining pretty straightforward if the data distribution among the Parallel Slaves was skewed or not.

Here is an example output of a skewed Parallel Execution:

DFO_NUMBER      TQ_ID SERVER_TYP   NUM_ROWS      BYTES  OPEN_TIME AVG_LATENCY      WAITS   TIMEOUTS PROCESS   
---------- ---------- ---------- ---------- ---------- ---------- ----------- ---------- ---------- ----------
         1          0 Producer       500807   54396692          0           0         20          6 P008
         1          0 Producer       499193   54259853          0           0         10          2 P009
         1          0 Consumer       500038   54332349          0           0       2876       1284 P010
         1          0 Consumer       499962   54324196          0           0       2861       1230 P011
         1          1 Producer       499826   57267724          0           0         20          5 P008
         1          1 Producer       500174   57357253          0           0          9          1 P009
         1          1 Consumer       499933   57304789          0           0       2467       1208 P010
         1          1 Consumer       500067   57320188          0           0       2481       1069 P011
         1          2 Producer       462069   52963939          0           0         19          7 P010
         1          2 Producer       537931   61681312          0           0         22          9 P011
         1          2 Consumer       500212   57346574          0           0         53         13 P008
         1          2 Consumer       499788   57298677          0           0        346        160 P009
         1          3 Producer       500038   57337041          0           0          0          0 P010
         1          3 Producer       499962   57328456          0           0          0          0 P011
         1          3 Consumer            0         48          0           0       2586       1200 P008
         1          3 Consumer      1000000  114665449          0           0       2611       1261 P009
         1          4 Producer            0         24          0           0          0          0 P008
         1          4 Producer      1000000  116668401          0           0          0          0 P009
         1          4 Consumer      1000000  116668425          0           0       1018        435 QC

So as you can see for the so called "Table Queue" TQ_ID = 3 the data was not evenly distributed. You can find the corresponding operation in the execution plan by looking at the TQ column.

There are various reasons for uneven work distribution among Parallel Slaves, but they have one thing in common: They are always related to the data resp. data distribution. Although some of the patterns could be recognized by the optimizer I believe at present this information is not used as part of the execution plan generation.

Depending on the PX ITERATOR used (block or partition) it might simply be that the PARTITION granule was favoured by the optimizer, but the data is not evenly distributed across the partitions.

Another common reason is that the HASH distribution method leads to an uneven distribution of data, simply because the key used to build the hash isn't uniformly distributed or has a very low number of distinct values. This can happen even with originally evenly distributed key information in case of massive outer joins.

But you might be unlucky no matter how the data is distributed or not, that only some of the Parallel Slaves produce data to be further processed: All further plan steps that are executed by that slave set without re-distribution will be affected - so sometimes it is desirable to re-distribute the data to achieve an improved work distribution.

Since the straightforward skew determination using the V$PQ_TQSTAT view is often not available, determining the skew and the reason for it is a non-trivial task, in particular if you need to do a post execution analysis (for example, why did statement XYZ take that long last night?).

If a Diagnostic Pack license is available, the Active Session History information can be used for shedding some light on that distribution issue and trying to answer some other interesting questions.

Post Execution Analysis using Active Session History And Real-Time SQL Monitoring

If you have a Diagnostic Pack license and are already using Oracle Database 11g the new columns SQL_PLAN_OPERATION, SQL_PLAN_LINE_ID and SQL_PLAN_OPTIONS along with SQL_EXEC_ID and SQL_EXEC_START in the Active Session History views allow a thorough post execution analysis of a SQL statement execution instance:

You can determine how many Parallel Slaves were involved in the execution, also dealing with more complex execution plans that make use of multiple DFOs, using the PROGRAM column of ASH data. The following query provides that information for the latest available execution in the ASH data for the given SQL_ID / CHILD_NUMBER combination:

define plan_table = "v$sql_plan"
define inst_id = "inst_id"
define global_ash = "gv$active_session_history"
define second_id = "child_number"
define dm = "YYYY-MM-DD HH24:MI:SS"
define si = "&1"
define cn = &2

column last_exec_start new_value ls
column last_exec_id    new_value li

select
        to_char(max(sql_exec_start), '&dm')             as last_exec_start
		
      , max(sql_exec_id) keep 
	  (dense_rank last order by sql_exec_start nulls first) as last_exec_id
from
        &global_ash
where
        sql_id = '&si'
and     sql_child_number = &cn
group by
        1
;

with set_count
as
(
select
        dfo
      , max(set_count) as set_count
from
        (
          select
                  cast(substr(p.object_node, 2, length(p.object_node) - 4) 
				  as varchar2(6))  as dfo
                , case when p.operation = 'PX RECEIVE' then 2 else 1 end
				                    as set_count
          from
                  &plan_table p
          where
                  p.sql_id = '&si'
          and     p.&second_id = &cn
          and     p.object_node is not null
        )
group by
        dfo
)
select
        &inst_id                                       as inst_id
      , pr.dfo
      , count(process)                                 as sample_count
      , count(distinct process)                        as slave_count
      , sc.set_count                                   as set_count
      , ceil(count(distinct process) / sc.set_count)   as assumed_degree
from    (
          select
                  ash.&inst_id
                , regexp_replace(ash.program, '^.*\((P[[:alnum:]]{3})\)$', '\1', 1, 1, 'c') 
				as process
                , cast(substr(p.object_node, 2, length(p.object_node) - 4) as varchar2(6))  
				as dfo
          from
                  &global_ash ash
                , &plan_table p
          where
                  ash.sql_id = '&si'
          and     ash.sql_exec_start = to_date('&ls', '&dm')
          and     ash.sql_exec_id = &li
          and     regexp_like(ash.program, '^.*\((P[[:alnum:]]{3})\)$')
          and     p.sql_id = '&si'
          and     p.&second_id = &cn
          and     p.id = ash.sql_plan_line_id
          and     p.object_node like ':Q%'
        ) pr
      , set_count sc
where
        sc.dfo = pr.dfo
group by
        &inst_id
      , pr.dfo
      , sc.set_count
order by
        &inst_id
      , pr.dfo
;

undefine plan_table
undefine inst_id
undefine global_ash
undefine second_id
undefine dm
undefine si
undefine cn
undefine ls
undefine li
undefine 1
undefine 2

column last_exec_start clear
column last_exec_id    clear

Of course this assumes that sufficient samples of Parallel Slaves are available, otherwise the information will be misleading, potentially providing an underestimation of the degree.

As already outlined above, the DOP information per DFO is not that easy to digest from the Real Time SQL Monitoring output - it only provides the information about the maximum DOP detected in the "General Information" section.

If I run this query on ASH data for the sample execution plan described in Part 1 on a single node that has two DFOs that are not active at the same time, the following output will be generated:

INST_ID DFO    SAMPLE_COUNT SLAVE_COUNT  SET_COUNT ASSUMED_DEGREE
---------- ------ ------------ ----------- ---------- --------------
         1 Q1               59           2          1              2
         1 Q2              119           8          2              4

The ASH data provides a strong indication that in fact the two DFOs used a different Parallel Degree (which can also be confirmed at execution time only by looking at (G)V$PX_SESSION)

Using the same information you can also tell if there was a skew in work distribution among the Parallel Slaves. You can even determine which operations of the execution plan are affected by the skew. This is something that is not that easily possible (yet) using Real-Time SQL Monitoring. I've written a utility SQL*Plus script XPLAN_ASH that injects that information into the DBMS_XPLAN.DISPLAY_CURSOR/AWR output that clearly shows which operations had an unbalanced distribution of work among the Parallel Slaves (for more information, read the corresponding blog post):

(Click to view a larger version of this image.)

If you have a Tuning Pack license you can use Real-Time SQL Monitoring to determine if there is a skew. The most useful tabs in the Active report are Activity, Parallel and Metrics, however Real-Time SQL Monitoring doesn't tell you which execution plan operations are affected by the skew.

Let's have a look at each of them when having a problem with a skewed execution. The Activity tab provides the following information (Figure 7).

Figure 7: Real Time Activity

Here it can be seen that the expected Parallel Degree was reached for only a short period at the beginning of the execution (32 active sessions on CPU in this case), but then there is a long tail where only a single CPU is active.

The Parallel tab reveals the following (Figure 8).

Figure 8: Real Time Parallel Tab

Here we can see the confirmation of the problem: a single Parallel Slave had to work much more than the remaining ones.

The same is confirmed in that case from the Metrics tabs, since it is a CPU bound operation and we expect to make use of 32 CPUs in this case (Figure 9).

Figure 9: Real Time Metrics

We can even use the new Advanced Analytics option and generate a nice graph using R based on the ASH data that shows the activity of the different Parallel Slaves along the different execution plan operation IDs (Figure 10).

Figure 10: Advanced Analytics

Again it becomes clearly visible at which plan operation the work distribution was not distributed as expected.

If you have determined such a skewed execution, depending on which operations are affected in the execution plan, you can attempt to improve the situation by choosing different Parallel Execution distribution methods.

As outlined in Part 1, the PQ_DISTRIBUTE hint can be used to influence the distribution of joins and load operations. However, you cannot directly influence yet the distribution of other non-join operations like aggregations, sorts or analytic functions.

A quite common scenario is skew caused by a HASH re-distribution. This is also possible if the underlying data is not skewed in case of outer joins - if the data gets redistributed on join keys that become NULL due to the outer join, all rows that have NULL values will be re-directed to the same Parallel Slave. So if you end up with massive outer joins and re-distribute the data based on the massive number of NULLs generated, the distribution will become skewed.

In this case using a BROADCAST distribution can improve the performance significantly; however you need to be aware of the effects when using the BROADCAST distribution as already outlined above.

Summary

The Parallel Execution option of the Oracle Database Enterprise Edition is a fascinating piece of technology. This article shed some light on the basic building blocks of this feature and hopefully offered some helpful insights into how Parallel Execution works, why it might not behave as expected, and what you can do about it.

About the Author

Randolf Geist Randolf Geist, a freelance database consultant, has been working with Oracle software for since 2000, focusing primarily on performance-related issues. A regular speaker at major user conferences conferences around the world, Randolf is also an instructor for Oracle University. Randolf is an Oracle ACE Director and a proud member of the Oak Table Network.