Understanding Parallel Execution - Part 1

by Randolf Geist

Maximizing the performance of the Parallel Execution feature in Oracle Database Enterprise Edition

November 2012

Oracle Database Enterprise Edition includes the powerful Parallel Execution feature that allows spreading the processing of a single SQL statement across multiple threads. The feature is fully integrated into the SQL optimizer as well as the execution runtime engine and automatically distributes the work across the so called Parallel Slaves.

However, like any other feature, it doesn't necessarily provide the expected performance improvement out of the box in every case; hence it is important to understand the building blocks and their implications. This applies particularly to powerful platforms like Exadata. Although you can get blistering performance when using Parallel Execution in such environments, it isn't guaranteed. The following is therefore not specific to Exadata, but very applicable.

In order to get the most out of Parallel Execution, the following prerequisites need to be met:

Efficient Parallel Execution plan
Sufficient resources available at execution time
No significant skew of work distribution across Parallel Slaves

In the first part of this two-part series we'll look at the first point in detail.

Efficient Parallel Execution Plan

Quite often Parallel Execution is treated like a "Silver Bullet" – the serial execution of a SQL statement is slow, so let's try to run in it in parallel to see if it's going to be faster. But if the serial execution plan is already inefficient (for example wrong join order, join or access methods) it is very likely that the Parallel Execution plan will be inefficient, too. Therefore it is very important to check the basics first in order to get an understanding how an efficient execution plan is supposed to look like. Once this is achieved, Parallel Execution can be evaluated if it is applicable to the query and the system.

So on top of the important aspects that apply to an efficient serial execution plan—the optimizer's picture of the data volume and distribution needs to correspond to reality—and that apply to Parallel Execution, too, there are some specifics to Parallel Execution plans:

Parallel Execution Forced to Serial

There are some cases where Oracle in recent releases detects during the optimization phase that it can't use Parallel Execution, but has already started to generate a Parallel Execution plan. One common cause is the usage of user-defined PL/SQL functions that are not parallel enabled, but I've seen this happening also in different circumstances. It is easy to spot, though, if you look closely, the execution plan contains (one or more) PX COORDINATOR FORCED SERIAL operations:

------------------------------------------------------------------------------
| Id  | Operation                    | Name     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |        |      |            |
|   1 |  PX COORDINATOR FORCED SERIAL|          |        |      |            |
|   2 |   PX SEND QC (RANDOM)        | :TQ10003 |  Q1,03 | P->S | QC (RAND)  |
|   3 |    HASH UNIQUE               |          |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE               |          |  Q1,03 | PCWP |            |
|   5 |      PX SEND HASH            | :TQ10002 |  Q1,02 | P->P | HASH       |
|*  6 |       HASH JOIN BUFFERED     |          |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE            |          |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH         | :TQ10000 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR   |          |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS FULL  | T2       |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE            |          |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH         | :TQ10001 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR   |          |  Q1,01 | PCWC |            |
|  14 |           TABLE ACCESS FULL  | T2       |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------

So whenever you see a PX COORDINATOR FORCED SERIAL operation you need to be very careful. This means that, although the plan looks like a parallel execution, Oracle effectively will run this serially at execution time.

The main problem here is that the costing is still based on the cost reduction implied by parallel execution, so you can easily end up with a plan that would be rejected in case of serial costing. For example, a full table scan in parallel might look cheaper to the cost based optimizer than a corresponding index based access. However, since the plan is going to be executed serially, it is not unlikely that the index based access plan is more efficient at execution time.

At least, since the execution is done serially, and the SQL execution engine has to follow the execution plan generated by the optimizer, the potential overhead of additional blocking operations (in the example above this is the HASH JOIN BUFFERED operation) won't be incurred at runtime when executing it serially, so the runtime engine is clever enough to avoid these unnecessary blocking operations.

I'll explain the additional blocking operations in a moment.

Note that you can end up with exactly the same problems whenever your parallel execution plans get downgraded at execution time, in particular when downgrading them to serial execution. I'll cover that point later in more detail.

Additional Blocking Operations

In a serial execution plan there is exactly one process working on the execution and it is, in principle, recursively calling the different functions that correspond to the different operations in the execution plan. Tanel Poder has a nice visualization available for that on his blog.

In case of a Parallel Execution things look quite different though. Oracle uses a so called Consumer / Producer model for most non-trivial Parallel Execution plans, so at most two sets of Parallel Slaves will be working at the same time on different related operations of one so called "Data Flow Operation" (DFO, these will be explained later in detail).

Due to this Consumer / Producer model Oracle has to deal with the situation that both Parallel Slave Sets are busy (one producing, the other consuming data) but the data according to the execution plan has to be consumed by the next related parent operation. If this next operation is supposed to be executed by a separate Parallel Slave Set (you can tell this from the TQ column of the DBMS_XPLAN output) then there is effectively no slave set / process left that could consume the data, hence Oracle sometimes needs to revert to sync points (or blocking operations that otherwise wouldn't be blocking) where the data produced needs to be "parked" until one of the slave sets is available for picking up the data.

In recent releases of Oracle you can spot these blocking operations quite easily in the execution plan. Either these are separate operations (BUFFER SORT – not to be confused with regular BUFFER SORT operations that are also there in the serial version of the execution plan) or one of the existing operations is turned into a BUFFERED operation, like a HASH JOIN BUFFERED.

There is not much you can do about these – Oracle adds them automatically to a Parallel Execution plan whenever its generic implementation deems them necessary.

However, there are three important points to consider:

First, the generic implementation sometimes adds these blocking operations, although it is not entirely obvious why. Here is a simple example where it is obvious to the human eye that buffering the data before returning it to the client is superfluous, but Oracle still turns the HASH JOIN into a HASH JOIN BUFFERED:

-------------------------------------------------------------------------
| Id  | Operation               | Name     |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |        |      |            |
|   1 |  PX COORDINATOR         |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED   |          |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE          |          |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| T1       |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE          |          |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH       | :TQ10001 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR |          |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL| T2       |  Q1,01 | PCWP |            |
-------------------------------------------------------------------------

Look closely at the operation ID = 3: It is the last operation before the data will be sent to the Coordinator process in order to return the result set to the client. Why would you want to BUFFER that whole result set in memory, potentially writing it to TEMP and re-reading it, before returning it to the client?

It looks like that the generic implementation always generates a Parallel execution plan under the assumption for the final step that there is potentially another Parallel Slave Set active that needs to consume the data via re-distribution. This is a pity as it quite often implies unnecessary blocking operations as shown above.

Note that you are able to influence this by changing the distribution method of Parallel joins. I'll come back to that point later.

Second, if you know how to interpret serial execution plans regarding the order of execution, you might be surprised by the fact that the Consumer / Producer model, along with the buffering, can lead to scenarios where the usual rules do not apply. Take a look at the following serial execution plan. According to the rules the execution will effectively start at operation ID 2 (the HASH JOIN at operation ID 1 will build a hash table based on the contents of table T3), since it is the first leaf of the execution plan tree and none of the existing exceptions to this rule apply.

------------------------------------
| Id  | Operation           | Name |
------------------------------------
|   0 | SELECT STATEMENT    |      |
|*  1 |  HASH JOIN          |      |
|   2 |   TABLE ACCESS FULL | T3   |
|*  3 |   HASH JOIN         |      |
|   4 |    TABLE ACCESS FULL| T1   |
|   5 |    TABLE ACCESS FULL| T1   |
------------------------------------

But if you look at the following corresponding Parallel Execution plan, the execution no longer starts at operation 7 (which corresponds to operation ID 2 of the serial plan), but at operation 14 (which corresponds to operation ID 4 of the serial counterpart), simply because Oracle cannot have more than two Parallel Slave sets active per Data Flow Operation. That operation can only be started after one of the other two is completed. You can see this easily confirmed by using the Real Time SQL Monitoring feature of the SQL Tuning pack option for the Enterprise Edition by looking at the Active column when comparing serial to Parallel execution (and when using an execution that takes long enough to clearly spot the different start and active times).

----------------------------------------------------------------------------
| Id  | Operation                  | Name     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |          |        |      |            |
|   1 |  PX COORDINATOR            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)      | :TQ10004 |  Q1,04 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED      |          |  Q1,04 | PCWP |            |
|   4 |     PX RECEIVE             |          |  Q1,04 | PCWP |            |
|   5 |      PX SEND HASH          | :TQ10002 |  Q1,02 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR    |          |  Q1,02 | PCWC |            |
|   7 |        TABLE ACCESS FULL   | T3       |  Q1,02 | PCWP |            |
|   8 |     PX RECEIVE             |          |  Q1,04 | PCWP |            |
|   9 |      PX SEND HASH          | :TQ10003 |  Q1,03 | P->P | HASH       |
|* 10 |       HASH JOIN BUFFERED   |          |  Q1,03 | PCWP |            |
|  11 |        PX RECEIVE          |          |  Q1,03 | PCWP |            |
|  12 |         PX SEND HASH       | :TQ10000 |  Q1,00 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR |          |  Q1,00 | PCWC |            |
|  14 |           TABLE ACCESS FULL| T1       |  Q1,00 | PCWP |            |
|  15 |        PX RECEIVE          |          |  Q1,03 | PCWP |            |
|  16 |         PX SEND HASH       | :TQ10001 |  Q1,01 | P->P | HASH       |
|  17 |          PX BLOCK ITERATOR |          |  Q1,01 | PCWC |            |
|  18 |           TABLE ACCESS FULL| T1       |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------

Last and most importantly: Since Oracle needs to buffer or "park" the data produced, this can have significant impact on the overall performance of the execution. If the amount of data to buffer is large, it cannot be held in memory and therefore has to be written to temporary disk space, only to be re-read by / to be sent to the Parallel Slave set that is supposed to consume / pick up the data. And even if it can be held in memory, the additional PGA memory required holding the data can be significant.

So a Parallel Execution plan can spend a significant amount of time writing and reading data to and from disk for no apparent reason in the worst case. Don't confuse this with SQL workareas that spill to disk, like a HASH JOIN that needs to process a large first row source to build the hash table.

Parallel Distribution Methods

If you have identified that a Parallel Execution plan spends a significant amount of time in such BUFFER operations writing and reading data from and to temporary disk space then you can influence this by changing the parallel distribution method of join operations. By changing the join distribution method you control how Oracle combines operations that will be executed by the same or different Parallel Slave set (join plus data access) potentially avoiding the otherwise necessary data redistribution. Since the join and one or more of the data access operations are now processed by the same Parallel Slave set, depending on the execution plan, it might not be necessary to block an operation and park the data.

In principle there are three common ways in which Oracle can operate a parallel HASH join (which is the most commonly used join method for Parallel Execution):

·The HASH JOIN itself is performed by a separate Parallel Slave Set and both join row sources are HASH distributed based on the join keys. In this case both join row sources need to be re-distributed based on the join keys and this will be performed by a separate slave set (unless the join row source itself is already distributed on the same criteria by child operations, which means there are several tables joined on the same key). Since this operation requires two active Parallel Slave sets (one performing the join, one performing the child operation) it usually requires a blocking operation (BUFFERED JOIN/ BUFFER SORT) depending on the remaining execution plan. Note that if there is a blocking parent operation, like a HASH GROUP BY or SORT AGGREGATE that can be performed by the same Parallel Slave Set there is no need for an artificial blocking operation.

Both phases (build phase of the hash table and probe phase using the second row source) of the join operation typically resemble Figure 1.

Figure 1: Hash Distribution Build Phase

Probe phase:

Figure 2: Hash Distribution Probe Phase

·The HASH JOIN is performed together with one of the join row sources. Since now the data is not re-distributed on the join key but each Parallel Slave simply processes the data it obtains from the child operation (usually a Parallel Full Table Scan), the other join row source will have to be broadcasted to all Parallel Slaves performing the join. Note that this means that the data broadcasted gets effectively duplicated as many times as there are target Parallel Slaves, which corresponds to the Degree Of Parallelism (DOP) at execution time of that operation / DFO (Data Flow Operation). So a large join row source combined with a high DOP will lead to a huge amount of data to be distributed between the two Parallel Slave Sets - if this is the first row source of a HASH JOIN that is used to build the hash table this also means that every slave will have to build a hash table for all rows of the first row source with corresponding consequences for the memory requirements of that operation.

Usually this operation is most efficient when the first row source is relatively small compared to the second and therefore will be broadcasted to the Parallel Slave Set performing the join. Since the phase of building the hash table of a HASH JOIN is blocking anyway it doesn't matter that potentially two Parallel Slave Sets are active at this stage – for small join row sources this BROADCAST could even be performed serially by the Coordinator process. Once the hash table is built only one Parallel Slave Set is active performing the join together with the second join row source, and hence another operation could be performed by the second Parallel Slave Set, avoiding any blocking operation for this particular join. Note that it is also possible to broadcast the second row source, but this means that again potentially two Parallel Slave Sets are active when the HASH JOIN in principle is non-blocking (it can pass the results of the join to a parent operation before completion), so this again means that an artificial blocking operation might be required – if you're unlucky and this BROADCASTED data gets "parked"(usually represented by an additional BUFFER SORT operation) you'll end up with a potentially massive amount of data that needs to be parked, which is usually a bad idea.

The Build Phase of such an operation typically looks like this:

Figure 3: Broadcast Distribution Build Phase

The Probe phase then can be performed by a single slave set - it performs both the data acquisition, like a Parallel Full Table Scan, and the join operation (Figure 4).

Figure 4: Broadcast Distribution Probe Phase

·If the two join row sources are equi-partitioned (requires licensing the Partitioning option) on the join key a so-called full "Partition-Wise Join" can be performed that doesn't require a re-distribution of the data, so that the whole operation can be performed by the same Parallel Slave Set (join plus acquiring join row sources). Note that there are also variations of this possible, a so called "Partial Partition-Wise Join" where one row source will be distributed according to the Partitions of the other row source, but since one of the two row sources needs to be distributed again, it is similar to the broadcast variation in terms of blocking, only that it doesn't require duplicating the data like the broadcast operation and therefore should be more efficient for larger row sources than the broadcast method.

The Full Partition-Wise join is therefore the most efficient way to join two row sources in Oracle, in particular when performing it in parallel (it can also be performed serially, whereas the "Partial" variation requires Parallel Execution for the data re-distribution of the other row source), and even more relevant to Parallel Execution across RAC nodes where no data needs to be re-distributed among the instances.

What may not be obvious in the images below is the fact that the Full Partition-Wise Join can operate on the partition level, which means that it joins one partition of the first row source to the corresponding partition of the second row source, and when that join operation is completed it can go ahead with the next partition pair to join. So basically each Parallel Slave runs the build phase for the partition to process and then runs the probe phase reading the corresponding partition from the other row source. This means that the parallel Partition-Wise Join only needs as much memory for the hash tables at any time as there are partitions concurrently processed, which corresponds to the parallel degree of that operation, no matter how many partitions there are to process in total, and therefore independent from the total size of the build row source.

This is a key difference from all other Parallel Join operations (including the Partial Partition-Wise Join) where the Parallel Slaves need to build hash tables that in total hold the complete build row source, which means that the size of the total hash table depends on the size of the build row source. Note that the same difference also applies to a serial Partition-Wise Join.

The Build Phase of such an operation is illustrated in Figure 5.

Figure 5: Partition Distribution Build Phase

In principle, the Probe Phase looks the same, and since only a single Slave Set is active, another join operation could be performed by the other Slave Set (Figure 6).

Figure 6: Partition Distribution Probe Phase

Other Join Methods

Notice that the same rules in principle apply to a MERGE JOIN, but for a MERGE join the HASH distribution and the Partition-Wise join are only supported for Equi-Joins, whereas for non-Equi-Joins only the BROADCAST distribution is supported. Furthermore it isn't that relevant which if the two row sources comes first or second, since both row sources need to be in sorted order before they can be joined, which usually means that a separate SORT operation is required on both row sources, making the MERGE JOIN only attractive if at least one of the row sources is already sorted according to the join keys, for example by accessing it via an index. Technically, Oracle only supports accessing the first row source of the MERGE join without a separate sort operation, though.

The NESTED LOOP JOIN in principle only supports a broadcast distribution of the driving row source, but usually operates without any distribution if the second row source uses an index or cluster-based access path, which is the predominant use of a NESTED LOOP JOIN.

Parallel Distribution Examples

Here is the same execution plan as above, but this time using a different distribution method of the joins:

--------------------------------------------------------------------------
| Id  | Operation                | Name     |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |        |      |            |
|   1 |  PX COORDINATOR          |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)    | :TQ10002 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN             |          |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE           |          |  Q1,02 | PCWP |            |
|   5 |      PX SEND BROADCAST   | :TQ10000 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR  |          |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL | T3       |  Q1,00 | PCWP |            |
|*  8 |     HASH JOIN            |          |  Q1,02 | PCWP |            |
|   9 |      PX RECEIVE          |          |  Q1,02 | PCWP |            |
|  10 |       PX SEND BROADCAST  | :TQ10001 |  Q1,01 | P->P | BROADCAST  |
|  11 |        PX BLOCK ITERATOR |          |  Q1,01 | PCWC |            |
|  12 |         TABLE ACCESS FULL| T1       |  Q1,01 | PCWP |            |
|  13 |      PX BLOCK ITERATOR   |          |  Q1,02 | PCWC |            |
|  14 |       TABLE ACCESS FULL  | T1       |  Q1,02 | PCWP |            |
--------------------------------------------------------------------------

Notice the difference:

The BUFFERED operations are no longer there, both HASH JOINs operate now non-blocking, so no potential threat from "parking" the data.

The downside is of course that now both T3 and one instance of T1 will be duplicated as many times as there are slaves to receive the broadcasted data, potentially leading to increased CPU, memory and RAC interconnect usage.
Both HASH JOINs and the full table scan of the other instance of T1 will be performed by the same Parallel Slave Set as can be seen from the TQ column (all denoted as Q1,02)

-----------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |       |       |        |      |            |
|   1 |  PX COORDINATOR         |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION HASH ALL|          |     1 |     4 |  Q1,00 | PCWC |            |
|*  4 |     HASH JOIN           |          |       |       |  Q1,00 | PCWP |            |
|   5 |      TABLE ACCESS FULL  | T3       |     1 |     4 |  Q1,00 | PCWP |            |
|*  6 |      HASH JOIN          |          |       |       |  Q1,00 | PCWP |            |
|   7 |       TABLE ACCESS FULL | T1       |     1 |     4 |  Q1,00 | PCWP |            |
|   8 |       TABLE ACCESS FULL | T1       |     1 |     4 |  Q1,00 | PCWP |            |
-----------------------------------------------------------------------------------------

Here again is the same SQL statement, this time using no distribution at all, because all tables are hash partitioned on the join key. Each Parallel Slave will work on a partition and perform the hash join exactly between those matching hash partitions.

Notice that the whole execution plan only uses a single Parallel Slave Set (Q1,00) and there is a single PX PARTITION operator as parent of all the joins. Furthermore no PX SEND / RECEIVE operations are required as no data re-distribution is necessary - this is the fastest way to perform a join in Oracle, but only if there is no skew of data distribution between the partitions, so that all Parallel Slaves work approximately on the same amount of data. Otherwise the duration of the operation will depend on the Parallel Slave that needs to perform most of the work. This can dramatically influence the efficiency of the operation.

The Cost-Based Optimizer (CBO) evaluates the different distribution methods and usually favors the BROADCAST distribution for smaller amounts of data (where "smaller" is relative to the size of the other row source joined). Nevertheless there are cases where the optimizer might over- or underestimate the amount of data and / or favor the HASH distribution where a BROADCAST distribution might lead to improved performance (or vice versa). This is particularly relevant when the CBO under- or overestimates the cardinality.

If you want to make use of the "Partition Wise Join" you have to be careful with the DOP - if the DOP doesn't correspond to the number of partitions, in particular if you have a much higher DOP than partitions, then the Cost Based Optimizer might start to favor a join method that involves re-distribution, which might turn out to be less efficient at execution time. Ideally for a "Partition Wise Join" you should have either the same number or many more partitions than DOP.

The PQ_DISTRIBUTE hint allows influencing the distribution, and the best way to find out how to specify the hint is using the OUTLINE formatting option of the DBMS_XPLAN functions and looking for the corresponding PQ_DISTRIBUTE hint specified there.

For the BROADCAST distribution you need to be careful which of the two row sources you want to distribute, since for inner joins both could be subject to re-distribution - there seem to be restrictions that apply to outer joins, though.

In principle the syntax for the PQ_DISTRIBUTE hint could look like this:

PQ_DISTRIBUTE([@qb_name] T[@qb_name] BROADCAST NONE)

PQ_DISTRIBUTE([@qb_name] T[@qb_name] NONE BROADCAST)

The rule seems to be that the first distribution method specified applies to the "other" row source (so in our case not "T", but the other one), whereas the second distribution applies to the one mentioned in the hint ("T" in our case here). Note that this rule also applies in principle to the Partial Partition-Wise Join (the PARTITION NONE / NONE PARTITION distribution method).

So if you want to re-distribute T by broadcast, you would have to use the second form, above.

But you have to be careful when specifying the hints:

Check the OUTLINE option of DBMS_XPLAN for the hints generated by the optimizer. The actual form used of the hint depends on which of the two row sources are mentioned in the OUTLINE for this particular join. It could be "T" or the other row source in our case
Be aware of the "SWAP_JOIN_INPUTS" capability of a HASH JOIN operation. So a hash join can swap the two row sources since the optimizer attempts to use the "smaller one" for building the hash table.

So you need to carefully check which alias name to use in the hint, and whether that alias shows up as a first or second row source of the join. Then you can decide how to specify the hint to achieve the desired re-distribution.

Distribution Of Load Operations

Since Oracle 11.2 the PQ_DISTRIBUTE hint can also be used to influence the data distribution for Parallel Load operations (Parallel Insert or Create Table As Select (CTAS)). This is particularly useful if you want to maximize the compression for data loads by pre-sorting the data by the most repeating column values (for example by looking at the NUM_DISTINCT attribute of the columns). If you do so then by default Oracle will perform a RANGE distribution of the data based on the SORT keys for the final sort operation. But since the sort keys will be very similar for most of the data, it will very likely result in a skewed RANGE data distribution where most of the data will be routed to the same Parallel Slave for sorting, and leaving the remaining Parallel Slaves almost idle.

By using the PQ_DISTRIBUTE hint for the DML/DDL operation performing the load the RANGE distribution on the ORDER BY columns and therefore the skew can be avoided (PQ_DISTRIBUTE( NONE), see the official documentation for more information). Note that this doesn't seem work in 11.2.0.1 and even in 11.2.0.2/3 there are bugs (12683462) that can lead to wrong data and/or ORA-600 errors when using the PQ_DISTRIBUTE hint and a join operation is involved that uses a RANGE distribution - as a side effect of using the hint, any final join HASH distribution turns into a join RANGE distribution. Be careful that the patch for bug 12683462 doesn't seem to fix the problem, and as of the time of writing this article the patch for another bug on top of 12683462 is still being tested, so at present influencing the distribution of loads via the PQ_DISTRIBUTE hint is only recommended when no join is involved in the query part.

If you need to influence the distribution of loads and joins are involved, a reasonable workaround seems to be the materialization of the final result set before the ORDER BY takes place by using an appropriate WITH clause and a MATERIALIZE hint, although at the cost of an otherwise unnecessary additional step that very likely will cause read and write I/O to the TEMPORARY tablespace.

Plans Including Multiple Data Flow Operations (DFOs)

The question is often asked: "How parallel?" For example, what actual Parallel Degree was used for the execution of that statement last night? The question in fact is not that simple to answer since it lacks some precision. First of all there is a so called Degree Of Parallelism (DOP) that is determined by the optimizer when generating the execution plan. A recent overview of how this DOP is determined can be found here, on the The Data Warehouse Insider blog, including the latest Automatic Degree Of Parallelism (Auto DOP) feature introduced in Oracle Database 11.2.

Then there is the problem that it is possible that the optimizer generates multiple so-called "Data Flow Operations (DFO)" per execution plan, and that each DFO has its own, possibly different, DOP.

At execution time for each DFO there can be up to two Parallel Slave Sets active at the same time, each Parallel Slave Set consisting of DOP number of Parallel Slaves.

So for a single SQL statement executed in parallel, there is potentially no single answer to the question "How parallel?" because the optimizer might de-compose it into several DFOs, each having its own DOP.

At execution time, since each DFO can have either one or two Parallel Slave Sets active concurrently, again it is not that simple to answer how many Parallel Slaves the execution uses, however the number of Parallel Slave Sets required can be told by looking at the execution plan.

In order to make things a little bit more difficult, even by determining that multiple DFOs are generated by the optimizer doesn't tell if these DFOs will be active at the same time. In those cases where multiple DFOs are active at the same time, you will end up with more than two Parallel Slave Sets active and hence the execution might consume more Parallel Slaves than expected, and hence can also consume more resources - in particular CPU - than expected.

Finally, the actual DOP per DFO at execution time can be totally different from the DOP determined by the optimizer when generating the execution plan. A so-called "Downgrade," where a lower DOP gets used, is possible for various reasons.

For your entertainment, here is a silly and in most cases completely inefficient but interesting execution plan demonstrating two DFO trees that are active at the same time and hence requiring more than two Parallel Slave Sets active:

---------------------------------------------------------------------------------------
| Id  | Operation                | Name     | Cost (%CPU)|    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT         |          |  2302M  (1)|        |      |            |
|*  1 |  FILTER                  |          |            |        |      |            |
|   2 |   PX COORDINATOR         |          |            |        |      |            |
|   3 |    PX SEND QC (RANDOM)   | :TQ20002 |  5129   (1)|  Q2,02 | P->S | QC (RAND)  |
|*  4 |     HASH JOIN BUFFERED   |          |  5129   (1)|  Q2,02 | PCWP |            |
|   5 |      PX RECEIVE          |          |  1164   (1)|  Q2,02 | PCWP |            |
|   6 |       PX SEND HASH       | :TQ20000 |  1164   (1)|  Q2,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR |          |  1164   (1)|  Q2,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL| T1       |  1164   (1)|  Q2,00 | PCWP |            |
|   9 |      PX RECEIVE          |          |  1163   (1)|  Q2,02 | PCWP |            |
|  10 |       PX SEND HASH       | :TQ20001 |  1163   (1)|  Q2,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR |          |  1163   (1)|  Q2,01 | PCWC |            |
|  12 |         TABLE ACCESS FULL| T1       |  1163   (1)|  Q2,01 | PCWP |            |
|  13 |   PX COORDINATOR         |          |            |        |      |            |
|  14 |    PX SEND QC (RANDOM)   | :TQ10000 |  2328   (1)|  Q1,00 | P->S | QC (RAND)  |
|  15 |     PX BLOCK ITERATOR    |          |  2328   (1)|  Q1,00 | PCWC |            |
|* 16 |      TABLE ACCESS FULL   | T1       |  2328   (1)|  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------

As already mentioned the interesting point here is that for each DFO a separate DOP is possible, so above filter subquery will be executed with a different parallel degree than the DFO of the main query.

In above case this is also easy to tell from the EXPLAIN PLAN output because the same table is used all over the place - and in the FILTER subquery part the full table scan is shown with twice the cost as in the main query.

While the statement is executing you can see this in (G)V$PX_SESSION for example where for each DFO that is represented by a separate SERVER_GROUP (1 and 2 in this case) a different REQ_DEGREE will be shown along with the corresponding number of different Parallel Slaves.

Let's have a look at another example where multiple DFOs get used but they are not active at the same time:

---------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                      |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                           |        |      |            |
|   1 |  TEMP TABLE TRANSFORMATION |                           |        |      |            |
|   2 |   PX COORDINATOR           |                           |        |      |            |
|   3 |    PX SEND QC (RANDOM)     | :TQ10000                  |  Q1,00 | P->S | QC (RAND)  |
|   4 |     LOAD AS SELECT         | SYS_TEMP_0FD9D6602_9709B5 |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR     |                           |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL    | T1                        |  Q1,00 | PCWP |            |
|   7 |   PX COORDINATOR           |                           |        |      |            |
|   8 |    PX SEND QC (RANDOM)     | :TQ20002                  |  Q2,02 | P->S | QC (RAND)  |
|*  9 |     HASH JOIN BUFFERED     |                           |  Q2,02 | PCWP |            |
|  10 |      PX RECEIVE            |                           |  Q2,02 | PCWP |            |
|  11 |       PX SEND HASH         | :TQ20000                  |  Q2,00 | P->P | HASH       |
|  12 |        VIEW                |                           |  Q2,00 | PCWP |            |
|  13 |         PX BLOCK ITERATOR  |                           |  Q2,00 | PCWC |            |
|  14 |          TABLE ACCESS FULL | SYS_TEMP_0FD9D6602_9709B5 |  Q2,00 | PCWP |            |
|  15 |      PX RECEIVE            |                           |  Q2,02 | PCWP |            |
|  16 |       PX SEND HASH         | :TQ20001                  |  Q2,01 | P->P | HASH       |
|  17 |        PX BLOCK ITERATOR   |                           |  Q2,01 | PCWC |            |
|  18 |         TABLE ACCESS FULL  | T1                        |  Q2,01 | PCWP |            |
---------------------------------------------------------------------------------------------

So again this plan shows multiple DFOs (it is not shown in the plan above but again with different DOPs) but when executed it can be seen that those DFOs are executed one after the other and therefore very likely the different Parallel Slave Sets of the corresponding Parallel Server Groups will attempt to re-use the same Parallel Slave processes.

The actual DOP at execution time is something that the Real Time SQL Monitoring will show (along with an indication if a so called "downgrade" happened) and in case of multiple DFOs in a single query Real Time SQL Monitoring will show the highest parallel degree detected among the DFOs as DOP, since it only shows a single DOP per query and therefore lacks a bit of precision in cases of multiple DFOs.

Furthermore there is the possibility that the Parallel Coordinator process is unable to get the desired number of Parallel Slaves at execution time for various reasons. It could simply be a deliberate limitation imposed via an active Resource Manager plan, or it could be a limitation imposed via the PARALLEL_ADAPTIVE_MULTI_USER, PARALLEL_MAX_SERVERS or PROCESSES instance parameters. This and other important aspects will be discussed in more detail in Part 2.

About the Author

Randolf Geist Randolf Geist, a freelance database consultant, has been working with Oracle software for since 2000, focusing primarily on performance-related issues. A regular speaker at major user conferences conferences around the world, Randolf is also an instructor for Oracle University. Randolf is an Oracle ACE Director and a proud member of the Oak Table Network.