Diagnosing DB2 hot page issues

Abstract
________________________________________
Certain types of queries perform worse under increasing user load, due to hash bucket group latch contention caused by hot pages.

Symptom
________________________________________
db2pd -latches shows excessive waiters on
SQLB_HASH_BUCKET_GROUP_HEADER_groupLatch (“grouplatch”).

db2pd -stacks shows excessive EDU waits in latch conflict routines
(getConflictComplex) arising from bufferpool routines (sqlbfix, sqlbufix).

Cause
________________________________________
There are numerous causes to this issue, all of which involve concurrent access
by multiple agents to a single page — a “hot page”. In all cases, the high
rate of concurrent access to hot pages will cause the corresponding hash bucket
group latch to become contended and can cause increased CPU usage and degraded
performance.

Some typical scenarios are listed below:

a) hot index root page

This scenario is caused by concurrent queries performing index scans (IXSCAN)
on the “inner” leg of a nested loop join (NLJOIN). Because the “inner” leg is
accessed multiple times (once per row on the “outer” leg), coupled with the
concurrent execution of the query, the index root page is heavily accessed and
becomes hot.

b) hot index leaf page

This scenario occurs when there is a high rate of concurrent insert, delete
and/or update queries that are collocated around a specific range of index key
values. Because of the frequent operations against a small set of key values,
the index leaf page holding these index keys can become hot.

c) hot data page – insert workload

This scenario occurs when there is a high rate of concurrent inserts on a
table. Insert processing requires free space to be found in the table, either
on existing pages within the table or on new pages created in the table.

i) table NOT in append mode

Information regarding free space in tables is stored in Free Space Control
Records (FSCR) pages. These pages are modified whenever an insert occurs
(consuming free space) or a delete occurs (creating free space). If there is a
sufficient degree of concurrent insert and/or delete activity, the FSCR pages
will become hot. Furthermore, if there is minimal free space within the table,
the majority of inserts will be directed to the “last” page of the table which
will become hot as a result.

ii) table in append mode

When a table is in append mode FSCR records are not used. Instead, groups of
pages are added to the end of the table and agents are distributed across these
pages when performing insert activity. If the concurrent insert rate is
sufficiently high, to the point where the new set of pages are completely
populated before the background task of allocating the next set of pages has
completed, agents may serialize on page 0 of the table object until the next
cache is ready, causing the page to become hot.

d) hot data page – select or update workload

This scenario occurs when there is a high rate of concurrent read or modify
access to a table. If the same rows are being continually read and/or accessed
in a concurrent fashion, the corresponding pages will become hot.

Diagnosis
________________________________________
db2pd -latches can be used to collect information about latches held and the
number of waiters on those latches.

If a large number of holders exist on an instance of grouplatch, and this
condition worsens as user load is increased, then there is evidence that this
is a problem.

Example of db2pd -latches output:

Latches:
Address Holder Waiter Filename LOC LatchType
0x0780000039CE8980 385208 1900704 sqlbslat.h 711
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
0x0780000039CE8980 385208 1966278 sqlbslat.h 711
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
0x0780000039CE8D80 1548396 1773670 sqlbufix.C 367
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
0x0780000039CE8D80 1548396 1765476 sqlbufix.C 367
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
0x0780000039CE8D80 1548396 1523746 sqlbufix.C 367
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
0x0780000039CE8D80 1548396 1015838 sqlbufix.C 367
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
0x0780000039CE8D80 1548396 1618134 sqlbufix.C 367
SQLO_LT_SQLB_HASH_BUCKET_GROUP_HEADER__groupLatch
….
(This list should be much longer in order to genuinely indicate the presence of
this problem.)

In this example, we can see that there are a lot of waiters on the latch with
address 0x0780000039CE8D80 (the “grouplatch”). Even though the address of the
latch does not change, the “Holder” value changes during the course of the
db2pd -latches execution. This is evidence that the sessions are all moving
forward, but rather slowly, due to latch contention.

db2pd -applications and db2pd -dynamic output can be used to identify
frequently-executed queries that are accumulating a lot of CPU time and/or
latch wait time. The MON_GET_PKG_CACHE_STMT table function, in particular,
columns TOTAL_CPU_TIME and TOTAL_EXTENDED_LATCH_WAIT_TIME, are also useful here.

Classify the queries with large rates of accumulated CPU or latch wait time
into one of the following typical use cases.

a) hot root index page – select workload
b) hot leaf index page – insert/update/delete workload
c) hot data page – insert workload
d) hot data page – select/update workload

Resolution
________________________________________
In order to resolve the problem, a way must be found to minimize access to the
hash bucket group latch, generally by reducing concurrent access to the hot
page.

a) hot index root page

In this case, there are four user-facing options.

1) Add or remove indexes.

Creating additional secondary indexes on the table may cause some queries to
use these new indexes. This will reduce the amount of concurrent access to the
hot index root page.

If there is only one index, and primary key constraints are not required,
dropping the index may be an option. This will force the query optimizer to
use a table scan, which although may be inefficient, may provide better
performance overall. This option should only be considered for very small
tables.

2) Partition the hot index.

If the underlying table and index can be partitioned, then concurrent access to
the original index will be distributed over the smaller index partitions. If
the data accesses are not heavily skewed, and the number of partitions is
sufficient, then none of the root pages in any of the index partitions will be
hot. Note that in order for an index to be partitioned, it must include the
table’s partitioning key.

3) Force a different query plan

By using optimizer guidelines, a different query plan can be forced. For
example, a hash join could be used instead of a nested-loop join.

4) Change the application.

By changing the application logic to execute queries less frequently, which may
entail client-side caching of some data, the rate of concurrent access to the
hot index root page can be reduced.

b) hot index leaf page

If an index is seeing a lot of key manipulation activity
(insert/update/delete), page splits may occur. These are expensive operations
(both in CPU and locking) and can cause btree structures to become skewed and
deeper than desired. Resolving these problems often requires length REORG and
RUNSTAT operations.

By using a larger page size for index objects, the number of keys/page can be
increased. This will provide a denser btree structure, and reduce the
potential for splits and skew.

c) hot data page – insert

i) table NOT in append mode

If a table is subject to a large amount of INSERT activity, consider enabling
the table for append mode. In addition to spreading insert activity over a
number of pages at the “end” of the table, CPU consumption will be reduced
since FSCR records are no longer searched or maintained. Note that because of
this, any DELETE or decomposed UPDATE operation will leave free space within
the table that will not be reused; a REORG is required to reclaim this space.

ii) table in append mode

If the append cache is being fully populated before the next append cache is
available, adjustments to the append cache size are needed. This requires
measuring the time to fill the cache (inserts per second) and the time to
allocate the cache (allocating N pages on disk), and determining how the
allocation times change as the append cache size is decreased or increased.
Once a size that meets the time criteria is found, the append cache size can be
adjusted either by altering the extent size of the tablespace, or by using DB2
registry variables to manipulate the append cache algorithm.

d) hot data page – select/update

In general, the application logic should be revisited to determine if the rows
can be accessed fewer times, or even cached on the application side.

If multiple rows on a page are being accessed, consider using PCTFREE to reduce
the number of rows per page. This will distribute the rows across more pages,
which will reduce the potential for hot pages.

It may also be appropriate to reduce the page size of the table object, as this
will achieve similar effects as increasing PCTFREE.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s