Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1516180 - db2 resource agent fails to promote Slave when Master has crashed/failed abruptly
Summary: db2 resource agent fails to promote Slave when Master has crashed/failed abru...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: resource-agents
Version: 7.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Oyvind Albrigtsen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: 1521019
TreeView+ depends on / blocked
 
Reported: 2017-11-22 08:36 UTC by Ondrej Faměra
Modified: 2018-04-10 12:10 UTC (History)
10 users (show)

Fixed In Version: resource-agents-3.9.5-115.el7
Doc Type: If docs needed, set a value
Doc Text:
Previously, the DB2 resource agent failed to promote a Slave resource to Master. As a consequence, automatic failover of the Master resource was not possible. With this update, additional keywords for disconnected peer have been added to the agent. As a result, the DB2 agent is able to detect the resource state correctly, and the described problem no longer occurs.
Clone Of:
: 1521019 (view as bug list)
Environment:
Last Closed: 2018-04-10 12:09:28 UTC
Target Upstream Version:


Attachments (Terms of Use)
proposed fix (deleted)
2017-11-22 08:36 UTC, Ondrej Faměra
no flags Details | Diff
extra debugging seen in outputs from description (deleted)
2017-11-22 08:42 UTC, Ondrej Faměra
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3250061 None None None 2017-11-27 14:46:31 UTC
Red Hat Product Errata RHBA-2018:0757 None None None 2018-04-10 12:10:40 UTC

Description Ondrej Faměra 2017-11-22 08:36:21 UTC
Created attachment 1357294 [details]
proposed fix

=== Description of problem:
DB2 resource agent fails with exit code 1 during promotion when the PRIMARY (Master)
was killed or when we cannot communicate with it. This is basically preventing automatic
failover of Master resource when the node running Master was fenced. Slave will
indefinatelly stay as Slave and manual intervention is required.

=== Version-Release number of selected component (if applicable):
resource-agents-3.9.5-105.el7_4.2

=== How reproducible:
Always

=== Steps to Reproduce:
1. Configure DB2 HADR in pacemaker cluster - https://access.redhat.com/articles/3172731
2. Let the DB2_HADR Master/Slave resource properly start
3. Fence node running Master DB2_HADR resource

=== Actual results:
Slave will fail the promote and restart the DB2 resource.

  Nov 21 18:16:54 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_execute:  executing - rsc:DB2_HADR action:promote call_id:93
  Nov 21 18:16:54  db2(DB2_HADR)[11786]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 21 18:16:54  db2(DB2_HADR)[11786]:    INFO: running    db2 takeover hadr on db sample
  Nov 21 18:16:59 [22672] fastvm-rhel-7-4-161        cib:     info: cib_process_ping:     Reporting our current digest to fastvm-rhel-7-4-161: 9f793423678d5df46c9e3c76dec92687 for 0.10.124 (0x55adf6758eb0 0)
  Nov 21 18:17:07  db2(DB2_HADR)[11786]:    INFO: output:  SQL1770N  Takeover HADR cannot complete. Reason code = "7".
  Nov 21 18:17:08  db2(DB2_HADR)[11786]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 21 18:17:08 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_finished: finished - rsc:DB2_HADR action:promote call_id:93 pid:11786 exit-code:1 exec-time:13565ms queue-time:0ms

After restart of DB2 resource we won't be able to promote it until manual intervention and we will be receiving following warnings

  db2(DB2_HADR)[xxx]:    WARNING: DB2 database db2inst1(0)/sample in status STANDBY/REMOTE_CATCHUP_PENDING/DISCONNECTED can never be promoted

=== Expected results:
The Slave will be promoted to Master if the issue is detected within HADR_PEER_WINDOW and promotion timeout. Below is example from edited resource agent that is expected.
Note: additional debugging messages were added so we see what is happening.

  Nov 21 18:25:58 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_execute:  executing - rsc:DB2_HADR action:promote call_id:177
  Nov 21 18:25:58  db2(DB2_HADR)[28023]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 21 18:25:58  db2(DB2_HADR)[28023]:    INFO: running    db2 takeover hadr on db sample
  Nov 21 18:26:03 [22672] fastvm-rhel-7-4-161        cib:     info: cib_process_ping:     Reporting our current digest to fastvm-rhel-7-4-161: 92352f43210d7ef1a10d25678d628cbd for 0.10.256 (0x55adf6863980 0)
  Nov 21 18:26:10  db2(DB2_HADR)[28023]:    INFO: output:  SQL1770N  Takeover HADR cannot complete. Reason code = "7".
  Nov 21 18:26:11  db2(DB2_HADR)[28023]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 21 18:26:11  db2(DB2_HADR)[28023]:    INFO: running    db2 takeover hadr on db sample by force peer window only
  Nov 21 18:26:14  db2(DB2_HADR)[28023]:    INFO: SUCCESS    db2 takeover hadr on db sample by force peer window only
  Nov 21 18:26:16  db2(DB2_HADR)[28023]:    INFO: DB20000I  The ARCHIVE LOG command completed successfully.
  Nov 21 18:26:16 [22674] fastvm-rhel-7-4-161       lrmd:     info: log_finished: finished - rsc:DB2_HADR action:promote call_id:177 pid:28023 exit-code:0 exec-time:18310ms queue-time:0ms


=== Additional info:
Attached to this BZ is proposed fix and changes that were done during testing with additional debugging.
This was tested so far only on DB 10.5 and 2-node pacemaker cluster latest RHEL 7.4.
The graceful promotion works (=when primary is up and running and we issue 'pcs resource move DB2_HADR --master'). Manual takeover with DB2 only utilities also works in the described scenario.

Comment 2 Ondrej Faměra 2017-11-22 08:42:24 UTC
Created attachment 1357296 [details]
extra debugging seen in outputs from description

Comment 4 Oyvind Albrigtsen 2017-11-22 11:30:33 UTC
https://github.com/ClusterLabs/resource-agents/pull/1060

Comment 6 Ondrej Faměra 2017-11-22 13:20:43 UTC
== Additional information:
DB2 11.1 on same RHEL 7.4 experience the same issue.
Note that original package at least shows now the first error message that it failed with.
Still the code relies on the second for loop to get the takeover with peer window and that works
with patched package.

=== original package - promote fails
  Nov 22 13:56:41 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_execute:   executing - rsc:DB2_HADR action:promote call_id:26
  Nov 22 13:56:45  db2(DB2_HADR)[4776]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 22 13:56:46 [1308] fastvm-rhel-7-4-164        cib:     info: cib_process_ping:      Reporting our current digest to fastvm-rhel-7-4-164: b94a9db433150f5a8109e2fe6bc33eba for 0.10.90 (0x5574388c19f0 0)
  Nov 22 13:56:49  db2(DB2_HADR)[4776]:    ERROR: DB2 database db2inst1(0)/sample promote failed: SQL1770N  Takeover HADR cannot complete. Reason code = "1".
  Nov 22 13:56:49 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_finished:  finished - rsc:DB2_HADR action:promote call_id:26 pid:4776 exit-code:1 exec-time:7622ms queue-time:0ms
  Nov 22 13:56:49 [1314] fastvm-rhel-7-4-164       crmd:   notice: process_lrm_event:     Result of promote operation for DB2_HADR on fastvm-rhel-7-4-164: 1 (unknown error) | call=26 key=DB2_HADR_promote_0 confirmed=true cib-update=47

=== updated package with patch proposed to upstream - promote works
  Nov 22 14:03:11 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_execute:   executing - rsc:DB2_HADR action:promote call_id:68
  Nov 22 14:03:11  db2(DB2_HADR)[14174]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/PEER/CONNECTED and will be promoted
  Nov 22 14:03:15 [1308] fastvm-rhel-7-4-164        cib:     info: cib_process_ping:      Reporting our current digest to fastvm-rhel-7-4-164: 193e41f6eded9a5c61cdcc23d4c8f925 for 0.10.151 (0x55743851a0d0 0)
  Nov 22 14:03:28  db2(DB2_HADR)[14174]:    INFO: DB2 database db2inst1(0)/sample has HADR status STANDBY/DISCONNECTED_PEER/DISCONNECTED and will be promoted
  Nov 22 14:03:40  db2(DB2_HADR)[14174]:    INFO: DB20000I  The ARCHIVE LOG command completed successfully.
  Nov 22 14:03:40 [1310] fastvm-rhel-7-4-164       lrmd:     info: log_finished:  finished - rsc:DB2_HADR action:promote call_id:68 pid:14174 exit-code:0 exec-time:29627ms queue-time:0ms

Comment 10 Monika Muzikovska 2018-01-12 13:04:58 UTC
Marking Verified as SanityOnly in version resource-agents-3.9.5-119.el7.

Comment 13 errata-xmlrpc 2018-04-10 12:09:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0757


Note You need to log in before you can comment on or make changes to this bug.