Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1056995 - SMB:Multiple mount point with different VIP's hangs when one of the node is rebooted in the ctdb cluster.
Summary: SMB:Multiple mount point with different VIP's hangs when one of the node is r...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: samba
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.0.0
Assignee: Jose A. Rivera
QA Contact: surabhi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-23 10:09 UTC by surabhi
Modified: 2015-05-13 16:58 UTC (History)
10 users (show)

Fixed In Version: ctdb2.5-2.5.3-2.el6rhs
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-09-22 19:32:06 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description surabhi 2014-01-23 10:09:51 UTC
Description of problem:
==========================

In a ctdb environment when a dis-rep volume is mounted on four different clients with four virtual ip's(VIP's) and one of the node corresponding to one of the VIP is rebooted,all the mount point hangs.

** The mount points are not hanging forever(it comes back once the faiover is done) but the issue here is about all the nodes getting hung.

When master node is rebooted then all mount point may hang because election happens for the master and all node goes to unhealthy state and comes to ok state.But when any other node who is not master is rebooted should not effect other mount points which are not mounted with VIP corresponding to the master node IP.

Version-Release number of selected component (if applicable):
[root@gqac001 ~]#
ctdb-1.0.114.6-1.el6rhs.x86_64
samba-glusterfs-3.6.9-167.9.el6rhs.x86_64
glusterfs-3.4.0.57rhs-1.el6rhs.x86_64


How reproducible:
Tried twice

Steps to Reproduce:
1.Perform ctdb setup.Create a dis-rep volume.
2.Mount the volume on four different clients with 4 VIP'S corresponding to each physical node.
3.Reboot one of the node which is not master node 
4.Do ls on all 4 mount points.
5.Check the ctdb status and ctdb ip.

-----> ctdb status should show one node as disconnected/unhealthy/inactive :

[root@gqac001 /]# ctdb status
Number of nodes:4
pnn:0 10.16.157.0      OK (THIS NODE)
pnn:1 10.16.157.3      OK
pnn:2 10.16.157.6      OK
pnn:3 10.16.157.9      DISCONNECTED|UNHEALTHY|INACTIVE
Generation:2015376953
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:2

-------> ctdb ip will show the failover from another node has happened:

[root@gqac001 /]# ctdb ip
Public IPs on node 0
10.16.157.2 node[0] active[em1] available[em1] configured[em1]
10.16.157.5 node[2] active[] available[em1] configured[em1]
10.16.157.8 node[1] active[] available[em1] configured[em1]
10.16.157.12 node[0] active[em1] available[em1] configured[em1]

Actual results:
============================
All the mount point hangs even if there is no election happened and the other nodes are in ok state and ip failover has happened.

Expected results:
============================
In this case all mount points should not hang.Only mount point whose corresponding physical node has been rebooted should get hang and comes back once the failover is done.

Additional info:

Comment 2 surabhi 2014-05-05 11:31:07 UTC
Tried the same test with following version:
ctdb2.5-2.5.2-1.el6rhs.x86_64

In a 4 node CTDB cluster when a node is rebooted/powered off the other nodes goes to UNHEALTHY status and the mount point for all the nodes with different VIP hangs and is not accessible.The ctdb log reports following error:

2014/05/05 07:17:02.902695 [ 2863]: 01.reclock: ERROR: 5 consecutive failures for 01.reclock, marking node unhealthy
2014/05/05 07:17:12.921575 [ 2863]: 01.reclock: ERROR: 6 consecutive failures for 01.reclock, marking node unhealthy
2014/05/05 07:17:14.727849 [ 2863]: 10.interface: Killing TCP connection 10.16.159.180:47508 10.16.157.12:445
2014/05/05 07:17:14.744531 [ 2863]: 10.interface: Killed 1 TCP connections to released IP 10.16.157.12
2014/05/05 07:17:27.940751 [ 2863]: 01.reclock: ERROR: 7 consecutive failures for 01.reclock, marking node unhealthy

Actual results:
All the mount point hangs till the nodes comes back up and the other nodes goes to UNHEALTHY state.

Expected results:
the Nodes other than which is rebooted/powered off should not go to UNHEALTHY state and all the mount points should be accessible other than the one whose corresponding node was rebooted.

Comment 3 Jose A. Rivera 2014-05-05 12:48:27 UTC
A request for some clarification.

 * When you power off a given node, _all other_ nodes go UNHEALTHY?
 * Is this indefinitely, or do they eventually recover?
 * Is this the case even when no clients are accessing samba shares?

Comment 4 surabhi 2014-05-06 10:29:49 UTC
(In reply to Jose A. Rivera from comment #3)
> A request for some clarification.
> 
>  * When you power off a given node, _all other_ nodes go UNHEALTHY?
->  Yes,Sometimes all nodes go UNHEALTHY and sometimes 1/2 of it.

>  * Is this indefinitely, or do they eventually recover?
They recover within few seconds. But why they go to UNHEALTHY state ?

>  * Is this the case even when no clients are accessing samba shares?
No, If we try this test without any clients accessing the samba shares and just with the CTDB setup (to see failover)it doesn't go to UNHEALTHY state.


So in detail:
=========================
Case 1: 
**************
In a 4 node CTDB cluster where none of the samba share is mounted on Linux/Windows client:

-> Reboot/Power off node 1 
-> Check ctdb status  : CTDB OK for all other nodes except the one powered-off.

# ctdb status
Number of nodes:4
pnn:0 10.16.157.0      OK (THIS NODE)
pnn:1 10.16.157.3      DISCONNECTED|UNHEALTHY|INACTIVE
pnn:2 10.16.157.6      OK
pnn:3 10.16.157.9      OK
Generation:1780597823

->Power on the node : All nodes comes to OK state.

# ctdb status
Number of nodes:4
pnn:0 10.16.157.0      OK (THIS NODE)
pnn:1 10.16.157.3      OK
pnn:2 10.16.157.6      OK
pnn:3 10.16.157.9      OK

Case 2: 
**************
In a 4 node CTDB cluster where the samba share is mounted on 4 different clients using 4 different VIP's:

-> Reboot/Power off node 1 
-> Check ctdb status  : CTDB OK for all other nodes except the one powered-off.
-> Check ctdb status :  CTDB UNHEALTHY for two other nodes.
# ctdb status
Number of nodes:4
pnn:0 10.16.157.0      OK (THIS NODE)
pnn:1 10.16.157.3      DISCONNECTED|UNHEALTHY|INACTIVE
pnn:2 10.16.157.6      UNHEALTHY
pnn:3 10.16.157.9      UNHEALTHY

-> Try to access/ls on mount point : All the mount points are hung.

Comment 5 surabhi 2014-05-13 10:28:11 UTC
Upgraded the CTDB package to ctdb2.5-2.5.3-2.el6rhs version and executed the following:

Mount the volume on four different clients with 4 VIP's.
Reboot/Power-off one of the node
check the ctdb status and ctdb IP
Do ls on all the mount points

Actual results: 
On reboot/Power-off , the other nodes doesn't go to UNHEALTHY state.
The mount points are accessible once the failover is done.

The issue is not seen with the latest build.Marking the BZ as verified.

Comment 7 errata-xmlrpc 2014-09-22 19:32:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.