Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1513736 - Client unable to see or mount NFS-Ganesha export from 12 x (4 + 2) distributed-dispersed volume [NEEDINFO]
Summary: Client unable to see or mount NFS-Ganesha export from 12 x (4 + 2) distribute...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: nfs-ganesha
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Kaleb KEITHLEY
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-15 21:34 UTC by Dustin Black
Modified: 2018-02-15 05:20 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-15 05:20:28 UTC
asoman: needinfo+
jthottan: needinfo? (dblack)


Attachments (Terms of Use)
Output from 1460514 test program (deleted)
2017-11-15 21:37 UTC, Dustin Black
no flags Details
Output from bug1283983 test program (deleted)
2017-11-15 21:37 UTC, Dustin Black
no flags Details

Description Dustin Black 2017-11-15 21:34:43 UTC
Description of problem:
After creating a 12 x (4 + 2) distributed-dispersed volume and configuring NFS-Ganesha with HA, a RHEL NFS client is unable to see the volume exported from the Gluster nodes.


Version-Release number of selected component (if applicable):
RHGS 3.3.0 from ISO


How reproducible:
Consistently reproducible with clean builds in hardware lab with 6 nodes each with 12 HDDs.


Steps to Reproduce:
1. Create 12 x (4 + 2) distributed-dispersed volume across 6 nodes with 12 HDDs each, one LVM stack and brick per HDD
2. Configure NFS-Ganesha for HA per documentation
3. Open firewall services and ports per documentation
4. Attempt 'showmount -e' or 'mount -t nfs -o vers=3' command from a RHEL client

Actual results:
The 'showmount -e' command reports no exports from the VIP; mount command fails with 'access denied' error.


Expected results:
The 'showmount -e' displays the gluster volume as exported; mount command succeeds.


Additional info:

Testing our automated deployment, I have a 12 x (4 + 2) volume -- 6 nodes each with 12 single-disk bricks. Additionally there is an lvmcache layer attached to each brick's thin pool. We are using a 2:3 ratio for NFS-Ganesha nodes, so there are 4 nodes configured to host VIPs and share the volumes.

Everything seems properly configured for NFS-Ganesha to start correctly with HA. VIPs are up, pcs status looks good, NFS-Ganesha service is running, configs show they are exporting the Gluster volume...

However, attempting to mount from a client I get:

# mount -t nfs -o vers=3 192.168.1.201:/gluster1 /mnt
mount.nfs: access denied by server while mounting 192.168.1.201:/gluster1

What I immediately find on the server side is a repeating set of W and E messages in the ganesha-gfapi.log file. Here's a snippet of some lines I think are relevant; I'll share more as it's useful:

[2017-11-10 19:04:18.288694] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7efe9e943242] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7efe9e7088ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efe9e7089be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7efe9e70a130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7efe9e70abe0] ))))) 0-gluster1-client-45: forced unwinding frame type(GlusterFS Handshake) op(SET_LK_VER(4)) called at 2017-11-10 18:54:57.701296 (xid=0x5)
2017-11-10 19:04:18.288728] W [MSGID: 114032] [client-handshake.c:190:client_set_lk_version_cbk] 0-gluster1-client-45: received RPC status error [Transport endpoint is not connected]
[2017-11-10 19:04:18.289215] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7efe9e943242] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7efe9e7088ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efe9e7089be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7efe9e70a130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7efe9e70abe0] ))))) 0-gluster1-client-45: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2017-11-10 18:54:57.701307 (xid=0x6)
[2017-11-10 19:04:18.289237] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-gluster1-client-45: socket disconnected
[2017-11-10 19:04:18.289286] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-gluster1-client-49: disconnected from gluster1-client-49. Client process will keep trying to connect to glusterd until brick's port is available

As far as I can quickly see, the log lines are all variations on this theme, mostly pointing to different bricks.

Initially, 'showmount -e' does show the exported volume, but after some time it returns an empty list of exports.

The output of 'gluster vol status' seems fine with regard to the ports all up and listening. Firewall looks to be correct with the right services enabled and the gluster brick port range visible in the iptables output. I don't see any selinux denials.

The ganesha.log file gives some more interesting messages:

10/11/2017 18:49:33 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume gluster1 exported at : '/'
10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] posix2fsal_type :FSAL :WARN :Unknown object type: 0
10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] posix2fsal_type :FSAL :WARN :Unknown object type: 0
10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] mdcache_new_entry :INODE :MAJ :unknown type 4294967295 provided
10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] init_export_root :EXPORT :CRIT :Lookup failed on path, ExportId=2 Path=/gluster1 FSAL_ERROR=(Invalid object type,0)
10/11/2017 19:06:43 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] gsh_export_addexport :EXPORT :CRIT :0 export entries in /var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf added because (invalid param value) errors
10/11/2017 19:06:43 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] dbus_message_entrypoint :DBUS :MAJ :Method (AddExport) on (org.ganesha.nfsd.exportmgr) failed: name = (org.freedesktop.DBus.Error.InvalidFileContent), message = (0 export entries in /var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf added because (invalid param value) errors. Details:
Config File (/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf:4): 1 validation errors in block EXPORT
Config File (/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf:4): Errors found in configuration block EXPORT

So first it seems to export the gluster1 volume, and then it complains about it. Looking at the config file that was generated for the volume, I can't see what it would be complaining about.

EXPORT{
      Export_Id = 2;
      Path = "/gluster1";
      FSAL {
           name = GLUSTER;
           hostname="localhost";
          volume="gluster1";
           }
      Access_type = RW;
      Disable_ACL = true;
      Squash="No_root_squash";
      Pseudo="/gluster1";
      Protocols = "3", "4" ;
      Transports = "UDP","TCP";
      SecType = "sys";
     }

Comment 2 Dustin Black 2017-11-15 21:36:36 UTC
Offline from Soumya:

> Maybe the attributes of the filesystem returned were not valid. Is the fuse-mount of the same volume successful? To narrow down if its indeed issue with only ganesha, could you please try these sample gfapi 'C' programs [1] [2]
> [1] https://github.com/gluster/glusterfs/blob/master/tests/basic/gfapi/bug1283983.c
> [2] https://github.com/gluster/glusterfs/blob/master/tests/bugs/gfapi/bug-1447266/1460514.c


Yes, the volume mounts successfully via gluster native fuse.

I ran both of the gfapi sample programs for a while and have attached the log files. They both seem to result in critical timeout errors in communicating with the bricks.

Comment 3 Dustin Black 2017-11-15 21:37:17 UTC
Created attachment 1353008 [details]
Output from 1460514 test program

Comment 4 Dustin Black 2017-11-15 21:37:45 UTC
Created attachment 1353009 [details]
Output from bug1283983 test program

Comment 5 Soumya Koduri 2017-11-16 08:12:41 UTC
As per comment#2, even simple gfapi programs failed to get executed for this particular volume. So the issue lies in the gluster stack. In the logs, we can see that there are many disconnects between client and brick servers. This may have led to not being enough disperse subvolumes up. 

Request Pranith/Du to provide comments.

@Dustin,
Just to rule out network issues, are there any AVCs or firewalld warnings reported?

Comment 9 Dustin Black 2017-11-16 21:21:41 UTC
(In reply to Soumya Koduri from comment #5)
> As per comment#2, even simple gfapi programs failed to get executed for this
> particular volume. So the issue lies in the gluster stack. In the logs, we
> can see that there are many disconnects between client and brick servers.
> This may have led to not being enough disperse subvolumes up. 
> 
> Request Pranith/Du to provide comments.
> 
> @Dustin,
> Just to rule out network issues, are there any AVCs or firewalld warnings
> reported?

The firewalld service was running on the nodes, and all ports and services seemed to be appropriately configured for Gluster, based on documentation. I never noticed failures or denials related to AVC or firewalld. My lab cycles were limited, and I didn't get a chance to test again with the firewall disabled. I'll need to work on building out another reporoducer lab.

Comment 15 Dustin Black 2017-11-17 20:04:54 UTC
I believe I have found the trigger for the problem. As part of my usual deployment, I set a default gateway on the servers, even though the subnet that I am on isn't actually connected to a router -- therefore I am setting an invalid or unavailable default gateway on the nodes.

My nodes and clients are all on the same subnet, so this shouldn't matter at all as there is no reason for any of the client-server or server-server communication to need to be routed, but with this "bad" gateway defined the client mount will always fail as described here. If I delete the default gateway entries from all of the servers, the clients can mount via NFS with no problem.

Try adding a bogus default gateway to the servers and see if you can reproduce the behavior.


Note You need to log in before you can comment on or make changes to this bug.