Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1509654 - SpmStart fails in case there is a storage domain which its file system is RO
Summary: SpmStart fails in case there is a storage domain which its file system is RO
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.20.3
Hardware: x86_64
OS: Unspecified
medium
high vote
Target Milestone: ovirt-4.2.2
: 4.20.22
Assignee: Fred Rolland
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-05 13:09 UTC by Elad
Modified: 2018-04-18 12:26 UTC (History)
8 users (show)

Fixed In Version: vdsm v4.20.22
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-04-18 12:26:01 UTC
oVirt Team: Storage
rule-engine: ovirt-4.2+
ylavi: ovirt-4.3+
ylavi: testing_plan_complete?


Attachments (Terms of Use)
logs from engine and hypervisor (deleted)
2017-11-05 13:09 UTC, Elad
no flags Details
logs2 (deleted)
2018-01-29 09:31 UTC, Elad
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 87865 master MERGED storage: update SD role in domain state callback 2018-03-12 13:03:23 UTC
oVirt gerrit 88834 ovirt-4.2 MERGED storage: update SD role in domain state callback 2018-03-14 09:28:59 UTC

Description Elad 2017-11-05 13:09:36 UTC
Created attachment 1348127 [details]
logs from engine and hypervisor

Description of problem:
A Gluster domain, that its file system became read only, causes the whole storage pool to be down, no matter if the Gluster domain was the master or not. 


Version-Release number of selected component (if applicable):

ovirt-engine-4.2.0-0.0.master.20171101152059.git8b85834.el7.centos.noarch

vdsm-jsonrpc-4.20.6-11.git582b910.el7.centos.noarch
vdsm-yajsonrpc-4.20.6-11.git582b910.el7.centos.noarch
vdsm-client-4.20.6-11.git582b910.el7.centos.noarch
vdsm-http-4.20.6-11.git582b910.el7.centos.noarch
vdsm-hook-vfio-mdev-4.20.6-11.git582b910.el7.centos.noarch
vdsm-4.20.6-11.git582b910.el7.centos.x86_64
vdsm-api-4.20.6-11.git582b910.el7.centos.noarch
vdsm-hook-vhostmd-4.20.6-11.git582b910.el7.centos.noarch
vdsm-hook-vmfex-dev-4.20.6-11.git582b910.el7.centos.noarch
vdsm-hook-openstacknet-4.20.6-11.git582b910.el7.centos.noarch
vdsm-python-4.20.6-11.git582b910.el7.centos.noarch
vdsm-hook-fcoe-4.20.6-11.git582b910.el7.centos.noarch
vdsm-hook-ethtool-options-4.20.6-11.git582b910.el7.centos.noarch
glusterfs-client-xlators-3.12.1-2.el7.x86_64
glusterfs-rdma-3.12.1-2.el7.x86_64
glusterfs-3.12.1-2.el7.x86_64
glusterfs-cli-3.12.1-2.el7.x86_64
glusterfs-libs-3.12.1-2.el7.x86_64
glusterfs-api-3.12.1-2.el7.x86_64
glusterfs-fuse-3.12.1-2.el7.x86_64
libvirt-daemon-driver-storage-3.2.0-14.el7_4.3.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.3.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-lxc-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-secret-3.2.0-14.el7_4.3.x86_64
libvirt-libs-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-interface-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-kvm-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-config-network-3.2.0-14.el7_4.3.x86_64
libvirt-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-network-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.3.x86_64
libvirt-client-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-3.2.0-14.el7_4.3.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
libvirt-daemon-driver-storage-core-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-iscsi-3.2.0-14.el7_4.3.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-lxc-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-scsi-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-secret-3.2.0-14.el7_4.3.x86_64
libvirt-libs-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-qemu-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-gluster-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-logical-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-interface-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-kvm-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-config-network-3.2.0-14.el7_4.3.x86_64
libvirt-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-network-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-disk-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-nodedev-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-config-nwfilter-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-mpath-3.2.0-14.el7_4.3.x86_64
libvirt-client-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-nwfilter-3.2.0-14.el7_4.3.x86_64
libvirt-daemon-driver-storage-rbd-3.2.0-14.el7_4.3.x86_64
qemu-kvm-ev-2.9.0-16.el7_4.8.1.x86_64
qemu-img-ev-2.9.0-16.el7_4.8.1.x86_64
ipxe-roms-qemu-20170123-1.git4e85b27.el7_4.1.noarch
qemu-kvm-common-ev-2.9.0-16.el7_4.8.1.x86_64
sanlock-python-3.5.0-1.el7.x86_64
sanlock-3.5.0-1.el7.x86_64
libvirt-lock-sanlock-3.2.0-14.el7_4.3.x86_64
sanlock-lib-3.5.0-1.el7.x86_64
selinux-policy-targeted-3.13.1-166.el7_4.5.noarch
libselinux-2.5-11.el7.x86_64
libselinux-utils-2.5-11.el7.x86_64
libselinux-python-2.5-11.el7.x86_64
selinux-policy-3.13.1-166.el7_4.5.noarch
kernel: 3.10.0-693.5.2.el7.x86_64 #1 SMP Fri Oct 13 10:46:25 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server 7.4 (Maipo)



How reproducible:
Always

Steps to Reproduce:
1. Have a data center with more than 1 host and a Gluster domain reside on a replica 3 volume and another domain from different type
2. Cause 2 of the Gluster volume bricks where the Gluster domain resides to become unreachable to the hosts 


Actual results:

Read only file system is reported for the Gluster domain which reside on a replica 3 volume while 2 of the volume's bricks are unreachable (which makes the volume RO):

2017-11-05 14:39:25,886+0200 ERROR (jsonrpc/4) [storage.TaskManager.Task] (Task='21e71231-b46c-4df0-9a56-a03674758f1f') Unexpected error (task:875)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in createStorageDomain
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 2571, in createStorageDomain
    storageType, domVersion)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 83, in create
    version)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 50, in _preCreateValidation
    fileSD.validateFileSystemFeatures(sdUUID, domPath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 100, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 319, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 575, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 459, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 30] Read-only file system


SpmStart fails on all the DC hosts:


Engine:

2017-11-05 14:34:10,427+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [4628c4ae] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: SpmStart failed
 

2017-11-05 14:34:12,334+02 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (EE-ManagedThreadFactory-engine-Thread-34337) [667a7af2] IrsBroker::Failed::DetachStorageDomainVDS: IRSGenericException: IRSErrorException: SpmStart failed
2017-11-05 14:34:12,336+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.DetachStorageDomainVDSCommand] (EE-ManagedThreadFactory-engine-Thread-34337) [667a7af2] FINISH, DetachStorageDomainVDSCommand, log id: 5965d2f
2017-11-05 14:34:12,336+02 ERROR [org.ovirt.engine.core.bll.storage.domain.ForceRemoveStorageDomainCommand] (EE-ManagedThreadFactory-engine-Thread-34337) [667a7af2] Could not force detach storage domain 'test_gluster_2': EngineException: org.ovirt.engine.core.vdsbroker.irsbroker.IrsSpmStartFailedException: IRSGenericException: IRSErrorException: SpmStart failed (Failed with error ENGINE and code 5001)


Expected results:
In case there is a Gluster domain with RO file system, only the Gluster domain should be affected.


Additional info:



Gluster volume details:

Type: Replicate
Volume ID: 9c28d910-62b1-41dc-a5d4-3639812b0ac1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster01.scl.lab.tlv.redhat.com:/gluster_volumes/jenkins_ge12_volume_1
Brick2: gluster02.scl.lab.tlv.redhat.com:/gluster_volumes/jenkins_ge12_volume_1
Brick3: gluster03.scl.lab.tlv.redhat.com:/gluster_volumes/jenkins_ge12_volume_1
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
storage.owner-gid: 36
storage.owner-uid: 36


[root@gluster01 ~]# gluster peer status
Number of Peers: 2

Hostname: 10.35.83.241
Uuid: ccdbbe2f-d0c2-404b-a50a-c686b99f3f78
State: Peer in Cluster (Disconnected)

Hostname: 10.35.83.242
Uuid: 88604198-b9f1-4796-a970-b85730bf6407
State: Peer in Cluster (Disconnected)

Comment 1 Allon Mureinik 2017-11-05 13:54:07 UTC
After no host can be selected as SPM, I'd expect the monitoring to conclude that the domain is the problem, and set it on inactive.

Comment 2 Yaniv Lavi 2017-11-06 09:14:49 UTC
I'm raising the priority for the HCI use case.
We need to fix this by the oVirt GA.

Comment 3 Tal Nisan 2017-11-06 15:53:03 UTC
Adam, can you please scrub?

Comment 4 Adam Litke 2017-11-06 20:19:24 UTC
Yep... Working on it.

Comment 5 Adam Litke 2017-11-06 20:47:03 UTC
How are you making the two hosts unreachable?  I am finding the following errors in the vdsm.log which happen when trying to locate storage domains on storage.  This problem appears to block the the produce operation for all domains which inhibits the domain monitor from starting.  



2017-11-05 15:01:41,170+0200 DEBUG (monitor/4cc7fe5) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-3 /usr/bin/sudo -n /usr/sbin/lvm vgs --config ' devices { preferred_names =      ["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter = [ '\''a|/dev/mapper/3600a09803830347a625d467a7974572d|/dev/mapper/3600a098038     30347a625d467a79745738|/dev/mapper/3600a09803830347a625d467a79745739|/dev/mapper/3600a09803830347a625d467a79745741|/dev/mapper/3600a09803830347a625d467a79745742|/dev/mapper/3600a09     803830347a625d467a79745743|/dev/mapper/3600a09803830347a625d467a79746175|/dev/mapper/3600a09803830347a625d467a79746432|/dev/mapper/3600a09803830347a625d467a79746433|'\'', '\''r|.*|     '\'' ] }  global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0 }  backup {  retain_min = 50  retain_days = 0 } ' --noheadings --units b --nosuffix --     separator '|' --ignoreskippedcluster -o uuid,name,attr,size,free,extent_size,extent_count,free_count,tags,vg_mda_size,vg_mda_free,lv_count,pv_count,pv_name 4cc7fe5f-f5f5-4732-8ea1-     dfea0c391db0 (cwd None) (commands:70)                                                                                                                                               

2017-11-05 15:01:41,272+0200 DEBUG (monitor/4cc7fe5) [storage.Misc.excCmd] FAILED: <err> = '  /dev/mapper/3600a09803830347a625d467a79745739: read failed after 0 of 4096 at 0: Input     /output error\n  /dev/mapper/3600a09803830347a625d467a79745739: read failed after 0 of 4096 at 161061208064: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745739: re     ad failed after 0 of 4096 at 161061265408: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745739 was disabled\n  /d     ev/mapper/3600a09803830347a625d467a79745738: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745738: read failed after 0 of 4096 at 1     61061208064: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745738: read failed after 0 of 4096 at 161061265408: Input/output error\n  WARNING: Error counts reached a      limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745738 was disabled\n  /dev/mapper/3600a09803830347a625d467a7974572d: read failed after 0 of 4096 at 0: Input/output erro     r\n  /dev/mapper/3600a09803830347a625d467a7974572d: read failed after 0 of 4096 at 161061208064: Input/output error\n  /dev/mapper/3600a09803830347a625d467a7974572d: read failed af     ter 0 of 4096 at 161061265408: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a7974572d was disabled\n  /dev/mapper/36     00a09803830347a625d467a79745741: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745741: read failed after 0 of 4096 at 53687025664:      Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745741: read failed after 0 of 4096 at 53687083008: Input/output error\n  WARNING: Error counts reached a limit of 3. D     evice /dev/mapper/3600a09803830347a625d467a79745741 was disabled\n  /dev/mapper/3600a09803830347a625d467a79745742: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapp     er/3600a09803830347a625d467a79745742: read failed after 0 of 4096 at 53687025664: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745742: read failed after 0 of 4096 a     t 53687083008: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745742 was disabled\n  /dev/mapper/3600a09803830347a6     25d467a79745743: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79745743: read failed after 0 of 4096 at 53687025664: Input/output err     or\n  /dev/mapper/3600a09803830347a625d467a79745743: read failed after 0 of 4096 at 53687083008: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mappe     r/3600a09803830347a625d467a79745743 was disabled\n  /dev/mapper/3600a09803830347a625d467a79746175: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/3600a09803830     347a625d467a79746175: read failed after 0 of 4096 at 139607343104: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79746175: read failed after 0 of 4096 at 139607400448:      Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746175 was disabled\n  /dev/mapper/3600a09803830347a625d467a7974643     2: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79746432: read failed after 0 of 4096 at 16106061824: Input/output error\n  /dev/map     per/3600a09803830347a625d467a79746432: read failed after 0 of 4096 at 16106119168: Input/output error\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a0980383     0347a625d467a79746432 was disabled\n  /dev/mapper/3600a09803830347a625d467a79746433: read failed after 0 of 4096 at 0: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79     746433: read failed after 0 of 4096 at 37585092608: Input/output error\n  /dev/mapper/3600a09803830347a625d467a79746433: read failed after 0 of 4096 at 37585149952: Input/output er     ror\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746433 was disabled\n  WARNING: Error counts reached a limit of 3. Device /dev/mapp     er/3600a09803830347a625d467a79745739 was disabled\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745738 was disabled\n  WARNING: Error      counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a7974572d was disabled\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a6     25d467a79745741 was disabled\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79745742 was disabled\n  WARNING: Error counts reached a lim     it of 3. Device /dev/mapper/3600a09803830347a625d467a79745743 was disabled\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746175 was d     isabled\n  WARNING: Error counts reached a limit of 3. Device /dev/mapper/3600a09803830347a625d467a79746432 was disabled\n  WARNING: Error counts reached a limit of 3. Device /dev/     mapper/3600a09803830347a625d467a79746433 was disabled\n  Volume group "4cc7fe5f-f5f5-4732-8ea1-dfea0c391db0" not found\n  Cannot process volume group 4cc7fe5f-f5f5-4732-8ea1-dfea0c     391db0\n'; <rc> = 5 (commands:94)

Comment 6 Elad 2017-11-07 09:03:48 UTC
(In reply to Adam Litke from comment #5)
> How are you making the two hosts unreachable?  I am finding the following
> errors in the vdsm.log which happen when trying to locate storage domains on
> storage.  This problem appears to block the the produce operation for all
> domains which inhibits the domain monitor from starting.  

It happened due to malfunction in our Gluster setup, in which 2/3 of the Gluster peers were down and this caused the Gluster volume FS to become RO. 

*The domain's Gluster volume is replica3 which consists of a brick on each one of the Gluster peers.

Comment 7 Adam Litke 2017-11-07 14:41:02 UTC
(In reply to Elad from comment #6)
> (In reply to Adam Litke from comment #5)
> > How are you making the two hosts unreachable?  I am finding the following
> > errors in the vdsm.log which happen when trying to locate storage domains on
> > storage.  This problem appears to block the the produce operation for all
> > domains which inhibits the domain monitor from starting.  
> 
> It happened due to malfunction in our Gluster setup, in which 2/3 of the
> Gluster peers were down and this caused the Gluster volume FS to become RO. 
> 
> *The domain's Gluster volume is replica3 which consists of a brick on each
> one of the Gluster peers.

Did anything else go wrong during this time?  The lvm I/O errors are concerning to me (even if not directly related).

Comment 8 Adam Litke 2017-11-07 15:14:41 UTC
The vdsm logs don't seem to match up with the time from the engine logs?  The engine starts showing errors at 11 am but I don't have any logs for the hosts until 3pm?

Comment 9 Elad 2017-11-08 09:06:48 UTC
I don't have any more logs from the time of this occurrence. I'll reproduce again and provide the logs.

Comment 10 Yaniv Kaul 2017-11-16 08:28:06 UTC
(In reply to Elad from comment #9)
> I don't have any more logs from the time of this occurrence. I'll reproduce
> again and provide the logs.

Re-setting NEEDINFO for reproduction.

Comment 11 Elad 2017-12-27 14:11:57 UTC
I wasn't able to reproduce the issue. 
Fred, please try to work with the current supplied information.

Comment 12 Yaniv Kaul 2017-12-27 15:05:08 UTC
(In reply to Elad from comment #11)
> I wasn't able to reproduce the issue. 
> Fred, please try to work with the current supplied information.

I'm closing for the time being. Adam has investigated it above. It's a very good tier 4 tests (which can easily be done with Gluster), so I suggest re-creating it (I suggesting taking 2 different Gluster setups with replica 3 - keep one healthy, the other, bring down 2 nodes).

Please re-open when reproduced.

Comment 13 Elad 2018-01-29 09:31:41 UTC
Created attachment 1387669 [details]
logs2

We encountered the bug once again.
Our Gluster cluster is based on 3 virtual machines and 2 of these machines entered into Paused state which caused the replica3 volumes provided by them to become RO file system.
These volumes are used by our environments (for RHV automation).


2018-01-29 10:44:58,754+0200 ERROR (upgrade/6d1d3bb) [storage.StoragePool] FINISH thread <Thread(upgrade/6d1d3bb, started daemon 139700764600064)> failed (concurrent:201)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/securable.py", line 79, in wrapper
    return method(self, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 189, in _upgradePoolDomain
    domain = sdCache.produce(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 110, in produce
    domain.getRealDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 391, in __init__
    validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 104, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 319, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 577, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 461, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 30] Read-only file system
2018-01-29 10:44:58,898+0200 DEBUG (upgrade/84c91da) [storage.StorageDomainManifest] Reading domain in path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge5__nfs__1/84c
91da9-15ec-447f-92b5-a16016d917a2 (fileSD:168)



Hypervisor:
vdsm-4.20.17-1.el7ev.x86_64
libvirt-3.9.0-6.el7.x86_64
glusterfs-3.8.4-53.el7.x86_64

Engine:
rhvm-4.2.1.3-0.1.el7.noarch

Gluster servers:
glusterfs-server-3.12.2-1.el7.x86_64

Comment 14 Allon Mureinik 2018-01-29 13:16:26 UTC
Are there other domains in the setup?
Was the gluster domain the master SD?

Comment 15 Elad 2018-01-29 14:19:41 UTC
(In reply to Allon Mureinik from comment #14)
> Are there other domains in the setup?
Yes, NFS and iSCSI 
> Was the gluster domain the master SD?
The Gluster domain doesn't have to be master for the bug to reproduce

Comment 16 Elad 2018-01-30 09:35:38 UTC
Allon, this bug occurs also on 4.1.9. Should we clone it?

Comment 17 Allon Mureinik 2018-01-31 14:41:20 UTC
(In reply to Elad from comment #16)
> Allon, this bug occurs also on 4.1.9. Should we clone it?
It's an upstream oVirt bug, no need to clone - we can just retarget it if we decide to fix it in 4.1.z. But I'd like to get a sense of the scope of the fix before retargetting. This bug has probably been around forever, and 4.1.10 will probably be the last 4.1.z release. I don't want to introduce anything we aren't 100% sure of.

Comment 18 Charlie 2018-01-31 15:50:19 UTC
oVirt Version 4.1.2.2-1.el7.centos

Yesterday I had routing issues with a gateway, but not sure why this happen, as the ovirt engine and hypervisors behind the gateway on a private network...  But it looks gluster related.  The odd thing is the glusterFS is connect and i can read/write without issue to the gluster mount.


[root@terri-wolven 8fe3f7f3-15c9-4376-b20e-5e237b31b185]# ls -halt
total 16K
drwxr-xr-x.  4 vdsm kvm 4.0K Jan 31 15:48 .
drwxr-xr-x. 11 vdsm kvm 4.0K Jan 31 15:33 images
drwxr-xr-x.  6 vdsm kvm 4.0K Nov 21 22:31 ..
drwxr-xr-x.  2 vdsm kvm 4.0K Aug  3 13:42 dom_md




2018-01-31 15:41:19,353+0000 ERROR (monitor/8fe3f7f) [storage.Monitor] Setting up monitor for 8fe3f7f3-15c9-4376-b20e-5e237b31b185 failed (monitor:329)
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 326, in _setupLoop
    self._setupMonitor()
  File "/usr/share/vdsm/storage/monitor.py", line 349, in _setupMonitor
    self._produceDomain()
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 401, in wrapper
    value = meth(self, *a, **kw)
  File "/usr/share/vdsm/storage/monitor.py", line 367, in _produceDomain
    self.domain = sdCache.produce(self.sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 112, in produce
    domain.getRealDomain()
  File "/usr/share/vdsm/storage/sdc.py", line 53, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 136, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/share/vdsm/storage/sdc.py", line 153, in _findDomain
    return findMethod(sdUUID)
  File "/usr/share/vdsm/storage/glusterSD.py", line 55, in findDomain
    return GlusterStorageDomain(GlusterStorageDomain.findDomainPath(sdUUID))
  File "/usr/share/vdsm/storage/fileSD.py", line 364, in __init__
    validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint)
  File "/usr/share/vdsm/storage/fileSD.py", line 98, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 333, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 582, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 466, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 5] Input/output error

Comment 19 Charlie 2018-01-31 15:52:07 UTC
specifically it looks like there is an error with the 'directTouch" (testFilePath).  I manually ran 'touch' on the glusterFS mouunt point and there was no issue.

Comment 20 Charlie 2018-01-31 16:20:43 UTC
I added a INFO log to the outOfProcess.py...line 333..

OSError: [Errno 5] Input/output error

2018-01-31 16:16:27,654+0000 INFO  (tasks/8) [storage.oop] Touching File at: /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__ ... (outOfProcess:333)

2018-01-31 16:16:27,661+0000 ERROR (tasks/8) [storage.StoragePool] Unexpected error (sp:384)


I attempted to 'move' said file to /tmp...

[root@terri-wolven 8fe3f7f3-15c9-4376-b20e-5e237b31b185]# mv /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__ /tmp/
mv: error reading ‘/rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__’: Input/output error
mv: failed to extend ‘/tmp/__DIRECT_IO_TEST__’: Input/output error

Comment 21 Charlie 2018-01-31 16:34:57 UTC
(In reply to Charlie from comment #20)
> I added a INFO log to the outOfProcess.py...line 333..
> 
> OSError: [Errno 5] Input/output error
> 
> 2018-01-31 16:16:27,654+0000 INFO  (tasks/8) [storage.oop] Touching File at:
> /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__
> ... (outOfProcess:333)
> 
> 2018-01-31 16:16:27,661+0000 ERROR (tasks/8) [storage.StoragePool]
> Unexpected error (sp:384)
> 
> 
> I attempted to 'move' said file to /tmp...
> 
> [root@terri-wolven 8fe3f7f3-15c9-4376-b20e-5e237b31b185]# mv
> /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__
> /tmp/
> mv: error reading
> ‘/rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__’:
> Input/output error
> mv: failed to extend ‘/tmp/__DIRECT_IO_TEST__’: Input/output error

Odd... I just forced a umount of "/rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0" and now all my normal NFS storage domains are up.  vdsm still complaining about not being able to touch __DIRECT_IO_TEST__

I then ran 'rm /rhev/data-center/mnt/glusterSD/glusterfs.test.org:_gv0/__DIRECT_IO_TEST__' and now my glusterfs storage domain is back up too.  

I believe this was related to some kind of IO locking on the __DIRECT_IO_TEST__ file pointer/node and it would appear that forcing an umount and removing that file resolved the issue?

Comment 22 Fred Rolland 2018-02-12 14:14:31 UTC
(In reply to Elad from comment #15)
> (In reply to Allon Mureinik from comment #14)
> > Are there other domains in the setup?
> Yes, NFS and iSCSI 
> > Was the gluster domain the master SD?
> The Gluster domain doesn't have to be master for the bug to reproduce

I did the same test where the gluster SD is not the master, and it worked fine.
Only the gluster SD is reported as Inactive.

I will do another test where the gluster SD is the master

Comment 23 Fred Rolland 2018-02-13 09:48:41 UTC
It does reproduce in case that the gluster SD is the master.

- The gluster SD becomes inactive
- The engine tries to start the SPM with another master domain
- The startSpm verb fails when trying to build the domains.
The gluster SD is passed as "active" to the VDSM, (I guess for trying to recover) but it fails to produce the domain due to the ReadOnly state.

A quite similar scenario happens if the master domain is a NFS SD and I change the permissions to RO (chmod a-w __DIRECT_IO_TEST__), and then again all the DC is down.

The exception in the NFS flow: (there are similar exceptions both in getSpmStatus and in connectStoragePool)


Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/storage/task.py", line 882, in _run
    return fn(*args, **kargs)
  File "<string>", line 2, in getSpmStatus
  File "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 633, in getSpmStatus
    status = self._getSpmStatusInfo(pool)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/hsm.py", line 627, in _getSpmStatusInfo
    (pool.spmRole,) + pool.getSpmStatus()))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sp.py", line 140, in getSpmStatus
    return self._backend.getSpmStatus()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/spbackends.py", line 432, in getSpmStatus
    lVer, spmId = self.masterDomain.inquireClusterLock()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 48, in __getattr__
    return getattr(self.getRealDomain(), attrName)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 51, in getRealDomain
    return self._cache._realProduce(self._sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 134, in _realProduce
    domain = self._findDomain(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/sdc.py", line 151, in _findDomain
    return findMethod(sdUUID)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/nfsSD.py", line 126, in findDomain
    return NfsStorageDomain(NfsStorageDomain.findDomainPath(sdUUID))
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 391, in __init__
    validateFileSystemFeatures(manifest.sdUUID, manifest.mountpoint)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/fileSD.py", line 104, in validateFileSystemFeatures
    oop.getProcessPool(sdUUID).directTouch(testFilePath)
  File "/usr/lib/python2.7/site-packages/vdsm/storage/outOfProcess.py", line 319, in directTouch
    ioproc.touch(path, flags, mode)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 575, in touch
    self.timeout)
  File "/usr/lib/python2.7/site-packages/ioprocess/__init__.py", line 459, in _sendCommand
    raise OSError(errcode, errstr)
OSError: [Errno 13] Permission denied

Comment 25 Elad 2018-04-12 13:58:58 UTC
In case of a 'OSError: [Errno 30] Read-only file system', while a Gluster domain is the master, reconstruct takes place and succeeds.

Also, went over the latest RHV automation runs (tiers 1-3). No regressions related to the fix here were found.

Used:
vdsm-4.20.23-1.el7ev.x86_64
glusterfs-3.8.4-54.3.el7.x86_64
rhvm-4.2.2.6-0.1.el7.noarch

Comment 26 Sandro Bonazzola 2018-04-18 12:26:01 UTC
This bugzilla is included in oVirt 4.2.2 release, published on March 28th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.