Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1013338 - Error when removing snapshot - Storage domain layout corrupted if domain is mounted by two different paths (e.g. FQDN and IP)
Summary: Error when removing snapshot - Storage domain layout corrupted if domain is m...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.4.0
Assignee: Allon Mureinik
QA Contact: Elad
URL:
Whiteboard: storage
Depends On:
Blocks: rhev3.4beta 1142926
TreeView+ depends on / blocked
 
Reported: 2013-09-29 11:37 UTC by Raz Tamir
Modified: 2016-02-10 17:05 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Storage
Target Upstream Version:


Attachments (Terms of Use)
engine + vdsm logs (deleted)
2013-09-29 11:37 UTC, Raz Tamir
no flags Details
all_vdsm_logs (deleted)
2013-09-29 13:49 UTC, Raz Tamir
no flags Details

Description Raz Tamir 2013-09-29 11:37:22 UTC
Created attachment 804671 [details]
engine + vdsm logs

Description of problem:
When trying to delete a snapshot that was created when the vm running (live snapshot) , an error message appears :
"Error while executing action Remove Snapshot: Storage domain layout corrupted".

from the vdsm log the problem occur during the "deleteImage":

Thread-321928::DEBUG::2013-09-29 11:41:12,621::task::974::TaskManager.Task::(_decref) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::ref 1 aborting False
Thread-321928::ERROR::2013-09-29 11:41:12,634::task::850::TaskManager.Task::(_setError) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::Unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/task.py", line 857, in _run
    return fn(*args, **kargs)
  File "/usr/share/vdsm/logUtils.py", line 45, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/storage/hsm.py", line 1543, in deleteImage
    dom.deleteImage(sdUUID, imgUUID, volsByImg)
  File "/usr/share/vdsm/storage/fileSD.py", line 343, in deleteImage
    currImgDir = getImagePath(sdUUID, imgUUID)
  File "/usr/share/vdsm/storage/fileSD.py", line 98, in getImagePath
    return os.path.join(getDomPath(sdUUID), 'images', imgUUID)
  File "/usr/share/vdsm/storage/fileSD.py", line 92, in getDomPath
    raise se.StorageDomainLayoutError(sdUUID)
StorageDomainLayoutError: Storage domain layout corrupted: ('d2d701ea-a185-483c-af8e-e71753dbb0cc',)
Thread-321928::DEBUG::2013-09-29 11:41:12,635::task::869::TaskManager.Task::(_run) Task=`73c7cb7d-425b-4245-b283-5d8efe3f290f`::Task._run: 73c7cb7d-425b-4245-b283-5d8efe3f290f ('d2d701ea-a1
85-483c-af8e-e71753dbb0cc', '0fc8637c-c5a9-463d-81d7-42266087900a', '6f151c8d-740c-459c-8c6d-61fb0b2273ed', 'false', 'false') {} failed - stopping task

the nfs domain is somehow mounted twice on the host - once through fqdn once through ip:

[root@gold-vdsc ~]# ll /rhev/data-center/mnt/
total 16
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 10.35.64.106:_fastpass_ratamir_nfs1
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 10.35.64.106:_fastpass_ratamir_nfs2
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1
drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2


so it seems vdsm sees the same SDUUID(SD=d2d701ea-a185-483c-af8e-e71753dbb0cc) directory under 2 different paths:
root@gold-vdsc ~]# tree /rhev/data-center/mnt/ | grep d2d
|   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc
|   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc


Version-Release number of selected component (if applicable):
vdsm-4.12.0-138.gitab256be.el6ev.x86_64
rhevm-3.3.0-0.22.master.el6ev.noarch

How reproducible:
100%

Steps to Reproduce:
setup with 1 NFS DC ,1 host 2 storage domain.

1.create new vm
2.create live snapshot of that vm
3.when the snapshot status is "OK", "Power Off" the vm 
4.delete the snapshot

Actual results:
Error message appears :
"Error while executing action Remove Snapshot: Storage domain layout corrupted".

and the snapshot status become "BROKEN" on the webadmin portal

Expected results:
The snapshot should be removed

Additional info:

Comment 1 Nir Soffer 2013-09-29 12:18:44 UTC
(In reply to ratamir from comment #0)
> Created attachment 804671 [details]
> engine + vdsm logs
> 
> Description of problem:
> When trying to delete a snapshot that was created when the vm running (live
> snapshot) ,

Can we see the creation of the snapshot in the logs? 

> 
> the nfs domain is somehow mounted twice on the host - once through fqdn once
> through ip:
> 
> [root@gold-vdsc ~]# ll /rhev/data-center/mnt/
> total 16
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03 10.35.64.106:_fastpass_ratamir_nfs1
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20 10.35.64.106:_fastpass_ratamir_nfs2
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:03
> tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1
> drwxr-xr-x. 3 vdsm kvm 4096 Sep 23 17:20
> tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2
> 
> 
> so it seems vdsm sees the same
> SDUUID(SD=d2d701ea-a185-483c-af8e-e71753dbb0cc) directory under 2 different
> paths:
> root@gold-vdsc ~]# tree /rhev/data-center/mnt/ | grep d2d
> |   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc
> |   |-- d2d701ea-a185-483c-af8e-e71753dbb0cc

Can you check how nfs domain is mounted before and after the creation of the snapshot?

> Expected results:
> The snapshot should be removed

Can you tell on which version this operation did succeed?

Comment 2 Ayal Baron 2013-09-29 13:09:39 UTC
Please attach all vdsm logs (the attached 2 logs do not contain all the info necessary).

Please attach /proc/mounts as well.

The problem is as you stated that the domain is mounted twice, we need to figure out how you reached this state.

Comment 3 Raz Tamir 2013-09-29 13:49:03 UTC
Created attachment 804719 [details]
all_vdsm_logs

(In reply to Ayal Baron from comment #2)
> Please attach all vdsm logs (the attached 2 logs do not contain all the info
> necessary).
all vdsm logs added

> 
> Please attach /proc/mounts as well.
[root@gold-vdsc ~]# mount
/dev/mapper/vg0-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)               
sysfs on /sys type sysfs (rw)              
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0")
/dev/sda1 on /boot type ext4 (rw)                                           
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)                      
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)                      
tiger.qa.lab.tlv.redhat.com:/fastpass/ratamir/nfs1 on /rhev/data-center/mnt/tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs1 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)                                                                                                                                                                            
tiger.qa.lab.tlv.redhat.com:/fastpass/ratamir/nfs2 on /rhev/data-center/mnt/tiger.qa.lab.tlv.redhat.com:_fastpass_ratamir_nfs2 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)                                                                                                                                                                            
10.35.64.106:/fastpass/ratamir/nfs1 on /rhev/data-center/mnt/10.35.64.106:_fastpass_ratamir_nfs1 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)             
10.35.64.106:/fastpass/ratamir/nfs2 on /rhev/data-center/mnt/10.35.64.106:_fastpass_ratamir_nfs2 type nfs (rw,soft,nosharecache,timeo=600,retrans=6,nfsvers=3,addr=10.35.64.106)  

> 
> The problem is as you stated that the domain is mounted twice, we need to
> figure out how you reached this state.

Comment 4 Ayal Baron 2013-09-29 14:26:23 UTC
These logs still do not contain the info of when these domains were mounted unfortunately.

Comment 6 Ayal Baron 2013-09-29 19:08:57 UTC
Haim, why is this a regression?
Can you reproduce?

Comment 7 Allon Mureinik 2013-09-30 08:29:55 UTC
From the engine log:

2013-09-24 17:38:27,685 - Attempt to create the domain via IP mount (failed):
2013-09-24 17:38:27,685 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-10) Command CreateStorageDomainVDS execution failed. Exception: VDSErrorException: VDSGenericException: VDSError
Exception: Failed in vdscommand to CreateStorageDomainVDS, error = Storage domain is not empty - requires cleaning: ('10.35.64.106:/fastpass/ratamir/nfs1',)
2013-09-24 17:38:27,685 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand] (ajp-/127.0.0.1:8702-10) FINISH, CreateStorageDomainVDSCommand, log id: 9e368ee
2013-09-24 17:38:27,686 ERROR [org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand] (ajp-/127.0.0.1:8702-10) Command org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand throw Vdc Bll exception. With error message VdcBLL
Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to CreateStorageDomainVDS, error = Storage domain is not empty - requires cleaning: ('10.35.64.106:/fast
pass/ratamir/nfs1',) (Failed with error StorageDomainNotEmpty and code 361)

There are several instances of creating a storage domain (either successful or failing) in the log, both from IP and from FQDN.
When the creation fails, there is no DisconnectStorageServerCommand upon failure, resulting in the same path being mounted twice.

Action items: need to try to reproduce this scenario, and see:
1. Does this happen consistently (QA)
2. Check if the command is reported with success=true (even though it failed), explaining why no DisconnectStorageServer was issued (dev).

Comment 8 Allon Mureinik 2013-09-30 08:32:18 UTC
In any event, resetting the component, since VDSM can't run a command that's never sent to it :-)

Comment 9 Raz Tamir 2013-09-30 10:41:52 UTC
(In reply to Allon Mureinik from comment #7)
> From the engine log:
> 
> 2013-09-24 17:38:27,685 - Attempt to create the domain via IP mount (failed):
> 2013-09-24 17:38:27,685 ERROR
> [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand]
> (ajp-/127.0.0.1:8702-10) Command CreateStorageDomainVDS execution failed.
> Exception: VDSErrorException: VDSGenericException: VDSError
> Exception: Failed in vdscommand to CreateStorageDomainVDS, error = Storage
> domain is not empty - requires cleaning:
> ('10.35.64.106:/fastpass/ratamir/nfs1',)
> 2013-09-24 17:38:27,685 INFO 
> [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateStorageDomainVDSCommand]
> (ajp-/127.0.0.1:8702-10) FINISH, CreateStorageDomainVDSCommand, log id:
> 9e368ee
> 2013-09-24 17:38:27,686 ERROR
> [org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand]
> (ajp-/127.0.0.1:8702-10) Command
> org.ovirt.engine.core.bll.storage.AddNFSStorageDomainCommand throw Vdc Bll
> exception. With error message VdcBLL
> Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException:
> VDSGenericException: VDSErrorException: Failed in vdscommand to
> CreateStorageDomainVDS, error = Storage domain is not empty - requires
> cleaning: ('10.35.64.106:/fast
> pass/ratamir/nfs1',) (Failed with error StorageDomainNotEmpty and code 361)
> 
> There are several instances of creating a storage domain (either successful
> or failing) in the log, both from IP and from FQDN.
> When the creation fails, there is no DisconnectStorageServerCommand upon
> failure, resulting in the same path being mounted twice.
> 
> Action items: need to try to reproduce this scenario, and see:
> 1. Does this happen consistently (QA)

- From GUI the  mount is cleaned.
- From REST it does not clean after failed creation.

> 2. Check if the command is reported with success=true (even though it
> failed), explaining why no DisconnectStorageServer was issued (dev).

Comment 10 Allon Mureinik 2013-09-30 16:28:53 UTC
This is another aspect of the same root cause described in bug 991470.

Since this is a corner case only available via REST, proposing to postpone to 3.4.

Comment 12 Allon Mureinik 2014-02-16 08:50:39 UTC
Moved to ON_QA, since the root cause was solved by bug 991470.

Comment 13 Leonid Natapov 2014-02-18 11:24:38 UTC
3.4.0-0.7.beta2.el6. No error when removing live snapshot from stopped VM.

Comment 15 Itamar Heim 2014-06-12 14:07:44 UTC
Closing as part of 3.4.0


Note You need to log in before you can comment on or make changes to this bug.