Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1357919 - Copying a raw sparse file take a lot of time comparing to qcow sparse file using qemu-img (depends on Gluster bugs) [NEEDINFO]
Summary: Copying a raw sparse file take a lot of time comparing to qcow sparse file us...
Keywords:
Status: NEW
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Gluster
Version: 4.0.1.1
Hardware: Unspecified
OS: Unspecified
medium
medium vote
Target Milestone: ovirt-4.3.4
: ---
Assignee: Sahina Bose
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-19 15:10 UTC by Raz Tamir
Modified: 2019-04-14 13:36 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Gluster
kdhananj: needinfo? (sasundar)
sabose: ovirt-4.3?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
vdsm.log (deleted)
2016-07-19 15:10 UTC, Raz Tamir
no flags Details
raw-sparse-libgfapi-trace-output (deleted)
2017-06-02 05:57 UTC, Krutika Dhananjay
no flags Details
raw-sparse-xfs-strace-output (deleted)
2017-06-02 05:59 UTC, Krutika Dhananjay
no flags Details

Description Raz Tamir 2016-07-19 15:10:06 UTC
Created attachment 1181674 [details]
vdsm.log

Description of problem:
When trying to copy disks on file based domain, the is a difference between raw and cow sparse disks with the same size:

raw, sparse 10GB on file
[host images]# du -a *
0       55d380fe-cfc2-42b5-884b-f5b8d6da0be7/dc01ddd5-8501-4c63-ad51-0e19e3854046
1028    55d380fe-cfc2-42b5-884b-f5b8d6da0be7/dc01ddd5-8501-4c63-ad51-0e19e3854046.lease
4       55d380fe-cfc2-42b5-884b-f5b8d6da0be7/dc01ddd5-8501-4c63-ad51-0e19e3854046.meta
1036    55d380fe-cfc2-42b5-884b-f5b8d6da0be7

Copy disk:
DEBUG::2016-07-19 17:09:27,761::qemuimg::224::QemuImg::(__init__) /usr/bin/taskset --cpu-list 0-1 /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img convert -p -t none 
-T none -f raw /rhev/data-center/ce185612-2017-4ca1-a76f-ee701f73bd33/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/55d380fe-cfc2-42b5-884b-f5b8d6da0be7/dc01ddd5-8501-4c63-ad51-0e19e3854046 -O raw /rhev/data-center/m
nt/10.35.64.11:_vol_RHEV_Storage_storage__jenkins__ge5__nfs__0/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/c2d5b640-5142-4c8a-b4f6-e1b750a2e742/137d8028-ad59-47e4-acf7-2aaf7b541fa1 (cwd None)
DEBUG::2016-07-19 17:09:27,776::image::140::Storage.Image::(_wait_for_qemuimg_operation) waiting for qemu-img operation to complete
...
DEBUG::2016-07-19 17:11:47,782::image::147::Storage.Image::(_wait_for_qemuimg_operation) qemu-img operation progress: 100.0%
DEBUG::2016-07-19 17:11:47,782::image::149::Storage.Image::(_wait_for_qemuimg_operation) qemu-img operation has completed
DEBUG::2016-07-19 17:11:47,783::utils::870::vds.stopwatch::(stopwatch) Copy volume dc01ddd5-8501-4c63-ad51-0e19e3854046: 140.01 seconds


************************************************


cow, sparse 10GB on file
[host images]# du -a *
200      976ee819-50b5-4156-bd3c-84aca4e50483/14a85f54-73d2-48cd-bb79-02a2f027afd7
1028     976ee819-50b5-4156-bd3c-84aca4e50483/14a85f54-73d2-48cd-bb79-02a2f027afd7.lease
4        976ee819-50b5-4156-bd3c-84aca4e50483/14a85f54-73d2-48cd-bb79-02a2f027afd7.meta
1236     976ee819-50b5-4156-bd3c-84aca4e50483

Copy disk:
DEBUG::2016-07-19 17:05:07,036::qemuimg::224::QemuImg::(__init__) /usr/bin/taskset --cpu-list 0-1 /usr/bin/nice -n 19 /usr/bin/ionice -c 3 /usr/bin/qemu-img convert -p -t none 
-T none -f qcow2 /rhev/data-center/ce185612-2017-4ca1-a76f-ee701f73bd33/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/976ee819-50b5-4156-bd3c-84aca4e50483/14a85f54-73d2-48cd-bb79-02a2f027afd7 -O qcow2 -o compat=0.10 
/rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Storage_storage__jenkins__ge5__nfs__0/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/6d9be53e-56b0-42cd-9694-275461a09151/91768f6c-d3c2-4c02-9fd6-7c37c0e36fdd (cwd None)
DEBUG::2016-07-19 17:05:07,051::image::140::Storage.Image::(_wait_for_qemuimg_operation) waiting for qemu-img operation to complete
bcb205cc-94f3-4ad6-ab89-7babb17abedf::DEBUG::2016-07-19 17:05:07,118::image::147::Storage.Image::(_wait_for_qemuimg_operation) qemu-img operation progress: 100.0%
DEBUG::2016-07-19 17:05:07,118::image::149::Storage.Image::(_wait_for_qemuimg_operation) qemu-img operation has completed
DEBUG::2016-07-19 17:05:07,119::utils::870::vds.stopwatch::(stopwatch) Copy volume 14a85f54-73d2-48cd-bb79-02a2f027afd7: 0.06 seconds

The differences in execution time are strange because the image is with the same actual and provisioned size



Version-Release number of selected component (if applicable):


How reproducible:
100%

Steps to Reproduce:
1. Create 2 disks on file based domain, both 10GB sparse, one cow format and one raw format
2. Copy each disk to the same domain they exist on
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yaniv Kaul 2016-07-20 07:03:53 UTC
1. So the bug is that raw/sparse is slow? This is not what the title says.
2. Can it be reproduced with qemu-img alone, without VDSM/oVirt? If so, it's a qemu bug, no?
3. Are you sure that the NFS server supports sparse? I'd try with NFSv4.2. See http://www.snia.org/sites/default/files/NFS_4.2_Final.pdf for more details.

Comment 2 Raz Tamir 2016-07-20 08:39:42 UTC
1. True
2. 
qCOW:
[root@green-vdsa tmp]# time /usr/bin/qemu-img convert -p -t none -T none -f qcow2 /rhev/data-center/ce185612-2017-4ca1-a76f-ee701f73bd33/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/6b998279-5cda-448f-a1a3-f65aefbd4252/c9185df1-1d4a-4af5-ae1c-4b9c68edf5be -O qcow2 -o compat=0.10 /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Storage_storage__jenkins__ge5__nfs__0/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/tmp/tmp_vol
    (100.00/100%)

real    0m0.058s
user    0m0.020s
sys     0m0.017s


raw:
[root@green-vdsa tmp]# time /usr/bin/qemu-img convert -p -t none -T none -f raw /rhev/data-center/ce185612-2017-4ca1-a76f-ee701f73bd33/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/35a17d2f-bbc3-444d-ae55-07df3936b247/f3cbae20-bd43-4e65-a2ff-829fdc1a6083 -O raw /rhev/data-center/mnt/10.35.64.11:_vol_RHEV_Storage_storage__jenkins__ge5__nfs__0/5cf6dd4b-65db-413a-8226-95bac7ee9378/images/tmp/tmp_vol
    (100.00/100%)

real    2m18.914s
user    0m2.201s
sys     0m2.586s

3. I will check this, but this issue seems to be related the the image format,  raw vs qcow and both sparse so if the server doesn't support sparse it shouldn't support for both formats

Comment 3 Allon Mureinik 2016-07-20 09:10:18 UTC
Niels - both files here are empty, but offhand it looks like in the raw case qemu-img reads the "entire" size, including blocks that aren't really allocated.

Should your recent work on SEEK_DATA address this?

Comment 4 Yaniv Kaul 2016-07-20 09:38:30 UTC
(In reply to ratamir from comment #2)

> 
> 3. I will check this, but this issue seems to be related the the image
> format,  raw vs qcow and both sparse so if the server doesn't support sparse
> it shouldn't support for both formats

The above statement is plain wrong. qcow2 is not sparse, it's thin - the FILE grows as you add content to it.
The raw sparse format tries to take advantage of the underlying file system's sparseness feature, allowing to 'punch hole' in it where there's no need for allocation.

Comment 5 Raz Tamir 2016-07-20 11:30:26 UTC
(In reply to Yaniv Kaul from comment #4)
> (In reply to ratamir from comment #2)
> 
> > 
> > 3. I will check this, but this issue seems to be related the the image
> > format,  raw vs qcow and both sparse so if the server doesn't support sparse
> > it shouldn't support for both formats
> 
> The above statement is plain wrong. qcow2 is not sparse, it's thin - the
> FILE grows as you add content to it.
> The raw sparse format tries to take advantage of the underlying file
> system's sparseness feature, allowing to 'punch hole' in it where there's no
> need for allocation.

You are right,
I should have use the term thin-provisioned disks instead of sparse because of the double meaning

Comment 6 Niels de Vos 2016-11-22 15:14:19 UTC
(In reply to Allon Mureinik from comment #3)
> Niels - both files here are empty, but offhand it looks like in the raw case
> qemu-img reads the "entire" size, including blocks that aren't really
> allocated.
> 
> Should your recent work on SEEK_DATA address this?

Sorry for the late reply, Allon!

When a recent qemu-img with the gluster block-driver is used to copy a sparse file, the sparseness of the file should be maintained. There are some caveats though.

Older (inc. RHEL) versions of the kernel can not do SEEK_DATA/HOLE over FUSE mounts. Support for SEEK_DATA/HOLE is being added to FUSE in RHEL through bug 1306396.

Unfortunately, volumes that have sharding enabled do not support SEEK_DATA/HOLE yet either... Bug 1301647 for upstream Gluster tracks (the lack of) progress there.

NFS added support for SEEK_DATA/HOLE recently (see comment #1). Not all NFS-servers support this, nor all NFS-clients. Debugging with 'rpcdebug' or capturing a network trace can tell you (or me) if SEEK_DATA/HOLE is used.

'cp' from coreutils does not use SEEK_DATA/HOLE, but the lower-level filesystem interfaces. The coreutils developers were planning to use lseek() instead of filesystem specific ioctl()'s. I did not follow progress there and do not know if patches for this exist already.

Comment 7 Maor 2017-02-12 18:09:51 UTC
(In reply to Niels de Vos from comment #6)
> (In reply to Allon Mureinik from comment #3)
....
> 
> NFS added support for SEEK_DATA/HOLE recently (see comment #1). Not all
> NFS-servers support this, nor all NFS-clients. Debugging with 'rpcdebug' or
> capturing a network trace can tell you (or me) if SEEK_DATA/HOLE is used.

Raz, can u please share the debug of rpcdebug.
So, based on your test of qemu-img isn't that should be a qemu bug?

Comment 8 Raz Tamir 2017-02-13 14:16:35 UTC
(In reply to Maor from comment #7)
> (In reply to Niels de Vos from comment #6)
> > (In reply to Allon Mureinik from comment #3)
> ....
> > 
> > NFS added support for SEEK_DATA/HOLE recently (see comment #1). Not all
> > NFS-servers support this, nor all NFS-clients. Debugging with 'rpcdebug' or
> > capturing a network trace can tell you (or me) if SEEK_DATA/HOLE is used.
> 
> Raz, can u please share the debug of rpcdebug.
I don't have the output of rpcdebug cause this bug is 6 months old.

> So, based on your test of qemu-img isn't that should be a qemu bug?
This might be a qemu bug but the product uses it so I opened it under ovirt engine

Comment 9 Yaniv Lavi 2017-02-13 14:20:42 UTC
Moving to gluster to review impact on HCI.

Comment 10 Sahina Bose 2017-02-23 07:19:20 UTC
Krutika, would Bug 1301647 help address this?

Comment 11 Krutika Dhananjay 2017-03-24 07:25:55 UTC
(In reply to Sahina Bose from comment #10)
> Krutika, would Bug 1301647 help address this?

I'm not sure yet.

Here's some data I collected from my tests:

1. Like Raz rightly reported, on FUSE mount, time taken for the qemu-img command to complete for raw files was much more than that for qcow2 files. In my run, raw files took ~40 seconds to complete while in the qcow2 case, it took less than a second.

2. I disabled shard and ran the tests again. Not much difference except for the command finishing execution slightly faster for raw than in the sharded case.

3. So with shard still disabled, I ran the same test with libgfapi and it was *much* slower than (2) for raw files.

This is the command I used for gfapi, if it helps:

[root@rhs-srv-07 ~]# truncate -s 10G /mnt/vm9.img  (this i did from the FUSE mount, but that's OK!)
                                                                                                                                                                                           
[root@rhs-srv-07 ~]# strace -ff -T -o /tmp/raw-no-shard-and-gfapi-2.log /usr/bin/qemu-img convert -p -t none -T none -f raw gluster://10.8.152.17/rep/vm9.img -O raw gluster://10.8.152.17/rep/vm9-output.img


Niels,

Is SEEK_HOLE/SEEK_DATA supposed to work with libgfapi?

I'm using glusterfs-3.8.10 FWIW.

-Krutika

PS: I do have strace output collected from all 3 runs for analysis. I'll do that some time next week && based on Niels' confirmation on the needinfo.

Comment 12 Krutika Dhananjay 2017-05-18 07:34:01 UTC
Niels,

Did you get a chance to see comment #11?

-Krutika

Comment 13 Niels de Vos 2017-05-18 08:59:11 UTC
SEEK_HOLE/SEEK_DATA is expected to work with libgfapi from glusterfs-3.8.10. Because libgfapi is completely userspace, doing an strace will not show much details. You will need to use ltrace instead.

  $ ltrace -f -x glfs* -o qemu-img.ltrace.log qemu-img ...

Some xlators (shard and disperse?) do not support SEEK_HOLE/SEEK_DATA, and they will/should return EINVAL or similar. qemu-img may fall-back to not using glfs_seek() in that case.

Also note that the version of QEMU is important. The enhancement of glfs_seek() with SEEK_DATA/SEEK_HOLE was included in upstream QEMU 2.7.0. I do not know if this was backported to qemu-kvm-rhev (or what version was used while testing).

Comment 14 Krutika Dhananjay 2017-06-02 05:56:21 UTC
(In reply to Niels de Vos from comment #13)
> SEEK_HOLE/SEEK_DATA is expected to work with libgfapi from glusterfs-3.8.10.
> Because libgfapi is completely userspace, doing an strace will not show much
> details. You will need to use ltrace instead.
> 
>   $ ltrace -f -x glfs* -o qemu-img.ltrace.log qemu-img ...
> 
> Some xlators (shard and disperse?) do not support SEEK_HOLE/SEEK_DATA, and
> they will/should return EINVAL or similar. qemu-img may fall-back to not
> using glfs_seek() in that case.

Thanks. I tried ltrace and it wasn't particularly useful because it prints addresses in place of actual parameters.

This time I ran this test on a plain replicate volume (no shard!) over libgfapi and captured trace output.

There was no improvement in time-to-completion. It took over a minute.
In the trace output, I see no occurrence of seek fop. In place there are countless reads that are issued on the src file (presuming it is trying to look for data through reading instead of doing SEEK_DATA). When I run the same thing on XFS, it completes in under a second and there is a call to seek with SEEK_DATA parameter.

Needless to say, I had to write https://review.gluster.org/17437 to make trace watch for seek FOP. And of course the version of gluster is latest master which means it has all of your SEEK fop enhancements.
This means that the issue at this point is not even about shard not supporting SEEK fop.

I'm attaching both trace output and the strace output files for your reference.

Care to take a look?

-Krutika

> 
> Also note that the version of QEMU is important. The enhancement of
> glfs_seek() with SEEK_DATA/SEEK_HOLE was included in upstream QEMU 2.7.0. I
> do not know if this was backported to qemu-kvm-rhev (or what version was
> used while testing).

Comment 15 Krutika Dhananjay 2017-06-02 05:57:39 UTC
Created attachment 1284314 [details]
raw-sparse-libgfapi-trace-output

Comment 16 Krutika Dhananjay 2017-06-02 05:59:30 UTC
Created attachment 1284315 [details]
raw-sparse-xfs-strace-output

Comment 17 Krutika Dhananjay 2017-07-17 09:44:09 UTC
Niels,

Did you get a chance to see comment #14?

Comment 18 Krutika Dhananjay 2017-12-05 12:12:09 UTC
Hi Raz,

Could you please confirm if the steps to recreate the issue in https://bugzilla.redhat.com/show_bug.cgi?id=1357919#c11 are correct?
Lot of the analysis done in debugging this issue is based on comment 11. So I want to be sure I'm on the right track.

-Krutika

Comment 19 Niels de Vos 2017-12-05 14:02:53 UTC
On Tue, Dec 05, 2017 at 05:51:40PM +0530, Krutika Dhananjay wrote:
> So I checked the strace output from fuse point of view and this is what i
> saw:
>
> 1. In the raw sparse case, there was an attempt by the application to do
> seek_data which was returned with EOPNOTSUPPORTED error. So after this the
> app reads 2MB at a time from the src file.
> The larger the src file, the more time it takes for qemu-img convert to
> return as a result. The file itself was 10GB in size in my case.

That makes sense, and is expected in some cases. It is possible that the
FUSE kernel module does not support SEEK_HOLE/SEEK_DATA, and does not
pass it through to the Gluster client. Bug 1306396 was used for the
enhancement of FUSE in kernel-3.10.0-579.el7 and newer.

Which version of the kernel was used for testing?

> 2. However, when i create qcow2 file using `qemu-img create -f qcow2 ...`
> the file itself is only 193KB in size and with data.
> So when I attempt `qemu-img convert..` for copying, it only needs to read
> 193K of data and write to destination.

Well, yes, qemu-img knows the qcow2 format so it will only copy/convert
the data blocks that are in use.

> @Niels,
> Request you to confirm if the command completes in under 1s in libgfapi
> case (which it did not when id looked into it last on 2017-07-17 and i'd
> even left a needinfo on you back then too) and whether seek_data was being
> called and returned with success with gfapi. I've done my bit from fuse pov.

Unfortunately the ltrace output did not show much (maybe the debuginfo
was missing?). I'll try to reproduce it on one of the machines that
Kasturi provided.

Comment 20 Raz Tamir 2017-12-05 16:41:17 UTC
(In reply to Krutika Dhananjay from comment #18)
> Hi Raz,
> 
> Could you please confirm if the steps to recreate the issue in
> https://bugzilla.redhat.com/show_bug.cgi?id=1357919#c11 are correct?
> Lot of the analysis done in debugging this issue is based on comment 11. So
> I want to be sure I'm on the right track.
> 
> -Krutika

Hi Krutika,

I didn't test it on gluster so except step 1 which is describing the result of this bug I can't really confirm if with gluster it works as you described

Comment 21 Krutika Dhananjay 2017-12-06 07:47:38 UTC
(In reply to Raz Tamir from comment #20)
> (In reply to Krutika Dhananjay from comment #18)
> > Hi Raz,
> > 
> > Could you please confirm if the steps to recreate the issue in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1357919#c11 are correct?
> > Lot of the analysis done in debugging this issue is based on comment 11. So
> > I want to be sure I'm on the right track.
> > 
> > -Krutika
> 
> Hi Krutika,
> 
> I didn't test it on gluster so except step 1 which is describing the result
> of this bug I can't really confirm if with gluster it works as you described

Ok in that case I'm assuming that the way to create a raw sparse src disk (which I will copy later) is by selecting the "Preallocated" option from the drop down menu of the "New Virtual Disk" tab. And similarly for qcow2 it would be by selecting "Thin Provision".

If this is not correct, then please describe for me how you created the images "55d380fe-cfc2-42b5-884b-f5b8d6da0be7/dc01ddd5-8501-4c63-ad51-0e19e3854046" and "976ee819-50b5-4156-bd3c-84aca4e50483/14a85f54-73d2-48cd-bb79-02a2f027afd7" in comment #1.

I am not familiar with what "Create 2 disks on file based domain" means in the "steps to reproduce" section.

-Krutika

Comment 22 Raz Tamir 2017-12-06 12:07:17 UTC
(In reply to Krutika Dhananjay from comment #21)
> (In reply to Raz Tamir from comment #20)
> > (In reply to Krutika Dhananjay from comment #18)
> > > Hi Raz,
> > > 
> > > Could you please confirm if the steps to recreate the issue in
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1357919#c11 are correct?
> > > Lot of the analysis done in debugging this issue is based on comment 11. So
> > > I want to be sure I'm on the right track.
> > > 
> > > -Krutika
> > 
> > Hi Krutika,
> > 
> > I didn't test it on gluster so except step 1 which is describing the result
> > of this bug I can't really confirm if with gluster it works as you described
> 
> Ok in that case I'm assuming that the way to create a raw sparse src disk
> (which I will copy later) is by selecting the "Preallocated" option from the
> drop down menu of the "New Virtual Disk" tab. And similarly for qcow2 it
> would be by selecting "Thin Provision".

No,

In this case you will create a preallocated disk which is not what we want.
We need to create 2 thin disks - 1 qcow and 1 raw.
You can do it by executing a REST API call to /api/disks with the body:

<disk>
    <alias>my_disk_name</alias>
    <format>raw</format>  <<<====== create 1 with 'raw' and 1 with 'cow'
    <provisioned_size>1073741824</provisioned_size>
    <sparse>true</sparse>
    <storage_domains>
        <storage_domain id="b692a210-8e2c-4862-b408-b2dbfa3a6d6f">
            <name>nfs_storage_domain_name</name>
            <type>data</type>
            <data_centers>
                <data_center id="480d7e0f-4587-41d7-a045-a0fd8a41b6af"/>
            </data_centers>
        </storage_domain>
    </storage_domains>
</disk>

Few things to ensure:
1) the storage domain ID must be an ID of a FILE storage domain
2) <starse> must be 'true'

> 
> If this is not correct, then please describe for me how you created the
> images
> "55d380fe-cfc2-42b5-884b-f5b8d6da0be7/dc01ddd5-8501-4c63-ad51-0e19e3854046"
> and
> "976ee819-50b5-4156-bd3c-84aca4e50483/14a85f54-73d2-48cd-bb79-02a2f027afd7"
> in comment #1.
> 
> I am not familiar with what "Create 2 disks on file based domain" means in
> the "steps to reproduce" section.

This is the 1st point under "Few things to ensure"

Let me know if you need further assistance 
> 
> -Krutika

Comment 23 Sandro Bonazzola 2019-01-28 09:34:07 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 24 Sahina Bose 2019-01-30 12:20:58 UTC
Is there anything left here - do we need to raise any dependent gluster bugs?


Note You need to log in before you can comment on or make changes to this bug.