Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1694604 - gluster fuse mount crashed, when deleting 2T image file from RHV Manager UI
Summary: gluster fuse mount crashed, when deleting 2T image file from RHV Manager UI
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: rhhi
Version: rhhiv-1.6
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Sahina Bose
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1694595 1696136
Blocks: RHHI-V-1-6-Release-Notes
TreeView+ depends on / blocked
 
Reported: 2019-04-01 08:48 UTC by SATHEESARAN
Modified: 2019-04-11 07:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
When sharding is enabled on a volume, a single file allocation operation creates all shards in a single batch. If the number of shards involved in an operation is greater than the number of entries allowed in the lru-cache, the inode associated with the file operation is freed while it is still in use. This leads to a crash in the mount process when deleting large files from volumes with sharding enabled, which causes all virtual machines that had mounted that storage to pause. To work around this issue, forcefully shut down the virtual machines and start them again.
Clone Of: 1694595
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description SATHEESARAN 2019-04-01 08:48:24 UTC
Description of problem:
------------------------
When deleting the 2TB image file , gluster fuse mount process has crashed

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHGS 3.4.4 ( glusterfs-3.12.2-47.el7rhgs )

How reproducible:
-----------------
1/1

Steps to Reproduce:
-------------------
1. Create a image file of 2T from RHV Manager UI
2. Delete the same image file after its created successfully

Actual results:
---------------
Fuse mount crashed

Expected results:
-----------------
All should work fine and no fuse mount crashes

--- Additional comment from SATHEESARAN on 2019-04-01 08:33:14 UTC ---

frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash: 
2019-04-01 07:57:53
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.12.2
/lib64/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x9d)[0x7fc72c186b9d]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7fc72c191114]
/lib64/libc.so.6(+0x36280)[0x7fc72a7c2280]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x9627)[0x7fc71f8ba627]
/usr/lib64/glusterfs/3.12.2/xlator/features/shard.so(+0x9ef1)[0x7fc71f8baef1]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x3ae9c)[0x7fc71fb15e9c]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x9e8c)[0x7fc71fd88e8c]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0xb79b)[0x7fc71fd8a79b]
/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0xc226)[0x7fc71fd8b226]
/usr/lib64/glusterfs/3.12.2/xlator/protocol/client.so(+0x17cbc)[0x7fc72413fcbc]
/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fc72bf2ca00]
/lib64/libgfrpc.so.0(rpc_clnt_notify+0x26b)[0x7fc72bf2cd6b]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fc72bf28ae3]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x7586)[0x7fc727043586]
/usr/lib64/glusterfs/3.12.2/rpc-transport/socket.so(+0x9bca)[0x7fc727045bca]
/lib64/libglusterfs.so.0(+0x8a870)[0x7fc72c1e5870]
/lib64/libpthread.so.0(+0x7dd5)[0x7fc72afc2dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fc72a889ead]

--- Additional comment from SATHEESARAN on 2019-04-01 08:37:56 UTC ---

1. RHHI-V Information
----------------------
RHV 4.3.3
RHGS 3.4.4

2. Cluster Information
-----------------------
[root@rhsqa-grafton11 ~]# gluster pe s
Number of Peers: 2

Hostname: rhsqa-grafton10.lab.eng.blr.redhat.com
Uuid: 46807597-245c-4596-9be3-f7f127aa4aa2
State: Peer in Cluster (Connected)
Other names:
10.70.45.32

Hostname: rhsqa-grafton12.lab.eng.blr.redhat.com
Uuid: 8a3bc1a5-07c1-4e1c-aa37-75ab15f29877
State: Peer in Cluster (Connected)
Other names:
10.70.45.34

3. Volume information
-----------------------
Affected volume: data
[root@rhsqa-grafton11 ~]# gluster volume info data
 
Volume Name: data
Type: Replicate
Volume ID: 9d5a9d10-f192-49ed-a6f0-c912224869e8
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: rhsqa-grafton10.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick2: rhsqa-grafton11.lab.eng.blr.redhat.com:/gluster_bricks/data/data
Brick3: rhsqa-grafton12.lab.eng.blr.redhat.com:/gluster_bricks/data/data (arbiter)
Options Reconfigured:
cluster.granular-entry-heal: enable
performance.strict-o-direct: on
network.ping-timeout: 30
storage.owner-gid: 36
storage.owner-uid: 36
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on
[root@rhsqa-grafton11 ~]# gluster volume status data
Status of volume: data
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick rhsqa-grafton10.lab.eng.blr.redhat.co
m:/gluster_bricks/data/data                 49154     0          Y       23403
Brick rhsqa-grafton11.lab.eng.blr.redhat.co
m:/gluster_bricks/data/data                 49154     0          Y       23285
Brick rhsqa-grafton12.lab.eng.blr.redhat.co
m:/gluster_bricks/data/data                 49154     0          Y       23296
Self-heal Daemon on localhost               N/A       N/A        Y       16195
Self-heal Daemon on rhsqa-grafton12.lab.eng
.blr.redhat.com                             N/A       N/A        Y       52917
Self-heal Daemon on rhsqa-grafton10.lab.eng
.blr.redhat.com                             N/A       N/A        Y       43829
 
Task Status of Volume data
------------------------------------------------------------------------------
There are no active volume tasks

Comment 2 Krutika Dhananjay 2019-04-03 15:09:30 UTC
Found the issue.

Devil is in those minor details but i'll write it here anyway more as a note to self because it's very hard to store all of the sequence and fine-grained details in memory and a very very specific case leads to this issue.

Reproducer on a smaller scale -
=============================
1. create a 1x3 volume.
2. set shard on it.
3. Set shard-block-size to 4MB
4. Set shard-lru-limit to 150
5. Turn off write-behind.
6. Start and fuse mount it.
7. qemu-img create -f raw -o preallocation=falloc /mnt/vm1.img 1G
8. Unlink it.

Why are these options configured this way? This bug is hit when the lru-list is bigger than the deletion rate and more than lru-list-size number of shards are created specifically using operations like fallocate that fallocate all shards in parallel as opposed to something like writev where there's guarantee that i won't be ever be needing to resolve > lru-limit number of shards for the same fop.

Why is this important?
A single fallocate fop in shard resolves and creates all shards (in this case 256 > shard-lru-limit) in a single batch. Since the lru list can only old 150 shards, after blocks 1..150 are resolved, shards 150-255 (skipping base shard) will evict the first (255-150) shards even before they're fallocated. This means these (255-150) inodes get evicted from lru list and inode_unlink()d. (in other words its not just enough to fill the lru list but also force eviction of some of the just resolved participant shards which are yet to be operated on).

Now shard sends fallocate on all 256 shards. (255-150) of these will be added to fsync list. and the rest of the 150 are part of both fsync and lru lists.

Then comes an unlink on the file.
Shard deletion is initiated in background. But this is done in batches of shard-deletion-rate shards at a time. The now-evicted shards also need to be resolved first before shard can send UNLINKs on them. inode_resolve() fails on them because they are inode_unlink()d as part of eviction. So a lookup is sent and in the callback they're linked again, but inode_link() still finds the old inode and returns that (whose inode ctx is still valid). unfortunately shard_link_block_inode() ends up relying on local->loc.inode as the source of base inode which is null in background deletion. so when one of these evicted shards is added back to list, the ref on the base shard is not taken since base shard itself is null. that's a missing ref. which is still ok as long as we know we shouldn't unref it at the end. but that's not what happens. once these shards are deleted, in the cbk the base_shard gets unref'd (this is an unwanted extra unref) for each such previously evicted shard because the inode ctx is still valid of the old inode returned by inode_unlink() and contains reference to base inode pointer.

SIMPLE SUMMARY OF THE ISSUE:
============================
Here are the expectations -
1. when a shard inode is added to lru list, we ref base shard. when it is added to fsync list, we take another ref. this is to keep the base shard alive precisely for deletion operations.
2. When a shard is evicted from lru list, base shard is unref'd once. When it is evicted from fsync list, again base_shard needs to be unref'd.

Simple stuff so far, basically undo what you did in step 1 once you're in step 2.

In this bug, step 2 was executed correctly everywhere. But in one particular scenario, when the shard is added to LRU list, base inode is not ref'd (because the pointer passed to the function is NULL).
But when that shard's life cycle ends, shard translator *somehow* gets access to the base shard pointer and unrefs it thinking it needs to undo whatever was done at the beginning.
So in this way, the number of refs keeps coming down even when io (unlinks) is happening on the shards, leading to more unrefs than refs leading at some point to inode destruction and illegal memory access.


There is a way to work around the issue by setting shard-lru-limit to a value that is very high. But exposing to users can have unintended consequences like leading to high memory consumption. Or if the value is too low, it will lead to frequent evictions and hence unnecessary lookups over the network. Besides i introduced the option itself just to make testing easier. the option is NO_DOC and not exposed to users. Purely meant for testing purposes.

As for the fix, I still need to think about it. There are multiple ways to "patch" the issue. But I need to be sure it won't break anything that was already working and there are lot many cases to consider where we need to confirm that the fix won't break anything.


Note You need to log in before you can comment on or make changes to this bug.