Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1595020 - Top File operations panel (Volume dashboard/Profile information) latency representation is confusing
Summary: Top File operations panel (Volume dashboard/Profile information) latency repr...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.4
Hardware: All
OS: All
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Ankush Behl
QA Contact: sds-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-26 01:46 UTC by Anand Paladugu
Modified: 2019-04-11 08:19 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
File operations latency panel (deleted)
2018-06-26 02:06 UTC, Anand Paladugu
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1515856 None ASSIGNED missing units and timestamps in various charts 2019-03-14 09:53:39 UTC

Internal Links: 1515856

Description Anand Paladugu 2018-06-26 01:46:20 UTC
Description of problem:  Top File operations panel latency representation is confusing


Version-Release number of selected component (if applicable):3.4 (Sandbox Environment)


How reproducible: Always


Steps to Reproduce:
1. View the Top File operations panel in Volume dashboard/Profile information.
2.
3.

Actual results: For writes we are showing units as K.  Is that K milli seconds ? The information for the panel says % latency is the fraction of the FOP response time.  For other operations we are not showing any units.

Expected results: Update the Panel information clearly on what the % latency represents and also show the units as appropriate for the data items consistently.


Additional info:

Comment 2 Anand Paladugu 2018-06-26 02:06:14 UTC
Created attachment 1454526 [details]
File operations latency panel

Comment 3 Martin Bukatovic 2018-06-27 17:54:04 UTC
Assuming this has been reported based on sandbox-usm1 instance, the             
full package version list follows:                                              
                                                                                
```                                                                             
[root@sandbox-usm1-server ~]# rpm -qa | grep tendrl | sort                      
tendrl-ansible-1.6.3-5.el7rhgs.noarch                                           
tendrl-api-1.6.3-3.el7rhgs.noarch                                               
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch                                         
tendrl-commons-1.6.3-7.el7rhgs.noarch                                           
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch                                   
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch                                   
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch                            
tendrl-node-agent-1.6.3-7.el7rhgs.noarch                                        
tendrl-notifier-1.6.3-4.el7rhgs.noarch                                          
tendrl-selinux-1.5.4-2.el7rhgs.noarch                                           
tendrl-ui-1.6.3-4.el7rhgs.noarch                                                
```

Comment 4 Ankush Behl 2018-07-13 05:59:24 UTC
We are showing the hits right now which is not percentage latency(% latency).

So changes we are going to make to a panel:

1. We are going to change the label "% latency"(row heading) to "Avg. Latency"
2. It will show the average latency value which will be in microseconds.

Comment 7 Martin Bukatovic 2018-08-07 17:16:04 UTC
As of tendrl-monitoring-integration-1.6.3-7.el7rhgs.noarch, the unit of measure
are still unclear. Rechecked during dashboard overview.

Comment 13 Martin Bukatovic 2018-09-06 19:01:32 UTC
Here is an example of output of `gluster volume profile` command, which this
WA panel uses to get data:

```
# gluster volume profile volume_gama_disperse_4_plus_2x2 info                   
                                                                                
Interval 2573 Stats:                                                            
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of calls         Fop 
 ---------   -----------   -----------   -----------   ------------        ---- 
      0.00       0.00 us       0.00 us       0.00 us              1  RELEASEDIR 
      0.28      12.00 us      12.00 us      12.00 us              1     OPENDIR 
      2.07      90.00 us      90.00 us      90.00 us              1      STATFS 
      2.14      46.50 us      27.00 us      66.00 us              2     READDIR 
     28.95     629.50 us     296.00 us     963.00 us              2    GETXATTR 
     66.57     965.00 us     428.00 us    1816.00 us              3      LOOKUP 
```

I believe that to provide meaningful description and to improve representation
of data this BZ talks about, we need to:

 * locate and link gluster documentation of this feature
 * check with gluster eng. to provide additional insight

That said, the documentation I was able to find doesn't provide enough details
for me to understand meaning of %-latency:

https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/chap-monitoring_red_hat_storage_workload#profiling-volumes

Atin, could you help us here to understand the meaning of %-latency field so
that WA team can improve representation of panel in question and it's
description?

Comment 14 Atin Mukherjee 2018-09-07 10:04:58 UTC
Pranith - Can you help Martin to explain the question asked for?

Comment 15 Anand Paladugu 2018-09-09 17:18:13 UTC
If % latency is showing the portion of latency of a particular file operation as a percentage of all operations, then this is not useful to customer.  What would customer do if the graph shows STATFS is responsible for 50% of latency ?

Even if we decide to show average latency for each operation, we still need inform customer of how to use this info.  If average latency for RELEASEDIR is higher than normal what would customer do ?  Does he then look at other resources like CPU and MEM and determine where the problem might be ?

Comment 16 Pranith Kumar K 2018-09-10 06:33:45 UTC
(In reply to Anand Paladugu from comment #15)

"gluster volume profile" is created to do something similar to what "strace -Tc" does for an application/process. The way to use this tool is to run a particular workload while profile is enabled and try to get the statistics of what fops are triggered by the application on the bricks.

> If % latency is showing the portion of latency of a particular file
> operation as a percentage of all operations, then this is not useful to
> customer.  What would customer do if the graph shows STATFS is responsible
> for 50% of latency ?

If 50% of the latency is contributed by STATFS and the performance of the application is not met for the customer, there are 2 things that we need to answer:
1) Are these STATFS calls coming from DHT (or any other internal xlator) which keeps doing STATFS to figure out which subvolume it needs to create new files or is it the application which is doing all these STATFS calls.
    a) If it is DHT, then there seems to be a bug in DHT that needs to be fixed.
    b) If it is application, we need to have a conversation with the customer to find why the said application is doing so many statfs calls, based on that we should probably give a feature/fix which would do caching of STATFS to reduce the %-latency of STATFS for the workload enough to meet the latency/throughput requirements of the customer/user.

> 
> Even if we decide to show average latency for each operation, we still need
> inform customer of how to use this info.  If average latency for RELEASEDIR
> is higher than normal what would customer do ?  Does he then look at other
> resources like CPU and MEM and determine where the problem might be ?

It needs to be used exactly how the customer would use "strace -Tc" for explaining what the fs is behaving for the application workload. Just this output doesn't give all the information one may need to identify the bottlenecks in some cases but these metrics let us figure out which subsystem to look for the problems.

For example: In 3.4.0 we were working on a performance regression issue where we found a problem based on profile info (https://bugzilla.redhat.com/show_bug.cgi?id=1558989#c57).
We compared 3.3.1 profile info vs 3.4.0 profile info and the avg latency/%-latency were similar so we had to suspect the layer above it which is network and we found a regression introduced in the rpc layer. 

Pranith


Note You need to log in before you can comment on or make changes to this bug.