Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1511884 - Incorrect Health status of hosts on cluster with many nodes [NEEDINFO]
Summary: Incorrect Health status of hosts on cluster with many nodes
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Shubhendu Tripathi
QA Contact: sds-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-10 11:03 UTC by Filip Balák
Modified: 2018-11-20 12:06 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
shtripat: needinfo? (fbalak)
shtripat: needinfo? (fbalak)
shtripat: needinfo? (nthomas)


Attachments (Terms of Use)
Hosts dashboard (deleted)
2017-11-10 11:03 UTC, Filip Balák
no flags Details
Hosts dashboard (deleted)
2017-11-10 11:05 UTC, Filip Balák
no flags Details

Description Filip Balák 2017-11-10 11:03:53 UTC
Description of problem:
I have cluster with 42 nodes and 7 volumes. Health panel in Hosts dashboard of some nodes is reporting value `Down` but I can access them via ssh and all services like tendrl-node-agent and collectd are running on them. Other metrics are reported. (As seen on screenshot)

Version-Release number of selected component (if applicable):
glusterfs-3.8.4-52.el7rhgs.x86_64
tendrl-collectd-selinux-1.5.3-2.el7rhgs.noarch
tendrl-commons-1.5.4-1.el7rhgs.noarch
tendrl-node-agent-1.5.4-1.el7rhgs.noarch
tendrl-selinux-1.5.3-2.el7rhgs.noarch
tendrl-gluster-integration-1.5.4-1.el7rhgs.noarch
tendrl-api-1.5.4-1.el7rhgs.noarch
tendrl-api-httpd-1.5.4-1.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-2.el7rhgs.noarch
tendrl-grafana-selinux-1.5.3-2.el7rhgs.noarch
tendrl-notifier-1.5.4-1.el7rhgs.noarch
tendrl-ansible-1.5.4-1.el7rhgs.noarch
tendrl-ui-1.5.4-1.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-2.el7rhgs.noarch

How reproducible:
70% There probably needs to be a great number of nodes.

Steps to Reproduce:
1. Install tendrl on big cluster with many nodes and more volumes.
2. Open Hosts grafana dashboard.
3. Check health panel for all nodes.

Actual results:
Some of the nodes are reported as Down but are up and running.

Expected results:
All nodes should be reported correctly.

Additional info:

Comment 1 Filip Balák 2017-11-10 11:05:02 UTC
Created attachment 1350421 [details]
Hosts dashboard

Comment 5 Shubhendu Tripathi 2017-11-13 09:57:46 UTC
@Filip, while debugging figured out that tendrl-gluster-integration was not running of any of the storage nodes. Looks like the storage nodes got restarted and as tendrl-gluster-integration was not enabled earlier for restart across reboots, the services down.

Also for the nodes which are shown as DOWN in dashboard, there are errors in tendrl-node-agent service logs around service refresh etc. Have seen few `os.fork` failures and suspect that could be due to less resources on the nodes.

I would suggest to refresh the setup with latest 1.5.4 build and keep the nodes configuration at least s below

Storage Nodes: 4GB RAM with dual core
Tendrl Server: 16GB RAM (as huge cluster) with quad core

Once the cluster is re-setup and imported in tendrl, lets look for any failures etc.

Comment 6 Nishanth Thomas 2017-11-15 18:37:03 UTC
This use case is addressed in the current release based on PM's comments. proposing this to the next release


Note You need to log in before you can comment on or make changes to this bug.