Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1509795 - SQL Connection failed / Dashboard failure / Galera Cluster not synced
Summary: SQL Connection failed / Dashboard failure / Galera Cluster not synced
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: mariadb-galera
Version: 10.0 (Newton)
Hardware: x86_64
OS: Linux
low
low
Target Milestone: ---
: 12.0 (Pike)
Assignee: Damien Ciabrini
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-06 05:12 UTC by Anup P
Modified: 2017-12-11 23:54 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 14:16:35 UTC


Attachments (Terms of Use)
Output of || sudo crm_resource --force-promote -r galera -V (deleted)
2017-11-06 05:12 UTC, Anup P
no flags Details

Description Anup P 2017-11-06 05:12:51 UTC
Created attachment 1348412 [details]
Output of || sudo crm_resource --force-promote -r galera -V

Description of problem:

Dashboard unreachable.Cli also not working. Galera cluster not synced after 1 (out of 3) controller node reboot. 


Version-Release number of selected component (if applicable):

Redhat Openstack 10

[stack@ucloud ~]$ sudo rpm -qa | grep rhosp
rhosp-director-images-ipa-10.0-20170615.1.el7ost.noarch
rhosp-director-images-10.0-20170615.1.el7ost.noarch
[stack@ucloud ~]$


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


[stack@ucloud ~]$ source stackrc
[stack@ucloud ~]$ openstack stack list
+--------------------------------------+------------+-----------------+----------------------+----------------------+
| ID                                   | Stack Name | Stack Status    | Creation Time        | Updated Time         |
+--------------------------------------+------------+-----------------+----------------------+----------------------+
| 2d6ffd1e-4ba4-43f0-b89e-6ee54a3546a5 | overcloud  | UPDATE_COMPLETE | 2017-10-20T21:00:43Z | 2017-10-24T08:50:08Z |
+--------------------------------------+------------+-----------------+----------------------+----------------------+


[stack@ucloud ~]$ source overcloudrc
[stack@ucloud ~]$ openstack server list
Discovering versions from the identity service failed when creating the password plugin. Attempting to determine version from URL.
Unable to establish connection to http://10.64.105.17:5000/v2.0/tokens: HTTPConnectionPool(host='10.64.105.17', port=5000): Max retries exceeded with url: /v2.0/tokens (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x3178ed0>: Failed to establish a new connection: [Errno 111] Connection refused',))
[stack@ucloud ~]$



+--------------------------------------+------------+-----------------+----------------------+----------------------+
[root@overcloud-controller-0 keystone]# tail keystone.log
2017-11-06 05:08:06.309 894674 WARNING oslo_db.sqlalchemy.engines [req-34dfa39f-6754-4318-91af-cca8e42669f7 - - - - -] SQL connection failed. -482 attempts left.
2017-11-06 05:08:06.512 894667 WARNING oslo_db.sqlalchemy.engines [req-b4bf821f-df6b-4908-8e67-01af100a330c - - - - -] SQL connection failed. -535 attempts left.
2017-11-06 05:08:06.541 894742 WARNING oslo_db.sqlalchemy.engines [req-8c4cbd10-208e-4b78-8a4d-a3143444abe8 - - - - -] SQL connection failed. -484 attempts left.
2017-11-06 05:08:06.796 894684 WARNING oslo_db.sqlalchemy.engines [req-131f4a8e-430e-4fbc-b8ab-171108e4539f - - - - -] SQL connection failed. -542 attempts left.
2017-11-06 05:08:06.819 894669 WARNING oslo_db.sqlalchemy.engines [req-1250aa29-ee24-4057-a1d1-98c20a5ca61d - - - - -] SQL connection failed. -422 attempts left.
2017-11-06 05:08:07.099 894572 WARNING oslo_db.sqlalchemy.engines [req-dc7e40a9-a18f-45f5-a443-42d621bc5b3b - - - - -] SQL connection failed. -568 attempts left.
2017-11-06 05:08:08.302 894670 WARNING oslo_db.sqlalchemy.engines [req-42bef569-dac0-4b9c-88c3-70ad97d00fd2 - - - - -] SQL connection failed. -485 attempts left.
2017-11-06 05:08:08.588 894701 WARNING oslo_db.sqlalchemy.engines [req-ffd3bdc2-c548-483d-877c-2e1e7193c2f3 - - - - -] SQL connection failed. -511 attempts left.
2017-11-06 05:08:08.602 894702 WARNING oslo_db.sqlalchemy.engines [req-f9ac49d5-3689-406c-9887-5b2cf60fec40 - - - - -] SQL connection failed. -523 attempts left.
2017-11-06 05:08:08.667 894703 WARNING oslo_db.sqlalchemy.engines [req-61a674e4-7c1e-460e-9a74-ed7be47df3e1 - - - - -] SQL connection failed. -533 attempts left.
[root@overcloud-controller-0 keystone]#

+-------------------------------------------------------------------------------------------------------------------+

Controller 0

[heat-admin@overcloud-controller-0 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Mon Nov  6 04:48:15 2017          Last change: Mon Nov  6 04:22:23 2017 by root via cibadmin on overcloud-controller-0

3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-172.16.73.23        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-10.64.105.17        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.72.24        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-172.16.76.29        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Master/Slave Set: galera-master [galera]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-172.16.73.29        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.75.21        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
     Stopped: [ overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0

Failed Actions:
* galera_start_0 on overcloud-controller-2 'unknown error' (1): call=520, status=Timed Out, exitreason='none',
    last-rc-change='Sat Nov  4 09:28:34 2017', queued=0ms, exec=120005ms
* galera_start_0 on overcloud-controller-0 'unknown error' (1): call=1259, status=Timed Out, exitreason='none',
    last-rc-change='Sat Nov  4 09:30:34 2017', queued=0ms, exec=120004ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  
  
[heat-admin@overcloud-controller-0 ~]$ sudo clustercheck
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.



[heat-admin@overcloud-controller-0 ~]$ sudo tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
171104  9:08:48 [Note] WSREP: Found saved state: b7e5134e-b5de-11e7-aff7-e6e8074a3e99:6832561
171104  9:18:06 [Note] WSREP: Found saved state: b7e5134e-b5de-11e7-aff7-e6e8074a3e99:6835285


[heat-admin@overcloud-controller-0 ~]$ sudo systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
   Active: active (running) since Sat 2017-11-04 09:18:08 UTC; 1 day 19h ago
  Process: 1028947 ExecStartPost=/usr/libexec/mariadb-wait-ready $MAINPID (code=exited, status=0/SUCCESS)
  Process: 1028918 ExecStartPre=/usr/libexec/mariadb-prepare-db-dir %n (code=exited, status=0/SUCCESS)
 Main PID: 1028946 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
           ├─1028946 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
           └─1030258 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --wsrep-provider=/usr/lib64/galera/libgalera...

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[heat-admin@overcloud-controller-0 ~]$

+-------------------------------------------------------------------------------------------------------------------+

Controller 1

[heat-admin@overcloud-controller-1 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Mon Nov  6 04:49:44 2017          Last change: Mon Nov  6 04:22:23 2017 by root via cibadmin on overcloud-controller-0

3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-172.16.73.23        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-10.64.105.17        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.72.24        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-172.16.76.29        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Master/Slave Set: galera-master [galera]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-172.16.73.29        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.75.21        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
     Stopped: [ overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0

Failed Actions:
* galera_start_0 on overcloud-controller-2 'unknown error' (1): call=520, status=Timed Out, exitreason='none',
    last-rc-change='Sat Nov  4 09:28:34 2017', queued=0ms, exec=120005ms
* galera_start_0 on overcloud-controller-0 'unknown error' (1): call=1259, status=Timed Out, exitreason='none',
    last-rc-change='Sat Nov  4 09:30:34 2017', queued=0ms, exec=120004ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
  
  
[heat-admin@overcloud-controller-1 ~]$ sudo clustercheck
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.



[heat-admin@overcloud-controller-1 ~]$ sudo tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
171104  9:08:58 [Note] WSREP: Found saved state: b7e5134e-b5de-11e7-aff7-e6e8074a3e99:6832561


[heat-admin@overcloud-controller-1 ~]$ sudo systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
   Active: active (running) since Sat 2017-11-04 09:09:01 UTC; 1 day 19h ago
 Main PID: 247864 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
           ├─247864 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
           └─248706 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --wsrep-provider=/usr/lib64/galera/libgalera_...

Nov 04 09:08:55 overcloud-controller-1.localdomain systemd[1]: Starting MariaDB database server...
Nov 04 09:08:56 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:56 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Nov 04 09:08:56 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Nov 04 09:08:56 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:56 mysqld_safe WSREP: Running position recovery with --log_error='/var...er.pid'
Nov 04 09:08:58 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:58 mysqld_safe WSREP: Recovered position b7e5134e-b5de-11e7-aff7-e6e80...6832561
Nov 04 09:09:01 overcloud-controller-1.localdomain systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[heat-admin@overcloud-controller-1 ~]$



+-------------------------------------------------------------------------------------------------------------------+

Controller 2

[heat-admin@overcloud-controller-2 ~]$ sudo pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Mon Nov  6 04:50:03 2017          Last change: Mon Nov  6 04:22:23 2017 by root via cibadmin on overcloud-controller-0

3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-172.16.73.23        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-10.64.105.17        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.72.24        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-172.16.76.29        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Master/Slave Set: galera-master [galera]
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-172.16.73.29        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-172.16.75.21        (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 ]
     Stopped: [ overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0

Failed Actions:
* galera_start_0 on overcloud-controller-2 'unknown error' (1): call=520, status=Timed Out, exitreason='none',
    last-rc-change='Sat Nov  4 09:28:34 2017', queued=0ms, exec=120005ms
* galera_start_0 on overcloud-controller-0 'unknown error' (1): call=1259, status=Timed Out, exitreason='none',
    last-rc-change='Sat Nov  4 09:30:34 2017', queued=0ms, exec=120004ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled
[heat-admin@overcloud-controller-2 ~]$ sudo clustercheck
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 36

Galera cluster node is not synced.


[heat-admin@overcloud-controller-2 ~]$ sudo tail -n 1000 /var/log/mariadb/mariadb.log | grep "Found saved state"  | grep -v ":-1"
171103  4:47:48 [Note] WSREP: Found saved state: b7e5134e-b5de-11e7-aff7-e6e8074a3e99:6811407
171103  7:20:31 [Note] WSREP: Found saved state: b7e5134e-b5de-11e7-aff7-e6e8074a3e99:6832561


[heat-admin@overcloud-controller-2 ~]$ sudo systemctl status mariadb
● mariadb.service - MariaDB database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; disabled; vendor preset: disabled)
   Active: active (running) since Sat 2017-11-04 09:09:14 UTC; 1 day 19h ago
 Main PID: 1017828 (mysqld_safe)
   CGroup: /system.slice/mariadb.service
           ├─1017828 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
           └─1018672 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --wsrep-provider=/usr/lib64/galera/libgalera...

Nov 04 09:09:04 overcloud-controller-2.localdomain systemd[1]: Starting MariaDB database server...
Nov 04 09:09:05 overcloud-controller-2.localdomain mysqld_safe[1017828]: 171104 09:09:05 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Nov 04 09:09:05 overcloud-controller-2.localdomain mysqld_safe[1017828]: 171104 09:09:05 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Nov 04 09:09:05 overcloud-controller-2.localdomain mysqld_safe[1017828]: 171104 09:09:05 mysqld_safe WSREP: Running position recovery with --log_error='/va...er.pid'
Nov 04 09:09:07 overcloud-controller-2.localdomain mysqld_safe[1017828]: 171104 09:09:07 mysqld_safe WSREP: Recovered position b7e5134e-b5de-11e7-aff7-e6e8...6832561
Nov 04 09:09:14 overcloud-controller-2.localdomain systemd[1]: Started MariaDB database server.
Hint: Some lines were ellipsized, use -l to show in full.
[heat-admin@overcloud-controller-2 ~]$


+-------------------------------------------------------------------------------------------------------------------+


Expected results:


Additional info:

Steps mentioned in below url executed.

https://access.redhat.com/articles/1298564

Output file attached.

Comment 1 Michael Bayer 2017-11-06 14:21:08 UTC
Hi there -

Can you please attach complete SOS reports for all three controller nodes, which include all logs, particularly /var/log/mysqld.log, spanning the entire incident?  thanks.

Comment 3 Michael Bayer 2017-11-08 14:42:59 UTC
per the attached SOS reports, you have MySQL running via systemd, which will conflict with pacemaker.  Your description above also indicates that systemctl sees MySQL as running:

Nov 04 09:08:55 overcloud-controller-1.localdomain systemd[1]: Starting MariaDB database server...
Nov 04 09:08:56 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:56 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
Nov 04 09:08:56 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:56 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
Nov 04 09:08:56 overcloud-controller-1.localdomain mysqld_safe[247864]: 171104 09:08:56 mysqld_safe WSREP: Running position recovery with --log_error='/var...er.pid'

When running Galera under Pacemaker, with our current configuration log output no longer goes to /var/log/mariadb/mariadb.log; instead you will see output in /var/log/mysqld.log on each controller.

In this case we can see in mysqld.log that Pacemaker is unable to start Galera because it is already running via systemd.   The MariaDB services should not be enabled so that they don't come up on reboot, and they should never be started via systemctl.

Comment 4 Anup P 2017-11-09 04:25:20 UTC
Thanks for the analysis. 

Do the logs show the reason for first failure of dashboard ? All 3 controllers were up at running normally at that time. Also do they show why first ' pcs resource cleanup ' did not help ? as this  is 2nd occurrence of the issue.

What are the ideal steps for recovery in this scenario and any best practices to avoid this issue ?

Comment 5 Michael Bayer 2017-11-09 16:06:20 UTC
The first event I can see is that there was some kind of network outage around 171102 13:20, where I can see keystone not able to connect to the database or rabbit, as well as the Galera nodes losing connectivity to each other around that range.

By 171103 0:51:55 the galera cluster was up again and keystone stopped having SQL errors, however it continued having rabbit connectivity errors that weren't resolved until the next Galera event.

The next event on the Galera side would have begun at 171103 04:23:22 when AFAICT it looks like mariadb was manually started using systemctl on controller zero, which interfered with the existing mysqld process started by pacemaker;  pacemaker then decided that controller zero was down, likely because the mariadb instance put the database into a state that showed that WSREP was not available in some way, it then stopped the mysqld process on controller 0 normally (but not the mariadb process, which is in a different pidfile and it has no idea about that).  The still-running mariadb process was then able to take over the place of the pacemaker-initiated mysqld process entirely, e.g. binding to its ports.     25 minutes after that, systemctl was used to start mariadb on controller one and controller two, which similarly interfered with the other two mysql processes.

The SOS reports don't show the corosync logs from the time of the event as they seem to have been truncated up to Nov 7 so I can't confirm the decisions pacemaker was making.   But all three controllers were up normally as of 171103  0:51:55:

controller 0 pacemaker mysql:
171103  0:51:55 [Note] WSREP: Member 2.0 (overcloud-controller-1.localdomain) synced with group.

controller 1 pacemaker mysql:
171103  0:51:55 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 6773293)
171103  0:51:55 [Note] WSREP: Synchronized with group, ready for connections

controller 2 pacemaker mysql:
171103  0:51:55 [Note] WSREP: Member 2.0 (overcloud-controller-1.localdomain) synced with group.

all three of those services have no events of any kind until 171103  4:23:32.   Ten seconds prior to that, the mariadb service on controller 0 is started:

controller 0 systemd mariadb:
171103 04:23:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Note the above event is the first thing that the systemd mariadb daemon had logged since it was shut down as a normal part of cluster installation at 171020 21:36:20, two weeks earlier.

Once mariadb on controller zero is started, pacemaker mysqld is taken down ten seconds later (we would assume by pacemaker itself but I can't be 100% sure, but it was a normal, non-crash shutdown):

controller 0 pacemaker mysql (first event logged since it was fully up at 0:51:55):
171103  4:23:32 [Note] /usr/libexec/mysqld: Normal shutdown


and at 171103 04:47 we see mariadb being started by systemd on controller 1 and 2:

171103 04:47:30 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

171103 04:47:39 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

From 4:23 on, we can see all kinds of flapping amongst the six services (3 pacemaker mysql, 3 mariadb systemctl) which is pretty intricate to serialize, but basically the mariadb version of the cluster seems to have come up completely as pacemaker (or something) was shutting down the pacemaker-initiated mysqld processes.  the database was still reachable.

Throughout this time, Keystone did not report SQL errors, the next one we see is 24 hours from its previous error at 2017-11-04 00:16:35.  The systemd-initiated mariadb cluster seems like it was sort of able to run but also was seeing things flapping, it would be a lot more guesswork to see what might have been going on then.   

But in any case I see no evidence of anything wrong with Galera or the resource agent, there was a network event on 1102 that seemed to precipitate things being down, then the DB was recovered, then at 1103 04:23:22 Mariadb got started via systemd and that's when things started breaking again.

Comment 6 Anup P 2017-11-27 12:22:55 UTC
Thanks for the indepth analysis. 

Have migrated setup to different he & L2 network switch & under observation.
Working fine so far.

Wish to know if horizon is expected to fail if network fluctuation is observed and if it detects if network is reachable again and becomes operational. Any documentation regarding this is much appreciated.

Also may close this bug.

Comment 7 Anup P 2017-11-27 12:24:21 UTC
(In reply to Anup P from comment #6)
> Thanks for the indepth analysis. 
> 
> Have migrated setup to different **hardware & L2 network switch & under observation.
> Working fine so far.
> 
> Wish to know if horizon is expected to fail if network fluctuation is
> observed and if it detects if network is reachable again and becomes
> operational. Any documentation regarding this is much appreciated.
> 
> Also may close this bug.

Comment 8 Chris Jones 2017-11-28 14:16:35 UTC
Flagging Jason Rist from the UI team to take a look at the Horizon related question in Comment #7

Comment 9 Radomir Dopieralski 2017-12-11 21:29:28 UTC
Answering the question from comment 6, Horizon doesn't detect and react to network fluctuations or outages in any way, and it, by itself, stores no information except for what is inside the user's session. It's a simple stateless web application, which merely talks to all other OpenStack services to display their answers or perform actions that the user chooses. It doesn't even have any code running when nobody is refreshing a web page. It can't detect network state other than by failing to communicate with services, it doesn't remember such failures, and when connectivity is restored, it will work as if nothing has happened, because it has no storage or memory for its state. In the worst case, you might need to re-login or delete the cookies from your browser.

Comment 10 Jason E. Rist 2017-12-11 23:54:47 UTC
Thanks Radomir!


Note You need to log in before you can comment on or make changes to this bug.