Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.

Bug 1605246

Summary: [HA] [Scale] Cluster does not recover even after the controllers are up
Product: Red Hat OpenStack Reporter: Sridhar Gaddam <sgaddam>
Component: opendaylightAssignee: Vishal Thapar <vthapar>
Status: CLOSED WONTFIX QA Contact: Tomas Jamrisko <tjamrisk>
Severity: high Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: aadam, mkolesni, skitt, smalleni, vpickard, vthapar
Target Milestone: z5Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: HA
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-03-06 16:16:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Controller-0 logs
none
Controller-1 logs
none
Controller-2 logs none

Description Sridhar Gaddam 2018-07-20 14:29:51 UTC
Description of problem:
Description of problem:
Deployment with 3 OSP Controller + ODL nodes and 2 computes.

While running the Browbeat neutron rally tests (concurrency 8, 16 with times set to 500), it was observed that controllers were terminated due to OOM and at some point in time, two of the ODL Controllers went down. Even after the controllers came up, the cluster does not seem to recover.

https://snapshot.raintank.io/dashboard/snapshot/E2gAMmlMToS65z7jU0CHs9WBd8WDd6bT
https://snapshot.raintank.io/dashboard/snapshot/bEPgnhlNwTZsYXRljishtqFoLHprHIKt
https://snapshot.raintank.io/dashboard/snapshot/QUuoBYxgHx1ZGVWH2bpMdgxo4Fdm71XK

Controller-0: was terminated due to OOM around "05:26 07/19/2017" and was up at "06:05 07/19/2017". Was terminated again at "08:22 07/19/2017" and up at "08:39 07/19/2017"
Controller-1: was terminated due to OOM around "05:45 07/19/2017" and was up at "07:56 07/19/2017"
Controller-2: was terminated due to OOM around "06:19 07/19/2017" and was up at "06:33 07/19/2017"

Some logs which could be of interest:

2018-07-20 12:50:34,070 | INFO  | d-dispatcher-151 | Shard                            | 224 - org.opendaylight.controller.sal-clustering-commons - 1.7.3.redhat-1 | member-0-shard-default-config (Leader): handleAppendEntriesReply - received unsuccessful reply: AppendEntriesReply [term=278, success=false, followerId=member-1-shard-default-config, logLastIndex=1607057, logLastTerm=278, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], leader snapshotIndex: 1759997
2018-07-20 12:50:34,070 | INFO  | d-dispatcher-151 | Shard                            | 224 - org.opendaylight.controller.sal-clustering-commons - 1.7.3.redhat-1 | member-0-shard-default-config (Leader): follower member-1-shard-default-config last log term 278 conflicts with the leader's -1 - dec next index to 1410594 

Version-Release number of selected component (if applicable):
opendaylight-8.3.0-1.el7ost.noarch

How reproducible:
Not very often

Steps to Reproduce:
1. Deploy OSP with ODL 3 Controllers and 2 Computes
2. Run Browbeat rally neutron tests with concurrency 8, 16 and times set to 500
3. ODL Controllers would get terminated due to OOM

Actual results:
On one hand we should identify the root cause for OOM issue, but this bug is about cluster not recovering even after the controllers are up.

Expected results:
Once all the controllers are up, cluster should synchronize and should function normally.

Additional info:

Comment 1 Sridhar Gaddam 2018-07-20 14:36:38 UTC
Created attachment 1465006 [details]
Controller-0 logs

Comment 2 Sridhar Gaddam 2018-07-20 14:37:39 UTC
Created attachment 1465018 [details]
Controller-1 logs

Comment 3 Sridhar Gaddam 2018-07-20 14:38:17 UTC
Created attachment 1465026 [details]
Controller-2 logs

Comment 12 Franck Baudin 2019-03-06 16:16:52 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 13 Franck Baudin 2019-03-06 16:17:50 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality