Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1605246 - [HA] [Scale] Cluster does not recover even after the controllers are up
Summary: [HA] [Scale] Cluster does not recover even after the controllers are up
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z5
: 13.0 (Queens)
Assignee: Vishal Thapar
QA Contact: Tomas Jamrisko
URL:
Whiteboard: HA
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-20 14:29 UTC by Sridhar Gaddam
Modified: 2019-03-06 16:17 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-06 16:16:52 UTC
Target Upstream Version:


Attachments (Terms of Use)
Controller-0 logs (deleted)
2018-07-20 14:36 UTC, Sridhar Gaddam
no flags Details
Controller-1 logs (deleted)
2018-07-20 14:37 UTC, Sridhar Gaddam
no flags Details
Controller-2 logs (deleted)
2018-07-20 14:38 UTC, Sridhar Gaddam
no flags Details

Description Sridhar Gaddam 2018-07-20 14:29:51 UTC
Description of problem:
Description of problem:
Deployment with 3 OSP Controller + ODL nodes and 2 computes.

While running the Browbeat neutron rally tests (concurrency 8, 16 with times set to 500), it was observed that controllers were terminated due to OOM and at some point in time, two of the ODL Controllers went down. Even after the controllers came up, the cluster does not seem to recover.

https://snapshot.raintank.io/dashboard/snapshot/E2gAMmlMToS65z7jU0CHs9WBd8WDd6bT
https://snapshot.raintank.io/dashboard/snapshot/bEPgnhlNwTZsYXRljishtqFoLHprHIKt
https://snapshot.raintank.io/dashboard/snapshot/QUuoBYxgHx1ZGVWH2bpMdgxo4Fdm71XK

Controller-0: was terminated due to OOM around "05:26 07/19/2017" and was up at "06:05 07/19/2017". Was terminated again at "08:22 07/19/2017" and up at "08:39 07/19/2017"
Controller-1: was terminated due to OOM around "05:45 07/19/2017" and was up at "07:56 07/19/2017"
Controller-2: was terminated due to OOM around "06:19 07/19/2017" and was up at "06:33 07/19/2017"

Some logs which could be of interest:

2018-07-20 12:50:34,070 | INFO  | d-dispatcher-151 | Shard                            | 224 - org.opendaylight.controller.sal-clustering-commons - 1.7.3.redhat-1 | member-0-shard-default-config (Leader): handleAppendEntriesReply - received unsuccessful reply: AppendEntriesReply [term=278, success=false, followerId=member-1-shard-default-config, logLastIndex=1607057, logLastTerm=278, forceInstallSnapshot=false, payloadVersion=5, raftVersion=3], leader snapshotIndex: 1759997
2018-07-20 12:50:34,070 | INFO  | d-dispatcher-151 | Shard                            | 224 - org.opendaylight.controller.sal-clustering-commons - 1.7.3.redhat-1 | member-0-shard-default-config (Leader): follower member-1-shard-default-config last log term 278 conflicts with the leader's -1 - dec next index to 1410594 

Version-Release number of selected component (if applicable):
opendaylight-8.3.0-1.el7ost.noarch

How reproducible:
Not very often

Steps to Reproduce:
1. Deploy OSP with ODL 3 Controllers and 2 Computes
2. Run Browbeat rally neutron tests with concurrency 8, 16 and times set to 500
3. ODL Controllers would get terminated due to OOM

Actual results:
On one hand we should identify the root cause for OOM issue, but this bug is about cluster not recovering even after the controllers are up.

Expected results:
Once all the controllers are up, cluster should synchronize and should function normally.

Additional info:

Comment 1 Sridhar Gaddam 2018-07-20 14:36:38 UTC
Created attachment 1465006 [details]
Controller-0 logs

Comment 2 Sridhar Gaddam 2018-07-20 14:37:39 UTC
Created attachment 1465018 [details]
Controller-1 logs

Comment 3 Sridhar Gaddam 2018-07-20 14:38:17 UTC
Created attachment 1465026 [details]
Controller-2 logs

Comment 12 Franck Baudin 2019-03-06 16:16:52 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality

Comment 13 Franck Baudin 2019-03-06 16:17:50 UTC
As per depreciation notice [1], closing this bug. Please reopen if relevant for RHOSP13, as this is the only version shipping ODL.

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14/html-single/release_notes/index#deprecated_functionality


Note You need to log in before you can comment on or make changes to this bug.