Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1599433 - ASB deployment stuck in DeploymentAwaitingCancellation after upgrade to 3.10
Summary: ASB deployment stuck in DeploymentAwaitingCancellation after upgrade to 3.10
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Service Broker
Version: 3.10.0
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 3.10.z
Assignee: Fabian von Feilitzsch
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-09 19:29 UTC by Mike Fiedler
Modified: 2018-11-12 20:22 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-11-12 20:22:41 UTC
Target Upstream Version:


Attachments (Terms of Use)
ansible upgrade log, master logs. (deleted)
2018-07-09 19:29 UTC, Mike Fiedler
no flags Details
Inventory (deleted)
2018-07-10 11:30 UTC, Mike Fiedler
no flags Details
openshift-ansible-service-broker events (deleted)
2018-07-10 13:25 UTC, Mike Fiedler
no flags Details
controller log at level 5, ansible -vvv from upgrade, inventory (deleted)
2018-07-11 15:14 UTC, Mike Fiedler
no flags Details

Description Mike Fiedler 2018-07-09 19:29:04 UTC
Created attachment 1457568 [details]
ansible upgrade log, master logs.

Description of problem:

Upgrading from 3.9.27 to 3.10.15.  ASB is running fine before the upgrade but after the upgrade completes (successfully - no fatal error anyways), there are no pods running in openshift-ansible-service-broker namespace.   

oc describe dc/asb shows:

Deployment #1 (latest):
        Name:           asb-1
        Created:        about an hour ago
        Status:         Complete
        Replicas:       0 current / 0 desired
        Selector:       app=openshift-ansible-service-broker,deployment=asb-1,deploymentconfig=asb
        Labels:         app=openshift-ansible-service-broker,openshift.io/deployment-config.name=asb,service=asb
        Pods Status:    0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Deployment #2:
        Created:        12 minutes ago
        Status:         Failed
        Replicas:       0 current / 0 desired

Events:
  Type          Reason                          Age                     From                            Message
  ----          ------                          ----                    ----                            -------
  Normal        DeploymentAwaitingCancellation  12m (x3 over 12m)       deploymentconfig-controller     Deployment of version 0 awaiting cancellation of older running deployments
  Normal        DeploymentCancelled             12m (x2 over 12m)       deploymentconfig-controller     Cancelled deployment "asb-2" superceded by version 0




Version-Release number of selected component (if applicable): 3.10.15


How reproducible: 3 for 3 so far


Steps to Reproduce:
1.  Install a 3.9 cluster and verify the asb deployment is healthy
2.  Run the upgrade to 3.10.  I do see that the etcd data migration job completes successfully
3.  oc get pods -n openshift-ansible-service-broker

Actual results:

Nothing running.   Checking the DC show deployment cancellation and a deployment awaiting cancellation

Expected results:

asb runnning well

Additional info:

I've hit this multiple time.  I'll attach the ansible -vvv upgrade log and the master logs.   Let me know what else would help.

Comment 2 Zhang Cheng 2018-07-10 05:59:38 UTC
@Mike,

In fact, this problem was not hit in our side in previous. Could you provide the detail reproduce steps? It will bring a big trouble for customer and should be a "testblocker" in OCP3.10 GA if with high reproduce rate(I noticed you said "I've hit this multiple time and 3 for 3 so far")

Comment 3 Zhang Cheng 2018-07-10 06:24:05 UTC
I'm setting "target release" to 3.10 and add "testblocker" keyword since reporter have hit this issue multiple time and 3 for 3 so far. This issue will block existing customer upgrade to 3.10.

Please correct me if I have mistake after we have more investigation. Thanks.

Comment 4 Jian Zhang 2018-07-10 06:24:47 UTC
FYI.
I found the migration from ASB etcd to CRD was failed in the ansible log, here:

2018-07-09 19:08:47,451 p=70049 u=root |  TASK [ansible_service_broker : scale up asb deploymentconfig] *******************************************************************************
2018-07-09 19:08:47,451 p=70049 u=root |  task path: /usr/share/ansible/openshift-ansible/roles/ansible_service_broker/tasks/migrate.yml:189
2018-07-09 19:08:47,506 p=70049 u=root |  Using module file /usr/share/ansible/openshift-ansible/roles/lib_openshift/library/oc_scale.py
2018-07-09 19:08:47,956 p=70049 u=root |  ok: [ec2-54-190-30-129.us-west-2.compute.amazonaws.com] => {
    "changed": false, 
    "invocation": {
        "module_args": {
            "debug": false, 
            "kind": "dc", 
            "kubeconfig": "/etc/origin/master/admin.kubeconfig", 
            "name": "asb", 
            "namespace": "openshift-ansible-service-broker", 
            "replicas": 1, 
            "state": "present"
        }
    }, 
    "result": [
        1
    ], 
    "state": "present"
}
2018-07-09 19:08:47,970 p=70049 u=root |  TASK [ansible_service_broker : Fail out because the ASB etcd to CRD migration was unsuccessful] *********************************************
2018-07-09 19:08:47,970 p=70049 u=root |  task path: /usr/share/ansible/openshift-ansible/roles/ansible_service_broker/tasks/migrate.yml:196
2018-07-09 19:08:48,008 p=70049 u=root |  skipping: [ec2-54-190-30-129.us-west-2.compute.amazonaws.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False"
}

Comment 5 Jian Zhang 2018-07-10 06:37:04 UTC
@Mike

The above log is from your attached log, so maybe there are something wrong in the migration from the ASB etcd to CRD. Just for your information.

Comment 6 Jian Zhang 2018-07-10 07:40:17 UTC
@Mike

Oh, sorry, my mistake, the migration did succeed, please ignore my above comments. But, from the log of the master-controller, I saw these:
I0709 19:09:00.027106       1 scheduler.go:191] Failed to schedule pod: openshift-ansible-service-broker/asb-2-deploy
I0709 19:09:00.027188       1 factory.go:1375] Updating pod condition for openshift-ansible-service-broker/asb-2-deploy to (PodScheduled==False)

So, I'd suggest you can check your cluster nodes or labels. Hope that helps.

Comment 7 Zihan Tang 2018-07-10 08:03:19 UTC
In #comment1, my steps in the 2 upgrade are as below: 
1. install OCP v3.9
2. provision apbs in projects, I provision 4 instance in 2 projects.
3. run the following upgrade job:
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-master/openshift_node_group.yml
ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml

Comment 8 Mike Fiedler 2018-07-10 11:30:02 UTC
2 additional SVT deployments hit this issue last night.   The basic steps are the same as comment 7.  I will attach the inventory being used to install and upgrade - that is the only thing I can think might differ significantly.  More details

AWS cluster 1 master, 1 etcd (separate from master), 1 infra node, 3 compute nodes.  All instances are m4.xlarge (4 vCPU, 16GB)

1. Install 3.9.27
2. Run cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) to create some deployments, services, builds, secrets, etc).   The cluster-loader config is here:  https://gist.github.com/mffiedler/6f12b4bd6206c7805038661c9dd22ce9#file-upgrade_small-yaml
3. Update yum repos on all systems to point to 3.10.15
4. Run ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-master/openshift_node_group.yml
5. Run ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_10/upgrade.yml

I am recreating the issue now and will check labels, but I did not see any MatchNodeSelector events, just the repeated deployment cancellations.   

Do I need to add new node labels after upgrading to 3.10?

Comment 9 Mike Fiedler 2018-07-10 11:30:41 UTC
Created attachment 1457777 [details]
Inventory

Comment 10 Mike Fiedler 2018-07-10 13:24:49 UTC
Reproduced - I will keep an environment available today (Tues, July 10)


After the upgrade the asb deploymentconfig had this nodeSelector

      nodeSelector:
        node-role.kubernetes.io/infra: "true"

And I do have a node with that label.     I'll attach all events.

I tried oc rollout latest dc/asb and saw this error in the created pods log:

root@ip-172-31-34-193: ~ # oc logs asb-1-zwr9n
Using config file mounted to /etc/ansible-service-broker/config.yaml
============================================================
==           Starting Ansible Service Broker...           ==
============================================================
2018/07/10 13:23:17 Unable to get log.logfile from config
[2018-07-10T13:23:17.051Z] [NOTICE] - Initializing clients...
[2018-07-10T13:23:17.051Z] [INFO] - == ETCD CX ==
[2018-07-10T13:23:17.051Z] [INFO] - EtcdHost:
[2018-07-10T13:23:17.051Z] [INFO] - EtcdPort: 0
[2018-07-10T13:23:17.051Z] [INFO] - Endpoints: [http://:0]
[2018-07-10T13:23:17.052Z] [ERROR] - client: etcd cluster is unavailable or misconfigured; error #0: dial tcp :0: getsockopt: connection refused 


It still thinks it is using etcd?

Comment 11 Mike Fiedler 2018-07-10 13:25:59 UTC
Created attachment 1457818 [details]
openshift-ansible-service-broker events

Comment 13 Mike Fiedler 2018-07-11 01:21:47 UTC
The error messages in comment 10 were after an oc rollout latest dc/asb

Comment 14 Jian Zhang 2018-07-11 01:56:55 UTC
FYI, the version of the ASB is still 3.9 after the upgrade. See below:
root@ip-172-31-34-193: ~ # oc get pods -o yaml | grep image
      image: registry.reg-aws.openshift.com:443/openshift3/ose-ansible-service-broker:v3.9

Comment 15 Mike Fiedler 2018-07-11 02:08:17 UTC
@jiazha + team:  Can you look at my inventory (attached).  When you upgrade to 3.10, do you change the ansible_service_broker_image_tag var?  or maybe you don't have it in your inventory?   

With template service  broker, the image gets updated to 3.10 even though the inventory has template_service_broker_version=v3.9

Comment 16 Jian Zhang 2018-07-11 02:28:18 UTC
@Mike

No need to change the tag, it's the openshift-ansible responsibility to update it. comment 14 shows the ASB status so that developers could find the root cause faster.

Comment 17 Zihan Tang 2018-07-11 02:59:57 UTC
@Mike, I don't manually change `ansible_service_broker_image_tag` when upgrade. 
can you attache the upgrade log in #comment 10 ?

Comment 18 DeShuai Ma 2018-07-11 03:34:47 UTC
In contollers log find "Failed to schedule pod: openshift-ansible-service-broker/asb-2-deploy", so the new deploy canceled. Could you set the loglevel to high to catch why it failed schedule?

I0710 12:35:12.291694       1 event.go:218] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"openshift-ansible-service-broker", Name:"asb", UID:"b06ed25c-843d-11e8-91d9-0205f8c4d8c0", APIVersion:"apps.openshift.io", ResourceVersion:"19243", FieldPath:""}): type: 'Normal' reason: 'DeploymentCreated' Created new replication controller "asb-2" for version 2
I0710 12:35:12.305921       1 scheduler.go:191] Failed to schedule pod: openshift-ansible-service-broker/asb-2-deploy
I0710 12:35:12.306075       1 factory.go:1375] Updating pod condition for openshift-ansible-service-broker/asb-2-deploy to (PodScheduled==False)
I0710 12:35:12.310625       1 scheduler.go:191] Failed to schedule pod: openshift-ansible-service-broker/asb-2-deploy
I0710 12:35:12.311189       1 factory.go:1375] Updating pod condition for openshift-ansible-service-broker/asb-2-deploy to (PodScheduled==False)

Comment 19 Mike Fiedler 2018-07-11 15:14:13 UTC
Reproduced on a new cluster with loglevel=5 and gathered controller log and corresponding ansible log.   The cluster will be available until tomorrow.   The cluster in comment 12 is no longer available.

The new master which can be used  for investigation is:  ec2-34-220-164-240.us-west-2.compute.amazonaws.com

I don't see the "Failed to schedule" message in these logs.   The only mentions of asb-2 are:

oc logs master-controllers-ip-172-31-28-243.us-west-2.compute.internal  -n kube-system | grep asb-2
I0711 15:02:10.280663       1 graph_builder.go:603] GraphBuilder process object: v1/ReplicationController, namespace openshift-ansible-service-broker, name asb-2, uid 26fcdb66-851b-11e8-8e58-025c2034c908, event type add                                                              
I0711 15:02:10.327809       1 controller_utils.go:193] Controller openshift-ansible-service-broker/asb-2 either never recorded expectations, or the ttl expired.                                                                                                                         
I0711 15:02:10.327885       1 replica_set.go:575] Finished syncing ReplicationController "openshift-ansible-service-broker/asb-2" (106.189µs)
I0711 15:02:10.375047       1 graph_builder.go:603] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace openshift-ansible-service-broker, name asb-2-deploy.1540588b9daf8cae, uid 2700c81a-851b-11e8-8e58-025c2034c908, event type add                                   
I0711 15:02:10.375066       1 graph_builder.go:603] GraphBuilder process object: events.k8s.io/v1beta1/Event, namespace openshift-ansible-service-broker, name asb-2-deploy.1540589075e58279, uid 3367a6b6-851b-11e8-8e58-025c2034c908, event type add

Comment 20 Mike Fiedler 2018-07-11 15:14:49 UTC
Created attachment 1458137 [details]
controller log at level 5, ansible -vvv from upgrade, inventory

Comment 22 Fabian von Feilitzsch 2018-07-16 18:03:44 UTC
I noticed this line in the events log:

asb-2-deploy.154002094eccc66a               Pod                                                   Warning   FailedScheduling                 default-scheduler                                      0/5 nodes are available: 5 node(s) didn't match node selector.

It looks like you set the nodeSelector for TSB/several other services to region: infra, but I don't see one for ASB. The default nodeSelector for ASB in 3.10 is node-role.kubernetes.io/infra=true. 

Do you still see this issue if you add 


    ansible_service_broker_node_selector={"region": "infra"}

to your inventory?

Comment 23 Mike Fiedler 2018-07-17 20:08:10 UTC
Adding the node selector before the upgrade fixes things - asb is running after the upgrade.  Why would it require an explicit node selector after the upgrade if it was not using one in 3.9?  Shouldn't it just use the default cluster node selector?

Clearing TestBlocker and lowering sev.

Comment 24 Fabian von Feilitzsch 2018-07-17 21:20:06 UTC
I think the default node selector selects for `node-role.kubernetes.io/compute=true`, but since the broker is an infrastructure piece it wants to run on infra nodes. The default selector for infra nodes is `node-role.kubernetes.io/infra=true`. It looks like your inventory sets the infra label to region=infra instead. It looks like you set the selector for the other components in the configuration already (for the template broker, logging, router, and registry), so I think the behavior of asb is in line with the other components. I think you can set the selector for most of these components with the variable

openshift_hosted_infra_selector

Comment 25 Fabian von Feilitzsch 2018-11-12 20:22:41 UTC
This seemed to be an issue with configuration rather than a bug in the deployment logic, if you're still seeing this issue even after making the configuration update mentioned above please reopen.


Note You need to log in before you can comment on or make changes to this bug.