Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1688388 - [OCP 3.9] openshift_service_catalog install fails - wait to see that the apiserver is available
Summary: [OCP 3.9] openshift_service_catalog install fails - wait to see that the apis...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 3.9.0
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: ---
Assignee: Casey Callendrello
QA Contact: Meng Bo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-13 16:38 UTC by sperezto
Modified: 2019-03-15 08:59 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-15 08:59:08 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description sperezto 2019-03-13 16:38:11 UTC
Description of problem:
Cluster is not able to resolv cluster.local, and because of it two tasks fails on this part "openshift-service-catalog/config.yml":

[Task1]
/usr/share/ansible/openshift-ansible/playbooks/openshift-service-catalog/private/roles/openshift_service_catalog/tasks/start_api_server.yml

# wait to see that the apiserver is available
- name: wait for api server to be ready
  uri:
    url: https://apiserver.kube-service-catalog.svc/healthz
    validate_certs: no
    return_content: yes
  environment:
    no_proxy: '*'
  register: api_health
  until: "'ok' in api_health.content"
  retries: 60
  delay: 10
  changed_when: false

[Task2]
/usr/share/ansible/openshift-ansible/playbooks/openshift-service-catalog/private/roles/template_service_broker/tasks/deploy.yml

# Check that the TSB is running
- name: Verify that TSB is running
  uri:
    url: https://apiserver.openshift-template-service-broker.svc/healthz
    validate_certs: no
    return_content: yes
  environment:
    no_proxy: '*'
  register: api_health
  until: "'ok' in api_health.content"
  retries: 60
  delay: 10
  changed_when: false

Version-Release number of the following components:

openshift-ansible-roles-3.9.68-1.git.0.0c424ac.el7.noarch
ansible-2.4.6.0-1.el7ae.noarch
openshift-ansible-docs-3.9.68-1.git.0.0c424ac.el7.noarch
openshift-ansible-3.9.68-1.git.0.0c424ac.el7.noarch
openshift-ansible-playbooks-3.9.68-1.git.0.0c424ac.el7.noarch


[sperezto@master ~]$ oc version
oc v3.9.68
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://master.lab.example.com:443
openshift v3.9.68
kubernetes v1.9.1+a0ce1bc657
[sperezto@master ~]$ ansible --version
ansible 2.4.6.0
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/sperezto/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]



How reproducible:

Steps to Reproduce:
1. RHEL 7.6 on KVM
2. IDM as DNS
3. # yum install atomic-openshift-utils
3.1.- hosts.yml
[OSEv3:children]
masters
nodes
etcd

# Set variables common for all OSEv3 hosts
[OSEv3:vars]
# SSH user, this user should allow ssh based auth without requiring a password
ansible_ssh_user=sperezto

# If ansible_ssh_user is not root, ansible_become must be set to true
ansible_become=true

# General Cluster Variables
openshift_deployment_type=openshift-enterprise
#oreg_url=lab.example.com/openshift3/ose-${component}:${version}
#openshift_examples_modify_imagestreams=true

openshift_deployment_type=openshift-enterprise
openshift_release=v3.9
openshift_image_tag=v3.9.68
openshift_pkg_version=-3.9.68
openshift_disable_check=disk_availability,docker_storage,memory_availability

# Cluster Authentication Variables
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/ etc/origin/master/htpasswd'}]
openshift_master_htpasswd_users={'admin':'$apr1$p56igIQy$P4X3f0pWUg3R79ipUyCZw.', 'developer': '$apr1$xIvLz.UM$XWOpa2S/lNAr7hq2NkvoV/'}

#OpenShift Networking Variables
#os_firewall_use_firewalld=true
openshift_master_api_port=443
openshift_master_console_port=443
openshift_master_default_subdomain=apps.lab.example.com
#openshift_master_cluster_public_hostname=master.lab.example.com
#openshift_master_cluster_hostname=master.lab.example.com
#openshift_override_hostname_check=true

## NFS is an unsupported configuration
openshift_enable_unsupported_configurations=true
openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_access_modes=['ReadWriteMany']
openshift_hosted_registry_storage_host=nfs.lab.example.com
openshift_hosted_registry_storage_nfs_directory=/var/nfsshare
openshift_hosted_registry_storage_volume_name=registry
openshift_hosted_registry_storage_volume_size=10Gi

#OAB's etcd configuration variables
openshift_hosted_etcd_storage_kind=nfs
openshift_hosted_etcd_storage_nfs_directory=/var/nfsshare
openshift_hosted_etcd_storage_host=nfs.lab.example.com
openshift_hosted_etcd_storage_volume_name=etcd-vol2
openshift_hosted_etcd_storage_access_modes=["ReadWriteOnce"]
openshift_hosted_etcd_storage_volume_size=1G
openshift_hosted_etcd_storage_labels={'storage': 'etcd'}

#######################
## Metrics system
########################
openshift_metrics_install_metrics=true
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_nfs_directory=/var/nfsshare
openshift_metrics_storage_host=nfs.lab.example.com
openshift_metrics_storage_volume_name=metrics
openshift_metrics_storage_volume_size=2Gi
openshift_metrics_heapster_requests_memory=300M
openshift_metrics_hawkular_requests_memory=750M
openshift_metrics_cassandra_requests_memory=750M
openshift_metrics_duration=2
openshift_metrics_hawkular_nodeselector={'region': 'infra'}
openshift_metrics_cassandra_nodeselector={'region': 'infra'}
openshift_metrics_heapster_nodeselector={'region': 'infra'}

# host group for masters
[masters]
master.lab.example.com

# host group for etcd
[etcd]
master.lab.example.com

# host group for nodes, includes region info
[nodes]
master.lab.example.com  openshift_ip=192.168.122.20 
infra.lab.example.com   openshift_ip=192.168.122.21 openshift_node_labels="{'region': 'infra', 'zone': 'default'}"
node1.lab.example.com   openshift_ip=192.168.122.22 openshift_node_labels="{'region': 'primary', 'zone': 'bcn'}"
node2.lab.example.com   openshift_ip=192.168.122.23 openshift_node_labels="{'region': 'primary', 'zone': 'mad'}"

#[nfs]
#nfs.lab.example.com ansible_host=192.168.122.13

4.- ansible-playbook -i ~/hosts.yml prerequisites.yml
5.- ansible-playbook -i ~/hosts.yml deploy.yml

Actual results:

TASK [openshift_service_catalog : Label master.lab.example.com for APIServer and controller deployment] **************
ok: [master.lab.example.com]

TASK [openshift_service_catalog : wait for api server to be ready] ***************************************************
FAILED - RETRYING: wait for api server to be ready (60 retries left).
FAILED - RETRYING: wait for api server to be ready (59 retries left).
FAILED - RETRYING: wait for api server to be ready (58 retries left).
FAILED - RETRYING: wait for api server to be ready (57 retries left).
FAILED - RETRYING: wait for api server to be ready (56 retries left).

FAILED - RETRYING: wait for api server to be ready (5 retries left).
FAILED - RETRYING: wait for api server to be ready (4 retries left).
FAILED - RETRYING: wait for api server to be ready (3 retries left).
FAILED - RETRYING: wait for api server to be ready (2 retries left).
FAILED - RETRYING: wait for api server to be ready (1 retries left).
fatal: [master.lab.example.com]: FAILED! => {"attempts": 60, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno -2] Name or service not known>", "redirected": false, "status": -1, "url": "https://apiserver.kube-service-catalog.svc/healthz"}
 [WARNING]: Could not create retry file '/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.retry'.
[Errno 13] Permission denied: u'/usr/share/ansible/openshift-ansible/playbooks/deploy_cluster.retry'


PLAY RECAP ***********************************************************************************************************
infra.lab.example.com      : ok=138  changed=52   unreachable=0    failed=0   
localhost                  : ok=12   changed=0    unreachable=0    failed=0   
master.lab.example.com     : ok=756  changed=273  unreachable=0    failed=1   
node1.lab.example.com      : ok=138  changed=52   unreachable=0    failed=0   
node2.lab.example.com      : ok=138  changed=52   unreachable=0    failed=0   


INSTALLER STATUS *****************************************************************************************************
Initialization             : Complete (0:00:45)
Health Check               : Complete (0:03:04)
etcd Install               : Complete (0:02:59)
Master Install             : Complete (0:04:59)
Master Additional Install  : Complete (0:05:19)
Node Install               : Complete (0:13:52)
Hosted Install             : Complete (0:02:26)
Web Console Install        : Complete (0:01:08)
Metrics Install            : Complete (0:02:37)
Service Catalog Install    : In Progress (0:11:40)
	This phase can be restarted by running: playbooks/openshift-service-catalog/config.yml

[sperezto@master ~]$ oc get hostsubnet 
NAME                     HOST                     HOST IP          SUBNET          EGRESS IPS
infra.lab.example.com    infra.lab.example.com    192.168.122.21   10.128.0.0/23   []
master.lab.example.com   master.lab.example.com   192.168.122.20   10.131.0.0/23   []
node1.lab.example.com    node1.lab.example.com    192.168.122.22   10.129.0.0/23   []
node2.lab.example.com    node2.lab.example.com    192.168.122.23   10.130.0.0/23   []

[sperezto@master ~]$ oc get pods -o wide --all-namespaces
NAMESPACE               NAME                          READY     STATUS    RESTARTS   AGE       IP               NODE
default                 docker-registry-1-sgpd4       1/1       Running   0          13m       10.128.0.4       infra.lab.example.com
default                 registry-console-1-9ljz8      1/1       Running   0          11m       10.131.0.4       master.lab.example.com
default                 router-1-cgw6t                1/1       Running   0          13m       192.168.122.21   infra.lab.example.com
kube-service-catalog    apiserver-d7tf4               1/1       Running   0          8m        10.131.0.5       master.lab.example.com
kube-service-catalog    controller-manager-5thrd      1/1       Running   0          8m        10.131.0.6       master.lab.example.com
openshift-infra         hawkular-cassandra-1-g469n    1/1       Running   0          9m        10.128.0.7       infra.lab.example.com
openshift-infra         hawkular-metrics-s7fm8        0/1       Pending   0          9m        <none>           <none>
openshift-infra         heapster-nddrk                0/1       Running   1          9m        10.128.0.8       infra.lab.example.com
openshift-web-console   webconsole-78779f6ff7-msc8w   1/1       Running   1          12m       10.131.0.3       master.lab.example.com

[sperezto@master ~]$ curl -k https://10.131.0.5:6443
{
  "paths": [
    "/apis",
    "/apis/servicecatalog.k8s.io",
    "/apis/servicecatalog.k8s.io/v1beta1",
    "/apis/settings.servicecatalog.k8s.io",
    "/healthz",
    "/healthz/etcd",
    "/healthz/ping",
    "/healthz/poststarthook/generic-apiserver-start-informers",
    "/healthz/poststarthook/start-service-catalog-apiserver-informers",
    "/metrics",
    "/swaggerapi",
    "/version"
  ]

[sperezto@master ~]$ curl -k https://apiserver.kube-service-catalog.svc/healthz
curl: (6) Could not resolve host: apiserver.kube-service-catalog.svc; Unknown error

[sperezto@master ~]$ curl -k https://apiserver.kube-service-catalog.svc.cluster.local/healthz
ok


[sperezto@master ~]$ cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
# Ansible managed

search cluster.local example.com lab.example.com
domain lab.example.com
nameserver 192.168.122.20

[sperezto@master ~]$ for i in master infra node1 node2;do host $i;done
master.lab.example.com has address 192.168.122.20
infra.lab.example.com has address 192.168.122.21
node1.lab.example.com has address 192.168.122.22
node2.lab.example.com has address 192.168.122.23

Expected results:

TASK [Set Service Catalog install 'Complete'] ************************************************************************
ok: [master.lab.example.com]

PLAY RECAP ***********************************************************************************************************
infra.lab.example.com      : ok=0    changed=0    unreachable=0    failed=0   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   
master.lab.example.com     : ok=127  changed=48   unreachable=0    failed=0   
node1.lab.example.com      : ok=0    changed=0    unreachable=0    failed=0   
node2.lab.example.com      : ok=0    changed=0    unreachable=0    failed=0   


INSTALLER STATUS *****************************************************************************************************
Initialization             : Complete (0:00:51)
Service Catalog Install    : Complete (0:07:05)

Additional info:

if we change the url on both tasks, appended the domain cluster.local the cluster is deployed successfully.

https://apiserver.kube-service-catalog.svc/healthz --> https://apiserver.kube-service-catalog.svc.cluster.local/healthz
https://apiserver.openshift-template-service-broker.svc/healthz -->https://apiserver.openshift-template-service-broker.svc.cluster.local/healthz

Comment 1 Scott Dodson 2019-03-13 17:32:19 UTC
You should have 'cluster.local' added to the search path in /etc/resolv.conf, can you provide the contents of that file?

Comment 2 sperezto 2019-03-14 08:19:49 UTC
Yes It does, the installer added, however can't resolve if lookup isn't the fqdn. I alredy tried with options ndots:5, no luck.

[sperezto@master ~]$ cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
search cluster.local example.com lab.example.com
domain lab.example.com
nameserver 192.168.122.20

[sperezto@master ~]$ host master
master.lab.example.com has address 192.168.122.20


[sperezto@master ~]$ nslookup -debug apiserver.kube-service-catalog.svc.cluster.local
Server:		192.168.122.20
Address:	192.168.122.20#53

------------
    QUESTIONS:
	apiserver.kube-service-catalog.svc.cluster.local, type = A, class = IN
    ANSWERS:
    ->  apiserver.kube-service-catalog.svc.cluster.local
	internet address = 172.30.117.162
	ttl = 30
    AUTHORITY RECORDS:
    ADDITIONAL RECORDS:
------------
Name:	apiserver.kube-service-catalog.svc.cluster.local
Address: 172.30.117.162

[sperezto@master ~]$ nslookup -debug apiserver.kube-service-catalog.svc
Server:		192.168.122.20
Address:	192.168.122.20#53

------------
    QUESTIONS:
	apiserver.kube-service-catalog.svc, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  .
	origin = a.root-servers.net
	mail addr = nstld.verisign-grs.com
	serial = 2019031400
	refresh = 1800
	retry = 900
	expire = 604800
	minimum = 86400
	ttl = 10800
    ADDITIONAL RECORDS:
------------
** server can't find apiserver.kube-service-catalog.svc: NXDOMAIN
Server:		192.168.122.20
Address:	192.168.122.20#53

------------
    QUESTIONS:
	apiserver.kube-service-catalog.svc.lab.example.com, type = A, class = IN
    ANSWERS:
    AUTHORITY RECORDS:
    ->  lab.example.com
	origin = idm.lab.example.com
	mail addr = hostmaster.lab.example.com
	serial = 1552387174
	refresh = 3600
	retry = 900
	expire = 1209600
	minimum = 3600
	ttl = 3600
    ADDITIONAL RECORDS:
------------
** server can't find apiserver.kube-service-catalog.svc: NXDOMAIN

Comment 3 Scott Dodson 2019-03-14 15:14:14 UTC
It should've tried cluster.local first before trying example.com. I don't know why that would happen based on the manpage for resolv.conf, perhaps this is a side effect of example.com being both in the search and domain? I'm moving this to networking edge team who is responsible for dns integration.

Comment 4 sperezto 2019-03-15 08:17:07 UTC
I remove "domain" parameter from /etc/resolv, and I had another error "curl: (7) Failed connect to apiserver.openshift-template-service-broker.svc:443; Connection refused". There is no mention about domain parameter in documentation, is it?.

sperezto@master ~]$ cat /etc/resolv.conf 
# nameserver updated by /etc/NetworkManager/dispatcher.d/99-origin-dns.sh
search cluster.local example.com lab.example.com
nameserver 192.168.122.20

https://docs.openshift.com/container-platform/3.9/install_config/install/prerequisites.html#dns-config-prereq

Thxs in advance.

Comment 5 sperezto 2019-03-15 08:59:08 UTC
It could not connect, because It was memory lack to start the service, once I added memory installation works like a charm!.


Note You need to log in before you can comment on or make changes to this bug.