Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1689836 - The OLM metrics should queryable from the Prometheus UI
Summary: The OLM metrics should queryable from the Prometheus UI
Keywords:
Status: MODIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Jeff Peeler
QA Contact: Jian Zhang
URL:
Whiteboard:
: 1698530 1698533 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-18 09:24 UTC by Jian Zhang
Modified: 2019-04-16 03:19 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Jian Zhang 2019-03-18 09:24:08 UTC
Description of problem:
The OLM metrics are not queryable from the Prometheus UI. 

Version-Release number of selected component (if applicable):
Cluster version is 4.0.0-0.nightly-2019-03-15-063749
OLM commit:
               io.openshift.build.commit.id=840d806a3b20e5ebb7229631d0168864b1cfed12
               io.openshift.build.commit.url=https://github.com/operator-framework/operator-lifecycle-manager/commit/840d806a3b20e5ebb7229631d0168864b1cfed12
               io.openshift.build.source-location=https://github.com/operator-framework/operator-lifecycle-manager


How reproducible:
always

Steps to Reproduce:
1. Install the OCP 4.0
2. Login the Web console as the kubeadmin user, click "Monitoring"->"Metrics" -> Log in as the kubeadmin user -> query the below metrics:
install_plan_count
subscription_count
catalog_source_count
csv_count 
csv_upgrade_count


Actual results:
Got nothing.

Expected results:
The OLM metrics should queryable from the Prometheus UI.

Additional info:

Comment 3 Jian Zhang 2019-04-04 06:02:00 UTC
Verify failed.
OLM version:  io.openshift.build.commit.id=9ba3512c5406b62179968e2432b284e9a30c321e

1, I didn't find any metrics by following the steps described in the original description.
2, I try to `curl` the metrics port but got nothing. Like below:

[jzhang@dhcp-140-18 444]$ oc get pods
NAME                                READY   STATUS    RESTARTS   AGE
catalog-operator-7db68c98fb-p4nss   1/1     Running   0          165m
olm-operator-5f7cfb8cdc-hljzb       1/1     Running   0          165m
olm-operators-fz8mb                 1/1     Running   0          163m
packageserver-c96b4d7b7-swcnx       1/1     Running   0          162m
packageserver-c96b4d7b7-wvpkh       1/1     Running   0          163m

[jzhang@dhcp-140-18 444]$ oc port-forward catalog-operator-7db68c98fb-p4nss 8081
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081
E0404 13:58:05.171540   25765 portforward.go:331] an error occurred forwarding 8081 -> 8081: error forwarding port 8081 to pod f2856102d5b1107e3b3f4fb6c28d29f185d8149c2a05469c4425068b630be8de, uid : exit status 1: 2019/04/04 05:58:05 socat[12494] E connect(5, AF=2 127.0.0.1:8081, 16): Connection refused

[jzhang@dhcp-140-18 ~]$ curl -k http://localhost:8081/metrics -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8081 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8081
> User-Agent: curl/7.59.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server

^C[jzhang@dhcp-140-18 444]$ oc port-forward olm-operator-5f7cfb8cdc-hljzb 8081
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081
E0404 14:01:10.845859   25782 portforward.go:331] an error occurred forwarding 8081 -> 8081: error forwarding port 8081 to pod 9f9967270f5ae39ffe5dd59369fe9c55a60354d28df1dd77643d500ef2a47a21, uid : exit status 1: 2019/04/04 06:01:10 socat[12363] E connect(5, AF=2 127.0.0.1:8081, 16): Connection refused
[jzhang@dhcp-140-18 ~]$ curl -k http://localhost:8081/metrics -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8081 (#0)
> GET /metrics HTTP/1.1
> Host: localhost:8081
> User-Agent: curl/7.59.0
> Accept: */*
> 
* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server

Comment 4 Jeff Peeler 2019-04-08 20:16:31 UTC
The implementation has changed since the first metrics addition, so curling an http endpoint is not going to work. Shouldn't the test be more targeted at looking for the applicable OLM metrics in prometheus since that was the purpose of the latest change? If you want to do a sanity check to verify metrics are being served at the pod level, curling over https is what you need to do.

I assume that port forward attempt was unsuccessful given that you got a connection refused error message. I'd try to forward like `oc port-forward -p mypod :8081` just to ensure that there's not a local port conflict in play here.

Comment 5 Jian Zhang 2019-04-10 08:43:38 UTC
Jeff,

Thanks for your information. I login the Web console, and still get the same errors, as below: 
Click "Monitoring"->"Metrics" -> "Login as the Openshift"->"Status"->"Targets":

The status of the "openshift-operator-lifecycle-manager/catalog-operator" and "openshift-operator-lifecycle-manager/olm-operator" are DOWN. Errors:
Get https://10.128.0.3:8081/metrics: dial tcp 10.128.0.3:8081: connect: connection refused
Get https://10.128.0.7:8081/metrics: dial tcp 10.128.0.7:8081: connect: connection refused

> If you want to do a sanity check to verify metrics are being served at the pod level, curling over https is what you need to do.
Thanks! Got it, but seems like the olm-operator/catalog-operator cannot handle the SSL connect. Like below:
[jzhang@dhcp-140-18 ocp410]$ oc port-forward olm-operator-7bd6c84b68-5tkgm 8081
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081
E0410 16:38:56.209774    3743 portforward.go:331] an error occurred forwarding 8081 -> 8081: error forwarding port 8081 to pod 03e2e7c8663cd342e5dcc82e23785bf75a1cfc63633bb851b79ea73fcd571d24, uid : exit status 1: 2019/04/10 08:38:56 socat[31905] E connect(5, AF=2 127.0.0.1:8081, 16): Connection refused
Handling connection for 8081

[jzhang@dhcp-140-18 ~]$ curl -k https://localhost:8081/metrics -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 8081 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* ignoring certificate verify locations due to disabled peer verification
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:8081 
* stopped the pause stream!
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:8081 

> I'd try to forward like `oc port-forward -p mypod :8081` just to ensure that there's not a local port conflict in play here.

Thanks for your suggestion. But, no `-p` flag for my `oc port-forward`. Maybe I need to update my `oc` client.
[jzhang@dhcp-140-18 ocp410]$ oc port-forward -p catalog-operator-6b65b948bf-8fx58 :8081
Error: unknown shorthand flag: 'p' in -p


Usage:
  oc port-forward TYPE/NAME [LOCAL_PORT:]REMOTE_PORT [...[LOCAL_PORT_N:]REMOTE_PORT_N] [flags]

Examples:
  # Listens on ports 5000 and 6000 locally, forwarding data to/from ports 5000 and 6000 in the pod
  oc port-forward mypod 5000 6000
  
  # Listens on port 8888 locally, forwarding to 5000 in the pod
  oc port-forward mypod 8888:5000
  
  # Listens on a random port locally, forwarding to 5000 in the pod
  oc port-forward mypod :5000
  
  # Listens on a random port locally, forwarding to 5000 in the pod
  oc port-forward mypod 0:5000

Options:
      --pod-running-timeout=1m0s: The length of time (like 5s, 2m, or 3h, higher than zero) to wait until at least one pod is running

Use "oc options" for a list of global command-line options (applies to all commands).

[jzhang@dhcp-140-18 ocp410]$ oc version
oc v4.0.0-0.177.0
kubernetes v1.12.4+6a9f178753
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.jian-410.qe.devcluster.openshift.com:6443
kubernetes v1.12.4+0ba401e

Comment 6 Jian Zhang 2019-04-12 02:34:37 UTC
*** Bug 1698530 has been marked as a duplicate of this bug. ***

Comment 7 Jian Zhang 2019-04-12 02:36:17 UTC
*** Bug 1698533 has been marked as a duplicate of this bug. ***

Comment 8 Frederic Branczyk 2019-04-12 18:39:10 UTC
As https://bugzilla.redhat.com/show_bug.cgi?id=1698530 is closed as a dupe of this, I just want to make sure that the expectation is that both olm-operator and catalog-operator targets are shown as UP in Prometheus when this bug is verified. Thanks :)

Comment 9 Jeff Peeler 2019-04-12 19:06:15 UTC
The PR to fix this issue is here:
https://github.com/operator-framework/operator-lifecycle-manager/pull/809

Will set to modified once it merges.

I too encourage QE to test using Prometheus itself for final verification.


Note You need to log in before you can comment on or make changes to this bug.