Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1515447 - [3.6] Aggregated Logging replacing all log levels with '3' and '6' after upgrade to 3.5 from 3.4
Summary: [3.6] Aggregated Logging replacing all log levels with '3' and '6' after upgr...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.6.1
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: 3.6.z
Assignee: Rich Megginson
QA Contact: Anping Li
URL:
Whiteboard:
Depends On: 1514110
Blocks: 1515448
TreeView+ depends on / blocked
 
Reported: 2017-11-20 19:39 UTC by Rich Megginson
Modified: 2018-10-08 12:34 UTC (History)
6 users (show)

Fixed In Version: openshift3/logging-fluentd:v3.6.173.0.94-2
Doc Type: Bug Fix
Doc Text:
Cause: Fluentd was adding the level field with a value of 3 or 6, overwriting any existing level field. Consequence: The level field set by the application was being removed, and the '3' or '6' value was not useful. Fix: If there is already a `level` field in the record, then see if it is a "close" match to one of the canonical `level` field values at https://github.com/ViaQ/elasticsearch-templates/blob/master/namespaces/_default_.yml#L63 e.g. if `level` is "CRITICAL", convert to "crit", if level is "WARN" convert to 'warning', etc. Otherwise, if we cannot use it directly or normalize it, convert it to its string representation (ruby `to_s` method) and store the string value in the `level` field. Result: If the record already has a level field, the value is normalized or preserved, otherwise, a value like `info` or `err` is used.
Clone Of: 1514110
: 1515448 (view as bug list)
Environment:
Last Closed: 2018-10-08 12:34:04 UTC


Attachments (Terms of Use)

Description Rich Megginson 2017-11-20 19:39:29 UTC
+++ This bug was initially created as a clone of Bug #1514110 +++

Description of problem:
After upgrading the cluster from 3.4 to 3.5, all message log levels (as seen in Kibana/Elasticsearch) are either '3' or '6'. This makes solving issues harder, as filtering on errors etc. is now not possible in Kibana.

Version-Release number of selected component (if applicable):
openshift v3.5.5.31.36
kubernetes v1.5.2+43a9be4
openshift3/logging-elasticsearch:v3.5
openshift3/logging-fluentd:v3.5
openshift3/logging-kibana:v3.5

How reproducible:
Unconfirmed

Steps to Reproduce:
Unconfirmed

Actual results:
Messages contain 3 or 6 (numerals) as log levels.

Expected results:
Messages should contain (at least) the usual INFO, WARN, ERROR levels.

Additional info:

--- Additional comment from Rich Megginson on 2017-11-16 16:43:31 EST ---

In 3.6 and later we are strictly enforcing the use of the `level` field as defined in our data model: https://github.com/ViaQ/elasticsearch-templates/blob/master/namespaces/_default_.yml#L63

"Logging level as provided by: rsyslog(severitytext property), python's
        logging module, etc.
        Possible values are as listed here: http://sourceware.org/git/?p=glibc.git;a=blob;f=misc/sys/syslog.h;h=ee01478c4b19a954426a96448577c5a76e6647c0;hb=HEAD#l74 plus `trace` and `unknown`
        That is: alert crit debug emerg err info notice trace unknown warning
"

systemd PRIORITY levels are mapped here: https://github.com/ViaQ/fluent-plugin-viaq_data_model/blob/master/lib/fluent/plugin/filter_viaq_data_model_systemd.rb#L107

basically, map 1, 2, 3, ... to alert, crit, err, ...

container json-file levels are mapped here: https://github.com/ViaQ/fluent-plugin-viaq_data_model/blob/master/lib/fluent/plugin/filter_viaq_data_model.rb#L247

This just maps stream=stdout to level=info and stream=stderr to level=err

It will overwrite any existing level field.

The problem is that we have to strictly enforce the data model, which means we must force the `level` field to be one of the values listed above.  If we don't strictly enforce the `level` field, then it just becomes a dumping ground for any value, and correlation of the `level` field across different applications becomes impossible.

We could do this: If the level field is set, convert to lower case.  If it matches one of the canonical values, use it.  We can also convert "error" to "err", "panic" to "emerg", "warn" to "warning" - that is, do some simple normalization.  If the value does not match, then we just use it as provided by the application, and the user is responsible for either doing some sort of post processing, or otherwise determining what is the meaning of the `level` field across different applications.

--- Additional comment from  on 2017-11-16 18:35:22 EST ---

Interesting. I also found this line in our fluentd configuration (/etc/fluent/configs.d/openshift/filter-k8s-record-transform.conf):

    level ${record['stream'] == 'stdout' ? 6 : 3}

which seems to relate to the problem at hand.

At the moment we have no plans for the use of custom values in the 'level' field. Getting the usual container logging levels working would be nice, and the matching/conversion sound good. What is the proposed method and timeline for the fix?

--- Additional comment from Rich Megginson on 2017-11-16 20:22:39 EST ---

(In reply to antti.niskanen from comment #2)
> Interesting. I also found this line in our fluentd configuration
> (/etc/fluent/configs.d/openshift/filter-k8s-record-transform.conf):
> 
>     level ${record['stream'] == 'stdout' ? 6 : 3}
> 
> which seems to relate to the problem at hand.

Yes.  That is the problematic code in 3.5 which was replace with https://bugzilla.redhat.com/show_bug.cgi?id=1514110#c1 in 3.6 and later.
 
> At the moment we have no plans for the use of custom values in the 'level'
> field. Getting the usual container logging levels working would be nice, and
> the matching/conversion sound good. What is the proposed method and timeline
> for the fix?

The only level data we get for container logs is the "stream" field - "stdout" means "info" and "stderr" means "err".  There is no other log level information.  Are you ok if we did something like this?:

     level ${record['stream'] == 'stdout' ? 'info' : 'err'}

?

--- Additional comment from  on 2017-11-17 05:30:55 EST ---

(In reply to Rich Megginson from comment #3)
> The only level data we get for container logs is the "stream" field -
> "stdout" means "info" and "stderr" means "err".  There is no other log level
> information.

Now I kind of understand what you mean. For example, our users' applications write logs like this:

{"log":"{\"@timestamp\":\"2017-11-08T21:43:16.802+00:00\",\"@version\":1,\"message\":\"Checking mq connection failed: java.util.concurrent.TimeoutException\",\"logger_name\":\"checkActiveMQ\",\"thread_name\":\"qtp1681838063-40\",\"level\":\"WARN\",\"level_value\":30000}\n","stream":"stdout","time":"2017-11-08T21:43:16.803202093Z"}

Previously, the 'level' inside the message has been shown in Kibana (how...?), but now the 'stream' is the only defining factor.

Using 'stream' as you suggested would still not work as the warnings/errors seem to be mostly stdout.

Would it be possible to check for the message contents (JSON) for 'level' and use that as-is or convert/normalize if certain conditions are met?

--- Additional comment from Rich Megginson on 2017-11-17 09:58:57 EST ---

(In reply to antti.niskanen from comment #4)
> (In reply to Rich Megginson from comment #3)
> > The only level data we get for container logs is the "stream" field -
> > "stdout" means "info" and "stderr" means "err".  There is no other log level
> > information.
> 
> Now I kind of understand what you mean. For example, our users' applications
> write logs like this:
> 
> {"log":"{\"@timestamp\":\"2017-11-08T21:43:16.802+00:00\",\"@version\":1,
> \"message\":\"Checking mq connection failed:
> java.util.concurrent.TimeoutException\",\"logger_name\":\"checkActiveMQ\",
> \"thread_name\":\"qtp1681838063-40\",\"level\":\"WARN\",\"level_value\":
> 30000}\n","stream":"stdout","time":"2017-11-08T21:43:16.803202093Z"}
> 
> Previously, the 'level' inside the message has been shown in Kibana
> (how...?),

Because the kubernetes metadata filter has a side effect - if the value of the "log" field looks like a JSON string, it will parse it and put the fields into the record:

https://github.com/openshift/origin-aggregated-logging/blob/release-1.5/fluentd/configs.d/openshift/filter-k8s-meta.conf

https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter#configuration

merge_json_log - merge logs in JSON format as top level keys (default: true)

preserve_json_log - preserve JSON logs in raw form in the log key, only used if the previous option is true (default: true)

which means, in the above case, the fields @timestamp, @version, message, logger_name, thread_name, level and level_value are added to the record.

> but now the 'stream' is the only defining factor.
> 
> Using 'stream' as you suggested would still not work as the warnings/errors
> seem to be mostly stdout.
> 
> Would it be possible to check for the message contents (JSON) for 'level'
> and use that as-is or convert/normalize if certain conditions are met?

Yes, I think that is the correct approach:

if level field exists
  if level.lower looks like one of our canonical error levels
    record.level = canonical error level
  else
    preserve existing value
  fi
else
  record.level = info or err from stream
fi

In this case, that means level: INFO would be converted to level: info.

Comment 1 Rich Megginson 2017-12-11 22:26:10 UTC
logging-fluentd-docker-v3.6.173.0.83-3

Comment 2 Rich Megginson 2017-12-11 22:33:54 UTC
This is fixed in  	logging-fluentd-docker-v3.6.173.0.83-2

which is included in https://errata.devel.redhat.com/advisory/31698

RHBA-2017:31698-04 OpenShift Container Platform 3.6.z and 3.5.z image update

Comment 3 Anping Li 2017-12-20 11:16:31 UTC
Failed with both logging-fluentd/images/v3.6.173.0.83-2 and logging-fluentd/images/v3.6.173.0.90-1

My test steps as following
1) oc debug logging-kibana-1-rj99r
$ echo '{"message":"anli_container_message_loglevel_debug", "level":"debug"}'
2) search record in ES
oc exec -c elasticsearch $es_pod_name --  curl -s -XGET --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/_search?pretty -d '{ "query" : { "match" : { "message" : "anli_container_message_loglevel_warn" } } }'

3) result as following

    "_source" : {
        "message" : "anli_container_message_loglevel_debug",
        "docker" : {
          "container_id" : "3c615787a6ac7f2f9e3a9dc18835d74f7af6c19ef7b5190c256c09f5cdea0126"
        },
        "kubernetes" : {
          "container_name" : "kibana",
          "namespace_name" : "logging",
          "pod_name" : "logging-kibana-1-rj99r-debug",
          "pod_id" : "5c683163-e576-11e7-acfe-fa163ea31fda",
          "host" : "host-8-241-0.host.centralci.eng.rdu2.redhat.com",
          "master_url" : "https://kubernetes.default.svc.cluster.local",
          "namespace_id" : "ea000949-e4b4-11e7-918b-fa163ea31fda"
        },
        "level" : "info",
        "hostname" : "host-xxxx.example.com,

Comment 4 Rich Megginson 2018-01-04 02:51:56 UTC
koji_builds:
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=636402
repositories:
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:rhaos-3.6-rhel-7-docker-candidate-69601-20180104021016
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:latest
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.94
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6.173.0.94-2
  brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/logging-fluentd:v3.6

Comment 5 Anping Li 2018-01-08 06:07:48 UTC
Verified and pass in v3.6.173.0.95-1


"_index" : "project.install-test.504d5cc0-f426-11e7-bd8f-fa163e48c543.2018.01.08",
      "_type" : "com.redhat.viaq.common",
      "_id" : "AWDUWp6Ql6XcCvEMVCIR",
      "_score" : 3.7972813,
      "_source" : {
        "message" : "anli_container_message_loglevel_debug",
        "level" : "debug",
        "docker" : {
          "container_id" : "69037f7e99a5ea63da3514f38179de1b9d51c22def88ec3210151c7064d56343"
        },


  "_source" : {
        "message" : "anli_container_message_loglevel_warn",
        "level" : "warning",
        "docker" : {
          "container_id" : "69037f7e99a5ea63da3514f38179de1b9d51c22def88ec3210151c7064d56343"
        },


      "_source" : {
        "message" : "anli_container_message_loglevel_nolevel",
        "level" : "info",


Note You need to log in before you can comment on or make changes to this bug.