Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1515643 - when registering 60 hosts concurrently: "java.lang.Exception: null" and lots of "org.candlepin.audit.EventSinkImpl - Rolling back hornetq transaction."
Summary: when registering 60 hosts concurrently: "java.lang.Exception: null" and lots ...
Status: NEW
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Candlepin
Version: 6.3.0
Hardware: Unspecified
OS: Unspecified
low vote
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: jcallaha
Whiteboard: scale_lab
Depends On: 1520496
TreeView+ depends on / blocked
Reported: 2017-11-21 07:02 UTC by Jan Hutař
Modified: 2019-04-16 03:28 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1520495 (view as bug list)
Last Closed:
Target Upstream Version:

Attachments (Terms of Use)

Description Jan Hutař 2017-11-21 07:02:18 UTC
Description of problem:
Candlepin's error.log is full of messages below sometimes there is a traceback. During the process of registereing, we often see timeout reported on capsule:

[Mon Nov 20 12:32:33.408971 2017] [proxy_http:error] [pid 3484] (70007)The timeout specified has expired: [client] AH01102: error reading status line from remote server
[Mon Nov 20 12:32:33.409006 2017] [proxy:error] [pid 3484] [client] AH00898: Error reading from remote server returned by /rhsm/consumers

Might it be related?

Version-Release number of selected component (if applicable):

How reproducible:
very often

Steps to Reproduce:
1. Have satellite with 10k hosts and continue registration using in a rate of 60 new hosts every +- 7 minute (on a  hosts, there is a retry mechanism to try to run 3 times in case of failure)

Actual results:
Log full of:

2017-11-20 12:34:21,902 [thread=http-bio-8443-exec-42] [req=08662b3d-950d-491a-b7ad-bd0b3f0dee04, org=Default_Organization, csid=] WARN  org.candlepin.audit.EventSinkImpl - Rolling back hornetq transaction.
2017-11-20 12:34:27,680 [thread=http-bio-8443-exec-38] [req=27caf3af-ff1c-4d67-b579-e303d798130b, org=Default_Organization, csid=] WARN  org.candlepin.audit.EventSinkImpl - Rolling back hornetq transaction.
2017-11-20 12:34:30,502 [thread=http-bio-8443-exec-10] [req=dc099a24-779c-4c81-bb18-6e7aa20abb23, org=Default_Organization, csid=] WARN  org.candlepin.audit.EventSinkImpl - Rolling back hornetq transaction.

and this traceback from time to time:

2017-11-20 12:34:34,203 [thread=Finalizer] [=, org=, csid=] WARN  org.hornetq.core.client - HQ212016: I am closing a core ClientSession you left open. Please make sure you close all ClientSessions explicitly before letting them go out of 
scope! 394,651,300
java.lang.Exception: null
        at org.hornetq.core.client.impl.DelegatingSession.<init>( ~[hornetq-core-client-2.4.7.Final.jar:na]
        at org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSessionInternal( ~[hornetq-core-client-2.4.7.Final.jar:na]
        at org.hornetq.core.client.impl.ClientSessionFactoryImpl.createSessionInternal( ~[hornetq-core-client-2.4.7.Final.jar:na]
        at org.hornetq.core.client.impl.ClientSessionFactoryImpl.createTransactedSession( ~[hornetq-core-client-2.4.7.Final.jar:na]
        at org.candlepin.audit.EventSinkImpl.getClientSession( ~[EventSinkImpl.class:na]
        at org.candlepin.audit.EventSinkImpl.sendEvents( ~[EventSinkImpl.class:na]
        at org.candlepin.servlet.filter.EventFilter.doFilter( ~[EventFilter.class:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at org.candlepin.servlet.filter.ContentTypeHackFilter.doFilter( ~[ContentTypeHackFilter.class:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at org.candlepin.common.filter.LoggingFilter.doFilter( ~[candlepin-common-2.0.3.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at org.candlepin.servlet.filter.CandlepinPersistFilter.doFilter( ~[CandlepinPersistFilter.class:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at org.candlepin.servlet.filter.CandlepinScopeFilter.doFilter( ~[CandlepinScopeFilter.class:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at ~[guice-servlet-3.0.jar:na]
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter( ~[catalina.jar:7.0.76]
        at org.apache.catalina.core.ApplicationFilterChain.doFilter( ~[catalina.jar:7.0.76]
        at org.apache.catalina.core.StandardWrapperValve.invoke( ~[catalina.jar:7.0.76]
        at org.apache.catalina.core.StandardContextValve.invoke( ~[catalina.jar:7.0.76]
        at org.apache.catalina.authenticator.AuthenticatorBase.invoke( ~[catalina.jar:7.0.76]
        at org.apache.catalina.core.StandardHostValve.invoke( ~[catalina.jar:7.0.76]
        at org.apache.catalina.valves.ErrorReportValve.invoke( ~[catalina.jar:7.0.76]
        at org.apache.catalina.core.StandardEngineValve.invoke( ~[catalina.jar:7.0.76]
        at org.apache.catalina.connector.CoyoteAdapter.service( ~[catalina.jar:7.0.76]
        at org.apache.coyote.http11.AbstractHttp11Processor.process( ~[tomcat-coyote.jar:7.0.76]
        at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process( ~[tomcat-coyote.jar:7.0.76]
        at$ ~[tomcat-coyote.jar:7.0.76]
        at java.util.concurrent.ThreadPoolExecutor.runWorker( ~[na:1.7.0_151]
        at java.util.concurrent.ThreadPoolExecutor$ ~[na:1.7.0_151]
        at org.apache.tomcat.util.threads.TaskThread$ ~[tomcat-coyote.jar:7.0.76]
        at ~[na:1.7.0_151]

Expected results:
No tracebacks in the logs.

Additional info:
# grep ^2017 /var/log/candlepin/error.log | wc -l
# grep ^2017 /var/log/candlepin/error.log | grep 'Rolling back hornetq transaction' | wc -l
# grep ^2017 /var/log/candlepin/error.log | grep 'HQ212016: I am closing a core ClientSession you left open. Please make sure you close all ClientSessions explicitly before letting them go out of scope' | wc -l

Comment 4 Chris "Ceiu" Rog 2018-01-17 17:19:03 UTC
The exception posted, while bad and something to address eventually, is not causing issues. It's simply a notice that we're not cleaning up a session properly and HornetQ's session finalizer is having to do it for us.

I'm looking into the logs a bit more to see if there's another, more severe, issue related the timeouts you're seeing.

Comment 5 Chris "Ceiu" Rog 2018-01-19 21:02:45 UTC
I've looked into this a bit more and founds some interesting details.

First, the good news: there aren't any timeouts occurring here. At least, not in the traditional sense. Every completed request came in well under typical timeout thresholds, and not a single one took more than five seconds to finish.

The bad news, however, is that one request in particular died in a nasty way [2]. It looks like one consumer's certificate has some missing or malformed data, and when it was being serialized for output, it killed the whole request thread. To a client, it looks like the request just hangs indefinitely until the client timeout is exceeded. This is likely where the timeouts are coming from.

That said, I would be interested to know whether or not if this particular consumer is always causing problems when fetched. If you could try fetching the consumer in question [1] and report the results, that would be a great help. The behavior I'm expecting is for the request to appear to hang as far as the output in candlepin.log goes, but for a rather lengthy exception to show up in /var/log/tomcat/localhost-<date>.log similar to the exception below [3].

[1] Consumer UUID: ec5f52b0-cda0-4fcb-98bb-edda44c01e4b

[2] 2017-11-20 13:37:42,153: 2c7d19f5-7774-45a4-bad5-fb9cf5b1e070 => GET /candlepin/consumers/ec5f52b0-cda0-4fcb-98bb-edda44c01e4b

Nov 20, 2017 1:37:42 PM org.apache.catalina.core.StandardWrapperValve invoke
SEVERE: Servlet.service() for servlet [default] in context with path [/candlepin] threw exception
org.jboss.resteasy.spi.UnhandledException: Response is committed, can't handle exception
        at org.jboss.resteasy.core.SynchronousDispatcher.writeException(
        at org.jboss.resteasy.core.SynchronousDispatcher.writeResponse(
        at org.jboss.resteasy.core.SynchronousDispatcher.invoke(
        ... <snip> ...
        at java.util.concurrent.ThreadPoolExecutor$
        at org.apache.tomcat.util.threads.TaskThread$
Caused by: com.fasterxml.jackson.databind.JsonMappingException: 202 (through reference chain: org.candlepin.model.Consumer["idCert"]->org.candlepin.model.IdentityCertificate["serial"]->org.candlepin.model.CertificateSerial["expiration"])
        at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(
        at com.fasterxml.jackson.databind.JsonMappingException.wrapWithPath(
        at com.fasterxml.jackson.databind.ser.std.StdSerializer.wrapAndThrow(
        ... <snip> ...
        at org.jboss.resteasy.core.interception.AbstractWriterInterceptorContext.proceed(
        at org.jboss.resteasy.core.ServerResponseWriter.writeNomapResponse(
        at org.jboss.resteasy.core.SynchronousDispatcher.writeResponse(
        ... 43 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 202
        at sun.util.calendar.BaseCalendar.getCalendarDateFromFixedDate(
        at java.util.GregorianCalendar.computeFields(
        at java.util.GregorianCalendar.computeFields(
        ... <snip> ...
        at com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(
        at org.candlepin.common.jackson.CheckableBeanPropertyFilter.serializeAsField(
        at com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFieldsFiltered(
        ... 62 more

Comment 6 Jan Hutař 2018-01-22 10:02:43 UTC
Unfortunately I do not have the setup now - it was a setup for Satellite scale testing. But I might be able to use different setup for this. But I do not understand what you want me to do. Could you please elaborate a bit? I'm bit lost here :)

Comment 7 Chris "Ceiu" Rog 2018-01-22 14:47:01 UTC
Could you setup and run this test again, and look for more timeouts. If/when you see the issue popup again, check if the consumers that are timing out consistently show the same behavior even when fetched individually. From what I'm seeing in the logs, this is a data error that's causing requests to die, so it should be repeatable once triggered.

Comment 8 Jan Hutař 2018-01-26 11:17:43 UTC
so I was able to reproduce (used 100 concurrent this time, 5 times in a row but because there is some setup and tierdown code in my test, it is not like it would be one registration after another, there is about 1 minute between actual runs).

Anyway, there are 10 retries on runs in case it fails and at the end, none of the clients reported registration failure, so although I have seen tracebacks in the log, retry fixed it.

Is this the info you have expected?

Comment 10 Chris "Ceiu" Rog 2018-01-26 14:39:20 UTC
I want to emphasize that the traceback you're seeing is a warning regarding connection cleanup and is not really significant. The timeout you were seeing in the original post here is the issue I was trying to see if you were able to reproduce. The good news is that it appears you were not. 

That being said, the original reported issue (the noisy traceback) is on our radar and will be addressed as the eventing subsystems are updated next.

Comment 11 Barnaby Court 2018-01-29 16:57:08 UTC
Jan, The original BZ was describing two issues. 

1) A client failure that appears to be fixed by a retry according to comment 8. 
2) A noisy log statement from HornetQ. The HornetQ log message is not causing any server or client problems.

Based on the above, (1) is not reproducible and (2) is at best low priority. Would you prefer that we keep this BZ open for (2) or close it out since (1) was not reproducible?

Comment 12 Jan Hutař 2018-01-30 09:48:12 UTC
Hello. I do not have strong opinion. I would keep this bug for that low prio issue with ugly traceback instead of nicer log message. But we can also close this one and create new bug to make the things clear. Please just let me know and I'll do it if needed. Thank you!

Note You need to log in before you can comment on or make changes to this bug.