Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1692885 - Pulp scheduler cancels tasks even with a high worker_timeout specified - Worker has gone missing, removing from list of workers
Summary: Pulp scheduler cancels tasks even with a high worker_timeout specified - Work...
Keywords:
Status: NEW
Alias: None
Product: Red Hat Satellite 6
Classification: Red Hat
Component: Pulp
Version: 6.5.0
Hardware: Unspecified
OS: Unspecified
high
high vote
Target Milestone: Unspecified
Assignee: satellite6-bugs
QA Contact: jcallaha
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-26 16:11 UTC by Mike McCune
Modified: 2019-03-29 06:39 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)
scheduler.py patch (deleted)
2019-03-26 16:16 UTC, Mike McCune
no flags Details

Description Mike McCune 2019-03-26 16:11:03 UTC
We are seeing environments were Pulp is removing workers from system and canceling tasks that are executing:

pulp: pulp.server.async.scheduler:ERROR: Worker 'reserved_resource_worker-0@sat6.example.com' has gone missing, removing from list of workers
pulp: pulp.server.async.tasks:ERROR: The worker named reserved_resource_worker-0@sat6.example.com is missing. Canceling the tasks in its queue.
pulp.server.async.tasks:INFO: Task canceled: 4658bdd0-8f23-482a-a965-d2dbd313cf12

As a workaround, high timeout values have been attempted to be specified in the worker_timeout setting in /etc/pulp/server.conf

We have tried values of 60, 300, 600, 900+ yet still see workers being removed.

The results of this are Content View publish/promotes, repository syncs and other operations that fail with the only option is to re-start the process. This can be business impacting if their is automation in place that expects these operations to succeed. These operations continue to fail even after re-attempts offering the user no option currently without code modification.

Will attach a patch to this BZ that we have employed to /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py that removes the delete lines:

  if worker.last_heartbeat < oldest_heartbeat_time:
      msg = _("Worker '%s' has gone missing, removing from list of workers") % worker.name
      _logger.error(msg)

      if worker.name.startswith(constants.SCHEDULER_WORKER_NAME):
          # worker.delete()
          _logger.error("Not deleting SCHEDULER worker - continuing on")
      else:
          # _delete_worker(worker.name)
          _logger.error("Not deleting worker - continuing on")

Will also try increasingly large values of worker_timeout (24h+) to see if there really is a bug in the time calculations in this code.

Comment 4 Mike McCune 2019-03-26 16:16:53 UTC
Created attachment 1548133 [details]
scheduler.py patch

*** WORKAROUND ***

Patch that can be utilized as a workaround to disable worker deletion. Place in 

1) make backup copy of scheduler.py

cp /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py /usr/lib/python2.7/site-packages/pulp/server/async/scheduler.bak-1692885

2) copy attached scheduler.py from case to your Satellite in location:

/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py 

3) restart

foreman-maintain service restart

4) resume operations

Comment 5 Brian Bouterse 2019-03-26 16:33:00 UTC
With so many timeout values tried, and with clock drift not a possible cause due to workers and celerybeat running on the same machine, I deduce that the workers are not able to write their timestamps to the db as expected. Are we sure that workers are writing their heartbeats to the db?

The workers do that as part of their HeartbeatStep here https://github.com/pulp/pulp/blob/e5a22e13ae46fe86dccedc5bf214537c2b90ad0d/server/pulp/server/async/app.py#L119
That calls the handle_worker_heartbeat() method here:  https://github.com/pulp/pulp/blob/2-master/server/pulp/server/async/worker_watcher.py#L28-L57

Comment 6 Mike McCune 2019-03-27 19:20:23 UTC
We are going to try increasingly higher values of worker_timeout (eg 3600) to rule out any issues with the algorithm for checking heartbeats.

This condition is likely a result of over-taxed IO and slow storage but will continue to diagnose.

Comment 7 Mike McCune 2019-03-27 19:48:29 UTC
Timeout can be configured via the installer, you must use both params:

 satellite-installer --foreman-proxy-content-pulp-worker-timeout 3600 --katello-pulp-worker-timeout 3600


Note You need to log in before you can comment on or make changes to this bug.