Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1690510 - [osp15] load reaches >100
Summary: [osp15] load reaches >100
Keywords:
Status: CLOSED DUPLICATE of bug 1697561
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 15.0 (Stein)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Emilien Macchi
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-19 15:38 UTC by John Fulton
Modified: 2019-04-11 20:51 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-04-11 20:51:10 UTC
Target Upstream Version:


Attachments (Terms of Use)
tarball from pcp (deleted)
2019-03-19 15:53 UTC, John Fulton
no flags Details
watch -n 1 "uptime >> uptime.log" showing varying workload over time (deleted)
2019-03-31 20:50 UTC, John Fulton
no flags Details
watch -n 15 "top -b -o +%MEM | head -n 22" >> top_15 # left that running a few hours (deleted)
2019-04-01 18:36 UTC, John Fulton
no flags Details

Description John Fulton 2019-03-19 15:38:12 UTC
While testing an undercloud with 3 controllers and 2 compute/ceph-osd nodes (hci), after 5 days of uptime the system hit a high load. PCP was installed and the issue was reproduced so I will attach these logs to the bug: 

 sudo dnf install -y pcp-zeroconf pcp-system-tools
 /var/log/pcp/pmlogger/undercloud-0*

The test was to deploy and overcloud, and then re-run the same deployment command but with modifications to Heat environment parameters. The heat stack update the failure below was seen. After restarting all of the containers on the undercloud (for C in $(podman ps | awk {'print $1'}); do podman restart $C; done) the issue went away until I attempted another stack update. I have since rebooted the node and I'm no longer seeing the issue.


# ./deploy.sh
...
N8EUWDCAV', 'SwiftPassword': 'ssogYT4guZBFN4YkBy0L8i9rS', 'TackerPassword': 'ffyxTDe8TKMWDCzYsxPWMfw1T', 'TrovePassword': 'ZWSdVpHFaCpkDAPvzm0vO6NBE', 'ZaqarPassword': 'nFAEdOrb70yeUFNoxiGEpuusm'}, 'template': 'overcloud.yaml', 'version': 1.0}                           
Removing the current plan files
Uploading new plan files
Plan updated.
Processing templates in the directory /tmp/tripleoclient-gx2izh51/tripleo-heat-templates
Processing templates in the directory /tmp/tripleoclient-gx2izh51/tripleo-heat-templates
Exception occured while running the command
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/mistralclient/api/base.py", line 132, in _create
    resp = self.http_client.post(url, data)
  File "/usr/lib/python3.6/site-packages/mistralclient/api/httpclient.py", line 54, in decorator
    resp = func(self, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/mistralclient/api/httpclient.py", line 120, in post
    return self.session.post(self.base_url + url, data=body, **options)
  File "/usr/lib/python3.6/site-packages/keystoneauth1/session.py", line 1045, in post
    return self.request(url, 'POST', **kwargs)
  File "/usr/lib/python3.6/site-packages/keystoneauth1/session.py", line 890, in request
    raise exceptions.from_response(resp, method, url)
keystoneauth1.exceptions.http.InternalServerError: Internal Server Error (HTTP 500)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/tripleoclient/command.py", line 29, in run
    super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/osc_lib/command/command.py", line 41, in run
    return super(Command, self).run(parsed_args)
  File "/usr/lib/python3.6/site-packages/cliff/command.py", line 184, in run
    return_code = self.take_action(parsed_args) or 0
  File "/usr/local/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 910, in take_action
    self._deploy_tripleo_heat_templates_tmpdir(stack, parsed_args)
  File "/usr/local/lib/python3.6/site-packages/tripleoclient/v1/overcloud_deploy.py", line 365, in _deploy_tripleo_heat_templates_tmpdir                     


top - 21:40:42 up 4 days, 11:01,  4 users,  load average: 36.57, 9.31, 13.92
Tasks: 551 total,  34 running, 503 sleeping,   0 stopped,  14 zombie
%Cpu0  :  0.5 us, 98.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us, 99.5 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
%Cpu2  :  2.9 us, 96.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.8 hi,  0.0 si,  0.0 st
%Cpu3  :  0.5 us, 98.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
%Cpu4  :  0.3 us, 98.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.3 si,  0.0 st
%Cpu5  :  0.0 us, 99.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.3 si,  0.0 st
%Cpu6  :  1.3 us, 98.1 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
%Cpu7  :  1.1 us, 98.4 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.5 hi,  0.0 si,  0.0 st
MiB Mem :  19907.8 total,    153.6 free,  16746.5 used,   3007.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1313.9 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND  
     84 root      20   0       0      0      0 R  97.8   0.0  92:44.75 kswapd0  
      1 root      20   0  245964  10116   3816 R  90.0   0.0  26:49.43 systemd  
 827874 root      20   0 1394456  24052   3376 R  49.1   0.1   0:03.23 podman   
  37626 root      20   0 1615396  26868      0 R  26.1   0.1   0:27.21 podman   
 369401 root      20   0  159412   3764   2524 S  25.1   0.0   0:25.44 sshd     
  41151 42422     20   0  345444 115552   6112 R  24.0   0.6  32:04.43 ironic-conducto

then later...

(undercloud) [stack@undercloud-0 ~]$ w
21:44:47 up 4 days, 11:05,  4 users,  load average: 176.59, 99.93, 49.73
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT

On the console I also saw saw:

[root@undercloud-0 ~]# [440029.159217] overlayfs: upperdir is in-use by another mount, accessing files from both mounts will result in undefined behavior.
[440029.161492] overlayfs: workdir is in-use by another mount, accessing files from both mounts will result in undefined behavior.
[440029.234548] SELinux: mount invalid.  Same superblock, different security settings for (dev mqueue, type mqueue)
[440040.278216] serial8250: too much work for irq4

Comment 1 John Fulton 2019-03-19 15:53:18 UTC
Created attachment 1545738 [details]
tarball from pcp

Comment 2 Emilien Macchi 2019-03-21 14:46:09 UTC
Wow. Nice catch.
John, are you able to tell which service managed by systemd & which podman containers take respectively 90 and 49.1 % ?

Thanks

Comment 5 John Fulton 2019-03-28 13:41:13 UTC
(In reply to Emilien Macchi from comment #2)
> Wow. Nice catch.
> John, are you able to tell which service managed by systemd & which podman
> containers take respectively 90 and 49.1 % ?
> 
> Thanks

Sorry I don't have this information. Hopefully we can find out more if Michele reproduces. If I encounter this issue again I will update the bug.

Comment 6 John Fulton 2019-03-31 20:49:10 UTC
- Encountered this again on a 4 day old undercloud.
- I started 'watch -n 1 "uptime >> uptime.log"' and let it run in a tmux
- started an sosreport but it was killed on plugin 88/106 selinux 
- trained again to run the sosreport and it was killed again on plugin 88/106 selinux [1]
- I'll attach the partial sosreport [1] and uptime.log
- Note from uptime.log that the the load returns to normal and then spikes around 107 twice like it's following a pattern

Footnote

[1] 
An archive containing the collected information will be generated in                                     
/var/tmp/sos.7t8crle1 and may be provided to a Red Hat support                                           
representative.

Any information provided to Red Hat will be treated in accordance with                                   
the published support policies at:

  https://access.redhat.com/support/

The generated archive may contain data considered sensitive and its                                      
content should be reviewed by the originating organization before being                                  
passed to any third party.

No changes will be made to system configuration.

Press ENTER to continue, or CTRL-C to quit.

Please enter the case id that you are generating this report for []: bz1690510                           

 Setting up archive ...
 Setting up plugins ...
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
Not all environment variables set. Source the environment file for the user intended to connect to the OpenStack environment.
 Running plugins. Please wait ...

  Starting 88/106 selinux         [Running: logs openstack_octavia podman selinux]        Killed]e]      
[root@undercloud-0 ~]#

Comment 7 John Fulton 2019-03-31 20:50:35 UTC
Created attachment 1550319 [details]
watch -n 1 "uptime >> uptime.log" showing varying workload over time

Comment 13 John Fulton 2019-04-01 18:36:05 UTC
Created attachment 1550701 [details]
watch -n 15 "top -b -o +%MEM | head -n 22" >> top_15 # left that running a few hours


Note You need to log in before you can comment on or make changes to this bug.