Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1600262 - pmdaproc only reads the first 1024 bytes of the /proc/*/status file resulting in lost metric values
Summary: pmdaproc only reads the first 1024 bytes of the /proc/*/status file resulting...
Keywords:
Status: VERIFIED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: pcp
Version: 7.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: rc
: ---
Assignee: Mark Goodwin
QA Contact: Michal Kolar
Vladimír Slávik
URL:
Whiteboard:
Depends On:
Blocks: 1636959 1636960
TreeView+ depends on / blocked
 
Reported: 2018-07-11 20:50 UTC by Kyle Walker
Modified: 2019-01-30 11:31 UTC (History)
11 users (show)

Fixed In Version: pcp-4.1.1
Doc Type: If docs needed, set a value
Doc Text:
Previously, the Performance Co-Pilot (PCP) toolkit failed to report certain process details on very large systems in some cases. The code reading the process details files was changed so that it can read data of arbitrary length, instead of only the first 1024 bytes. As a result, the described PCP error can no longer happen.
Clone Of:
: 1636959 1636960 (view as bug list)
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)

Description Kyle Walker 2018-07-11 20:50:35 UTC
Description of problem:
 The pmdaproc collector agent only reads the first 1024 bytes of the /proc/*/status files. This results, on exceptionally large systems, in metrics at the end of the file being unavailable intermittently. This can result in metrics being unavailable even for smaller systems. 

For example:

# diff -yW70 <(cat /proc/1/status) <(head -c 1024  /proc/1/status)
Name:	systemd				Name:	systemd
Umask:	0000				Umask:	0000
State:	S (sleeping)			State:	S (sleeping)
Tgid:	1				Tgid:	1
Ngid:	0				Ngid:	0
Pid:	1				Pid:	1
PPid:	0				PPid:	0
TracerPid:	0			TracerPid:	0
Uid:	0	0	0		Uid:	0	0	0
Gid:	0	0	0		Gid:	0	0	0
FDSize:	128				FDSize:	128
Groups:					Groups:	
VmPeak:	  193568 kB			VmPeak:	  193568 kB
VmSize:	  128032 kB			VmSize:	  128032 kB
VmLck:	       0 kB			VmLck:	       0 kB
VmPin:	       0 kB			VmPin:	       0 kB
VmHWM:	    6624 kB			VmHWM:	    6624 kB
VmRSS:	    6308 kB			VmRSS:	    6308 kB
RssAnon:	    2548 kB		RssAnon:	    2548 kB
RssFile:	    3760 kB		RssFile:	    3760 kB
RssShmem:	       0 kB		RssShmem:	       0 kB
VmData:	   84220 kB			VmData:	   84220 kB
VmStk:	     132 kB			VmStk:	     132 kB
VmExe:	    1408 kB			VmExe:	    1408 kB
VmLib:	    3716 kB			VmLib:	    3716 kB
VmPTE:	     124 kB			VmPTE:	     124 kB
VmSwap:	       0 kB			VmSwap:	       0 kB
Threads:	1			Threads:	1
SigQ:	0/3860			  |	SigQ:	1/3860
SigPnd:	0000000000000000		SigPnd:	0000000000000000
ShdPnd:	0000000000000000		ShdPnd:	0000000000000000
SigBlk:	7be3c0fe28014a03		SigBlk:	7be3c0fe28014a03
SigIgn:	0000000000001000		SigIgn:	0000000000001000
SigCgt:	00000001800004ec		SigCgt:	00000001800004ec
CapInh:	0000000000000000		CapInh:	0000000000000000
CapPrm:	0000001fffffffff		CapPrm:	0000001fffffffff
CapEff:	0000001fffffffff		CapEff:	0000001fffffffff
CapBnd:	0000001fffffffff		CapBnd:	0000001fffffffff
CapAmb:	0000000000000000		CapAmb:	0000000000000000
Seccomp:	0			Seccomp:	0
Speculation_Store_Bypass:		Speculation_Store_Bypass:
Cpus_allowed:	3			Cpus_allowed:	3
Cpus_allowed_list:	0-1		Cpus_allowed_list:	0-1
Mems_allowed:	00000000,00000	  /	Mems_allowed:	00000000,00000
Mems_allowed_list:	0	  <
voluntary_ctxt_switches:	  <
nonvoluntary_ctxt_switches:	  <

In the above, the proc.psinfo.{nvctxsw,vctxsw} entries will be 0.


Version-Release number of selected component (if applicable):
    pcp-3.12.2-5.el7.x86_64


How reproducible:
    Easily

Steps to Reproduce:
1. Install the pcp-zeroconf package
    # yum install pcp-zeroconf -y

2. Instantiate a hotproc configuration to match "systemd"
    # pmstore hotproc.control.config 'fname == "systemd"' 

3. Verify the contents of the proc.psinfo.nvctxsw entries against the underlying file contents:
    # pminfo -f hotproc.psinfo.nvctxsw

Actual results:

# tail -1 /proc/1/status
nonvoluntary_ctxt_switches:	2348

# pminfo -f hotproc.psinfo.nvctxsw

hotproc.psinfo.nvctxsw
    inst [1 or "000001 /usr/lib/systemd/systemd"] value 0

Expected results:

# tail -1 /proc/1/status
nonvoluntary_ctxt_switches:	2348

# pminfo -f hotproc.psinfo.nvctxsw

hotproc.psinfo.nvctxsw
    inst [1 or "000001 /usr/lib/systemd/systemd"] value 2348


Additional info:
 Though the above example is simplest to achieve, the issue can present itself as PM_ERR_APPVERSION for string metrics such as {,hot}proc.psinfo.cpusallowed if the system is sufficiently large enough the that Cpus_allowed_list field is beyond the 1024 bytes limit. Issue seems to be present upstream.

Comment 2 Mark Goodwin 2018-07-12 04:52:31 UTC
Thanks for the report Kyle, it is indeed a bug - the buffer is only 1K. The simplistic patch would just increase that to 8K (or so), but a looping read would be better. Related code should also be audited, which I'll undertake too.

Regards

Comment 3 Kyle Walker 2018-07-12 16:10:53 UTC
(In reply to Mark Goodwin from comment #2)
> Thanks for the report Kyle, it is indeed a bug - the buffer is only 1K. The
> simplistic patch would just increase that to 8K (or so), but a looping read
> would be better.

I agree. The looping read doesn't look like the simplest change based on the current implementation, but much more effective than increasing the buffer.

Thanks for picking it up so quickly!

- Kyle Walker

Comment 4 Mark Goodwin 2018-07-20 07:07:57 UTC
upstream fix posted (still need a QA test). See: 

https://github.com/goodwinos/pcp/commit/73ce19502de10cad6c8a10b2cc7d787bb4717575


commit 73ce19502de10cad6c8a10b2cc7d787bb4717575
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Fri Jul 20 16:24:45 2018 +1000

    pmdaproc: rework fetch_proc_pid_stat to use a looping read
    
    RHBZ #1600262 https://bugzilla.redhat.com/show_bug.cgi?id=1600262
    
    fetch_proc_pid_stat() was reading /proc/PID/status with a fixed
    size 1024 byte buffer. This is no longer big enough on most
    machines, causing some metrics to have incorrect values; notably
    the Mems_allowed, Mems_allowed_list, voluntary_ctxt_switches and
    nonvoluntary_ctxt_switches fields were not being read or parsed.
    
    Rework this to use a looping read with a suitably sized malloc'd
    buffer. QA test forthcoming.
    
    modified:   src/pmdas/linux_proc/proc_pid.c

Comment 5 Mark Goodwin 2018-07-31 06:18:15 UTC
Additional fixes and QA tests for this BZ that will be in pcp-4.1.1

commit a3f27c07ea7f480c16070ada22835375d7c311a0
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Sun Jul 29 16:22:52 2018 +1000

    pmdaproc: tweak proc.memory.maps to return empty string for kernel workers
    
    Recent changes broke qa/860 by returning -ENODATA when a process
    has zero length /proc/PID/maps. WHilst this is technically correct,
    it's not historically correct and we need to clear the error and
    instead return a zero length string (as pmdaproc has always done).

commit a5a9195e2c16dd042e547001d67ab47c5b50b3ad
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Tue Jul 24 13:41:21 2018 +1000

    pmdaproc: additional proc_pid rework to fix short buffer issues
    
    RHBZ #1600262
    
    Fix short buffer/read issues with some additional proc metrics.
    proc.psinfo.environ (/proc/PID/environ), proc.memory.maps
    (/proc/PID/maps), proc.psinfo.nvctxsw and proc.psinfo.vctxsw
    (/proc/PID/status) and a few others.
    
    modified:   src/pmdas/linux_proc/proc_pid.c

commit 12304c9d71711951b763bbdfd03a4834e0d03114
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Sun Jul 29 16:01:03 2018 +1000

    qa: remake qa/1350 to check voluntary_ctxt_switches
    
    nonvoluntary_ctxt_switches in /proc/PID/status can sometimes be zero

commit 0b1fb9c7d8e04173e1036a2186b96393727223e3
Author: Mark Goodwin <mgoodwin@redhat.com>
Date:   Sun Jul 29 12:11:04 2018 +1000

    qa: add test 1350 to test pmdaproc reading /proc/PID/status
    
    RHBZ #1600262 https://bugzilla.redhat.com/show_bug.cgi?id=1600262
    
    new file:   qa/1350
    new file:   qa/1350.out
    modified:   qa/group

Comment 6 Fedora Update System 2018-08-04 00:21:52 UTC
pcp-4.1.1-2.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-0baea70d93

Comment 7 Fedora Update System 2018-08-04 00:34:05 UTC
pcp-4.1.1-2.fc27 has been submitted as an update to Fedora 27. https://bodhi.fedoraproject.org/updates/FEDORA-2018-bedf3e464f

Comment 8 Fedora Update System 2018-08-04 20:32:04 UTC
pcp-4.1.1-2.fc27 has been pushed to the Fedora 27 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-bedf3e464f

Comment 9 Fedora Update System 2018-08-04 22:37:00 UTC
pcp-4.1.1-2.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-0baea70d93

Comment 18 Michal Kolar 2019-01-30 11:31:55 UTC
Reproduced against pcp-4.1.0-4.el7 and verified against pcp-4.3.0-2.el7.


Note You need to log in before you can comment on or make changes to this bug.