Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 1510602 - locking: bring in upstream PREEMPT_RT rtlock patches to fix single-reader limitation
Summary: locking: bring in upstream PREEMPT_RT rtlock patches to fix single-reader lim...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.6
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: rc
: ---
Assignee: Scott Wood
QA Contact: Pei Zhang
URL:
Whiteboard:
Depends On:
Blocks: 1532680
TreeView+ depends on / blocked
 
Reported: 2017-11-07 19:05 UTC by Clark Williams
Modified: 2018-10-30 09:42 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2018-10-30 09:40:30 UTC


Attachments (Terms of Use)
Shell script to trace a run of rteval (deleted)
2017-11-07 19:10 UTC, Clark Williams
no flags Details
Guest call log of kernel-rt-3.10.0-880.rt56.823.el7.x86_64 (deleted)
2018-05-07 04:01 UTC, Pei Zhang
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:3096 None None None 2018-10-30 09:42:36 UTC

Description Clark Williams 2017-11-07 19:05:53 UTC
Early versions of rwsem and rwlock implementation on PREEMPT_RT imposed a single-reader limitation due to preemption problems. Upstream has lifted that limitation with commits:

81f6cae6196d rtmutex: add rwlock implementation based on rtmutex
b7abb5b0ad58 rtmutex: add rwsem implementation based on rtmutex

The current RHEL-RT 7.4 has an early version of the rwsem fix and does *not* contain the corresponding rwlock commit. 

Backport commit 81f6cae6196d so that rwlocks allow multiple read lockers.

Comment 2 Clark Williams 2017-11-07 19:10:25 UTC
Created attachment 1349103 [details]
Shell script to trace a run of rteval

Script used to run ftrace on an rteval run looking for latency spikes

Comment 3 Clark Williams 2017-11-07 19:20:14 UTC
To see this problem, boot an RT kernel with SELinux enabled on a system with more than 64-cores. I've been using hp-dl580gen9-01.khw.lab.eng.bos.redhat.com. 

Use the attached trace-rteval.sh script to run rteval with tracing:

$ sudo ./trace-rteval.sh --duration=2h --breakthresh=200 -z

This will kick off an rteval run with tracing set to stop if a cyclictest latency is seen greater than 200us. When the run stops at least to reports will be generated:

    latency-trace-n.txt
    latency-trace-n-cpuX.txt

Note that it's possible for a latency spike to occur on more than one CPU at the same time, so there may be multiple per-cpu reports. The per-cpu reports will show the cpu where a latency occurred. Search the file for the string 'hit', which is the tracemark message that occurs when a threshold is hit, then search backwards for a trace showing that the function 'security_compute_av()' went through the slowpath and caused a latency spike. This is due to contention on the lock 'policy_rwlock' in security/selinux/ss/services.c, which is a reader/writer lock (rwlock).

Comment 4 Clark Williams 2017-11-27 18:07:16 UTC
It looks like we're going to revert the initial commits that partially implement reader/writer locks (just rwsems) for 7.5, so I think we should start from scratch and bring both rwsems and rwlocks into 7.6. 

See https://bugzilla.redhat.com/show_bug.cgi?id=1448770 for more info on why the initial commits are being reverted.

Comment 5 Clark Williams 2017-11-29 16:58:38 UTC
Adding KVM guys to the CC. Will want to either duplicate the KVM testing setup or keep that setup around so we can test when we bring back the reader/writer patches

Comment 6 Clark Williams 2017-12-04 16:16:53 UTC
Just confirmation that we'll be reverting the rwsems changes for 7.5, so this bug is now for all reader/writer changes in our 7.6 RT kernel.

Comment 7 Pei Zhang 2018-04-03 08:49:00 UTC
==update==

QE hit xfs issue with latest kernel-rt build.

3.10.0-864.rt56.806.el7.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.1.x86_64
libvirt-3.9.0-14.el7.x86_64
tuned-2.9.0-1.el7.noarch

Reproduce: 2/2


However with Scott's latest debugging testing builds for QE, this xfs hang issue has gone.

I asked Luis this question on irc, seems latest patch is missed in latest kernel-rt build. I'll re-test after this patch applying.

Comment 10 Pei Zhang 2018-05-07 03:50:56 UTC
Unfortunately we hit xfs hang issue again.

Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
 

Testing:

Finishing 24 hours cyclictest testing with below scenario(1) and (2), then hit Call Trace issue with scenario (3) cyclitest testing process. 

(1)Single VM with 1 rt vCPU
(2)Single VM with 8 rt vCPUs
(3)Multiple VMs each with 1 rt vCPU


So this issue still exits, and it's become hard to reproduce.


Best Regards,
Pei

Comment 11 Pei Zhang 2018-05-07 04:01:49 UTC
Created attachment 1432478 [details]
Guest call log of kernel-rt-3.10.0-880.rt56.823.el7.x86_64

Comment 14 Marcelo Tosatti 2018-07-13 15:09:02 UTC
(In reply to Pei Zhang from comment #10)
> Unfortunately we hit xfs hang issue again.
> 
> Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
>  
> 
> Testing:
> 
> Finishing 24 hours cyclictest testing with below scenario(1) and (2), then
> hit Call Trace issue with scenario (3) cyclitest testing process. 
> 
> (1)Single VM with 1 rt vCPU
> (2)Single VM with 8 rt vCPUs
> (3)Multiple VMs each with 1 rt vCPU
> 
> 
> So this issue still exits, and it's become hard to reproduce.
> 
> 
> Best Regards,
> Pei

Pei,

Just to confirm, this was without make -j or a polling thread?
(in either host or guest).

Comment 15 Pei Zhang 2018-07-23 13:06:10 UTC
(In reply to Marcelo Tosatti from comment #14)
> (In reply to Pei Zhang from comment #10)
> > Unfortunately we hit xfs hang issue again.
> > 
> > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
> >  
> > 
> > Testing:
> > 
> > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then
> > hit Call Trace issue with scenario (3) cyclitest testing process. 
> > 
> > (1)Single VM with 1 rt vCPU
> > (2)Single VM with 8 rt vCPUs
> > (3)Multiple VMs each with 1 rt vCPU
> > 
> > 
> > So this issue still exits, and it's become hard to reproduce.
> > 
> > 
> > Best Regards,
> > Pei
> 
> Pei,
> 
> Just to confirm, this was without make -j or a polling thread?
> (in either host or guest).

Sorry, I forgot and missed to reply this needinfo.

In host and guest, we both test with # make -j$twice_housekeeping_cpus_num.

Comment 16 Marcelo Tosatti 2018-07-26 14:35:25 UTC
(In reply to Pei Zhang from comment #10)
> Unfortunately we hit xfs hang issue again.
> 
> Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
>  
> 
> Testing:
> 
> Finishing 24 hours cyclictest testing with below scenario(1) and (2), then
> hit Call Trace issue with scenario (3) cyclitest testing process. 
> 
> (1)Single VM with 1 rt vCPU
> (2)Single VM with 8 rt vCPUs
> (3)Multiple VMs each with 1 rt vCPU
> 
> 
> So this issue still exits, and it's become hard to reproduce.
> 
> 
> Best Regards,
> Pei

Pei, 

Can you confirm there was no polling thread (or make -j) 
in either guest and host in this case?

Comment 17 Pei Zhang 2018-07-27 05:07:56 UTC
(In reply to Marcelo Tosatti from comment #16)
> (In reply to Pei Zhang from comment #10)
> > Unfortunately we hit xfs hang issue again.
> > 
> > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
> >  
> > 
> > Testing:
> > 
> > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then
> > hit Call Trace issue with scenario (3) cyclitest testing process. 
> > 
> > (1)Single VM with 1 rt vCPU
> > (2)Single VM with 8 rt vCPUs
> > (3)Multiple VMs each with 1 rt vCPU
> > 
> > 
> > So this issue still exits, and it's become hard to reproduce.
> > 
> > 
> > Best Regards,
> > Pei
> 
> Pei, 
> 
> Can you confirm there was no polling thread (or make -j) 
> in either guest and host in this case?

In both host and guest, we both tested with # make -j$twice_housekeeping_cpus_num in this case.

Comment 18 Pei Zhang 2018-07-27 05:30:26 UTC
(In reply to Pei Zhang from comment #17)
> (In reply to Marcelo Tosatti from comment #16)
> > (In reply to Pei Zhang from comment #10)
> > > Unfortunately we hit xfs hang issue again.
> > > 
> > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
> > >  
> > > 
> > > Testing:
> > > 
> > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then
> > > hit Call Trace issue with scenario (3) cyclitest testing process. 
> > > 
> > > (1)Single VM with 1 rt vCPU
> > > (2)Single VM with 8 rt vCPUs
> > > (3)Multiple VMs each with 1 rt vCPU
> > > 
> > > 
> > > So this issue still exits, and it's become hard to reproduce.
> > > 
> > > 
> > > Best Regards,
> > > Pei
> > 
> > Pei, 
> > 
> > Can you confirm there was no polling thread (or make -j) 
> > in either guest and host in this case?
> 
> In both host and guest, we both tested with # make
> -j$twice_housekeeping_cpus_num in this case.


Marcelo, I'm not sure if I answer your question. So I'd like to share all the steps of testing(using automation):


1. Install rhel7.6 host

2. Setup rt host

3. Install guest and do testing

  3.1 Single VM with 1 rt vCPU

     3.1.1. Install rhel7.6 guest

     3.1.2. Setup rt guest

     3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time.
  
        (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num
        # make -j14
      
        (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num
        # make -j2

        (3)In rt guest, start cyclictest
        # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40 --notrace

  3.2 Single VM with 8 rt vCPUs

     3.1.1. Install rhel7.6 guest

     3.1.2. Setup rt guest

     3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time.
  
        (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num
        # make -j14
      
        (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num
        # make -j4

        (3)In rt guest, start cyclictest
        # taskset -c 1,2,3,4,5,6,7,8 cyclictest -m -n -q -p95 -D 24h -h60 -t 8 -a 1,2,3,4,5,6,7,8 -b40 --notrace

  3.3 Multiple VMs each with 1 rt vCPU

     3.1.1. Install 4 rhel7.6 guests

     3.1.2. Setup these 4 rt guests

     3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time.
  
        (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num
        # make -j14
      
        (2)In each rt guest, compiling kernel with $twice_housekeeping_vCPUs_num
        # make -j2.

        (3)In each rt guest, start cyclictest.
        # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40 --notrace
   

Thanks,
Pei

Comment 19 Pei Zhang 2018-07-27 05:33:15 UTC
Additional info to Comment 18, step (1) and (2) will repeat running until finish cyclictest 24 hours testing.

Comment 20 Marcelo Tosatti 2018-07-27 13:34:37 UTC
(In reply to Pei Zhang from comment #18)
> (In reply to Pei Zhang from comment #17)
> > (In reply to Marcelo Tosatti from comment #16)
> > > (In reply to Pei Zhang from comment #10)
> > > > Unfortunately we hit xfs hang issue again.
> > > > 
> > > > Version: kernel-rt-3.10.0-880.rt56.823.el7.x86_64
> > > >  
> > > > 
> > > > Testing:
> > > > 
> > > > Finishing 24 hours cyclictest testing with below scenario(1) and (2), then
> > > > hit Call Trace issue with scenario (3) cyclitest testing process. 
> > > > 
> > > > (1)Single VM with 1 rt vCPU
> > > > (2)Single VM with 8 rt vCPUs
> > > > (3)Multiple VMs each with 1 rt vCPU
> > > > 
> > > > 
> > > > So this issue still exits, and it's become hard to reproduce.
> > > > 
> > > > 
> > > > Best Regards,
> > > > Pei
> > > 
> > > Pei, 
> > > 
> > > Can you confirm there was no polling thread (or make -j) 
> > > in either guest and host in this case?
> > 
> > In both host and guest, we both tested with # make
> > -j$twice_housekeeping_cpus_num in this case.
> 
> 
> Marcelo, I'm not sure if I answer your question. So I'd like to share all
> the steps of testing(using automation):
> 
> 
> 1. Install rhel7.6 host
> 
> 2. Setup rt host
> 
> 3. Install guest and do testing
> 
>   3.1 Single VM with 1 rt vCPU
> 
>      3.1.1. Install rhel7.6 guest
> 
>      3.1.2. Setup rt guest
> 
>      3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time.
>   
>         (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num
>         # make -j14
>       
>         (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num
>         # make -j2
> 
>         (3)In rt guest, start cyclictest
>         # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40
> --notrace
> 
>   3.2 Single VM with 8 rt vCPUs
> 
>      3.1.1. Install rhel7.6 guest
> 
>      3.1.2. Setup rt guest
> 
>      3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time.
>   
>         (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num
>         # make -j14
>       
>         (2)In rt guest, compiling kernel with $twice_housekeeping_vCPUs_num
>         # make -j4
> 
>         (3)In rt guest, start cyclictest
>         # taskset -c 1,2,3,4,5,6,7,8 cyclictest -m -n -q -p95 -D 24h -h60 -t
> 8 -a 1,2,3,4,5,6,7,8 -b40 --notrace
> 
>   3.3 Multiple VMs each with 1 rt vCPU
> 
>      3.1.1. Install 4 rhel7.6 guests
> 
>      3.1.2. Setup these 4 rt guests
> 
>      3.1.3. Do cyclictest testing,(1)(2)(3) are running at same time.
>   
>         (1)In rt host, compiling kernel with $twice_housekeeping_CPUs_num
>         # make -j14
>       
>         (2)In each rt guest, compiling kernel with
> $twice_housekeeping_vCPUs_num
>         # make -j2.
> 
>         (3)In each rt guest, start cyclictest.
>         # taskset -c 1 cyclictest -m -n -q -p95 -D 1h -h60 -t 1 -a 1 -b40
> --notrace
>    
> 
> Thanks,
> Pei

Pei,

Yes, you do. The problem is as follows:

On the host, the kvm-vcpus (each one of them isolated to a single pcpu), 
have scheduling priority FIFO:1. 

This means that a kvm-vcpu will run until:

1) A higher -RT priority process is scheduled on that pcpu.
2) It sleeps.

Depending on the workload in the kvm-vcpu, the kvm-vcpu might never 
sleep, not allowing the worker threads on that pcpu (which have a lower priority than the kvm-vcpu process), to run. Which would explain why a certain
process is unable to execute for more than 600 seconds (the warning you see when
the message

"Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds"

is printed).

The tracing suggested at 

https://bugzilla.redhat.com/show_bug.cgi?id=1590222#c12

Would allow us to know if that is happening or not (that, kvm-vcpu process 
never sleeping due to non-sleeping "make -j" workload). 


So, the tracing suggested at:

Comment 21 Scott Wood 2018-07-27 22:43:18 UTC
(In reply to Marcelo Tosatti from comment #20)
> Pei,
> 
> Yes, you do. The problem is as follows:
> 
> On the host, the kvm-vcpus (each one of them isolated to a single pcpu), 
> have scheduling priority FIFO:1. 
> 
> This means that a kvm-vcpu will run until:
> 
> 1) A higher -RT priority process is scheduled on that pcpu.
> 2) It sleeps.
> 
> Depending on the workload in the kvm-vcpu, the kvm-vcpu might never 
> sleep, not allowing the worker threads on that pcpu (which have a lower
> priority than the kvm-vcpu process), to run. Which would explain why a
> certain
> process is unable to execute for more than 600 seconds (the warning you see
> when
> the message
> 
> "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds"
> 
> is printed).

Wouldn't RT throttling kick in in that case?

Comment 22 Marcelo Tosatti 2018-07-28 17:57:46 UTC
(In reply to Scott Wood from comment #21)
> (In reply to Marcelo Tosatti from comment #20)
> > Pei,
> > 
> > Yes, you do. The problem is as follows:
> > 
> > On the host, the kvm-vcpus (each one of them isolated to a single pcpu), 
> > have scheduling priority FIFO:1. 
> > 
> > This means that a kvm-vcpu will run until:
> > 
> > 1) A higher -RT priority process is scheduled on that pcpu.
> > 2) It sleeps.
> > 
> > Depending on the workload in the kvm-vcpu, the kvm-vcpu might never 
> > sleep, not allowing the worker threads on that pcpu (which have a lower
> > priority than the kvm-vcpu process), to run. Which would explain why a
> > certain
> > process is unable to execute for more than 600 seconds (the warning you see
> > when
> > the message
> > 
> > "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds"
> > 
> > is printed).
> 
> Wouldn't RT throttling kick in in that case?

-RT throttling is disabled for KVM-RT, and you dont want to enable it:

So after sometime investigating the suggestion to
use -RT throttling to solve the problem of
maintenance tasks that need to run on the isolated
CPUs, the following facts become clear:

Benefits of -RT throttling:

1) Applications do not need to be aware of scheduling in
/scheduling out points (schedule out points, by
the operating system are a generic, application independent
solution).

Disadvantages of -RT throttling:

1) The schedule out points are not controlled.

Say you have latency spikes on the application:

latency
   |
   |    /\
   |   /  \
   |  /    \
   | /      \______________
   |___________________ time

For example due to internal DPDK timer handling.
The schedule out points, not being coordinated
to happen at specific points in time, can happen
exactly at those latency spikes, adding the amount
of time reserved for non realtime applications to run
the latency spike.

In comparison, the DPDK application can choose when
to schedule out: for example when the queues are
empty and when its not about to execute an internal
timer.

2) runqueue lock contention.

A high resolution -RT unthrottle timer is necessary on the host side
(it must be armed to fire N us after the runqueue is throttled).
This timer must acquire the runqueue lock, which has
large maximum latency. From Daniel's message:

> I already have a prototype of the scheduler, and the results are very
> good so far. As the number of migration is limited to 1 per CPU
> (during
> a task's period), there is almost no runqueue lock contention! This
> helps to address the push/pull locking latency we see on large box (I
> can see a reduction from 70us to < 2us with the same workload - a very
> high work on the realtime1 box (with 24 cpus)).

I don't have the numbers handy, but IIRC i saw > 100us runqueue lock
contention previously.

Either way, those numbers are not in the acceptable range.

Given that guaranteeing "runqueue lock contention below X us"
is a requirement which upstream kernel development will not take into
account, i don't see any option other than the suggestion
for users to fix this in their application.

https://lwn.net/Articles/296419/

"Ingo's suggestion was to raise the limit to ten seconds of CPU time. As
he (and others) pointed out: any SCHED_FIFO application which needs to
monopolize the CPU for that long has serious problems and needs to be
fixed."

Unless someone has another idea to perform this at the operating
system level, without using runqueue lock, or reducing
runqueue lock contention, i'll reply to the DPDK thread which
included an example of how applications should perform this.

Comment 23 Scott Wood 2018-07-29 18:27:43 UTC
(In reply to Marcelo Tosatti from comment #22)
> (In reply to Scott Wood from comment #21)
> > Wouldn't RT throttling kick in in that case?
> 
> -RT throttling is disabled for KVM-RT, and you dont want to enable it:
> 
> So after sometime investigating the suggestion to
> use -RT throttling to solve the problem of
> maintenance tasks that need to run on the isolated
> CPUs, the following facts become clear:

I wasn't suggesting it as a solution, just trying to understand the cause of the hangs that have been reported.  What is it that automatically disables RT throttling?  It's enabled on my system even after installing kernel-rt-kvm.

I tried recreating the load in bug 1590222, and just reproduced a somewhat similar hang that left a bunch of processes stuck in the D state even after all loads were killed.  I'll look more deeply into it tomorrow.

Comment 24 Luis Claudio R. Goncalves 2018-07-30 14:05:51 UTC
(In reply to Marcelo Tosatti from comment #22)
> (In reply to Scott Wood from comment #21)
> > (In reply to Marcelo Tosatti from comment #20)
> > > Pei,
> > > 
> > > Yes, you do. The problem is as follows:
> > > 
> > > On the host, the kvm-vcpus (each one of them isolated to a single pcpu), 
> > > have scheduling priority FIFO:1. 
> > > 
> > > This means that a kvm-vcpu will run until:
> > > 
> > > 1) A higher -RT priority process is scheduled on that pcpu.
> > > 2) It sleeps.
> > > 
> > > Depending on the workload in the kvm-vcpu, the kvm-vcpu might never 
> > > sleep, not allowing the worker threads on that pcpu (which have a lower
> > > priority than the kvm-vcpu process), to run. Which would explain why a
> > > certain
> > > process is unable to execute for more than 600 seconds (the warning you see
> > > when
> > > the message
> > > 
> > > "Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds"
> > > 
> > > is printed).
> > 
> > Wouldn't RT throttling kick in in that case?
> 
> -RT throttling is disabled for KVM-RT, and you dont want to enable it:
> 
> So after sometime investigating the suggestion to
> use -RT throttling to solve the problem of
> maintenance tasks that need to run on the isolated
> CPUs, the following facts become clear:
> 
> Benefits of -RT throttling:
> 
> 1) Applications do not need to be aware of scheduling in
> /scheduling out points (schedule out points, by
> the operating system are a generic, application independent
> solution).
> 
> Disadvantages of -RT throttling:
> 
> 1) The schedule out points are not controlled.
> 
> Say you have latency spikes on the application:
> 
> latency
>    |
>    |    /\
>    |   /  \
>    |  /    \
>    | /      \______________
>    |___________________ time
> 
> For example due to internal DPDK timer handling.
> The schedule out points, not being coordinated
> to happen at specific points in time, can happen
> exactly at those latency spikes, adding the amount
> of time reserved for non realtime applications to run
> the latency spike.
> 
> In comparison, the DPDK application can choose when
> to schedule out: for example when the queues are
> empty and when its not about to execute an internal
> timer.
> 
> 2) runqueue lock contention.
> 
> A high resolution -RT unthrottle timer is necessary on the host side
> (it must be armed to fire N us after the runqueue is throttled).
> This timer must acquire the runqueue lock, which has
> large maximum latency. From Daniel's message:
> 
> > I already have a prototype of the scheduler, and the results are very
> > good so far. As the number of migration is limited to 1 per CPU
> > (during
> > a task's period), there is almost no runqueue lock contention! This
> > helps to address the push/pull locking latency we see on large box (I
> > can see a reduction from 70us to < 2us with the same workload - a very
> > high work on the realtime1 box (with 24 cpus)).
> 
> I don't have the numbers handy, but IIRC i saw > 100us runqueue lock
> contention previously.
> 
> Either way, those numbers are not in the acceptable range.
> 
> Given that guaranteeing "runqueue lock contention below X us"
> is a requirement which upstream kernel development will not take into
> account, i don't see any option other than the suggestion
> for users to fix this in their application.
> 
> https://lwn.net/Articles/296419/
> 
> "Ingo's suggestion was to raise the limit to ten seconds of CPU time. As
> he (and others) pointed out: any SCHED_FIFO application which needs to
> monopolize the CPU for that long has serious problems and needs to be
> fixed."
> 
> Unless someone has another idea to perform this at the operating
> system level, without using runqueue lock, or reducing
> runqueue lock contention, i'll reply to the DPDK thread which
> included an example of how applications should perform this.

It is important to keep in mind that RT Throttling default behavior is not exactly what one would expect:

The RT_RUNTIME_SHARE sched feature is enabled by default and allows a CPU about to suffer RT throttling to borrow RT bandwidth from other CPUs. In a system where not all the CPUs are currently running RT tasks, the CPU running a RT task  could keep this task running for longer than the RT throttling threshold. Depending on the load distribution in the system, this RT task could run for minutes without suffering the RT throttle. So, the expected behavior is only ensured if RT_RUNTIME_SHARE is explicitly disabled.

Also, Daniel Bristot introduced a sched feature called RT_RUNTIME_GREED that will only apply the RT throttling when there are SCHED_OTHER (non-RT) tasks waiting for CPU time. Even if the RT throttling threshold has been reached. RT_RUNTIME_GREED minimizes the RT throttling impact in performance and latency as much as possible.

More detailed notes:

- RT_RUNTIME_SHARE: https://bugzilla.redhat.com/show_bug.cgi?id=1459275#c39

- RT_SCHED_GREED: https://bugzilla.redhat.com/show_bug.cgi?id=1491722#c6

Comment 25 Marcelo Tosatti 2018-07-30 19:33:39 UTC
(In reply to Scott Wood from comment #23)
> (In reply to Marcelo Tosatti from comment #22)
> > (In reply to Scott Wood from comment #21)
> > > Wouldn't RT throttling kick in in that case?
> > 
> > -RT throttling is disabled for KVM-RT, and you dont want to enable it:
> > 
> > So after sometime investigating the suggestion to
> > use -RT throttling to solve the problem of
> > maintenance tasks that need to run on the isolated
> > CPUs, the following facts become clear:
> 
> I wasn't suggesting it as a solution, just trying to understand the cause of
> the hangs that have been reported.  What is it that automatically disables
> RT throttling?  It's enabled on my system even after installing
> kernel-rt-kvm.
> 

realtime-virtual-host tuned profile.

> I tried recreating the load in bug 1590222, and just reproduced a somewhat
> similar hang that left a bunch of processes stuck in the D state even after
> all loads were killed.  I'll look more deeply into it tomorrow.

That would be very helpful. Do you have a reproducer?

Comment 29 Beth Uptagrafft 2018-09-17 15:25:37 UTC
*** Bug 1590222 has been marked as a duplicate of this bug. ***

Comment 31 Pei Zhang 2018-10-17 08:01:01 UTC
==Verification==

KVM-RT regularly 24 hours queuelat testing get PASS results. So QE think this bug has been fixed.


Versions:
qemu-kvm-rhev-2.12.0-18.el7.x86_64
kernel-rt-3.10.0-957.rt56.910.el7.x86_64
tuned-2.10.0-6.el7.noarch
libvirt-4.5.0-10.el7.x86_64


Testings:
(1)Single VM with 1 rt vCPU: 24 hours queuelat testing Pass
(2)Single VM with 8 rt vCPUs: 24 hours queuelat testing Pass
(3)Multiple VMs each with 1 rt vCPU: 24 hours queuelat testing Pass

Move this bug to 'VERIFIED'. Please correct me if any mistakes. Thanks.


Note: The XFS hang issue was mentioned in above comments, and xfs issue still exists now, below bz is tracking it. 

Bug 1590222 - INFO: task worker:194936 blocked for more than 600 seconds

Comment 33 errata-xmlrpc 2018-10-30 09:40:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3096


Note You need to log in before you can comment on or make changes to this bug.