Note: This is a beta release of Red Hat Bugzilla 5.0. The data contained within is a snapshot of the live data so any changes you make will not be reflected in the production Bugzilla. Also email is disabled so feel free to test any aspect of the site that you want. File any problems you find or give feedback here.
Bug 451157 - [Stratus 5.3][2/2] ttyS1 lost interrupt and it stops transmitting
Summary: [Stratus 5.3][2/2] ttyS1 lost interrupt and it stops transmitting
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Brian Maly
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On: 440121
Blocks: 443071 455256
TreeView+ depends on / blocked
 
Reported: 2008-06-13 01:53 UTC by Brian Maly
Modified: 2018-10-20 00:33 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-20 19:34:28 UTC
Target Upstream Version:


Attachments (Terms of Use)
patch 2/2 - locking re-ordered continued from bug 440121 (deleted)
2008-06-17 17:40 UTC, Andrius Benokraitis
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:0225 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update 2009-01-20 16:06:24 UTC

Description Brian Maly 2008-06-13 01:53:03 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.0.1) Gecko/20060313 Fedora/1.5.0.1-9 Firefox/1.5.0.1 pango-text

Description of problem:
Description of problem:

Writes to /dev/ttyS1 frequently stall.  Analysis shows the 8250 driver
incorrectly detects the device as a 16550A with the UART_BUG_TXEN bug.  Only on
SMP systems is this bug is exposed.  When we boot with maxcpus=1 the problem
does not occur.

When this situation occurs, the tty_struxt does not stepped flags set and the
uart_info->xmit buffer is not empty.  If interrupts were occurring the data
should be sent to the UART.  Because the data is not being sent, it seems a
"transmitter holding register empty" interrupt (THRI) is getting lost and
therefore outgoing data stops.

This kind of errant behaviour was discussed in June 2006 on LKML under the topic
UART_BUG_TXEN.  However I'm not sure the patches then proposed handled all the
issues.  In particular, there are SMP races between the 8250 interrupt service
routine (ISR) and non-ISR code when they access the IIR register of the UART.

This has severe impact to Stratus since the serial ports are used for management.


Version-Release number of selected component (if applicable): 2.6.18-86.el5


How reproducible:
On SMP systems, ttyS1 seems to always be detected with the false positive
UART_BUG_TXEN.  When the false positive occurs, the UART eventually locks up and
it stops outputting more data.

Steps to Reproduce:
1. Boot kernel with 16550A devices present.
2. Use chat to send initialization strings repeatedly to the modem connected to
ttyS1 -or- login interactively to ttyS1
3.
  
Actual results:
Data flow out from ttyS1 hangs.

Expected results:
Data flow should not hang.

Additional info:

My analysis of this problem follows.

How UART_BUG_TXEN gets set due to a false positive on SMP systems ---
The UART is initialized by function serial8250_startup() in 8250.c.  At line
1755 the call to serial_link_irq_chain(up) connects the IRQ to the ISR in this
driver.  It is relevant that the ISR reads the IIR before it tries to acquire the
up->port.lock spinlock and reading the IIR would clear THRI if it is the
interrupt cause thus breaking this detection logic that comes a few lines later
in serial8250_startup().  Line 1776 is the last step necessary for the ISR to be
entered.
1776 serial8250_set_mctrl(&up->port, up->port.mctrl);
1777 
1778 /*
1779  * Do a quick test to see if we receive an
1780  * interrupt when we enable the TX irq.
1781  */
1782 serial_outp(up, UART_IER, UART_IER_THRI);
1783 lsr = serial_in(up, UART_LSR);
1784 iir = serial_in(up, UART_IIR);
1785 serial_outp(up, UART_IER, 0);
1786 
1787 if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT) {
1788     if (!(up->bugs & UART_BUG_TXEN)) {
1789         up->bugs |= UART_BUG_TXEN;
1790         pr_debug("ttyS%d - enabling bad tx status workarounds\n",
1791                  port->line);
1792     }
1793 } else {
1794     up->bugs &= ~UART_BUG_TXEN;
1795 }
1796 
1797 spin_unlock_irqrestore(&up->port.lock, flags);
The problem is that line 1782 causes an interrupt and the ISR is entered on
another processor and it reads the IIR before the IIR is read on line 1784.

How incorrectly detecting UART_BUG_TXEN causes output to stall ---
When usermode has more characters for the UART to transmit, the characters are
placed into the uart_info->xmit circular buffer and serial8250_start_tx() gets
called.  In that function we flow through the following code path:
1148     struct uart_8250_port *up = (struct uart_8250_port *)port;
1149 
1150     if (!(up->ier & UART_IER_THRI)) {
1151         up->ier |= UART_IER_THRI;
1152         serial_out(up, UART_IER, up->ier);
1153 
1154         if (up->bugs & UART_BUG_TXEN) {
1155             unsigned char lsr, iir;
1156             lsr = serial_in(up, UART_LSR);
1157             iir = serial_in(up, UART_IIR);
1158             if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT)
1159                 transmit_chars(up);
1160         }
1161     }
On NON-buggy UARTs line 1152 causes a THRI interrupt request.  If the IIR has
not be read by the ISR by the time it is read on line 1157, the value read by
line 1157 can indicate that THRI is pending; in this case, reading the IIR would
clear the THRI status.   This IIR value does not contain the UART_IIR_NO_INT bit
so that line 1159 would be bypassed and no characters sent to the transmitter in
the UART, yet the interrupt cause is cleared.  Subsequent calls to this routine
do nothing because (up->ier & UART_IER_THRI) is already true.  This causes
output stalls.

Proposed fix ---
The attached patch blocks the UART from asserting its IRQ during the quick test
in serial8250_startup previously discussed.  Our tests show this eliminates the
problem on Stratus SMP systems.  (We have not yet duplicated the original
problem on a non-Stratus system.)







Version-Release number of selected component (if applicable):
5.3

How reproducible:
Sometimes


Steps to Reproduce:
 


Actual Results:


Expected Results:


Additional info:

Comment 3 Andrius Benokraitis 2008-06-17 17:40:21 UTC
Created attachment 309643 [details]
patch 2/2 - locking re-ordered continued from bug 440121

Lock reordering patch continued from bug 440121 already committed to the 5.3
tree.

Comment 4 RHEL Product and Program Management 2008-06-17 17:51:16 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 RHEL Product and Program Management 2008-06-17 18:21:02 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 12 Don Zickus 2008-07-11 16:03:58 UTC
in kernel-2.6.18-96.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 14 Robert N. Evans 2008-07-21 19:28:32 UTC
Testing at Stratus has verified that the problem we reported was fixed in
kernel-2.6.18-96.el5.

Comment 15 Alan Cox 2008-07-22 13:14:26 UTC
Upstream activity: There are some further problems showing up with the way the
probing is done. They will want backporting at some point (certainly for any rt
patched system). The patch we have now should still go out as is however as it
is an improvement and the further fixes are _not_ for any regressions from
previous releases.


Comment 18 Chris Ward 2008-10-21 13:02:08 UTC
Attention Partners! 

RHEL 5.3 public Beta will be released soon. This URGENT priority/severity bug should have a fix in place in the recently released Partner Alpha drop, available at ftp://partners.redhat.com. If you haven't had a chance yet to test this bug, please do so at your earliest convenience, to ensure the highest possible quality bits in the upcoming Beta drop.

Thanks, more information about Beta testing to come.

 - Red Hat QE Partner Management

Comment 19 Robert N. Evans 2008-10-22 18:24:17 UTC
Stratus did a 24hr reboot test to verify this fix in the 5.3 Alpha (kernel 2.6.18-118.el5) using x86_64 architecture.  The system was rebooted 316 times and false positive detection of UART_BUG_TXEN never occurred.  This confirms the fix is working correctly in the tested kernel.

Comment 21 errata-xmlrpc 2009-01-20 19:34:28 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Comment 22 Markus Armbruster 2009-04-15 12:46:04 UTC
*** Bug 282231 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.