All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* Fw: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing
@ 2009-12-07 17:01 Stephen Hemminger
  2009-12-07 17:05 ` Stephen Hemminger
  2009-12-07 22:32 ` [RFC, PATCH] net: sock_queue_err_skb() and sk_forward_alloc corruption Eric Dumazet
  0 siblings, 2 replies; 4+ messages in thread
From: Stephen Hemminger @ 2009-12-07 17:01 UTC (permalink / raw
  To: netdev



Begin forwarded message:

Date: Sun, 6 Dec 2009 13:40:19 GMT
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing


http://bugzilla.kernel.org/show_bug.cgi?id=14749

           Summary: Kernel locks up after a few minutes of heavy surfing
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.31.6
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: rankincj@yahoo.com
        Regression: Yes


Created an attachment (id=24049)
 --> (http://bugzilla.kernel.org/attachment.cgi?id=24049)
Warnings found in kernel, relating to network corruption.

This bug is new as of 2.6.31.x kernels. After a short period of heavy surfing
(e.g. lots of tabs open in Firefox), the kernel will suddenly stop responding.
Nothing is written to the serial console, and the machine stops responding to
pings. My only clue so far has been a warning which I found once in my dmesg
log (attached).

I have already tried manually applying this patch from the upcoming -stable
queue:

net-fix-sk_forward_alloc-corruption.patch

to no effect.

I am currently switching back to Fedora's 2.6.31.6-145.fc12.i686 kernel to see
if it is more stable. (I cannot trust 2.6.31.6 any more.)

-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing
  2009-12-07 17:01 Fw: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing Stephen Hemminger
@ 2009-12-07 17:05 ` Stephen Hemminger
  2009-12-07 22:32 ` [RFC, PATCH] net: sock_queue_err_skb() and sk_forward_alloc corruption Eric Dumazet
  1 sibling, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2009-12-07 17:05 UTC (permalink / raw
  To: Stephen Hemminger; +Cc: netdev

On Mon, 7 Dec 2009 09:01:54 -0800
Stephen Hemminger <shemminger@vyatta.com> wrote:

> 
> 
> Begin forwarded message:
> 
> Date: Sun, 6 Dec 2009 13:40:19 GMT
> From: bugzilla-daemon@bugzilla.kernel.org
> To: shemminger@linux-foundation.org
> Subject: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing
> 
> 
> http://bugzilla.kernel.org/show_bug.cgi?id=14749
> 
>            Summary: Kernel locks up after a few minutes of heavy surfing
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.31.6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: high
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: rankincj@yahoo.com
>         Regression: Yes
> 
> 
> Created an attachment (id=24049)
>  --> (http://bugzilla.kernel.org/attachment.cgi?id=24049)
> Warnings found in kernel, relating to network corruption.
> 
> This bug is new as of 2.6.31.x kernels. After a short period of heavy surfing
> (e.g. lots of tabs open in Firefox), the kernel will suddenly stop responding.
> Nothing is written to the serial console, and the machine stops responding to
> pings. My only clue so far has been a warning which I found once in my dmesg
> log (attached).
> 
> I have already tried manually applying this patch from the upcoming -stable
> queue:
> 
> net-fix-sk_forward_alloc-corruption.patch
> 
> to no effect.
> 
> I am currently switching back to Fedora's 2.6.31.6-145.fc12.i686 kernel to see
> if it is more stable. (I cannot trust 2.6.31.6 any more.)
> 

Putting attachment inline since then developers are more likely to read it

-----------[ cut here ]------------
WARNING: at /home/chris/LINUX/linux-2.6.31/net/core/stream.c:202 inet_csk_destroy_sock+0x77/0xd3()
Hardware name: Precision WorkStation 650
Modules linked in: tun snd_seq_oss snd_seq_midi snd_seq_dummy fuse nfsd lockd auth_rpcgss exportfs sunrpc autofs4 af_packet ipt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_LOG nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 p4_clockmod speedstep_lib binfmt_misc dm_mirror dm_region_hash dm_log dm_mod uinput snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_emu10k1 snd_ac97_codec snd_usb_audio ac97_bus snd_seq snd_pcm snd_usb_lib snd_rawmidi snd_seq_device snd_timer firewire_ohci ppdev uvcvideo floppy firewire_core snd_page_alloc snd_util_mem snd_hwdep parport_pc pwc psmouse videodev parport v4l1_compat crc_itu_t pcspkr snd sg i2c_i801 serio_raw soundcore dcdbas ext3 jbd mbcache sr_mod cdrom sd_mo
 d pata_acpi sata_sil uhci_hcd ata_piix libata scsi_mod ehci_hcd e1000 usbcore thermal button radeon intel_agp ttm drm agpgart i2c_algo_bit cfbcopyarea cfbimgblt
cfbfillrect [last unloaded: processor]
Pid: 32056, comm: rpm Not tainted 2.6.31.6 #1
Call Trace:
[<c1023ba8>] ? warn_slowpath_common+0x5d/0x70
[<c1023bc6>] ? warn_slowpath_null+0xb/0xd
[<c11871ca>] ? inet_csk_destroy_sock+0x77/0xd3
[<c119188f>] ? tcp_rcv_state_process+0x81f/0x9e8
[<c11966c3>] ? tcp_v4_do_rcv+0x128/0x16d
[<c1196b0d>] ? tcp_v4_rcv+0x405/0x640
[<c118003e>] ? ip_local_deliver_finish+0xf3/0x1ab
[<c117fcd9>] ? ip_rcv_finish+0x2a9/0x2cf
[<c117fa30>] ? ip_rcv_finish+0x0/0x2cf
[<c116b7c5>] ? netif_receive_skb+0x261/0x281
[<f8527bfc>] ? e1000_clean_rx_irq+0x31c/0x3c3 [e1000]
[<f852a6fa>] ? e1000_clean+0x2a7/0x3f5 [e1000]
[<c11c783c>] ? _spin_unlock_irqrestore+0xe/0x21
[<c10354c0>] ? hrtimer_run_pending+0xd/0xa5
[<c11c769b>] ? _spin_lock_irq+0xe/0x24
[<c116bce5>] ? net_rx_action+0x57/0xfd
[<c1027ea3>] ? __do_softirq+0x7a/0xe3
[<c1027e29>] ? __do_softirq+0x0/0xe3
<IRQ> [<c1027c3c>] ? irq_exit+0x29/0x63
[<c1004320>] ? do_IRQ+0x7c/0x8d
[<c1002f29>] ? common_interrupt+0x29/0x30
---[ end trace e643d9455a26ccf3 ]---
------------[ cut here ]------------
WARNING: at /home/chris/LINUX/linux-2.6.31/net/ipv4/af_inet.c:151 inet_sock_destruct+0xd8/0x138()
Hardware name: Precision WorkStation 650
Modules linked in: tun snd_seq_oss snd_seq_midi snd_seq_dummy fuse nfsd lockd auth_rpcgss exportfs sunrpc autofs4 af_packet ipt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_LOG nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 p4_clockmod speedstep_lib binfmt_misc dm_mirror dm_region_hash dm_log dm_mod uinput snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_emu10k1 snd_ac97_codec snd_usb_audio ac97_bus snd_seq snd_pcm snd_usb_lib snd_rawmidi snd_seq_device snd_timer firewire_ohci ppdev uvcvideo floppy firewire_core snd_page_alloc snd_util_mem snd_hwdep parport_pc pwc psmouse videodev parport v4l1_compat crc_itu_t pcspkr snd sg i2c_i801 serio_raw soundcore dcdbas ext3 jbd mbcache sr_mod cdrom sd_mo
 d pata_acpi sata_sil uhci_hcd ata_piix libata scsi_mod ehci_hcd e1000 usbcore thermal button radeon intel_agp ttm drm agpgart i2c_algo_bit cfbcopyarea cfbimgblt
cfbfillrect [last unloaded: processor]
Pid: 32056, comm: rpm Tainted: G W 2.6.31.6 #1
Call Trace:
[<c1023ba8>] ? warn_slowpath_common+0x5d/0x70
[<c1023bc6>] ? warn_slowpath_null+0xb/0xd
[<c11a1414>] ? inet_sock_destruct+0xd8/0x138
[<c1163243>] ? __sk_free+0x10/0xa2
[<c1196b4a>] ? tcp_v4_rcv+0x442/0x640
[<c118003e>] ? ip_local_deliver_finish+0xf3/0x1ab
[<c117fcd9>] ? ip_rcv_finish+0x2a9/0x2cf
[<c117fa30>] ? ip_rcv_finish+0x0/0x2cf
[<c116b7c5>] ? netif_receive_skb+0x261/0x281
[<f8527bfc>] ? e1000_clean_rx_irq+0x31c/0x3c3 [e1000]
[<f852a6fa>] ? e1000_clean+0x2a7/0x3f5 [e1000]
[<c11c783c>] ? _spin_unlock_irqrestore+0xe/0x21
[<c10354c0>] ? hrtimer_run_pending+0xd/0xa5
[<c11c769b>] ? _spin_lock_irq+0xe/0x24
[<c116bce5>] ? net_rx_action+0x57/0xfd
[<c1027ea3>] ? __do_softirq+0x7a/0xe3
[<c1027e29>] ? __do_softirq+0x0/0xe3
<IRQ> [<c1027c3c>] ? irq_exit+0x29/0x63
[<c1004320>] ? do_IRQ+0x7c/0x8d
[<c1002f29>] ? common_interrupt+0x29/0x30
---[ end trace e643d9455a26ccf4 ]---
------------[ cut here ]------------
WARNING: at /home/chris/LINUX/linux-2.6.31/net/ipv4/af_inet.c:154 inet_sock_destruct+0x11e/0x138()
Hardware name: Precision WorkStation 650
Modules linked in: tun snd_seq_oss snd_seq_midi snd_seq_dummy fuse nfsd lockd auth_rpcgss exportfs sunrpc autofs4 af_packet ipt_LOG nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_LOG nf_conntrack_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 p4_clockmod speedstep_lib binfmt_misc dm_mirror dm_region_hash dm_log dm_mod uinput snd_emu10k1_synth snd_emux_synth snd_seq_virmidi snd_seq_midi_event snd_seq_midi_emul snd_emu10k1 snd_ac97_codec snd_usb_audio ac97_bus snd_seq snd_pcm snd_usb_lib snd_rawmidi snd_seq_device snd_timer firewire_ohci ppdev uvcvideo floppy firewire_core snd_page_alloc snd_util_mem snd_hwdep parport_pc pwc psmouse videodev parport v4l1_compat crc_itu_t pcspkr snd sg i2c_i801 serio_raw soundcore dcdbas ext3 jbd mbcache sr_mod cdrom sd_mo
 d pata_acpi sata_sil uhci_hcd ata_piix libata scsi_mod ehci_hcd e1000 usbcore thermal button radeon intel_agp ttm drm agpgart i2c_algo_bit cfbcopyarea cfbimgblt
cfbfillrect [last unloaded: processor]
Pid: 32056, comm: rpm Tainted: G W 2.6.31.6 #1
Call Trace:
[<c1023ba8>] ? warn_slowpath_common+0x5d/0x70
[<c1023bc6>] ? warn_slowpath_null+0xb/0xd
[<c11a145a>] ? inet_sock_destruct+0x11e/0x138
[<c1163243>] ? __sk_free+0x10/0xa2
[<c1196b4a>] ? tcp_v4_rcv+0x442/0x640
[<c118003e>] ? ip_local_deliver_finish+0xf3/0x1ab
[<c117fcd9>] ? ip_rcv_finish+0x2a9/0x2cf
[<c117fa30>] ? ip_rcv_finish+0x0/0x2cf
[<c116b7c5>] ? netif_receive_skb+0x261/0x281
[<f8527bfc>] ? e1000_clean_rx_irq+0x31c/0x3c3 [e1000]
[<f852a6fa>] ? e1000_clean+0x2a7/0x3f5 [e1000]
[<c11c783c>] ? _spin_unlock_irqrestore+0xe/0x21
[<c10354c0>] ? hrtimer_run_pending+0xd/0xa5
[<c11c769b>] ? _spin_lock_irq+0xe/0x24
[<c116bce5>] ? net_rx_action+0x57/0xfd
[<c1027ea3>] ? __do_softirq+0x7a/0xe3
[<c1027e29>] ? __do_softirq+0x0/0xe3
<IRQ> [<c1027c3c>] ? irq_exit+0x29/0x63
[<c1004320>] ? do_IRQ+0x7c/0x8d
[<c1002f29>] ? common_interrupt+0x29/0x30
---[ end trace e643d9455a26ccf5 ]---


-- 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC, PATCH] net: sock_queue_err_skb() and sk_forward_alloc corruption
  2009-12-07 17:01 Fw: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing Stephen Hemminger
  2009-12-07 17:05 ` Stephen Hemminger
@ 2009-12-07 22:32 ` Eric Dumazet
  2009-12-26  1:29   ` David Miller
  1 sibling, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2009-12-07 22:32 UTC (permalink / raw
  To: Stephen Hemminger, David S. Miller; +Cc: netdev

While investigating on sk_forward_alloc corruptions, I found two problems :

1) skb_tstamp_tx() is calling sock_queue_err_skb().

This is not good as is, because we need sock lock
before calling sock_queue_err_skb().

Problem is skb_tstamp_rx() wont be able to lock sock...

skb_tstamp_rx() ->
	sock_queue_err_skb() ->
		sk_mem_charge(sk, skb->truesize) -> // PROBLEM :
			sk->sk_forward_alloc -= size; // MUST BE PROTECTED


2) UDP (again) sk_forward_alloc corruption 

__udp4_lib_err ->
	if (inet->recverr)
		ip_icmp_error() ->
			sock_queue_err_skb() // PROBLEM


Oh well...

I wonder if we could use a special version of skb_set_owner_r()/sock_rfree()
*without* sk_mem_charge()/sk_mem_uncharge() calls for this error queue.

(We dont call sk_rmem_schedule() anyway, so I guess current usage is not correct,
even with sock locked ?)

Something like this (untested but compiled) patch ?

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h |   11 ++++++++++-
 net/core/sock.c    |    8 ++++++++
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3f1a480..76277ce 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -961,6 +961,7 @@ extern struct sk_buff		*sock_rmalloc(struct sock *sk,
 					      gfp_t priority);
 extern void			sock_wfree(struct sk_buff *skb);
 extern void			sock_rfree(struct sk_buff *skb);
+extern void			sock_nocharge_rfree(struct sk_buff *skb);
 
 extern int			sock_setsockopt(struct socket *sock, int level,
 						int op, char __user *optval,
@@ -1383,6 +1384,14 @@ static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
 	sk_mem_charge(sk, skb->truesize);
 }
 
+static inline void skb_set_owner_nocharge_r(struct sk_buff *skb, struct sock *sk)
+{
+	skb_orphan(skb);
+	skb->sk = sk;
+	skb->destructor = sock_nocharge_rfree;
+	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
+}
+
 extern void sk_reset_timer(struct sock *sk, struct timer_list* timer,
 			   unsigned long expires);
 
@@ -1398,7 +1407,7 @@ static inline int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb)
 	if (atomic_read(&sk->sk_rmem_alloc) + skb->truesize >=
 	    (unsigned)sk->sk_rcvbuf)
 		return -ENOMEM;
-	skb_set_owner_r(skb, sk);
+	skb_set_owner_nocharge_r(skb, sk);
 	skb_queue_tail(&sk->sk_error_queue, skb);
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk, skb->len);
diff --git a/net/core/sock.c b/net/core/sock.c
index 76ff58d..181a39a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1284,6 +1284,14 @@ void sock_rfree(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(sock_rfree);
 
+void sock_nocharge_rfree(struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+
+	atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
+}
+EXPORT_SYMBOL(sock_nocharge_rfree);
+
 
 int sock_i_uid(struct sock *sk)
 {

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC, PATCH] net: sock_queue_err_skb() and sk_forward_alloc corruption
  2009-12-07 22:32 ` [RFC, PATCH] net: sock_queue_err_skb() and sk_forward_alloc corruption Eric Dumazet
@ 2009-12-26  1:29   ` David Miller
  0 siblings, 0 replies; 4+ messages in thread
From: David Miller @ 2009-12-26  1:29 UTC (permalink / raw
  To: eric.dumazet; +Cc: shemminger, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 07 Dec 2009 23:32:16 +0100

> I wonder if we could use a special version of skb_set_owner_r()/sock_rfree()
> *without* sk_mem_charge()/sk_mem_uncharge() calls for this error queue.
> 
> (We dont call sk_rmem_schedule() anyway, so I guess current usage is not correct,
> even with sock locked ?)
> 
> Something like this (untested but compiled) patch ?
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

I think this is legitimate in exactly this kind of case.

The paths where we do these non-charging add, we already just
made sure the receive queue is not over the limit.  Therefore
we won't have possible paths where we can queue error skbs
endlessly and without any controls.

So I'm ok with this approach to fix these bugs.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-12-26  1:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-12-07 17:01 Fw: [Bug 14749] New: Kernel locks up after a few minutes of heavy surfing Stephen Hemminger
2009-12-07 17:05 ` Stephen Hemminger
2009-12-07 22:32 ` [RFC, PATCH] net: sock_queue_err_skb() and sk_forward_alloc corruption Eric Dumazet
2009-12-26  1:29   ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.