Dom0 crash with apache bench (ab)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* Dom0 crash with apache bench (ab)
@ 2015-07-28 13:09 Christoffer Dall
  2015-07-28 14:50 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2015-07-28 13:09 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 1586 bytes --]

Hi,

I've been doing some performance comparisons lately, and wanted to compare
the performance overhead of using Xen with apache bench, but unfortunately
the Dom0 kernel crashes when hitting it with ab from a remote machine.
Most other workloads seem to be stable, however, I do see similar crashes
if hitting Dom0 mysql with a mysql benchmark with a high level of
parallelism.

I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on a
Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.

Interestingly, we had a similarly looking issue on arm64 recently, but that
was fixed with an APM-soecific fix to the hypervisor, so I am guessing this
is unrelated, see:
http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02731.html
and the fix:
http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd87ba09e29c817415aaa44

I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and v4.1,
same issue.  I have tried with Xen 4.5-0 release, and the Ubuntu packaged
Xen 4.4 release, same issue.

Examples of crash:
http://pastebin.ubuntu.com/11953498/
http://pastebin.ubuntu.com/11953443/

Running DomU with a bridge and running ab against apache running in a DomU
also causes the system to crash.

Note: The server also has an embedded 1G Broadcom NIC (although not
suitable for testing due to it being 1G and on a control network), and
using that for the test does not cause a system crash, so this points to
some difficulties with the Mellanox device and Xen.

Any ideas or advice is greatly appreciated, thanks.

-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 1976 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-28 13:09 Dom0 crash with apache bench (ab) Christoffer Dall
@ 2015-07-28 14:50 ` Konrad Rzeszutek Wilk
  2015-07-28 14:55   ` Ian Campbell
  0 siblings, 1 reply; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-07-28 14:50 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: xen-devel

On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote:
> Hi,
> 
> I've been doing some performance comparisons lately, and wanted to compare
> the performance overhead of using Xen with apache bench, but unfortunately
> the Dom0 kernel crashes when hitting it with ab from a remote machine.
> Most other workloads seem to be stable, however, I do see similar crashes
> if hitting Dom0 mysql with a mysql benchmark with a high level of
> parallelism.
> 
> I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on a
> Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.
> 
> Interestingly, we had a similarly looking issue on arm64 recently, but that
> was fixed with an APM-soecific fix to the hypervisor, so I am guessing this
> is unrelated, see:
> http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg02731.html
> and the fix:
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd87ba09e29c817415aaa44
> 
> I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and v4.1,
> same issue.  I have tried with Xen 4.5-0 release, and the Ubuntu packaged
> Xen 4.4 release, same issue.
> 
> Examples of crash:
> http://pastebin.ubuntu.com/11953498/
> http://pastebin.ubuntu.com/11953443/

4.0-rc4?

Have you tried 4.1?

> 
> Running DomU with a bridge and running ab against apache running in a DomU
> also causes the system to crash.
> 
> Note: The server also has an embedded 1G Broadcom NIC (although not
> suitable for testing due to it being 1G and on a control network), and
> using that for the test does not cause a system crash, so this points to
> some difficulties with the Mellanox device and Xen.
> 
> Any ideas or advice is greatly appreciated, thanks.
> 
> -Christoffer

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-28 14:50 ` Konrad Rzeszutek Wilk
@ 2015-07-28 14:55   ` Ian Campbell
  2015-07-28 15:00     ` Christoffer Dall
  0 siblings, 1 reply; 13+ messages in thread
From: Ian Campbell @ 2015-07-28 14:55 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Christoffer Dall; +Cc: xen-devel

On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote:
> > Hi,
> > 
> > I've been doing some performance comparisons lately, and wanted to 
> > compare
> > the performance overhead of using Xen with apache bench, but 
> > unfortunately
> > the Dom0 kernel crashes when hitting it with ab from a remote machine.
> > Most other workloads seem to be stable, however, I do see similar 
> > crashes
> > if hitting Dom0 mysql with a mysql benchmark with a high level of
> > parallelism.
> > 
> > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on 
> > a
> > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.
> > 
> > Interestingly, we had a similarly looking issue on arm64 recently, but 
> > that
> > was fixed with an APM-soecific fix to the hypervisor, so I am guessing 
> > this
> > is unrelated, see:
> > http://lists.xenproject.org/archives/html/xen-devel/2015
> > -03/msg02731.html
> > and the fix:
> > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd
> > 87ba09e29c817415aaa44
> > 
> > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and 
> > v4.1,
> > same issue.  I have tried with Xen 4.5-0 release, and the Ubuntu 
> > packaged
> > Xen 4.4 release, same issue.
> > 
> > Examples of crash:
> > http://pastebin.ubuntu.com/11953498/
> > http://pastebin.ubuntu.com/11953443/
> 
> 4.0-rc4?
> 
> Have you tried 4.1?

According to the previous paragraph, yes he has.

> 
> > 
> > Running DomU with a bridge and running ab against apache running in a 
> > DomU
> > also causes the system to crash.
> > 
> > Note: The server also has an embedded 1G Broadcom NIC (although not
> > suitable for testing due to it being 1G and on a control network), and
> > using that for the test does not cause a system crash, so this points 
> > to
> > some difficulties with the Mellanox device and Xen.
> > 
> > Any ideas or advice is greatly appreciated, thanks.
> > 
> > -Christoffer
> 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > http://lists.xen.org/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-28 14:55   ` Ian Campbell
@ 2015-07-28 15:00     ` Christoffer Dall
  2015-07-31 10:24       ` Stefano Stabellini
  0 siblings, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2015-07-28 15:00 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1885 bytes --]

On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell <ian.campbell@citrix.com>
wrote:

> On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote:
> > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote:
> > > Hi,
> > >
> > > I've been doing some performance comparisons lately, and wanted to
> > > compare
> > > the performance overhead of using Xen with apache bench, but
> > > unfortunately
> > > the Dom0 kernel crashes when hitting it with ab from a remote machine.
> > > Most other workloads seem to be stable, however, I do see similar
> > > crashes
> > > if hitting Dom0 mysql with a mysql benchmark with a high level of
> > > parallelism.
> > >
> > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on
> > > a
> > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.
> > >
> > > Interestingly, we had a similarly looking issue on arm64 recently, but
> > > that
> > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing
> > > this
> > > is unrelated, see:
> > > http://lists.xenproject.org/archives/html/xen-devel/2015
> > > -03/msg02731.html
> > > and the fix:
> > >
> http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd
> > > 87ba09e29c817415aaa44
> > >
> > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and
> > > v4.1,
> > > same issue.  I have tried with Xen 4.5-0 release, and the Ubuntu
> > > packaged
> > > Xen 4.4 release, same issue.
> > >
> > > Examples of crash:
> > > http://pastebin.ubuntu.com/11953498/
> > > http://pastebin.ubuntu.com/11953443/
> >
> > 4.0-rc4?
> >
> > Have you tried 4.1?
>
> According to the previous paragraph, yes he has.
>
> yes, I have.  Just for clarify, I used 4.0-rc4 because that's a branch
which contained arm64 PCI support and has been used for other measurements,
so this was simply my 'working tree'.

Thanks,
-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 3098 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-28 15:00     ` Christoffer Dall
@ 2015-07-31 10:24       ` Stefano Stabellini
  2015-07-31 10:28         ` David Vrabel
  0 siblings, 1 reply; 13+ messages in thread
From: Stefano Stabellini @ 2015-07-31 10:24 UTC (permalink / raw)
  To: Christoffer Dall; +Cc: David Vrabel, Wei Liu, Ian Campbell, xen-devel

[-- Attachment #1: Type: text/plain, Size: 4066 bytes --]

This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450),
CC'ing relevant people. As you can see from the links below the crash
is:

[ 253.619326] Call Trace:
[ 253.619330] <IRQ>
[ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
[ 253.619347] [<ffffffff815e8525>] __netif_receive_skb_core+0x6f5/0x940
[ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
[ 253.619360] [<ffffffff815e87f8>] netif_receive_skb_internal+0x28/0x90
[ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
[ 253.619378] [<ffffffffa01b1173>] mlx4_en_process_rx_cq+0x753/0xb50 [mlx4_en]
[ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160 [mlx4_en]
[ 253.619393] [<ffffffff815e8bcd>] net_rx_action+0x13d/0x2f0
[ 253.619400] [<ffffffff8109fdea>] __do_softirq+0xda/0x1f0
[ 253.619406] [<ffffffff810a013d>] irq_exit+0x9d/0xb0
[ 253.619412] [<ffffffff813e3825>] xen_evtchn_do_upcall+0x35/0x50
[ 253.619420] [<ffffffff816c7bce>] xen_do_hypervisor_callback+0x1e/0x40
[ 253.619423] <EOI>
[ 253.619426] [<ffffffff811a7870>] ? shrink_dcache_for_umount+0x90/0x90
[ 253.619437] [<ffffffff811a7ad9>] ? d_alloc_pseudo+0x9/0x10
[ 253.619443] [<ffffffff815cbbed>] ? sock_alloc_file+0x4d/0x120
[ 253.619448] [<ffffffff815cdf78>] ? SYSC_accept4+0xb8/0x200
[ 253.619454] [<ffffffff811d0377>] ? SyS_epoll_wait+0x87/0xe0
[ 253.619459] [<ffffffff815cf5c9>] ? SyS_accept4+0x9/0x10
[ 253.619465] [<ffffffff816c630d>] ? system_call_fastpath+0x16/0x1b
[ 253.619469] Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc
ff ff eb e1 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83
e2 07 <f3
> 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c
[ 253.619513] RIP [<ffffffff81318b0d>] __memcpy+0xd/0x110
[ 253.619520] RSP <ffff88006b823c60>
[ 253.619524] ---[ end trace ba5d35a466b03856 ]---

On Tue, 28 Jul 2015, Christoffer Dall wrote:
> On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
>       On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote:
>       > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote:
>       > > Hi,
>       > >
>       > > I've been doing some performance comparisons lately, and wanted to
>       > > compare
>       > > the performance overhead of using Xen with apache bench, but
>       > > unfortunately
>       > > the Dom0 kernel crashes when hitting it with ab from a remote machine.
>       > > Most other workloads seem to be stable, however, I do see similar
>       > > crashes
>       > > if hitting Dom0 mysql with a mysql benchmark with a high level of
>       > > parallelism.
>       > >
>       > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on
>       > > a
>       > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.
>       > >
>       > > Interestingly, we had a similarly looking issue on arm64 recently, but
>       > > that
>       > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing
>       > > this
>       > > is unrelated, see:
>       > > http://lists.xenproject.org/archives/html/xen-devel/2015
>       > > -03/msg02731.html
>       > > and the fix:
>       > > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd
>       > > 87ba09e29c817415aaa44
>       > >
>       > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and
>       > > v4.1,
>       > > same issue.  I have tried with Xen 4.5-0 release, and the Ubuntu
>       > > packaged
>       > > Xen 4.4 release, same issue.
>       > >
>       > > Examples of crash:
>       > > http://pastebin.ubuntu.com/11953498/
>       > > http://pastebin.ubuntu.com/11953443/
>       >
>       > 4.0-rc4?
>       >
>       > Have you tried 4.1?
>
> According to the previous paragraph, yes he has.
>
> yes, I have.  Just for clarify, I used 4.0-rc4 because that's a branch which contained arm64 PCI support and has
> been used for other measurements, so this was simply my 'working tree'.

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-31 10:24       ` Stefano Stabellini
@ 2015-07-31 10:28         ` David Vrabel
  2015-07-31 13:17           ` Christoffer Dall
  0 siblings, 1 reply; 13+ messages in thread
From: David Vrabel @ 2015-07-31 10:28 UTC (permalink / raw)
  To: Stefano Stabellini, Christoffer Dall; +Cc: Wei Liu, Ian Campbell, xen-devel

On 31/07/15 11:24, Stefano Stabellini wrote:
> This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450),
> CC'ing relevant people. As you can see from the links below the crash
> is:
> 
> [ 253.619326] Call Trace:
> [ 253.619330] <IRQ>
> [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> [ 253.619347] [<ffffffff815e8525>] __netif_receive_skb_core+0x6f5/0x940
> [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> [ 253.619360] [<ffffffff815e87f8>] netif_receive_skb_internal+0x28/0x90
> [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> [ 253.619378] [<ffffffffa01b1173>] mlx4_en_process_rx_cq+0x753/0xb50 [mlx4_en]
> [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160 [mlx4_en]

What makes you think this is Xen specific?  I suggest raising this the
the mlx4 maintainers.

David

> [ 253.619393] [<ffffffff815e8bcd>] net_rx_action+0x13d/0x2f0
> [ 253.619400] [<ffffffff8109fdea>] __do_softirq+0xda/0x1f0
> [ 253.619406] [<ffffffff810a013d>] irq_exit+0x9d/0xb0
> [ 253.619412] [<ffffffff813e3825>] xen_evtchn_do_upcall+0x35/0x50
> [ 253.619420] [<ffffffff816c7bce>] xen_do_hypervisor_callback+0x1e/0x40
> [ 253.619423] <EOI>
> [ 253.619426] [<ffffffff811a7870>] ? shrink_dcache_for_umount+0x90/0x90
> [ 253.619437] [<ffffffff811a7ad9>] ? d_alloc_pseudo+0x9/0x10
> [ 253.619443] [<ffffffff815cbbed>] ? sock_alloc_file+0x4d/0x120
> [ 253.619448] [<ffffffff815cdf78>] ? SYSC_accept4+0xb8/0x200
> [ 253.619454] [<ffffffff811d0377>] ? SyS_epoll_wait+0x87/0xe0
> [ 253.619459] [<ffffffff815cf5c9>] ? SyS_accept4+0x9/0x10
> [ 253.619465] [<ffffffff816c630d>] ? system_call_fastpath+0x16/0x1b
> [ 253.619469] Code: 4e 48 83 c4 08 5b 5d c3 66 0f 1f 44 00 00 e8 6b fc
> ff ff eb e1 90 90 90 90 90 90 90 90 90 48 89 f8 48 89 d1 48 c1 e9 03 83
> e2 07 <f3
>> 48 a5 89 d1 f3 a4 c3 20 4c 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 
> [ 253.619513] RIP [<ffffffff81318b0d>] __memcpy+0xd/0x110
> [ 253.619520] RSP <ffff88006b823c60>
> [ 253.619524] ---[ end trace ba5d35a466b03856 ]---
> 
> On Tue, 28 Jul 2015, Christoffer Dall wrote:
>> On Tue, Jul 28, 2015 at 4:55 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
>>       On Tue, 2015-07-28 at 10:50 -0400, Konrad Rzeszutek Wilk wrote:
>>       > On Tue, Jul 28, 2015 at 03:09:31PM +0200, Christoffer Dall wrote:
>>       > > Hi,
>>       > >
>>       > > I've been doing some performance comparisons lately, and wanted to
>>       > > compare
>>       > > the performance overhead of using Xen with apache bench, but
>>       > > unfortunately
>>       > > the Dom0 kernel crashes when hitting it with ab from a remote machine.
>>       > > Most other workloads seem to be stable, however, I do see similar
>>       > > crashes
>>       > > if hitting Dom0 mysql with a mysql benchmark with a high level of
>>       > > parallelism.
>>       > >
>>       > > I use a 10G Mellanox MX354A Dual port FDR CX3 adapter for networking on
>>       > > a
>>       > > Dell PowerEdge R320 system with a Xeon E5-2450 and 16 GB of RAM.
>>       > >
>>       > > Interestingly, we had a similarly looking issue on arm64 recently, but
>>       > > that
>>       > > was fixed with an APM-soecific fix to the hypervisor, so I am guessing
>>       > > this
>>       > > is unrelated, see:
>>       > > http://lists.xenproject.org/archives/html/xen-devel/2015
>>       > > -03/msg02731.html
>>       > > and the fix:
>>       > > http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=50dcb3de603927db2fd
>>       > > 87ba09e29c817415aaa44
>>       > >
>>       > > I have tried with several Linux versions, v3.13, v3.18, v4.0-rc4, and
>>       > > v4.1,
>>       > > same issue.  I have tried with Xen 4.5-0 release, and the Ubuntu
>>       > > packaged
>>       > > Xen 4.4 release, same issue.
>>       > >
>>       > > Examples of crash:
>>       > > http://pastebin.ubuntu.com/11953498/
>>       > > http://pastebin.ubuntu.com/11953443/
>>       >
>>       > 4.0-rc4?
>>       >
>>       > Have you tried 4.1?
>>
>> According to the previous paragraph, yes he has.
>>
>> yes, I have.  Just for clarify, I used 4.0-rc4 because that's a branch which contained arm64 PCI support and has
>> been used for other measurements, so this was simply my 'working tree'.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-31 10:28         ` David Vrabel
@ 2015-07-31 13:17           ` Christoffer Dall
  2015-09-14 12:40             ` Christoffer Dall
  0 siblings, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2015-07-31 13:17 UTC (permalink / raw)
  To: David Vrabel; +Cc: xen-devel, Wei Liu, Ian Campbell, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 1093 bytes --]

On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@citrix.com>
wrote:

> On 31/07/15 11:24, Stefano Stabellini wrote:
> > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450),
> > CC'ing relevant people. As you can see from the links below the crash
> > is:
> >
> > [ 253.619326] Call Trace:
> > [ 253.619330] <IRQ>
> > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > [ 253.619347] [<ffffffff815e8525>] __netif_receive_skb_core+0x6f5/0x940
> > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > [ 253.619360] [<ffffffff815e87f8>] netif_receive_skb_internal+0x28/0x90
> > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > [ 253.619378] [<ffffffffa01b1173>] mlx4_en_process_rx_cq+0x753/0xb50
> [mlx4_en]
> > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> [mlx4_en]
>
> What makes you think this is Xen specific?  I suggest raising this the
> the mlx4 maintainers.
>
>
Linux native and KVM guests (same hw, same kernel version+config) run just
fine under the same workload.

Thanks,
-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 1671 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-07-31 13:17           ` Christoffer Dall
@ 2015-09-14 12:40             ` Christoffer Dall
  2015-09-14 15:11               ` Konrad Rzeszutek Wilk
  2015-09-14 15:20               ` Ian Campbell
  0 siblings, 2 replies; 13+ messages in thread
From: Christoffer Dall @ 2015-09-14 12:40 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Ian Campbell, xen-devel, Wei Liu, David Vrabel,
	Stefano Stabellini

On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@citrix.com>
> wrote:
> 
> > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450),
> > > CC'ing relevant people. As you can see from the links below the crash
> > > is:
> > >
> > > [ 253.619326] Call Trace:
> > > [ 253.619330] <IRQ>
> > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > > [ 253.619347] [<ffffffff815e8525>] __netif_receive_skb_core+0x6f5/0x940
> > > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > > [ 253.619360] [<ffffffff815e87f8>] netif_receive_skb_internal+0x28/0x90
> > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > > [ 253.619378] [<ffffffffa01b1173>] mlx4_en_process_rx_cq+0x753/0xb50
> > [mlx4_en]
> > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> > [mlx4_en]
> >
> > What makes you think this is Xen specific?  I suggest raising this the
> > the mlx4 maintainers.
> >
> >
> Linux native and KVM guests (same hw, same kernel version+config) run just
> fine under the same workload.
> 
Ping?

>From the fact that bare-metal and KVM works fine with this hardware I
still think it's reasonable to assume that it's a Xen issue and not a
mlx4 issue.

Is this completely flawed?

Thanks,
-Christoffer

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-09-14 12:40             ` Christoffer Dall
@ 2015-09-14 15:11               ` Konrad Rzeszutek Wilk
  2015-09-14 15:20               ` Ian Campbell
  1 sibling, 0 replies; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-14 15:11 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, xen-devel,
	Christoffer Dall, David Vrabel

On Mon, Sep 14, 2015 at 02:40:08PM +0200, Christoffer Dall wrote:
> On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@citrix.com>
> > wrote:
> > 
> > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5-2450),
> > > > CC'ing relevant people. As you can see from the links below the crash
> > > > is:
> > > >
> > > > [ 253.619326] Call Trace:
> > > > [ 253.619330] <IRQ>
> > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > > > [ 253.619347] [<ffffffff815e8525>] __netif_receive_skb_core+0x6f5/0x940
> > > > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > > > [ 253.619360] [<ffffffff815e87f8>] netif_receive_skb_internal+0x28/0x90
> > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > > > [ 253.619378] [<ffffffffa01b1173>] mlx4_en_process_rx_cq+0x753/0xb50
> > > [mlx4_en]
> > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> > > [mlx4_en]
> > >
> > > What makes you think this is Xen specific?  I suggest raising this the
> > > the mlx4 maintainers.
> > >
> > >
> > Linux native and KVM guests (same hw, same kernel version+config) run just
> > fine under the same workload.
> > 
> Ping?
> 
> >From the fact that bare-metal and KVM works fine with this hardware I
> still think it's reasonable to assume that it's a Xen issue and not a
> mlx4 issue.
> 
> Is this completely flawed?

I have a feeling it is an mlx4 issue but you don't easily reproduce it
under baremetal. Is there any way you could boot baremetal with
'iommu=soft swiotlb=force' to see if you can reproduce it under those
conditions?

thanks!
> 
> Thanks,
> -Christoffer
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-09-14 12:40             ` Christoffer Dall
  2015-09-14 15:11               ` Konrad Rzeszutek Wilk
@ 2015-09-14 15:20               ` Ian Campbell
  2015-09-14 16:16                 ` Christoffer Dall
  2015-09-28 20:53                 ` Christoffer Dall
  1 sibling, 2 replies; 13+ messages in thread
From: Ian Campbell @ 2015-09-14 15:20 UTC (permalink / raw)
  To: Christoffer Dall, Christoffer Dall
  Cc: xen-devel, Wei Liu, David Vrabel, Stefano Stabellini

On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote:
> On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <david.vrabel@citrix.com
> > >
> > wrote:
> > 
> > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5
> > > > -2450),
> > > > CC'ing relevant people. As you can see from the links below the
> > > > crash
> > > > is:
> > > > 
> > > > [ 253.619326] Call Trace:
> > > > [ 253.619330] <IRQ>
> > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > > > [ 253.619347] [<ffffffff815e8525>]
> > > > __netif_receive_skb_core+0x6f5/0x940
> > > > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > > > [ 253.619360] [<ffffffff815e87f8>]
> > > > netif_receive_skb_internal+0x28/0x90
> > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > > > [ 253.619378] [<ffffffffa01b1173>]
> > > > mlx4_en_process_rx_cq+0x753/0xb50
> > > [mlx4_en]
> > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> > > [mlx4_en]
> > > 
> > > What makes you think this is Xen specific?  I suggest raising this
> > > the
> > > the mlx4 maintainers.
> > > 
> > > 
> > Linux native and KVM guests (same hw, same kernel version+config) run
> > just
> > fine under the same workload.
> > 
> Ping?
> 
> From the fact that bare-metal and KVM works fine with this hardware I
> still think it's reasonable to assume that it's a Xen issue and not a
> mlx4 issue.
> 
> Is this completely flawed?

My (somewhat educated) guess is that this is to do with the difference
between (pseudo-)physical addresses and machine (AKA real-physical)
addresses when running under Xen.

The way this often shows up is in drivers which do not make correct use of
the kernels DMA APIs but which happen to work on native x86 because
physical==bus address on x86.

Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these
sorts of issues.

You are running 64-bit so I don't think the recent "config: Enable
NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be
relevant (it's already unconditionally on for 64-bit).

The trace appears to be on rx from a physical nic, there shouldn't be any
magic Xen stuff (granted pages etc) getting themselves into that path at
all. If it were tx then maybe it might be an issue with foreign pages. In
any case I think you are able to repro with just dom0, i.e. never having
started a domU, is that right?
Ian.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-09-14 15:20               ` Ian Campbell
@ 2015-09-14 16:16                 ` Christoffer Dall
  2015-09-28 20:53                 ` Christoffer Dall
  1 sibling, 0 replies; 13+ messages in thread
From: Christoffer Dall @ 2015-09-14 16:16 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Wei Liu, David Vrabel, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 3001 bytes --]

On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell <ian.campbell@citrix.com>
wrote:

> On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote:
> > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <
> david.vrabel@citrix.com
> > > >
> > > wrote:
> > >
> > > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5
> > > > > -2450),
> > > > > CC'ing relevant people. As you can see from the links below the
> > > > > crash
> > > > > is:
> > > > >
> > > > > [ 253.619326] Call Trace:
> > > > > [ 253.619330] <IRQ>
> > > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > > > > [ 253.619347] [<ffffffff815e8525>]
> > > > > __netif_receive_skb_core+0x6f5/0x940
> > > > > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > > > > [ 253.619360] [<ffffffff815e87f8>]
> > > > > netif_receive_skb_internal+0x28/0x90
> > > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > > > > [ 253.619378] [<ffffffffa01b1173>]
> > > > > mlx4_en_process_rx_cq+0x753/0xb50
> > > > [mlx4_en]
> > > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> > > > [mlx4_en]
> > > >
> > > > What makes you think this is Xen specific?  I suggest raising this
> > > > the
> > > > the mlx4 maintainers.
> > > >
> > > >
> > > Linux native and KVM guests (same hw, same kernel version+config) run
> > > just
> > > fine under the same workload.
> > >
> > Ping?
> >
> > From the fact that bare-metal and KVM works fine with this hardware I
> > still think it's reasonable to assume that it's a Xen issue and not a
> > mlx4 issue.
> >
> > Is this completely flawed?
>
> My (somewhat educated) guess is that this is to do with the difference
> between (pseudo-)physical addresses and machine (AKA real-physical)
> addresses when running under Xen.
>
> The way this often shows up is in drivers which do not make correct use of
> the kernels DMA APIs but which happen to work on native x86 because
> physical==bus address on x86.
>
> Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these
> sorts of issues.
>

I'll give this a try.


>
> You are running 64-bit so I don't think the recent "config: Enable
> NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be
> relevant (it's already unconditionally on for 64-bit).
>
> The trace appears to be on rx from a physical nic, there shouldn't be any
> magic Xen stuff (granted pages etc) getting themselves into that path at
> all. If it were tx then maybe it might be an issue with foreign pages. In
> any case I think you are able to repro with just dom0, i.e. never having
> started a domU, is that right?
>

As far as I remember and as far as I can interpret my own e-mail, yes.

Thanks for the feedback, I'll try the suggested approaches and also try
using v4.3-rc1 and take it up with the mlx4 maintainers if I still see the
issue.

-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 4283 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-09-14 15:20               ` Ian Campbell
  2015-09-14 16:16                 ` Christoffer Dall
@ 2015-09-28 20:53                 ` Christoffer Dall
  2015-09-30 15:12                   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 13+ messages in thread
From: Christoffer Dall @ 2015-09-28 20:53 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen-devel, Wei Liu, David Vrabel, Stefano Stabellini


[-- Attachment #1.1: Type: text/plain, Size: 2877 bytes --]

On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell <ian.campbell@citrix.com>
wrote:

> On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote:
> > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <
> david.vrabel@citrix.com
> > > >
> > > wrote:
> > >
> > > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5
> > > > > -2450),
> > > > > CC'ing relevant people. As you can see from the links below the
> > > > > crash
> > > > > is:
> > > > >
> > > > > [ 253.619326] Call Trace:
> > > > > [ 253.619330] <IRQ>
> > > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > > > > [ 253.619347] [<ffffffff815e8525>]
> > > > > __netif_receive_skb_core+0x6f5/0x940
> > > > > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > > > > [ 253.619360] [<ffffffff815e87f8>]
> > > > > netif_receive_skb_internal+0x28/0x90
> > > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > > > > [ 253.619378] [<ffffffffa01b1173>]
> > > > > mlx4_en_process_rx_cq+0x753/0xb50
> > > > [mlx4_en]
> > > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> > > > [mlx4_en]
> > > >
> > > > What makes you think this is Xen specific?  I suggest raising this
> > > > the
> > > > the mlx4 maintainers.
> > > >
> > > >
> > > Linux native and KVM guests (same hw, same kernel version+config) run
> > > just
> > > fine under the same workload.
> > >
> > Ping?
> >
> > From the fact that bare-metal and KVM works fine with this hardware I
> > still think it's reasonable to assume that it's a Xen issue and not a
> > mlx4 issue.
> >
> > Is this completely flawed?
>
> My (somewhat educated) guess is that this is to do with the difference
> between (pseudo-)physical addresses and machine (AKA real-physical)
> addresses when running under Xen.
>
> The way this often shows up is in drivers which do not make correct use of
> the kernels DMA APIs but which happen to work on native x86 because
> physical==bus address on x86.
>
> Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these
> sorts of issues.
>

Indeed it does, on both v4.0 and v4.3-rc2.


>
> You are running 64-bit so I don't think the recent "config: Enable
> NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be
> relevant (it's already unconditionally on for 64-bit).
>
> The trace appears to be on rx from a physical nic, there shouldn't be any
> magic Xen stuff (granted pages etc) getting themselves into that path at
> all. If it were tx then maybe it might be an issue with foreign pages. In
> any case I think you are able to repro with just dom0, i.e. never having
> started a domU, is that right?
>
>
Yes, I can reproduce on Dom0.

I will send this to the Mellanox people.

Thanks,
-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 4225 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Dom0 crash with apache bench (ab)
  2015-09-28 20:53                 ` Christoffer Dall
@ 2015-09-30 15:12                   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 13+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-30 15:12 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: David Vrabel, Wei Liu, Ian Campbell, Stefano Stabellini,
	xen-devel

On Mon, Sep 28, 2015 at 10:53:33PM +0200, Christoffer Dall wrote:
> On Mon, Sep 14, 2015 at 5:20 PM, Ian Campbell <ian.campbell@citrix.com>
> wrote:
> 
> > On Mon, 2015-09-14 at 14:40 +0200, Christoffer Dall wrote:
> > > On Fri, Jul 31, 2015 at 03:17:56PM +0200, Christoffer Dall wrote:
> > > > On Fri, Jul 31, 2015 at 12:28 PM, David Vrabel <
> > david.vrabel@citrix.com
> > > > >
> > > > wrote:
> > > >
> > > > > On 31/07/15 11:24, Stefano Stabellini wrote:
> > > > > > This is a Linux Dom0 crash on x86 (Dell PowerEdge R320, Xeon E5
> > > > > > -2450),
> > > > > > CC'ing relevant people. As you can see from the links below the
> > > > > > crash
> > > > > > is:
> > > > > >
> > > > > > [ 253.619326] Call Trace:
> > > > > > [ 253.619330] <IRQ>
> > > > > > [ 253.619332] [<ffffffff815d7c25>] ? skb_copy_ubufs+0xa5/0x230
> > > > > > [ 253.619347] [<ffffffff815e8525>]
> > > > > > __netif_receive_skb_core+0x6f5/0x940
> > > > > > [ 253.619353] [<ffffffff815e8788>] __netif_receive_skb+0x18/0x60
> > > > > > [ 253.619360] [<ffffffff815e87f8>]
> > > > > > netif_receive_skb_internal+0x28/0x90
> > > > > > [ 253.619366] [<ffffffff815e91f5>] napi_gro_frags+0x125/0x1a0
> > > > > > [ 253.619378] [<ffffffffa01b1173>]
> > > > > > mlx4_en_process_rx_cq+0x753/0xb50
> > > > > [mlx4_en]
> > > > > > [ 253.619387] [<ffffffffa01b1657>] mlx4_en_poll_rx_cq+0x97/0x160
> > > > > [mlx4_en]
> > > > >
> > > > > What makes you think this is Xen specific?  I suggest raising this
> > > > > the
> > > > > the mlx4 maintainers.
> > > > >
> > > > >
> > > > Linux native and KVM guests (same hw, same kernel version+config) run
> > > > just
> > > > fine under the same workload.
> > > >
> > > Ping?
> > >
> > > From the fact that bare-metal and KVM works fine with this hardware I
> > > still think it's reasonable to assume that it's a Xen issue and not a
> > > mlx4 issue.
> > >
> > > Is this completely flawed?
> >
> > My (somewhat educated) guess is that this is to do with the difference
> > between (pseudo-)physical addresses and machine (AKA real-physical)
> > addresses when running under Xen.
> >
> > The way this often shows up is in drivers which do not make correct use of
> > the kernels DMA APIs but which happen to work on native x86 because
> > physical==bus address on x86.
> >
> > Sometimes booting natively with 'iommu=soft swiotlb=force' can expose these
> > sorts of issues.
> >
> 
> Indeed it does, on both v4.0 and v4.3-rc2.

Yeeey!
> 
> 
> >
> > You are running 64-bit so I don't think the recent "config: Enable
> > NEED_DMA_MAP_STATE by default when SWIOTLB is selected" is likely to be
> > relevant (it's already unconditionally on for 64-bit).
> >
> > The trace appears to be on rx from a physical nic, there shouldn't be any
> > magic Xen stuff (granted pages etc) getting themselves into that path at
> > all. If it were tx then maybe it might be an issue with foreign pages. In
> > any case I think you are able to repro with just dom0, i.e. never having
> > started a domU, is that right?
> >
> >
> Yes, I can reproduce on Dom0.
> 
> I will send this to the Mellanox people.

Thank you :-) Thought please do keep us (or at least me) CC, this is an
interesting bug.

> 
> Thanks,
> -Christoffer

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-09-30 15:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-28 13:09 Dom0 crash with apache bench (ab) Christoffer Dall
2015-07-28 14:50 ` Konrad Rzeszutek Wilk
2015-07-28 14:55   ` Ian Campbell
2015-07-28 15:00     ` Christoffer Dall
2015-07-31 10:24       ` Stefano Stabellini
2015-07-31 10:28         ` David Vrabel
2015-07-31 13:17           ` Christoffer Dall
2015-09-14 12:40             ` Christoffer Dall
2015-09-14 15:11               ` Konrad Rzeszutek Wilk
2015-09-14 15:20               ` Ian Campbell
2015-09-14 16:16                 ` Christoffer Dall
2015-09-28 20:53                 ` Christoffer Dall
2015-09-30 15:12                   ` Konrad Rzeszutek Wilk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.