[Qemu-devel] linux-user crashes on clone(2) when run on ppc host

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
@ 2015-06-17  0:52 Emilio G. Cota
  2015-06-17  8:58 ` Peter Maydell
  0 siblings, 1 reply; 7+ messages in thread
From: Emilio G. Cota @ 2015-06-17  0:52 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Riku Voipio, qemu-ppc, QEMU Developers

Hi,

I'm having trouble running a simple multithreaded program on a PowerPC host machine.

The machine I'm using is a ppc VM--I think it's running under KVM (I'm using
OVH's RunAbove Power8 service):
  admin@adsf:~/qemu$ uname -a
  Linux adsf 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:27:09 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux

The original program I tried was doing pthread_create, and it was segfaulting.
Then I distilled it to a simpler test program taken from
  https://lists.gnu.org/archive/html/qemu-devel/2005-10/msg00251.html
, which is simply doing a clone(2):

/* gcc -O0 -g -o foo foo.c -pthread -static */
#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <stdio.h>
#include <sys/types.h>

int thread_main(void *arg)
{
    printf("child: Hello world!\n");
    while(1)
	;
    return 0;
}

unsigned long stack[8192];

int main()
{
    int pid;

    printf("About to clone: thread_main=%p\n", thread_main);
    pid = clone(thread_main, stack+4096, CLONE_VM, NULL);
    if(pid == -1) {
        perror("clone");
        return 1;
    }
    printf("parent: clone successful; child pid is %d\n", pid);
    printf("parent: sleeping a bit\n");
    sleep(2);
    printf("parent: killing process\n");
    kill(pid, SIGTERM);
    return 0;
}

Doesn't work (linux-user on ppc64le host):
- x86_64 static binary, compiled natively
- ppc static binary, cross-compiled from x86 host
- ppc64le static binary, compiled natively on the ppc64le host
- ppc64le binary (i.e. non-static), compiled natively on the ppc64le host
- ppc64 binary, compiled natively on ppc64 host (running ppc64-linux-user)

Works:
- Any of the above running on x86_64 host (linux-user or native)
- ppc64le binary running natively on ppc64le host

The current HEAD of the tree is:
  commit 93f6d1c16036aaf34055d16f54ea770fb8d6d280
  Merge: 4316536 7a4dfd1
  Author: Peter Maydell <peter.maydell@linaro.org>
  Date:   Tue Jun 16 10:35:43 2015 +0100

I've tried older versions of qemu (e.g. v2.0, v1.7) and they don't work either.

The segfault for the ppc64le static binary is as follows:

admin@adsf:~/qemu$ ppc64le-linux-user/qemu-ppc64le foo
About to clone: thread_main=0x100008f0
Invalid data memory access: 0x00003fffa2f8a720
NIP 00000040009aeec8   LR 0000000010000660 CTR 00000040009aee68 XER 0000000000000000 CPU#1
MSR 8000000002806001 HID0 0000000000000000  HF 0000000002806001 idx 0
TB 00000000 00000000
GPR00 0000000000000078 0000000010019030 0000004000a52800 0000000000000000
GPR04 0000000010019030 0000000000000027 0000000000000000 0000000000000001
GPR08 0000000000000000 0000000000000001 0000000000000000 0000000000000007
GPR12 00000040009aee68 0000004000a57b60 0000000000000000 0000000000000000
GPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24 0000000000000000 0000000000000000 0000000000000000 000000400084be10
GPR28 000000400084c148 0000000000000100 00000000100008f0 0000000000000000
CR 42000884  [ G  E  -  -  -  L  L  G  ]             RES ffffffffffffffff
FPR00 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPSCR 0000000000000000
Invalid segfault errno (42000000)
NIP 00000040009aeec8   LR 0000000010000660 CTR 00000040009aee68 XER 0000000000000000 CPU#1
MSR 8000000002806001 HID0 0000000000000000  HF 0000000002806001 idx 0
TB 00000000 00000000
GPR00 0000000000000078 0000000010019030 0000004000a52800 0000000000000000
GPR04 0000000010019030 0000000000000027 0000000000000000 0000000000000001
GPR08 0000000000000000 0000000000000001 0000000000000000 0000000000000007
GPR12 00000040009aee68 0000004000a57b60 0000000000000000 0000000000000000
GPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24 0000000000000000 0000000000000000 0000000000000000 000000400084be10
GPR28 000000400084c148 0000000000000100 00000000100008f0 0000000000000000
CR 42000884  [ G  E  -  -  -  L  L  G  ]             RES ffffffffffffffff
FPR00 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR04 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR08 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR12 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR16 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR20 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR24 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPR28 0000000000000000 0000000000000000 0000000000000000 0000000000000000
FPSCR 0000000000000000

^C

gdb stack trace:
[...]
Using host libthread_db library "/lib/powerpc64le-linux-gnu/libthread_db.so.1".
[New Thread 0x3fffb7aaf170 (LWP 12287)]
About to clone: thread_main=0x100007f4
[New Thread 0x3fffb3a7f170 (LWP 12288)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x3fffb3a7f170 (LWP 12288)]
0x0000000060014828 in ppc_tb_set_jmp_target (jmp_addr=<error reading variable: Cannot access memory at address 0x3fffb3a2a748>, addr=<error reading variable: Cannot access memory at address 0x3fffb3a2a740>) at /home/admin/qemu/tcg/ppc/tcg-target.c:1247
1247    {
(gdb) bt
#0  0x0000000060014828 in ppc_tb_set_jmp_target (jmp_addr=<error reading variable: Cannot access memory at address 0x3fffb3a2a748>, addr=<error reading variable: Cannot access memory at address 0x3fffb3a2a740>) at /home/admin/qemu/tcg/ppc/tcg-target.c:1247
#1  0x0000000060009ce0 in tb_set_jmp_target (tb=0x3fffb3adf4f0, n=0, addr=1614371232) at /home/admin/qemu/include/exec/exec-all.h:286
#2  0x000000006000b648 in tb_reset_jump (tb=0x3fffb3adf4f0, n=0) at /home/admin/qemu/translate-all.c:907
#3  0x000000006000c5d0 in tb_link_page (tb=0x3fffb3adf4f0, phys_pc=274888060616, phys_page2=18446744073709551615) at /home/admin/qemu/translate-all.c:1363
#4  0x000000006000bccc in tb_gen_code (cpu=0x6241b570, pc=274888060616, cs_base=0, flags=41967617, cflags=0) at /home/admin/qemu/translate-all.c:1034
#5  0x000000006000e6f0 in tb_find_slow (env=0x62423990, pc=274888060616, cs_base=0, flags=41967617) at /home/admin/qemu/cpu-exec.c:299
#6  0x000000006000ea14 in tb_find_fast (env=0x62423990) at /home/admin/qemu/cpu-exec.c:327
#7  0x000000006000efe4 in cpu_ppc_exec (env=0x62423990) at /home/admin/qemu/cpu-exec.c:485
#8  0x00000000600716e4 in cpu_loop (env=0x62423990) at /home/admin/qemu/linux-user/main.c:1569
#9  0x0000000060083c50 in clone_func (arg=0x3fffffffcbb8) at /home/admin/qemu/linux-user/syscall.c:4536
#10 0x00003fffb7cc89d8 in start_thread (arg=0x3fffb3a7f170) at pthread_create.c:314
#11 0x00003fffb7c1ef00 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:104


Can you reproduce this on a real host? I wonder whether the fact that the
host here is a VM has to do with it.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
  2015-06-17  0:52 [Qemu-devel] linux-user crashes on clone(2) when run on ppc host Emilio G. Cota
@ 2015-06-17  8:58 ` Peter Maydell
  2015-06-17 21:36   ` Emilio G. Cota
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Maydell @ 2015-06-17  8:58 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Riku Voipio, qemu-ppc@nongnu.org, Alexander Graf, QEMU Developers

On 17 June 2015 at 01:52, Emilio G. Cota <cota@braap.org> wrote:
> Hi,
>
> I'm having trouble running a simple multithreaded program on a PowerPC host machine.
>
> The machine I'm using is a ppc VM--I think it's running under KVM (I'm using
> OVH's RunAbove Power8 service):
>   admin@adsf:~/qemu$ uname -a
>   Linux adsf 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:27:09 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
>
> admin@adsf:~/qemu$ ppc64le-linux-user/qemu-ppc64le foo

Multithreaded binaries don't work with linux-user; there are a bunch
of known race conditions involving data structures we don't correctly
lock or make per-thread.

This is a long-standing issue; we're hoping we might get to fixing
it some time this year.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
  2015-06-17  8:58 ` Peter Maydell
@ 2015-06-17 21:36   ` Emilio G. Cota
  2015-06-18  7:42     ` Peter Maydell
  0 siblings, 1 reply; 7+ messages in thread
From: Emilio G. Cota @ 2015-06-17 21:36 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Riku Voipio, qemu-ppc@nongnu.org, Alexander Graf, QEMU Developers

On Wed, Jun 17, 2015 at 09:58:27 +0100, Peter Maydell wrote:
> On 17 June 2015 at 01:52, Emilio G. Cota <cota@braap.org> wrote:
> > I'm having trouble running a simple multithreaded program on a PowerPC host machine.
> >
> > The machine I'm using is a ppc VM--I think it's running under KVM (I'm using
> > OVH's RunAbove Power8 service):
> >   admin@adsf:~/qemu$ uname -a
> >   Linux adsf 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:27:09 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
> >
> > admin@adsf:~/qemu$ ppc64le-linux-user/qemu-ppc64le foo
> 
> Multithreaded binaries don't work with linux-user; there are a bunch
> of known race conditions involving data structures we don't correctly
> lock or make per-thread.
> 
> This is a long-standing issue; we're hoping we might get to fixing
> it some time this year.

I don't think this is a race because it also breaks when
run on a single core (with taskset -c 0).

What data structures are you referring to? Are they ppc-specific?
On x86 hosts linux-user works reliably with multithreaded apps.
I'd expect any races on common code to show up on x86 as well.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
  2015-06-17 21:36   ` Emilio G. Cota
@ 2015-06-18  7:42     ` Peter Maydell
  2015-06-18 14:23       ` Emilio G. Cota
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Maydell @ 2015-06-18  7:42 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Riku Voipio, qemu-ppc@nongnu.org, Alexander Graf, QEMU Developers

On 17 June 2015 at 22:36, Emilio G. Cota <cota@braap.org> wrote:
> On Wed, Jun 17, 2015 at 09:58:27 +0100, Peter Maydell wrote:
>> On 17 June 2015 at 01:52, Emilio G. Cota <cota@braap.org> wrote:
>> > I'm having trouble running a simple multithreaded program on a PowerPC host machine.
>> >
>> > The machine I'm using is a ppc VM--I think it's running under KVM (I'm using
>> > OVH's RunAbove Power8 service):
>> >   admin@adsf:~/qemu$ uname -a
>> >   Linux adsf 3.13.0-37-generic #64-Ubuntu SMP Mon Sep 22 21:27:09 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux
>> >
>> > admin@adsf:~/qemu$ ppc64le-linux-user/qemu-ppc64le foo
>>
>> Multithreaded binaries don't work with linux-user; there are a bunch
>> of known race conditions involving data structures we don't correctly
>> lock or make per-thread.
>>
>> This is a long-standing issue; we're hoping we might get to fixing
>> it some time this year.
>
> I don't think this is a race because it also breaks when
> run on a single core (with taskset -c 0).
>
> What data structures are you referring to? Are they ppc-specific?

None of the code generation data structures are locked at all --
if two threads try to generate code at the same time they'll
tend to clobber each other.

> On x86 hosts linux-user works reliably with multithreaded apps.

No, it doesn't :-) If any multithreaded app happens to run on
any host it is pure fluke.

-- PMM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
  2015-06-18  7:42     ` Peter Maydell
@ 2015-06-18 14:23       ` Emilio G. Cota
  2015-06-18 14:55         ` Peter Maydell
  0 siblings, 1 reply; 7+ messages in thread
From: Emilio G. Cota @ 2015-06-18 14:23 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Riku Voipio, qemu-ppc@nongnu.org, Alexander Graf, QEMU Developers

On Thu, Jun 18, 2015 at 08:42:40 +0100, Peter Maydell wrote:
> > What data structures are you referring to? Are they ppc-specific?
> 
> None of the code generation data structures are locked at all --
> if two threads try to generate code at the same time they'll
> tend to clobber each other.

AFAICT tb_gen_code is called with a mutex held (the sequence is
mutex->tb_find_fast->tb_find_slow->tb_gen_code in cpu-exec.c)

The only call to tb_gen_code that in usermode is not holding
the lock is in cpu_breakpoint_insert->breakpoint_invalidate->
tb_invalidate_phys_page_range->tb_gen_code. I'm not using
gdb so I guess I cannot trigger this.

Am I missing something?

> On 17 June 2015 at 22:36, Emilio G. Cota <cota@braap.org> wrote:
> > I don't think this is a race because it also breaks when
> > run on a single core (with taskset -c 0).

As I said, this problem doesn't seem to be a race.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
  2015-06-18 14:23       ` Emilio G. Cota
@ 2015-06-18 14:55         ` Peter Maydell
  2015-06-18 18:36           ` Emilio G. Cota
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Maydell @ 2015-06-18 14:55 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Riku Voipio, qemu-ppc@nongnu.org, Alexander Graf, QEMU Developers

On 18 June 2015 at 15:23, Emilio G. Cota <cota@braap.org> wrote:
> On Thu, Jun 18, 2015 at 08:42:40 +0100, Peter Maydell wrote:
>> > What data structures are you referring to? Are they ppc-specific?
>>
>> None of the code generation data structures are locked at all --
>> if two threads try to generate code at the same time they'll
>> tend to clobber each other.
>
> AFAICT tb_gen_code is called with a mutex held (the sequence is
> mutex->tb_find_fast->tb_find_slow->tb_gen_code in cpu-exec.c)
>
> The only call to tb_gen_code that in usermode is not holding
> the lock is in cpu_breakpoint_insert->breakpoint_invalidate->
> tb_invalidate_phys_page_range->tb_gen_code. I'm not using
> gdb so I guess I cannot trigger this.
>
> Am I missing something?

I'd forgotten we had that mutex. However it's not actually
a sufficient fix for the problem. What needs to happen is
that:
 (a) somebody actually sits down and figures out what data
structures we have and what locking/per-cpuness/etc they need,
ie a design
 (b) somebody implements that design

This is happening as port of the TCG multithreading work:
  http://wiki.qemu.org/Features/tcg-multithread

This is the bug we've had kicking around for a while about
multithreading races:
 https://bugs.launchpad.net/qemu/+bug/1098729

As just one example race, consider the possibility that
thread A calls tb_gen_code, which calls tb_alloc, which
calls tb_flush, which clears the whole code cache, and then
tb_gen_code starts generating code over the top of a TB
that thread B was in the middle of executing from...

>> On 17 June 2015 at 22:36, Emilio G. Cota <cota@braap.org> wrote:
>> > I don't think this is a race because it also breaks when
>> > run on a single core (with taskset -c 0).
>
> As I said, this problem doesn't seem to be a race.

The multiple threads will still all be racing with each
other on the single core.

In general I don't see much benefit in detailed investigation
into the precise reason why a guest program crashes when
the whole area is known to be fundamentally not designed
right...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] linux-user crashes on clone(2) when run on ppc host
  2015-06-18 14:55         ` Peter Maydell
@ 2015-06-18 18:36           ` Emilio G. Cota
  0 siblings, 0 replies; 7+ messages in thread
From: Emilio G. Cota @ 2015-06-18 18:36 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Riku Voipio, qemu-ppc@nongnu.org, Alexander Graf, QEMU Developers

On Thu, Jun 18, 2015 at 15:55:54 +0100, Peter Maydell wrote:
> I'd forgotten we had that mutex. However it's not actually
> a sufficient fix for the problem. What needs to happen is
> that:
>  (a) somebody actually sits down and figures out what data
> structures we have and what locking/per-cpuness/etc they need,
> ie a design
>  (b) somebody implements that design

I'm exactly doing (a) and (b). In doing so I've found this crash,
and I think that it is not due to races in QEMU--it seems
to be a ppc64 issue.

> This is happening as port of the TCG multithreading work:
>   http://wiki.qemu.org/Features/tcg-multithread

I'm closely following these discussions. My goal is to
have a sane multithreaded linux-user first, and then move on
to full-system.

> This is the bug we've had kicking around for a while about
> multithreading races:
>  https://bugs.launchpad.net/qemu/+bug/1098729

I've tried to reproduce it, but I can't (I could easily trigger
it with the qemu that is packaged by Ubuntu 14.04). I've
added a message to the thread stating this.

> As just one example race, consider the possibility that
> thread A calls tb_gen_code, which calls tb_alloc, which
> calls tb_flush, which clears the whole code cache, and then
> tb_gen_code starts generating code over the top of a TB
> that thread B was in the middle of executing from...

Agreed, this needs to be fixed. Certainly not the problem I'm
reporting here, however.

> >> On 17 June 2015 at 22:36, Emilio G. Cota <cota@braap.org> wrote:
> >> > I don't think this is a race because it also breaks when
> >> > run on a single core (with taskset -c 0).
> >
> > As I said, this problem doesn't seem to be a race.
> 
> The multiple threads will still all be racing with each
> other on the single core.

How? Even the bug report on launchpad cannot be triggered (on a
relatively ancient qemu) if the program is pinned to just one host core.

> In general I don't see much benefit in detailed investigation
> into the precise reason why a guest program crashes when
> the whole area is known to be fundamentally not designed
> right...

I'm working on such a design, not just on paper but with working
code. For instance, my trying to run qemu on ppc64 is to test an
initial solution to the TSO on RMO problem, i.e. the memory
consistency mismatch. ppc64 is unfortunately the only SMP RMO
machine I could get access to--I'd be happy to test this on an
ARM SMP and forget about ppc64 for now if I could.

To me this looks like a ppc64 issue, and I'd be very grateful
if ppc64 folks could take a look.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-06-18 18:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-17  0:52 [Qemu-devel] linux-user crashes on clone(2) when run on ppc host Emilio G. Cota
2015-06-17  8:58 ` Peter Maydell
2015-06-17 21:36   ` Emilio G. Cota
2015-06-18  7:42     ` Peter Maydell
2015-06-18 14:23       ` Emilio G. Cota
2015-06-18 14:55         ` Peter Maydell
2015-06-18 18:36           ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.