From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:35162)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1ZQFfV-00048z-P2
	for qemu-devel@nongnu.org; Fri, 14 Aug 2015 10:09:18 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <kwolf@redhat.com>) id 1ZQFfQ-0007DB-Oo
	for qemu-devel@nongnu.org; Fri, 14 Aug 2015 10:09:13 -0400
Date: Fri, 14 Aug 2015 16:08:52 +0200
From: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20150814140852.GB5856@noname.redhat.com>
References: <20150618074500.GB4270@noname.redhat.com>
	<558281B2.6020905@kamp.de>
	<20150618084241.GC4270@noname.redhat.com>
	<55828F73.3080809@kamp.de>
	<CAJSP0QW+vG8FZ7SOAbyjR+4kFAVQTRbGEBeKwJ9Gar9cEEPwig@mail.gmail.com>
	<558415C8.3060207@kamp.de>
	<CAJSP0QUqEqqpopu=G_ykG9DsN+jRAgtG=LvS1nHkTasByMaBXA@mail.gmail.com>
	<55880926.5070800@kamp.de> <55888430.50504@redhat.com>
	<55CDF083.9080503@kamp.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <55CDF083.9080503@kamp.de>
Subject: Re: [Qemu-devel] [Qemu-block] RFC cdrom in own thread?
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Lieven <pl@kamp.de>
Cc: qemu block <qemu-block@nongnu.org>, Stefan Hajnoczi <stefanha@gmail.com>, Alexander Bezzubikov <zuban32s@gmail.com>, qemu-devel <qemu-devel@nongnu.org>, Paolo Bonzini <pbonzini@redhat.com>, John Snow <jsnow@redhat.com>

Am 14.08.2015 um 15:43 hat Peter Lieven geschrieben:
> Am 22.06.2015 um 23:54 schrieb John Snow:
> >
> > On 06/22/2015 09:09 AM, Peter Lieven wrote:
> >> Am 22.06.2015 um 11:25 schrieb Stefan Hajnoczi:
> >>> On Fri, Jun 19, 2015 at 2:14 PM, Peter Lieven <pl@kamp.de> wrote:
> >>>> Am 18.06.2015 um 11:36 schrieb Stefan Hajnoczi:
> >>>>> On Thu, Jun 18, 2015 at 10:29 AM, Peter Lieven <pl@kamp.de> wrote:
> >>>>>> Am 18.06.2015 um 10:42 schrieb Kevin Wolf:
> >>>>>>> Am 18.06.2015 um 10:30 hat Peter Lieven geschrieben:
> >>>>>>>> Am 18.06.2015 um 09:45 schrieb Kevin Wolf:
> >>>>>>>>> Am 18.06.2015 um 09:12 hat Peter Lieven geschrieben:
> >>>>>>>>>> Thread 2 (Thread 0x7ffff5550700 (LWP 2636)):
> >>>>>>>>>> #0  0x00007ffff5d87aa3 in ppoll () from
> >>>>>>>>>> /lib/x86_64-linux-gnu/libc.so.6
> >>>>>>>>>> No symbol table info available.
> >>>>>>>>>> #1  0x0000555555955d91 in qemu_poll_ns (fds=0x5555563889c0,
> >>>>>>>>>> nfds=3,
> >>>>>>>>>>       timeout=4999424576) at qemu-timer.c:326
> >>>>>>>>>>           ts = {tv_sec = 4, tv_nsec = 999424576}
> >>>>>>>>>>           tvsec = 4
> >>>>>>>>>> #2  0x0000555555956feb in aio_poll (ctx=0x5555563528e0,
> >>>>>>>>>> blocking=true)
> >>>>>>>>>>       at aio-posix.c:231
> >>>>>>>>>>           node = 0x0
> >>>>>>>>>>           was_dispatching = false
> >>>>>>>>>>           ret = 1
> >>>>>>>>>>           progress = false
> >>>>>>>>>> #3  0x000055555594aeed in bdrv_prwv_co (bs=0x55555637eae0,
> >>>>>>>>>> offset=4292007936,
> >>>>>>>>>>       qiov=0x7ffff554f760, is_write=false, flags=0) at
> >>>>>>>>>> block.c:2699
> >>>>>>>>>>           aio_context = 0x5555563528e0
> >>>>>>>>>>           co = 0x5555563888a0
> >>>>>>>>>>           rwco = {bs = 0x55555637eae0, offset = 4292007936,
> >>>>>>>>>>             qiov = 0x7ffff554f760, is_write = false, ret =
> >>>>>>>>>> 2147483647,
> >>>>>>>>>> flags = 0}
> >>>>>>>>>> #4  0x000055555594afa9 in bdrv_rw_co (bs=0x55555637eae0,
> >>>>>>>>>> sector_num=8382828,
> >>>>>>>>>>       buf=0x7ffff44cc800 "(", nb_sectors=4, is_write=false,
> >>>>>>>>>> flags=0)
> >>>>>>>>>>       at block.c:2722
> >>>>>>>>>>           qiov = {iov = 0x7ffff554f780, niov = 1, nalloc = -1,
> >>>>>>>>>> size =
> >>>>>>>>>> 2048}
> >>>>>>>>>>           iov = {iov_base = 0x7ffff44cc800, iov_len = 2048}
> >>>>>>>>>> #5  0x000055555594b008 in bdrv_read (bs=0x55555637eae0,
> >>>>>>>>>> sector_num=8382828,
> >>>>>>>>>>       buf=0x7ffff44cc800 "(", nb_sectors=4) at block.c:2730
> >>>>>>>>>> No locals.
> >>>>>>>>>> #6  0x000055555599acef in blk_read (blk=0x555556376820,
> >>>>>>>>>> sector_num=8382828,
> >>>>>>>>>>       buf=0x7ffff44cc800 "(", nb_sectors=4) at
> >>>>>>>>>> block/block-backend.c:404
> >>>>>>>>>> No locals.
> >>>>>>>>>> #7  0x0000555555833ed2 in cd_read_sector (s=0x555556408f88,
> >>>>>>>>>> lba=2095707,
> >>>>>>>>>>       buf=0x7ffff44cc800 "(", sector_size=2048) at
> >>>>>>>>>> hw/ide/atapi.c:116
> >>>>>>>>>>           ret = 32767
> >>>>>>>>> Here is the problem: The ATAPI emulation uses synchronous
> >>>>>>>>> blk_read()
> >>>>>>>>> instead of the AIO or coroutine interfaces. This means that it
> >>>>>>>>> keeps
> >>>>>>>>> polling for request completion while it holds the BQL until the
> >>>>>>>>> request
> >>>>>>>>> is completed.
> >>>>>>>> I will look at this.
> >>>>>> I need some further help. My way to "emulate" a hung NFS Server is to
> >>>>>> block it in the Firewall. Currently I face the problem that I
> >>>>>> cannot mount
> >>>>>> a CD Iso via libnfs (nfs://) without hanging Qemu (i previously
> >>>>>> tried with
> >>>>>> a kernel NFS mount). It reads a few sectors and then stalls (maybe
> >>>>>> another
> >>>>>> bug):
> >>>>>>
> >>>>>> (gdb) thread apply all bt full
> >>>>>>
> >>>>>> Thread 3 (Thread 0x7ffff0c21700 (LWP 29710)):
> >>>>>> #0  qemu_cond_broadcast (cond=cond@entry=0x555556259940) at
> >>>>>> util/qemu-thread-posix.c:120
> >>>>>>          err = <optimized out>
> >>>>>>          __func__ = "qemu_cond_broadcast"
> >>>>>> #1  0x0000555555911164 in rfifolock_unlock
> >>>>>> (r=r@entry=0x555556259910) at
> >>>>>> util/rfifolock.c:75
> >>>>>>          __PRETTY_FUNCTION__ = "rfifolock_unlock"
> >>>>>> #2  0x0000555555875921 in aio_context_release
> >>>>>> (ctx=ctx@entry=0x5555562598b0)
> >>>>>> at async.c:329
> >>>>>> No locals.
> >>>>>> #3  0x000055555588434c in aio_poll (ctx=ctx@entry=0x5555562598b0,
> >>>>>> blocking=blocking@entry=true) at aio-posix.c:272
> >>>>>>          node = <optimized out>
> >>>>>>          was_dispatching = false
> >>>>>>          i = <optimized out>
> >>>>>>          ret = <optimized out>
> >>>>>>          progress = false
> >>>>>>          timeout = 611734526
> >>>>>>          __PRETTY_FUNCTION__ = "aio_poll"
> >>>>>> #4  0x00005555558bc43d in bdrv_prwv_co (bs=bs@entry=0x55555627c0f0,
> >>>>>> offset=offset@entry=7038976, qiov=qiov@entry=0x7ffff0c208f0,
> >>>>>> is_write=is_write@entry=false, flags=flags@entry=(unknown: 0)) at
> >>>>>> block/io.c:552
> >>>>>>          aio_context = 0x5555562598b0
> >>>>>>          co = <optimized out>
> >>>>>>          rwco = {bs = 0x55555627c0f0, offset = 7038976, qiov =
> >>>>>> 0x7ffff0c208f0, is_write = false, ret = 2147483647, flags =
> >>>>>> (unknown: 0)}
> >>>>>> #5  0x00005555558bc533 in bdrv_rw_co (bs=0x55555627c0f0,
> >>>>>> sector_num=sector_num@entry=13748, buf=buf@entry=0x555557874800 "(",
> >>>>>> nb_sectors=nb_sectors@entry=4, is_write=is_write@entry=false,
> >>>>>>      flags=flags@entry=(unknown: 0)) at block/io.c:575
> >>>>>>          qiov = {iov = 0x7ffff0c208e0, niov = 1, nalloc = -1, size
> >>>>>> = 2048}
> >>>>>>          iov = {iov_base = 0x555557874800, iov_len = 2048}
> >>>>>> #6  0x00005555558bc593 in bdrv_read (bs=<optimized out>,
> >>>>>> sector_num=sector_num@entry=13748, buf=buf@entry=0x555557874800 "(",
> >>>>>> nb_sectors=nb_sectors@entry=4) at block/io.c:583
> >>>>>> No locals.
> >>>>>> #7  0x00005555558af75d in blk_read (blk=<optimized out>,
> >>>>>> sector_num=sector_num@entry=13748, buf=buf@entry=0x555557874800 "(",
> >>>>>> nb_sectors=nb_sectors@entry=4) at block/block-backend.c:493
> >>>>>>          ret = <optimized out>
> >>>>>> #8  0x00005555557abb88 in cd_read_sector (sector_size=<optimized out>,
> >>>>>> buf=0x555557874800 "(", lba=3437, s=0x55555760db70) at
> >>>>>> hw/ide/atapi.c:116
> >>>>>>          ret = <optimized out>
> >>>>>> #9  ide_atapi_cmd_reply_end (s=0x55555760db70) at hw/ide/atapi.c:190
> >>>>>>          byte_count_limit = <optimized out>
> >>>>>>          size = <optimized out>
> >>>>>>          ret = 2
> >>>>> This is still the same scenario Kevin explained.
> >>>>>
> >>>>> The ATAPI CD-ROM emulation code is using synchronous blk_read().  This
> >>>>> function holds the QEMU global mutex while waiting for the I/O request
> >>>>> to complete.  This blocks other vcpu threads and the main loop thread.
> >>>>>
> >>>>> The solution is to convert the CD-ROM emulation code to use
> >>>>> blk_aio_readv() instead of blk_read().
> >>>> I tried a little, but i am stuck with my approach. I reads one sector
> >>>> and then doesn't continue. Maybe someone with more knowledge
> >>>> of ATAPI/IDE could help?
> >>> Converting synchronous code to asynchronous requires an understanding
> >>> of the device's state transitions.  Asynchronous code has to put the
> >>> device registers into a busy state until the request completes.  It
> >>> also needs to handle hardware register accesses that occur while the
> >>> request is still pending.
> >> That was my assumption as well. But I don't know how to proceed...
> >>
> >>> I don't know ATAPI/IDE code well enough to suggest a fix.
> >> Maybe @John can help?
> >>
> >> Peter
> >>
> >
> 
> I looked into this again and it seems that the remaining problem (at least when the CDROM is
> mounted via libnfs) is the blk_drain_all() in bmdma_cmd_writeb. At least I end there if I have
> a proper OS booted and cut off the NFS server. The VM remains responsive until the guest OS
> issues a DMA cancel.
> 
> I do not know what the proper solution is. I had the following ideas so far (not knowing if the
> approaches would be correct or not).
> 
> a) Do not clear BM_STATUS_DMAING if we are not able to drain all requests. This works until
> the connection is reestablished. The guest OS issues DMA cancel operations again and
> again, but when the connectivity is back I end in the following assertion:
> 
> qemu-system-x86_64: ./hw/ide/pci.h:65: bmdma_active_if: Assertion `bmdma->bus->retry_unit != (uint8_t)-1' failed.

I would have to check the specs to see if this is allowed.

> b) Call the aiocb with -ECANCELED and somehow (?) turn all the callbacks of the outstanding IOs into NOPs.

This wouldn't be correct for write requests: We would tell the guest
that the request is cancelled when it's actually still in flight. At
some point it could still complete, however, and that's not expected by
the guest.

> c) Follow the hint in the comment in bmdma_cmd_writeb (however this works out):
>              * In the future we'll be able to safely cancel the I/O if the
>              * whole DMA operation will be submitted to disk with a single
>              * aio operation with preadv/pwritev.

Not sure how likely it is that cancelling that single AIOCB will
actually cancel the operation and not end up doing bdrv_drain_all()
internally instead because there is no good way of cancelling the
request.

Kevin