From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:39189)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <imammedo@redhat.com>) id 1ZEzrI-00054A-TP
	for qemu-devel@nongnu.org; Tue, 14 Jul 2015 09:02:54 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <imammedo@redhat.com>) id 1ZEzrD-0004ZC-TK
	for qemu-devel@nongnu.org; Tue, 14 Jul 2015 09:02:52 -0400
Received: from mx1.redhat.com ([209.132.183.28]:56225)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <imammedo@redhat.com>) id 1ZEzrD-0004Yp-Md
	for qemu-devel@nongnu.org; Tue, 14 Jul 2015 09:02:47 -0400
Received: from int-mx14.intmail.prod.int.phx2.redhat.com
	(int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27])
	by mx1.redhat.com (Postfix) with ESMTPS id 3F91F376B79
	for <qemu-devel@nongnu.org>; Tue, 14 Jul 2015 13:02:47 +0000 (UTC)
Date: Tue, 14 Jul 2015 15:02:44 +0200
From: Igor Mammedov <imammedo@redhat.com>
Message-ID: <20150714150244.44c323eb@nial.brq.redhat.com>
In-Reply-To: <20150713231133-mutt-send-email-mst@redhat.com>
References: <1436442444-132020-1-git-send-email-imammedo@redhat.com>
	<1436442444-132020-5-git-send-email-imammedo@redhat.com>
	<20150709155919-mutt-send-email-mst@redhat.com>
	<559E7A65.6080908@redhat.com>
	<20150709164336-mutt-send-email-mst@redhat.com>
	<20150710121236.172d59e9@nial.brq.redhat.com>
	<20150713095252-mutt-send-email-mst@redhat.com>
	<20150713205513.1e7abe55@igors-macbook-pro.local>
	<20150713231133-mutt-send-email-mst@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH v4 4/7] pc: fix QEMU crashing when more
 than ~50 memory hotplugged
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, qemu-devel@nongnu.org

On Mon, 13 Jul 2015 23:14:37 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Mon, Jul 13, 2015 at 08:55:13PM +0200, Igor Mammedov wrote:
> > On Mon, 13 Jul 2015 09:55:18 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Fri, Jul 10, 2015 at 12:12:36PM +0200, Igor Mammedov wrote:
> > > > On Thu, 9 Jul 2015 16:46:43 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Thu, Jul 09, 2015 at 03:43:01PM +0200, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 09/07/2015 15:06, Michael S. Tsirkin wrote:
> > > > > > > > QEMU asserts in vhost due to hitting vhost backend limit
> > > > > > > > on number of supported memory regions.
> > > > > > > > 
> > > > > > > > Describe all hotplugged memory as one continuos range
> > > > > > > > to vhost with linear 1:1 HVA->GPA mapping in backend.
> > > > > > > > 
> > > > > > > > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > > > > > >
> > > > > > > Hmm - a bunch of work here to recombine MRs that memory
> > > > > > > listener interface breaks up.  In particular KVM could
> > > > > > > benefit from this too (on workloads that change the table a
> > > > > > > lot).  Can't we teach memory core to pass hva range as a
> > > > > > > single continuous range to memory listeners?
> > > > > > 
> > > > > > Memory listeners are based on memory regions, not HVA ranges.
> > > > > > 
> > > > > > Paolo
> > > > > 
> > > > > Many listeners care about HVA ranges. I know KVM and vhost do.
> > > > I'm not sure about KVM, it works just fine with fragmented memory
> > > > regions, the same will apply to vhost once module parameter to
> > > > increase limit is merged.
> > > > 
> > > > but changing generic memory listener interface to replace HVA mapped
> > > > regions with HVA container would lead to a case when listeners
> > > > won't see exact layout that they might need.
> > > 
> > > I don't think they care, really.
> > > 
> > > > In addition vhost itself will suffer from working with big HVA
> > > > since it allocates log depending on size of memory => bigger log.
> > > 
> > > Not really - it allocates the log depending on the PA range.
> > > Leaving unused holes doesn't reduce it's size.
> > if it would use HVA container instead then it will always allocate
> > log for max possible GPA, meaning that -m 1024,maxmem=1T will waste
> > a lot of memory and more so for bigger maxmem.
> > It's still possible to induce worst case by plugging pc-dimm at the end
> > of hotplug-memory area by specifying address for it explicitly.
> > That problem exists since memory hot-add was introduced, I've just
> > haven't noticed it back then.
> 
> There you are then. Depending on maxmem seems cleaner as it's more
> predictable.
> 
> > It's perfectly fine to allocate log by last GPA as far as
> > memory is nearly continuous but memory hot-add makes it possible to
> > have sparse layout with a huge gaps between guest mapped RAM
> > which makes current log handling inefficient.
> > 
> > I wonder how hard it would be to make log_size depend on present RAM
> > size rather than max present GPA so it wouldn't allocate excess 
> > memory for log.
> 
> We can simply map the unused parts of the log RESERVED.
meaning that vhost listener should get RAM regions so it would know
which parts of log it has to mmap(NORESERVE|DONTNEED)

it would also require custom allocator for log, that could manage
punching/unpunching holes in log depending on RAM layout.

btw is it possible for guest to force vhost module access
NORESERVE area and what would happen it that case?


> 
> That can be a natural continuation of these series, but
> I don't think it needs to block it.
> 
> > 
> > > 
> > > 
> > > > That's one of the reasons that in this patch HVA ranges in
> > > > memory map are compacted only for backend consumption,
> > > > QEMU's side of vhost uses exact map for internal purposes.
> > > > And the other reason is I don't know vhost enough to rewrite it
> > > > to use big HVA for everything.
> > > > 
> > > > > I guess we could create dummy MRs to fill in the holes left by
> > > > > memory hotplug?
> > > > it looks like nice thing from vhost pov but complicates other side,
> > > 
> > > What other side do you have in mind?
> > > 
> > > > hence I dislike an idea inventing dummy MRs for vhost's convenience.
> > memory core, but lets see what Paolo thinks about it.
> > 
> > > > 
> > > > 
> > > > > vhost already has logic to recombine
> > > > > consequitive chunks created by memory core.
> > > > which looks a bit complicated and I was thinking about simplifying
> > > > it some time in the future.
> > > 
>