QEMU bumping memory bug analysis

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* QEMU bumping memory bug analysis
@ 2015-06-05 16:43 Wei Liu
  2015-06-05 16:58 ` Ian Campbell
  2015-06-05 17:10 ` Stefano Stabellini
  0 siblings, 2 replies; 38+ messages in thread
From: Wei Liu @ 2015-06-05 16:43 UTC (permalink / raw)
  To: xen-devel
  Cc: wei.liu2, Ian Campbell, Stefano Stabellini, George Dunlap,
	Andrew Cooper, Ian Jackson, dslutz

Hi all

This bug is now considered a blocker for 4.6 release.

The premises of the problem remain the same (George's translated
version):

1. QEMU may need extra pages from Xen to implement option ROMS, and so at
   the moment it calls set_max_mem() to increase max_pages so that it can
   allocate more pages to the guest.  libxl doesn't know what max_pages a
   domain needs prior to qemu start-up.

2. Libxl doesn't know max_pages even after qemu start-up, because there
   is no mechanism to communicate between qemu and libxl.

3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
   Those pages are only accounted for in the hypervisor.  Libxl
   (currently) does not extract that value from the hypervisor.

Several solutions were proposed:

1. Add a new record type in libxc migration stream and call setmaxmem
   in the middle of xc migration stream.

Main objections are calling xc_domain_setmaxmem in the middle of xc
migration stream is layer violation. Also this prevents us from
disaggregating domain construction to a less privileged domain.

2. Use libxl toolstack save restore blob to tranmit max pages
   information to remote end.

This is considered a bodge and has been proven not to work because
toolstack blob restore happens after xc_domain_restore.

3. Add a libxl layer that wraps necessary information, take over
   Andrew's work on libxl migration v2.  Having a libxl layer that's not
   part of migration v2 is a waste of effort.

There are several obstacles for libxl migration v2 at the moment. Libxl
layer in migration v2 still has unresolved issues. It has
inter-dependency with Remus / COLO.

Most importantly it doesn't inherently solve the problem. It still
requires the current libxl JSON blob to contain information about max
pages (or information used to derive max pages).

Andrew, correct me if I'm wrong.

4. Add a none user configurable field in current libxl JSON structure to
   record max pages information.

This is not desirable. All fields in libxl JSON should be user
configurable.

5. Add a user configurable field in current libxl JSON structure to
   record how much more memory this domain needs. Admin is required to
   fill in that value manually. In the mean time we revert the change in
   QEMU and declare QEMU with that change buggy.

No response to this so far. But in fact I consider this the most viable
solution.

It's a simple enough solution that is achievable within 4.6 time frame.
It doesn't prevent us from doing useful work in the future
(disaggregated architecture with stricter security policy). It provides
a way to work around buggy QEMU (admin sets that value to prevent QEMU
from bumping memory limit). It's orgthogonal to migration v2 which means
it won't be blocked by migration v2 or block migration v2.

I tend to go with solution 5. Speak up if you don't agree with my
analysis or you think I miss some aspects.

For long term we need to:

1. Establish libxl as the arbitrator how much pages a domain can have.
   Anything else doing stuff behind arbitrator's back is considered
   buggy. This principle probably apply to other aspects of a domain as
   well.

2. Work out a solution communicate between QEMU and libxl. This can be
   expanded to cover other components in a Xen setup, but currently we
   only have QEMU.

Wei.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 16:43 QEMU bumping memory bug analysis Wei Liu
@ 2015-06-05 16:58 ` Ian Campbell
  2015-06-05 17:13   ` Stefano Stabellini
                     ` (2 more replies)
  2015-06-05 17:10 ` Stefano Stabellini
  1 sibling, 3 replies; 38+ messages in thread
From: Ian Campbell @ 2015-06-05 16:58 UTC (permalink / raw)
  To: Wei Liu
  Cc: Stefano Stabellini, George Dunlap, Andrew Cooper, Ian Jackson,
	dslutz, xen-devel

On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote:

> 3. Add a libxl layer that wraps necessary information, take over
>    Andrew's work on libxl migration v2.  Having a libxl layer that's not
>    part of migration v2 is a waste of effort.
> 
> There are several obstacles for libxl migration v2 at the moment. Libxl
> layer in migration v2 still has unresolved issues. It has
> inter-dependency with Remus / COLO.
> 
> Most importantly it doesn't inherently solve the problem. It still
> requires the current libxl JSON blob to contain information about max
> pages

It doesn't require that, the whole point of the libxl layer is to
provide a suitable home for that information which is not the current
libxl json blob (which is user facing cfg data) or the libxc stream
(which is opaque to libxl).

Once you have the general concept of the libxl layer, adding a new field
to it will be trivial (because it will have been designed to be
trivially extendable).

>  (or information used to derive max pages).
> 
> Andrew, correct me if I'm wrong.
> 
> 4. Add a none user configurable field in current libxl JSON structure to
>    record max pages information.
> 
> This is not desirable. All fields in libxl JSON should be user
> configurable.
> 
> 5. Add a user configurable field in current libxl JSON structure to
>    record how much more memory this domain needs. Admin is required to
>    fill in that value manually. In the mean time we revert the change in
>    QEMU and declare QEMU with that change buggy.
> 
> No response to this so far. But in fact I consider this the most viable
> solution.

I initially thought that this was just #4 in a silly hat and was
therefore no more acceptable than that.

But actually I think you are suggesting that users should have to
manually request additional RAM for option roms via some new interface
and that the old thing in qemu should be deprecated and removed?

How would a user know what value to use here? Just "a bigger one till it
works"? That's, well, not super...

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 16:43 QEMU bumping memory bug analysis Wei Liu
  2015-06-05 16:58 ` Ian Campbell
@ 2015-06-05 17:10 ` Stefano Stabellini
  2015-06-05 18:10   ` Wei Liu
                     ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-05 17:10 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Fri, 5 Jun 2015, Wei Liu wrote:
> Hi all
> 
> This bug is now considered a blocker for 4.6 release.
> 
> The premises of the problem remain the same (George's translated
> version):
> 
> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
>    the moment it calls set_max_mem() to increase max_pages so that it can
>    allocate more pages to the guest.  libxl doesn't know what max_pages a
>    domain needs prior to qemu start-up.
> 
> 2. Libxl doesn't know max_pages even after qemu start-up, because there
>    is no mechanism to communicate between qemu and libxl.

I might not know what is the right design for the overall solution, but
I do know that libxl shouldn't have its own state tracking for
max_pages, because max_pages is kept, maintained and enforced by Xen.

Ian might still remember, but at the beginning of the xl/libxl project,
we had few simple design principles. One of which was that we should not
have two places where we keep track of the same thing. If Xen keeps
track of something, libxl should avoid it.

In this specific case, libxl can ask Xen at any time what max_pages is
for the domain, so I don't think that libxl should store it or have its
own tracking for it.

Even if QEMU called into libxl to change max_pages, I don't think that
libxl should store max_pages anywhere. It is already stored in Xen and
can be retrieved at any time.


> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
>    Those pages are only accounted for in the hypervisor.  Libxl
>    (currently) does not extract that value from the hypervisor.
> 
> Several solutions were proposed:
> 
> 1. Add a new record type in libxc migration stream and call setmaxmem
>    in the middle of xc migration stream.
> 
> Main objections are calling xc_domain_setmaxmem in the middle of xc
> migration stream is layer violation. Also this prevents us from
> disaggregating domain construction to a less privileged domain.

It seems to me that max_pages is one of the memory properties of the
domain, so it should be saved and restored together with the rest of
memory.


> 2. Use libxl toolstack save restore blob to tranmit max pages
>    information to remote end.
> 
> This is considered a bodge and has been proven not to work because
> toolstack blob restore happens after xc_domain_restore.

Saving and restoring max_pages in libxl seems to me like a layering
violation. I would avoid 2. 3. 4. and 5.


> 3. Add a libxl layer that wraps necessary information, take over
>    Andrew's work on libxl migration v2.  Having a libxl layer that's not
>    part of migration v2 is a waste of effort.
> 
> There are several obstacles for libxl migration v2 at the moment. Libxl
> layer in migration v2 still has unresolved issues. It has
> inter-dependency with Remus / COLO.
> 
> Most importantly it doesn't inherently solve the problem. It still
> requires the current libxl JSON blob to contain information about max
> pages (or information used to derive max pages).
> 
> Andrew, correct me if I'm wrong.
> 
> 4. Add a none user configurable field in current libxl JSON structure to
>    record max pages information.
> 
> This is not desirable. All fields in libxl JSON should be user
> configurable.
> 
> 5. Add a user configurable field in current libxl JSON structure to
>    record how much more memory this domain needs. Admin is required to
>    fill in that value manually. In the mean time we revert the change in
>    QEMU and declare QEMU with that change buggy.

QEMU 2.3.0 was released with that change in it, so it is not quite
possible to revert it. Also I think it is the right change for QEMU.


> No response to this so far. But in fact I consider this the most viable
> solution.
> 
> It's a simple enough solution that is achievable within 4.6 time frame.
> It doesn't prevent us from doing useful work in the future
> (disaggregated architecture with stricter security policy). It provides
> a way to work around buggy QEMU (admin sets that value to prevent QEMU
> from bumping memory limit). It's orgthogonal to migration v2 which means
> it won't be blocked by migration v2 or block migration v2.
> 
> I tend to go with solution 5. Speak up if you don't agree with my
> analysis or you think I miss some aspects.

I disagree


> For long term we need to:
> 
> 1. Establish libxl as the arbitrator how much pages a domain can have.
>    Anything else doing stuff behind arbitrator's back is considered
>    buggy. This principle probably apply to other aspects of a domain as
>    well.

I disagree that libxl should be the arbitrator of a property that is
stored, maintained and enforced by Xen. Xen should be the arbitrator.


> 2. Work out a solution communicate between QEMU and libxl. This can be
>    expanded to cover other components in a Xen setup, but currently we
>    only have QEMU.

Even if QEMU called into libxl to change maxmem, I don't think that
libxl should store maxmem anywhere. It is already stored in Xen.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 16:58 ` Ian Campbell
@ 2015-06-05 17:13   ` Stefano Stabellini
  2015-06-05 19:06     ` Wei Liu
  2015-06-05 17:17   ` Andrew Cooper
  2015-06-05 17:39   ` Wei Liu
  2 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-05 17:13 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Fri, 5 Jun 2015, Ian Campbell wrote:
> On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote:
> 
> > 3. Add a libxl layer that wraps necessary information, take over
> >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> >    part of migration v2 is a waste of effort.
> > 
> > There are several obstacles for libxl migration v2 at the moment. Libxl
> > layer in migration v2 still has unresolved issues. It has
> > inter-dependency with Remus / COLO.
> > 
> > Most importantly it doesn't inherently solve the problem. It still
> > requires the current libxl JSON blob to contain information about max
> > pages
> 
> It doesn't require that, the whole point of the libxl layer is to
> provide a suitable home for that information which is not the current
> libxl json blob (which is user facing cfg data) or the libxc stream
> (which is opaque to libxl).
> 
> Once you have the general concept of the libxl layer, adding a new field
> to it will be trivial (because it will have been designed to be
> trivially extendable).
> 
> >  (or information used to derive max pages).
> > 
> > Andrew, correct me if I'm wrong.
> > 
> > 4. Add a none user configurable field in current libxl JSON structure to
> >    record max pages information.
> > 
> > This is not desirable. All fields in libxl JSON should be user
> > configurable.
> > 
> > 5. Add a user configurable field in current libxl JSON structure to
> >    record how much more memory this domain needs. Admin is required to
> >    fill in that value manually. In the mean time we revert the change in
> >    QEMU and declare QEMU with that change buggy.
> > 
> > No response to this so far. But in fact I consider this the most viable
> > solution.
> 
> I initially thought that this was just #4 in a silly hat and was
> therefore no more acceptable than that.
> 
> But actually I think you are suggesting that users should have to
> manually request additional RAM for option roms via some new interface
> and that the old thing in qemu should be deprecated and removed?
> 
> How would a user know what value to use here? Just "a bigger one till it
> works"? That's, well, not super...

This should not be a user configurable field. In fact it only depends on
the QEMU version in use.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 16:58 ` Ian Campbell
  2015-06-05 17:13   ` Stefano Stabellini
@ 2015-06-05 17:17   ` Andrew Cooper
  2015-06-05 17:39   ` Wei Liu
  2 siblings, 0 replies; 38+ messages in thread
From: Andrew Cooper @ 2015-06-05 17:17 UTC (permalink / raw)
  To: Ian Campbell, Wei Liu
  Cc: George Dunlap, Stefano Stabellini, Ian Jackson, dslutz, xen-devel

On 05/06/15 17:58, Ian Campbell wrote:
> On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote:
>
>> 3. Add a libxl layer that wraps necessary information, take over
>>    Andrew's work on libxl migration v2.  Having a libxl layer that's not
>>    part of migration v2 is a waste of effort.
>>
>> There are several obstacles for libxl migration v2 at the moment. Libxl
>> layer in migration v2 still has unresolved issues. It has
>> inter-dependency with Remus / COLO.
>>
>> Most importantly it doesn't inherently solve the problem. It still
>> requires the current libxl JSON blob to contain information about max
>> pages
> It doesn't require that, the whole point of the libxl layer is to
> provide a suitable home for that information which is not the current
> libxl json blob (which is user facing cfg data) or the libxc stream
> (which is opaque to libxl).
>
> Once you have the general concept of the libxl layer, adding a new field
> to it will be trivial (because it will have been designed to be
> trivially extendable).

Oh - I had not realised your intention of having some data blob in the
libxl stream which is not the JSON configuration.

Conceptually that would work, although this new information would still
have to be at the head of the stream. i.e. head of the libxc blob.

~Andrew

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 16:58 ` Ian Campbell
  2015-06-05 17:13   ` Stefano Stabellini
  2015-06-05 17:17   ` Andrew Cooper
@ 2015-06-05 17:39   ` Wei Liu
  2 siblings, 0 replies; 38+ messages in thread
From: Wei Liu @ 2015-06-05 17:39 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Fri, Jun 05, 2015 at 05:58:11PM +0100, Ian Campbell wrote:
> On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote:
> 
> > 3. Add a libxl layer that wraps necessary information, take over
> >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> >    part of migration v2 is a waste of effort.
> > 
> > There are several obstacles for libxl migration v2 at the moment. Libxl
> > layer in migration v2 still has unresolved issues. It has
> > inter-dependency with Remus / COLO.
> > 
> > Most importantly it doesn't inherently solve the problem. It still
> > requires the current libxl JSON blob to contain information about max
> > pages
> 
> It doesn't require that, the whole point of the libxl layer is to
> provide a suitable home for that information which is not the current
> libxl json blob (which is user facing cfg data) or the libxc stream
> (which is opaque to libxl).
> 

Right, it doesn't have to be in the user facing blob. I was wrong on
that one.

> Once you have the general concept of the libxl layer, adding a new field
> to it will be trivial (because it will have been designed to be
> trivially extendable).
> 

In light of Andrew's reply when we talked about JSON we were referring
to subtly different things. This libxl layer might be a viable solution.
I need to check it again.

The concern of not able to make it in time for 4.6 remains.

> >  (or information used to derive max pages).
> > 
> > Andrew, correct me if I'm wrong.
> > 
> > 4. Add a none user configurable field in current libxl JSON structure to
> >    record max pages information.
> > 
> > This is not desirable. All fields in libxl JSON should be user
> > configurable.
> > 
> > 5. Add a user configurable field in current libxl JSON structure to
> >    record how much more memory this domain needs. Admin is required to
> >    fill in that value manually. In the mean time we revert the change in
> >    QEMU and declare QEMU with that change buggy.
> > 
> > No response to this so far. But in fact I consider this the most viable
> > solution.
> 
> I initially thought that this was just #4 in a silly hat and was
> therefore no more acceptable than that.
> 
> But actually I think you are suggesting that users should have to
> manually request additional RAM for option roms via some new interface
> and that the old thing in qemu should be deprecated and removed?
> 

Yes.

I'm considering removing xc_domain_setmaxmem needs regardless of this
bug because that's going to cause problem in QEMU upstream stubdom with
strict XSM policy and deprivileged QEMU (may not have privilege to call
setmaxmem).

The security implication as it is now is big enough. One malicious guest
that controls QEMU has a vector to DoS hypervisor by setting its own
max_pages to -1;

> How would a user know what value to use here? Just "a bigger one till it
> works"? That's, well, not super...
> 

Not very good, but should work. A few trial and error will give you the
acceptable value. And this is easily superseded by any other more
advanced solutions.

This is going to be our last resort if Andrew's work is not a viable
solution within 4.6 time frame.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 17:10 ` Stefano Stabellini
@ 2015-06-05 18:10   ` Wei Liu
  2015-06-08 11:39     ` Stefano Stabellini
  2015-06-05 18:49   ` Ian Campbell
  2015-06-08 14:14   ` George Dunlap
  2 siblings, 1 reply; 38+ messages in thread
From: Wei Liu @ 2015-06-05 18:10 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson,
	dslutz, xen-devel

On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote:
> On Fri, 5 Jun 2015, Wei Liu wrote:
> > Hi all
> > 
> > This bug is now considered a blocker for 4.6 release.
> > 
> > The premises of the problem remain the same (George's translated
> > version):
> > 
> > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
> >    the moment it calls set_max_mem() to increase max_pages so that it can
> >    allocate more pages to the guest.  libxl doesn't know what max_pages a
> >    domain needs prior to qemu start-up.
> > 
> > 2. Libxl doesn't know max_pages even after qemu start-up, because there
> >    is no mechanism to communicate between qemu and libxl.
> 
> I might not know what is the right design for the overall solution, but
> I do know that libxl shouldn't have its own state tracking for
> max_pages, because max_pages is kept, maintained and enforced by Xen.
> 
> Ian might still remember, but at the beginning of the xl/libxl project,
> we had few simple design principles. One of which was that we should not
> have two places where we keep track of the same thing. If Xen keeps
> track of something, libxl should avoid it.
> 
> In this specific case, libxl can ask Xen at any time what max_pages is
> for the domain, so I don't think that libxl should store it or have its
> own tracking for it.
> 
> Even if QEMU called into libxl to change max_pages, I don't think that
> libxl should store max_pages anywhere. It is already stored in Xen and
> can be retrieved at any time.
> 

I think you're talking about keep track of that record permanently. I
only care about getting the right value at the right time and transfer
it to the other end. Getting the value whenever needed is OK.

> 
> > 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
> >    Those pages are only accounted for in the hypervisor.  Libxl
> >    (currently) does not extract that value from the hypervisor.
> > 
> > Several solutions were proposed:
> > 
> > 1. Add a new record type in libxc migration stream and call setmaxmem
> >    in the middle of xc migration stream.
> > 
> > Main objections are calling xc_domain_setmaxmem in the middle of xc
> > migration stream is layer violation. Also this prevents us from
> > disaggregating domain construction to a less privileged domain.
> 
> It seems to me that max_pages is one of the memory properties of the
> domain, so it should be saved and restored together with the rest of
> memory.
> 

#1 was actually referring to Don's patch. When processing libxc record
in the middle of the stream, we should not alter the size of memory.
It's not safe because we don't know whether that record comes early
enough before we exceed the limit.

The only safe way of doing it is to mandate that specific record at
the beginning of libxc stream, which might have other implications.

> 
> > 2. Use libxl toolstack save restore blob to tranmit max pages
> >    information to remote end.
> > 
> > This is considered a bodge and has been proven not to work because
> > toolstack blob restore happens after xc_domain_restore.
> 
> Saving and restoring max_pages in libxl seems to me like a layering
> violation. I would avoid 2. 3. 4. and 5.
> 

No, not really. If we follow the principle of "libxl is the
arbitrator". It needs to have thorough information on every aspect of
the domain and set up limit.

> 
> > 3. Add a libxl layer that wraps necessary information, take over
> >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> >    part of migration v2 is a waste of effort.
> > 
> > There are several obstacles for libxl migration v2 at the moment. Libxl
> > layer in migration v2 still has unresolved issues. It has
> > inter-dependency with Remus / COLO.
> > 
> > Most importantly it doesn't inherently solve the problem. It still
> > requires the current libxl JSON blob to contain information about max
> > pages (or information used to derive max pages).
> > 
> > Andrew, correct me if I'm wrong.
> > 
> > 4. Add a none user configurable field in current libxl JSON structure to
> >    record max pages information.
> > 
> > This is not desirable. All fields in libxl JSON should be user
> > configurable.
> > 
> > 5. Add a user configurable field in current libxl JSON structure to
> >    record how much more memory this domain needs. Admin is required to
> >    fill in that value manually. In the mean time we revert the change in
> >    QEMU and declare QEMU with that change buggy.
> 
> QEMU 2.3.0 was released with that change in it, so it is not quite
> possible to revert it. Also I think it is the right change for QEMU.
> 

It has security implications. Here is my reply copied from my mail to
Ian:

I'm considering removing xc_domain_setmaxmem needs regardless of this
bug because that's going to cause problem in QEMU upstream stubdom with
strict XSM policy and deprivileged QEMU (may not have privilege to call
setmaxmem).

The security implication as it is now is big enough. One malicious guest
that controls QEMU has a vector to DoS hypervisor by setting its own
max_pages to -1;

> 
> > No response to this so far. But in fact I consider this the most viable
> > solution.
> > 
> > It's a simple enough solution that is achievable within 4.6 time frame.
> > It doesn't prevent us from doing useful work in the future
> > (disaggregated architecture with stricter security policy). It provides
> > a way to work around buggy QEMU (admin sets that value to prevent QEMU
> > from bumping memory limit). It's orgthogonal to migration v2 which means
> > it won't be blocked by migration v2 or block migration v2.
> > 
> > I tend to go with solution 5. Speak up if you don't agree with my
> > analysis or you think I miss some aspects.
> 
> I disagree
> 
> 
> > For long term we need to:
> > 
> > 1. Establish libxl as the arbitrator how much pages a domain can have.
> >    Anything else doing stuff behind arbitrator's back is considered
> >    buggy. This principle probably apply to other aspects of a domain as
> >    well.
> 
> I disagree that libxl should be the arbitrator of a property that is
> stored, maintained and enforced by Xen. Xen should be the arbitrator.
> 

But it would be a burden for Xen once the arbitration goes beyond the
issue we discuss here. It certainly doesn't have as much as information
libxl has. In the long run we would still end up having libxl doing most
of the work so we might as well establish the principle now.

Wei.

> 
> > 2. Work out a solution communicate between QEMU and libxl. This can be
> >    expanded to cover other components in a Xen setup, but currently we
> >    only have QEMU.
> 
> Even if QEMU called into libxl to change maxmem, I don't think that
> libxl should store maxmem anywhere. It is already stored in Xen.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 17:10 ` Stefano Stabellini
  2015-06-05 18:10   ` Wei Liu
@ 2015-06-05 18:49   ` Ian Campbell
  2015-06-08 11:40     ` Stefano Stabellini
  2015-06-08 14:14   ` George Dunlap
  2 siblings, 1 reply; 38+ messages in thread
From: Ian Campbell @ 2015-06-05 18:49 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel

On Fri, 2015-06-05 at 18:10 +0100, Stefano Stabellini wrote:
> On Fri, 5 Jun 2015, Wei Liu wrote:
> > Hi all
> > 
> > This bug is now considered a blocker for 4.6 release.
> > 
> > The premises of the problem remain the same (George's translated
> > version):
> > 
> > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
> >    the moment it calls set_max_mem() to increase max_pages so that it can
> >    allocate more pages to the guest.  libxl doesn't know what max_pages a
> >    domain needs prior to qemu start-up.
> > 
> > 2. Libxl doesn't know max_pages even after qemu start-up, because there
> >    is no mechanism to communicate between qemu and libxl.
> 
> I might not know what is the right design for the overall solution, but
> I do know that libxl shouldn't have its own state tracking for
> max_pages, because max_pages is kept, maintained and enforced by Xen.
> 
> Ian might still remember, but at the beginning of the xl/libxl project,
> we had few simple design principles. One of which was that we should not
> have two places where we keep track of the same thing. If Xen keeps
> track of something, libxl should avoid it.

This isn't about libxl tracking something duplicating information in
Xen. It is about who gets to choose what that value is, which is not the
same as who stores that value.

So this is about libxl being the owner of what the current maxmem value
is. It can so this by using setmaxmem and getmaxmem to set and retrieve
the value with no state in libxl.

> I disagree that libxl should be the arbitrator of a property that is
> stored, maintained and enforced by Xen. Xen should be the arbitrator.

That's not what "arbitrator" means, an arbitrator decides what the value
should be, but that doesn't necessarily imply that it either stores,
maintains nor enforces that value. 

> Even if QEMU called into libxl to change maxmem, I don't think that
> libxl should store maxmem anywhere. It is already stored in Xen.

I don't think anyone suggested otherwise, did they?

What locking is there around QEMU's read/modify/write of the maxmem
value? What happens if someone else also modifies maxmem at the same
time?

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 17:13   ` Stefano Stabellini
@ 2015-06-05 19:06     ` Wei Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Wei Liu @ 2015-06-05 19:06 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson,
	dslutz, xen-devel

On Fri, Jun 05, 2015 at 06:13:01PM +0100, Stefano Stabellini wrote:
> On Fri, 5 Jun 2015, Ian Campbell wrote:
> > On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote:
> > 
> > > 3. Add a libxl layer that wraps necessary information, take over
> > >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> > >    part of migration v2 is a waste of effort.
> > > 
> > > There are several obstacles for libxl migration v2 at the moment. Libxl
> > > layer in migration v2 still has unresolved issues. It has
> > > inter-dependency with Remus / COLO.
> > > 
> > > Most importantly it doesn't inherently solve the problem. It still
> > > requires the current libxl JSON blob to contain information about max
> > > pages
> > 
> > It doesn't require that, the whole point of the libxl layer is to
> > provide a suitable home for that information which is not the current
> > libxl json blob (which is user facing cfg data) or the libxc stream
> > (which is opaque to libxl).
> > 
> > Once you have the general concept of the libxl layer, adding a new field
> > to it will be trivial (because it will have been designed to be
> > trivially extendable).
> > 
> > >  (or information used to derive max pages).
> > > 
> > > Andrew, correct me if I'm wrong.
> > > 
> > > 4. Add a none user configurable field in current libxl JSON structure to
> > >    record max pages information.
> > > 
> > > This is not desirable. All fields in libxl JSON should be user
> > > configurable.
> > > 
> > > 5. Add a user configurable field in current libxl JSON structure to
> > >    record how much more memory this domain needs. Admin is required to
> > >    fill in that value manually. In the mean time we revert the change in
> > >    QEMU and declare QEMU with that change buggy.
> > > 
> > > No response to this so far. But in fact I consider this the most viable
> > > solution.
> > 
> > I initially thought that this was just #4 in a silly hat and was
> > therefore no more acceptable than that.
> > 
> > But actually I think you are suggesting that users should have to
> > manually request additional RAM for option roms via some new interface
> > and that the old thing in qemu should be deprecated and removed?
> > 
> > How would a user know what value to use here? Just "a bigger one till it
> > works"? That's, well, not super...
> 
> This should not be a user configurable field. In fact it only depends on
> the QEMU version in use.

That field is generic so that we can use it to add some extra pages to
domain. Using it to cover QEMU option roms would be one use case.  It's
not very nice, but it's straight-forward.

Wei.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 18:10   ` Wei Liu
@ 2015-06-08 11:39     ` Stefano Stabellini
  2015-06-08 12:14       ` Andrew Cooper
  2015-06-08 13:10       ` Wei Liu
  0 siblings, 2 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 11:39 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Fri, 5 Jun 2015, Wei Liu wrote:
> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote:
> > On Fri, 5 Jun 2015, Wei Liu wrote:
> > > Hi all
> > > 
> > > This bug is now considered a blocker for 4.6 release.
> > > 
> > > The premises of the problem remain the same (George's translated
> > > version):
> > > 
> > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
> > >    the moment it calls set_max_mem() to increase max_pages so that it can
> > >    allocate more pages to the guest.  libxl doesn't know what max_pages a
> > >    domain needs prior to qemu start-up.
> > > 
> > > 2. Libxl doesn't know max_pages even after qemu start-up, because there
> > >    is no mechanism to communicate between qemu and libxl.
> > 
> > I might not know what is the right design for the overall solution, but
> > I do know that libxl shouldn't have its own state tracking for
> > max_pages, because max_pages is kept, maintained and enforced by Xen.
> > 
> > Ian might still remember, but at the beginning of the xl/libxl project,
> > we had few simple design principles. One of which was that we should not
> > have two places where we keep track of the same thing. If Xen keeps
> > track of something, libxl should avoid it.
> > 
> > In this specific case, libxl can ask Xen at any time what max_pages is
> > for the domain, so I don't think that libxl should store it or have its
> > own tracking for it.
> > 
> > Even if QEMU called into libxl to change max_pages, I don't think that
> > libxl should store max_pages anywhere. It is already stored in Xen and
> > can be retrieved at any time.
> > 
> 
> I think you're talking about keep track of that record permanently. I
> only care about getting the right value at the right time and transfer
> it to the other end. Getting the value whenever needed is OK.
> 
> > 
> > > 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
> > >    Those pages are only accounted for in the hypervisor.  Libxl
> > >    (currently) does not extract that value from the hypervisor.
> > > 
> > > Several solutions were proposed:
> > > 
> > > 1. Add a new record type in libxc migration stream and call setmaxmem
> > >    in the middle of xc migration stream.
> > > 
> > > Main objections are calling xc_domain_setmaxmem in the middle of xc
> > > migration stream is layer violation. Also this prevents us from
> > > disaggregating domain construction to a less privileged domain.
> > 
> > It seems to me that max_pages is one of the memory properties of the
> > domain, so it should be saved and restored together with the rest of
> > memory.
> > 
> 
> #1 was actually referring to Don's patch. When processing libxc record
> in the middle of the stream, we should not alter the size of memory.
> It's not safe because we don't know whether that record comes early
> enough before we exceed the limit.
> 
> The only safe way of doing it is to mandate that specific record at
> the beginning of libxc stream, which might have other implications.

I see, that makes sense


> > 
> > > 2. Use libxl toolstack save restore blob to tranmit max pages
> > >    information to remote end.
> > > 
> > > This is considered a bodge and has been proven not to work because
> > > toolstack blob restore happens after xc_domain_restore.
> > 
> > Saving and restoring max_pages in libxl seems to me like a layering
> > violation. I would avoid 2. 3. 4. and 5.
> > 
> 
> No, not really. If we follow the principle of "libxl is the
> arbitrator". It needs to have thorough information on every aspect of
> the domain and set up limit.

I am not sure I buy this "libxl is the arbitrator" concept. I am not
seeing libxl as adding much value in this context.

 
> > > 3. Add a libxl layer that wraps necessary information, take over
> > >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> > >    part of migration v2 is a waste of effort.
> > > 
> > > There are several obstacles for libxl migration v2 at the moment. Libxl
> > > layer in migration v2 still has unresolved issues. It has
> > > inter-dependency with Remus / COLO.
> > > 
> > > Most importantly it doesn't inherently solve the problem. It still
> > > requires the current libxl JSON blob to contain information about max
> > > pages (or information used to derive max pages).
> > > 
> > > Andrew, correct me if I'm wrong.
> > > 
> > > 4. Add a none user configurable field in current libxl JSON structure to
> > >    record max pages information.
> > > 
> > > This is not desirable. All fields in libxl JSON should be user
> > > configurable.
> > > 
> > > 5. Add a user configurable field in current libxl JSON structure to
> > >    record how much more memory this domain needs. Admin is required to
> > >    fill in that value manually. In the mean time we revert the change in
> > >    QEMU and declare QEMU with that change buggy.
> > 
> > QEMU 2.3.0 was released with that change in it, so it is not quite
> > possible to revert it. Also I think it is the right change for QEMU.
> > 
> 
> It has security implications. Here is my reply copied from my mail to
> Ian:
> 
> I'm considering removing xc_domain_setmaxmem needs regardless of this
> bug because that's going to cause problem in QEMU upstream stubdom with
> strict XSM policy and deprivileged QEMU (may not have privilege to call
> setmaxmem).

QEMU running in the stubdom should be able to set the maxmem for its
target domain, but not for the others.


> The security implication as it is now is big enough. One malicious guest
> that controls QEMU has a vector to DoS hypervisor by setting its own
> max_pages to -1;

Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen
side?  Could you please give me a bit more info on this security issue?


> > 
> > > No response to this so far. But in fact I consider this the most viable
> > > solution.
> > > 
> > > It's a simple enough solution that is achievable within 4.6 time frame.
> > > It doesn't prevent us from doing useful work in the future
> > > (disaggregated architecture with stricter security policy). It provides
> > > a way to work around buggy QEMU (admin sets that value to prevent QEMU
> > > from bumping memory limit). It's orgthogonal to migration v2 which means
> > > it won't be blocked by migration v2 or block migration v2.
> > > 
> > > I tend to go with solution 5. Speak up if you don't agree with my
> > > analysis or you think I miss some aspects.
> > 
> > I disagree
> > 
> > 
> > > For long term we need to:
> > > 
> > > 1. Establish libxl as the arbitrator how much pages a domain can have.
> > >    Anything else doing stuff behind arbitrator's back is considered
> > >    buggy. This principle probably apply to other aspects of a domain as
> > >    well.
> > 
> > I disagree that libxl should be the arbitrator of a property that is
> > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > 
> 
> But it would be a burden for Xen once the arbitration goes beyond the
> issue we discuss here. It certainly doesn't have as much as information
> libxl has. In the long run we would still end up having libxl doing most
> of the work so we might as well establish the principle now.

I would like to see some concrete examples of how libxl is adding value
in this context instead of just adding layers of indirection.

 
> > 
> > > 2. Work out a solution communicate between QEMU and libxl. This can be
> > >    expanded to cover other components in a Xen setup, but currently we
> > >    only have QEMU.
> > 
> > Even if QEMU called into libxl to change maxmem, I don't think that
> > libxl should store maxmem anywhere. It is already stored in Xen.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 18:49   ` Ian Campbell
@ 2015-06-08 11:40     ` Stefano Stabellini
  2015-06-08 12:11       ` Ian Campbell
  0 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 11:40 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Fri, 5 Jun 2015, Ian Campbell wrote:
> On Fri, 2015-06-05 at 18:10 +0100, Stefano Stabellini wrote:
> > On Fri, 5 Jun 2015, Wei Liu wrote:
> > > Hi all
> > > 
> > > This bug is now considered a blocker for 4.6 release.
> > > 
> > > The premises of the problem remain the same (George's translated
> > > version):
> > > 
> > > 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
> > >    the moment it calls set_max_mem() to increase max_pages so that it can
> > >    allocate more pages to the guest.  libxl doesn't know what max_pages a
> > >    domain needs prior to qemu start-up.
> > > 
> > > 2. Libxl doesn't know max_pages even after qemu start-up, because there
> > >    is no mechanism to communicate between qemu and libxl.
> > 
> > I might not know what is the right design for the overall solution, but
> > I do know that libxl shouldn't have its own state tracking for
> > max_pages, because max_pages is kept, maintained and enforced by Xen.
> > 
> > Ian might still remember, but at the beginning of the xl/libxl project,
> > we had few simple design principles. One of which was that we should not
> > have two places where we keep track of the same thing. If Xen keeps
> > track of something, libxl should avoid it.
> 
> This isn't about libxl tracking something duplicating information in
> Xen. It is about who gets to choose what that value is, which is not the
> same as who stores that value.
> 
> So this is about libxl being the owner of what the current maxmem value
> is. It can so this by using setmaxmem and getmaxmem to set and retrieve
> the value with no state in libxl.
> 
> > I disagree that libxl should be the arbitrator of a property that is
> > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> 
> That's not what "arbitrator" means, an arbitrator decides what the value
> should be, but that doesn't necessarily imply that it either stores,
> maintains nor enforces that value. 

The way you describe it, it looks like some kind of host wide memory
policy manager, that also doesn't belong to xl/libxl, the same way as
other memory management tools were never accepted into xl/libxl but kept
to separate daemons, like xenballoond or squeezed.

Let's step away from this specific issue for a second. If it is not an
host way policy manager but a per-VM layer on top of libxc, what value
is this indirection actually adding?


> > Even if QEMU called into libxl to change maxmem, I don't think that
> > libxl should store maxmem anywhere. It is already stored in Xen.
> 
> I don't think anyone suggested otherwise, did they?
> 
> What locking is there around QEMU's read/modify/write of the maxmem
> value? What happens if someone else also modifies maxmem at the same
> time?

It only happens at start of day before the domain is unpaused. At the
time I couldn't come up with a scenario where it would be an issue,
unless the admin is purposely trying to shut himself in the foot.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 11:40     ` Stefano Stabellini
@ 2015-06-08 12:11       ` Ian Campbell
  2015-06-08 13:22         ` Stefano Stabellini
  2015-06-08 14:53         ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 38+ messages in thread
From: Ian Campbell @ 2015-06-08 12:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel

On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote:

> > > I disagree that libxl should be the arbitrator of a property that is
> > > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > 
> > That's not what "arbitrator" means, an arbitrator decides what the value
> > should be, but that doesn't necessarily imply that it either stores,
> > maintains nor enforces that value. 
> 
> The way you describe it, it looks like some kind of host wide memory
> policy manager,

Not necessarily.

We need to make a distinction between the entity which enforces
maxpages, that (or those) which decide on what maxpages should be and
tells Xen.

Xen is the enforcer. It is also therefore by necessity the holder of the
canonical value for the maxpages, since it has to be.

Xen cannot be the second, it simply doesn't have the necessary
information to make a decision about a domain's memory limit, it just
blindly does whatever it is told by the last thing to tell it something.

There are lots of ways the desired value of maxpages can be decided on,
given a bunch of entities all with some level of responsibility for the
maxpages of a given domain:

     1. One single entity which decides on the limit and tells Xen. I
        think this is the xapi model (not squeezed/balloond, they just
        ask xapi to do things AFAIK, so are a red-herring). The unwary
        might think "libxl" was in this category, but it isn't since it
        is in fact multiple instances of the library.
     2. A group of entities which cooperate as a distributed system such
        that any one of them can recalculate (from from the shared
        state, or by intra entity communication) the current desired
        value of max pages and propagate it to Xen. Prior to QEMU
        starting to play with the max pages the multiple libxl's on a
        system fell into this case, coordinating via xenstore
        transactions and libxl userdata.
     3. A group of entities which operate in isolation by only ever
        increasing or descreasing the max pages according to their own
        requirements, without reference to anyone else. When QEMU
        entered the fray, and with the various libxl fixes since, you
        might think we are implementing this model, but we aren't
        because the hypervisor interface is only "set", not
        "increment/decrement" and so there is a racy read/modify/write
        cycle in every entity now.
     4. Xen exposes multiple "maxpages" variables corresponding to the
        various entities which have a need to control this (or multiple
        per entity). Xen adds all of those up to come up with the actual
        maxpages.
     5. I thought there wasw one more but I've forgotten what it was.

We used to have #2, we now have a broken version of #3 (broken because
it is racy due to the xen interface used).

Fixing #2 would be a case of adding a qmp command to allow libxl to ask
qmp what its extra memory needs are at any given point (start or day,
post hotplug event etc, or even of every memset).

Fixing #3 would be a case of adding a new hypercall interface to allow
the amount of memory to be changed by a delta instead of just set and
having everyone use it. You'd need to decide on migration whether to
propagate the current at stop and copy time and having everyone migrate
their requirements vs having everyone reregister their requirements on
the far end.

#4 is a new hypercall interface, which everyone would have to use. I
happen to think this is a pretty silly design.

> the same way as other memory management tools were never accepted into
> xl/libxl but kept to separate daemons, like xenballoond or squeezed.

I think squeezed and friends are a red-herring here, since they are
either a client of one or more entities or in the more distributed ones
are just one party among many.

> Let's step away from this specific issue for a second. If it is not an
> host way policy manager but a per-VM layer on top of libxc, what value
> is this indirection actually adding?

Given the raciness of libxc -- it is adding correctness.

> > What locking is there around QEMU's read/modify/write of the maxmem
> > value? What happens if someone else also modifies maxmem at the
> > same time?
> 
> It only happens at start of day before the domain is unpaused. At the
> time I couldn't come up with a scenario where it would be an issue,
> unless the admin is purposely trying to shut himself in the foot.

I don't think there is anything preventing a call to the toolstack to
set the amount of memory for a domain right after the domid exists.

In a system with something like balloond or squeezed I can well imagine
them doing something when a new domain was started resulting in opening
of this race condition. (even if something is "ask libxl" not "call
xc_setmaxpages").

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 11:39     ` Stefano Stabellini
@ 2015-06-08 12:14       ` Andrew Cooper
  2015-06-08 13:01         ` Stefano Stabellini
  2015-06-08 13:10       ` Wei Liu
  1 sibling, 1 reply; 38+ messages in thread
From: Andrew Cooper @ 2015-06-08 12:14 UTC (permalink / raw)
  To: Stefano Stabellini, Wei Liu
  Cc: George Dunlap, Ian Jackson, Ian Campbell, dslutz, xen-devel

On 08/06/15 12:39, Stefano Stabellini wrote:
> On Fri, 5 Jun 2015, Wei Liu wrote:
>> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote:
>>> On Fri, 5 Jun 2015, Wei Liu wrote:
>>>> Hi all
>>>>
>>>> This bug is now considered a blocker for 4.6 release.
>>>>
>>>> The premises of the problem remain the same (George's translated
>>>> version):
>>>>
>>>> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
>>>>    the moment it calls set_max_mem() to increase max_pages so that it can
>>>>    allocate more pages to the guest.  libxl doesn't know what max_pages a
>>>>    domain needs prior to qemu start-up.
>>>>
>>>> 2. Libxl doesn't know max_pages even after qemu start-up, because there
>>>>    is no mechanism to communicate between qemu and libxl.
>>> I might not know what is the right design for the overall solution, but
>>> I do know that libxl shouldn't have its own state tracking for
>>> max_pages, because max_pages is kept, maintained and enforced by Xen.
>>>
>>> Ian might still remember, but at the beginning of the xl/libxl project,
>>> we had few simple design principles. One of which was that we should not
>>> have two places where we keep track of the same thing. If Xen keeps
>>> track of something, libxl should avoid it.
>>>
>>> In this specific case, libxl can ask Xen at any time what max_pages is
>>> for the domain, so I don't think that libxl should store it or have its
>>> own tracking for it.
>>>
>>> Even if QEMU called into libxl to change max_pages, I don't think that
>>> libxl should store max_pages anywhere. It is already stored in Xen and
>>> can be retrieved at any time.
>>>
>> I think you're talking about keep track of that record permanently. I
>> only care about getting the right value at the right time and transfer
>> it to the other end. Getting the value whenever needed is OK.
>>
>>>> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
>>>>    Those pages are only accounted for in the hypervisor.  Libxl
>>>>    (currently) does not extract that value from the hypervisor.
>>>>
>>>> Several solutions were proposed:
>>>>
>>>> 1. Add a new record type in libxc migration stream and call setmaxmem
>>>>    in the middle of xc migration stream.
>>>>
>>>> Main objections are calling xc_domain_setmaxmem in the middle of xc
>>>> migration stream is layer violation. Also this prevents us from
>>>> disaggregating domain construction to a less privileged domain.
>>> It seems to me that max_pages is one of the memory properties of the
>>> domain, so it should be saved and restored together with the rest of
>>> memory.
>>>
>> #1 was actually referring to Don's patch. When processing libxc record
>> in the middle of the stream, we should not alter the size of memory.
>> It's not safe because we don't know whether that record comes early
>> enough before we exceed the limit.
>>
>> The only safe way of doing it is to mandate that specific record at
>> the beginning of libxc stream, which might have other implications.
> I see, that makes sense
>
>
>>>> 2. Use libxl toolstack save restore blob to tranmit max pages
>>>>    information to remote end.
>>>>
>>>> This is considered a bodge and has been proven not to work because
>>>> toolstack blob restore happens after xc_domain_restore.
>>> Saving and restoring max_pages in libxl seems to me like a layering
>>> violation. I would avoid 2. 3. 4. and 5.
>>>
>> No, not really. If we follow the principle of "libxl is the
>> arbitrator". It needs to have thorough information on every aspect of
>> the domain and set up limit.
> I am not sure I buy this "libxl is the arbitrator" concept. I am not
> seeing libxl as adding much value in this context.
>
>  
>>>> 3. Add a libxl layer that wraps necessary information, take over
>>>>    Andrew's work on libxl migration v2.  Having a libxl layer that's not
>>>>    part of migration v2 is a waste of effort.
>>>>
>>>> There are several obstacles for libxl migration v2 at the moment. Libxl
>>>> layer in migration v2 still has unresolved issues. It has
>>>> inter-dependency with Remus / COLO.
>>>>
>>>> Most importantly it doesn't inherently solve the problem. It still
>>>> requires the current libxl JSON blob to contain information about max
>>>> pages (or information used to derive max pages).
>>>>
>>>> Andrew, correct me if I'm wrong.
>>>>
>>>> 4. Add a none user configurable field in current libxl JSON structure to
>>>>    record max pages information.
>>>>
>>>> This is not desirable. All fields in libxl JSON should be user
>>>> configurable.
>>>>
>>>> 5. Add a user configurable field in current libxl JSON structure to
>>>>    record how much more memory this domain needs. Admin is required to
>>>>    fill in that value manually. In the mean time we revert the change in
>>>>    QEMU and declare QEMU with that change buggy.
>>> QEMU 2.3.0 was released with that change in it, so it is not quite
>>> possible to revert it. Also I think it is the right change for QEMU.
>>>
>> It has security implications. Here is my reply copied from my mail to
>> Ian:
>>
>> I'm considering removing xc_domain_setmaxmem needs regardless of this
>> bug because that's going to cause problem in QEMU upstream stubdom with
>> strict XSM policy and deprivileged QEMU (may not have privilege to call
>> setmaxmem).
> QEMU running in the stubdom should be able to set the maxmem for its
> target domain, but not for the others.

At the moment, set_max_mem is the only method the toolstack has of
putting a hard upper bound on a domains memory usage (along with a few
others like the shadow mem size, and PoD cache.)

In a disaggregated case, no deprivileged entity should be able to play
with this limit.  Being able to do so renders the security moot, as a
compromised stubdom can force a host OOM.

~Andrew

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 12:14       ` Andrew Cooper
@ 2015-06-08 13:01         ` Stefano Stabellini
  2015-06-08 13:33           ` Jan Beulich
  0 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 13:01 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, George Dunlap,
	Ian Jackson, dslutz, xen-devel

On Mon, 8 Jun 2015, Andrew Cooper wrote:
> On 08/06/15 12:39, Stefano Stabellini wrote:
> > On Fri, 5 Jun 2015, Wei Liu wrote:
> >> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote:
> >>> On Fri, 5 Jun 2015, Wei Liu wrote:
> >>>> Hi all
> >>>>
> >>>> This bug is now considered a blocker for 4.6 release.
> >>>>
> >>>> The premises of the problem remain the same (George's translated
> >>>> version):
> >>>>
> >>>> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
> >>>>    the moment it calls set_max_mem() to increase max_pages so that it can
> >>>>    allocate more pages to the guest.  libxl doesn't know what max_pages a
> >>>>    domain needs prior to qemu start-up.
> >>>>
> >>>> 2. Libxl doesn't know max_pages even after qemu start-up, because there
> >>>>    is no mechanism to communicate between qemu and libxl.
> >>> I might not know what is the right design for the overall solution, but
> >>> I do know that libxl shouldn't have its own state tracking for
> >>> max_pages, because max_pages is kept, maintained and enforced by Xen.
> >>>
> >>> Ian might still remember, but at the beginning of the xl/libxl project,
> >>> we had few simple design principles. One of which was that we should not
> >>> have two places where we keep track of the same thing. If Xen keeps
> >>> track of something, libxl should avoid it.
> >>>
> >>> In this specific case, libxl can ask Xen at any time what max_pages is
> >>> for the domain, so I don't think that libxl should store it or have its
> >>> own tracking for it.
> >>>
> >>> Even if QEMU called into libxl to change max_pages, I don't think that
> >>> libxl should store max_pages anywhere. It is already stored in Xen and
> >>> can be retrieved at any time.
> >>>
> >> I think you're talking about keep track of that record permanently. I
> >> only care about getting the right value at the right time and transfer
> >> it to the other end. Getting the value whenever needed is OK.
> >>
> >>>> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
> >>>>    Those pages are only accounted for in the hypervisor.  Libxl
> >>>>    (currently) does not extract that value from the hypervisor.
> >>>>
> >>>> Several solutions were proposed:
> >>>>
> >>>> 1. Add a new record type in libxc migration stream and call setmaxmem
> >>>>    in the middle of xc migration stream.
> >>>>
> >>>> Main objections are calling xc_domain_setmaxmem in the middle of xc
> >>>> migration stream is layer violation. Also this prevents us from
> >>>> disaggregating domain construction to a less privileged domain.
> >>> It seems to me that max_pages is one of the memory properties of the
> >>> domain, so it should be saved and restored together with the rest of
> >>> memory.
> >>>
> >> #1 was actually referring to Don's patch. When processing libxc record
> >> in the middle of the stream, we should not alter the size of memory.
> >> It's not safe because we don't know whether that record comes early
> >> enough before we exceed the limit.
> >>
> >> The only safe way of doing it is to mandate that specific record at
> >> the beginning of libxc stream, which might have other implications.
> > I see, that makes sense
> >
> >
> >>>> 2. Use libxl toolstack save restore blob to tranmit max pages
> >>>>    information to remote end.
> >>>>
> >>>> This is considered a bodge and has been proven not to work because
> >>>> toolstack blob restore happens after xc_domain_restore.
> >>> Saving and restoring max_pages in libxl seems to me like a layering
> >>> violation. I would avoid 2. 3. 4. and 5.
> >>>
> >> No, not really. If we follow the principle of "libxl is the
> >> arbitrator". It needs to have thorough information on every aspect of
> >> the domain and set up limit.
> > I am not sure I buy this "libxl is the arbitrator" concept. I am not
> > seeing libxl as adding much value in this context.
> >
> >  
> >>>> 3. Add a libxl layer that wraps necessary information, take over
> >>>>    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> >>>>    part of migration v2 is a waste of effort.
> >>>>
> >>>> There are several obstacles for libxl migration v2 at the moment. Libxl
> >>>> layer in migration v2 still has unresolved issues. It has
> >>>> inter-dependency with Remus / COLO.
> >>>>
> >>>> Most importantly it doesn't inherently solve the problem. It still
> >>>> requires the current libxl JSON blob to contain information about max
> >>>> pages (or information used to derive max pages).
> >>>>
> >>>> Andrew, correct me if I'm wrong.
> >>>>
> >>>> 4. Add a none user configurable field in current libxl JSON structure to
> >>>>    record max pages information.
> >>>>
> >>>> This is not desirable. All fields in libxl JSON should be user
> >>>> configurable.
> >>>>
> >>>> 5. Add a user configurable field in current libxl JSON structure to
> >>>>    record how much more memory this domain needs. Admin is required to
> >>>>    fill in that value manually. In the mean time we revert the change in
> >>>>    QEMU and declare QEMU with that change buggy.
> >>> QEMU 2.3.0 was released with that change in it, so it is not quite
> >>> possible to revert it. Also I think it is the right change for QEMU.
> >>>
> >> It has security implications. Here is my reply copied from my mail to
> >> Ian:
> >>
> >> I'm considering removing xc_domain_setmaxmem needs regardless of this
> >> bug because that's going to cause problem in QEMU upstream stubdom with
> >> strict XSM policy and deprivileged QEMU (may not have privilege to call
> >> setmaxmem).
> > QEMU running in the stubdom should be able to set the maxmem for its
> > target domain, but not for the others.
> 
> At the moment, set_max_mem is the only method the toolstack has of
> putting a hard upper bound on a domains memory usage (along with a few
> others like the shadow mem size, and PoD cache.)
> 
> In a disaggregated case, no deprivileged entity should be able to play
> with this limit.  Being able to do so renders the security moot, as a
> compromised stubdom can force a host OOM.

You have a good point but I wouldn't say that it renders the security
moot, as one needs to break into the stubdom first, and then it still
only gets as far as OOMing the host, not compromising other people's
data. 

Without stubdom, with QEMU deprivileged in dom0, this wouldn't be a
problem because QEMU can set max_mem before dropping privileges. The
set_max_mem calls happen before unpausing the guest. Maybe something
similar could be arranged for stubdoms too.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 11:39     ` Stefano Stabellini
  2015-06-08 12:14       ` Andrew Cooper
@ 2015-06-08 13:10       ` Wei Liu
  2015-06-08 13:27         ` Stefano Stabellini
  1 sibling, 1 reply; 38+ messages in thread
From: Wei Liu @ 2015-06-08 13:10 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson,
	dslutz, xen-devel

On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote:
[...]
> > > > 3. Add a libxl layer that wraps necessary information, take over
> > > >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> > > >    part of migration v2 is a waste of effort.
> > > > 
> > > > There are several obstacles for libxl migration v2 at the moment. Libxl
> > > > layer in migration v2 still has unresolved issues. It has
> > > > inter-dependency with Remus / COLO.
> > > > 
> > > > Most importantly it doesn't inherently solve the problem. It still
> > > > requires the current libxl JSON blob to contain information about max
> > > > pages (or information used to derive max pages).
> > > > 
> > > > Andrew, correct me if I'm wrong.
> > > > 
> > > > 4. Add a none user configurable field in current libxl JSON structure to
> > > >    record max pages information.
> > > > 
> > > > This is not desirable. All fields in libxl JSON should be user
> > > > configurable.
> > > > 
> > > > 5. Add a user configurable field in current libxl JSON structure to
> > > >    record how much more memory this domain needs. Admin is required to
> > > >    fill in that value manually. In the mean time we revert the change in
> > > >    QEMU and declare QEMU with that change buggy.
> > > 
> > > QEMU 2.3.0 was released with that change in it, so it is not quite
> > > possible to revert it. Also I think it is the right change for QEMU.
> > > 
> > 
> > It has security implications. Here is my reply copied from my mail to
> > Ian:
> > 
> > I'm considering removing xc_domain_setmaxmem needs regardless of this
> > bug because that's going to cause problem in QEMU upstream stubdom with
> > strict XSM policy and deprivileged QEMU (may not have privilege to call
> > setmaxmem).
> 
> QEMU running in the stubdom should be able to set the maxmem for its
> target domain, but not for the others.
> 

Right, but this is still not safe. I will explain below.

> 
> > The security implication as it is now is big enough. One malicious guest
> > that controls QEMU has a vector to DoS hypervisor by setting its own
> > max_pages to -1;
> 
> Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen
> side?  Could you please give me a bit more info on this security issue?
> 

Consider now we have a malicious guest. It has gained control of QEMU.
It then calls xc_domain_setmaxmem to set it's own max_pages inside
hypervisor to be -1 and start calling populate_physmap inside the guest
kernel. This is going to make hypervisor OOM.

XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't
think there is sensible way of distinguishing a valid setmaxmem call
from a malicious setmaxmem call from QEMU. It's untrusted code base
after all.

> 
> > > 
> > > > No response to this so far. But in fact I consider this the most viable
> > > > solution.
> > > > 
> > > > It's a simple enough solution that is achievable within 4.6 time frame.
> > > > It doesn't prevent us from doing useful work in the future
> > > > (disaggregated architecture with stricter security policy). It provides
> > > > a way to work around buggy QEMU (admin sets that value to prevent QEMU
> > > > from bumping memory limit). It's orgthogonal to migration v2 which means
> > > > it won't be blocked by migration v2 or block migration v2.
> > > > 
> > > > I tend to go with solution 5. Speak up if you don't agree with my
> > > > analysis or you think I miss some aspects.
> > > 
> > > I disagree
> > > 
> > > 
> > > > For long term we need to:
> > > > 
> > > > 1. Establish libxl as the arbitrator how much pages a domain can have.
> > > >    Anything else doing stuff behind arbitrator's back is considered
> > > >    buggy. This principle probably apply to other aspects of a domain as
> > > >    well.
> > > 
> > > I disagree that libxl should be the arbitrator of a property that is
> > > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > > 
> > 
> > But it would be a burden for Xen once the arbitration goes beyond the
> > issue we discuss here. It certainly doesn't have as much as information
> > libxl has. In the long run we would still end up having libxl doing most
> > of the work so we might as well establish the principle now.
> 
> I would like to see some concrete examples of how libxl is adding value
> in this context instead of just adding layers of indirection.
> 

No, not layer of indirection. The fact that initial max_pages being
calculated in libxl suggests that libxl already has that responsibility
to arbitrate that value.

Wei.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 12:11       ` Ian Campbell
@ 2015-06-08 13:22         ` Stefano Stabellini
  2015-06-08 13:52           ` Ian Campbell
  2015-06-08 14:20           ` George Dunlap
  2015-06-08 14:53         ` Konrad Rzeszutek Wilk
  1 sibling, 2 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 13:22 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Mon, 8 Jun 2015, Ian Campbell wrote:
> On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote:
> 
> > > > I disagree that libxl should be the arbitrator of a property that is
> > > > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > > 
> > > That's not what "arbitrator" means, an arbitrator decides what the value
> > > should be, but that doesn't necessarily imply that it either stores,
> > > maintains nor enforces that value. 
> > 
> > The way you describe it, it looks like some kind of host wide memory
> > policy manager,
> 
> Not necessarily.
> 
> We need to make a distinction between the entity which enforces
> maxpages, that (or those) which decide on what maxpages should be and
> tells Xen.
> 
> Xen is the enforcer. It is also therefore by necessity the holder of the
> canonical value for the maxpages, since it has to be.
> 
> Xen cannot be the second, it simply doesn't have the necessary
> information to make a decision about a domain's memory limit, it just
> blindly does whatever it is told by the last thing to tell it something.
> 
> There are lots of ways the desired value of maxpages can be decided on,
> given a bunch of entities all with some level of responsibility for the
> maxpages of a given domain:
> 
>      1. One single entity which decides on the limit and tells Xen. I
>         think this is the xapi model (not squeezed/balloond, they just
>         ask xapi to do things AFAIK, so are a red-herring). The unwary
>         might think "libxl" was in this category, but it isn't since it
>         is in fact multiple instances of the library.
>      2. A group of entities which cooperate as a distributed system such
>         that any one of them can recalculate (from from the shared
>         state, or by intra entity communication) the current desired
>         value of max pages and propagate it to Xen. Prior to QEMU
>         starting to play with the max pages the multiple libxl's on a
>         system fell into this case, coordinating via xenstore
>         transactions and libxl userdata.

The current libxl/xl code doesn't use xenstore or userdata to coordinate
maxmem values. Xenstore is used to store the memory "target" for the
in-guest balloon driver. Userdata is used to store a few QEMU
properties.

Maxmem is set at runtime by libxl is two cases:

- when the user calls xl mem-max
xl mem-max -> libxl_domain_setmaxmem -> xc_domain_setmaxmem

libxl_domain_setmaxmem does a xenstore read to check the target value,
not the maxmem value. It doesn't allow the user to lower maxmem below
the current target value. Arguably this doesn't make too much sense.

- when libxl_set_memory_target is called with enforce = 1, that happens
when the user executes xl mem-set


There is no coordination as far as I can tell. Libxl is just exposing
xc_domain_setmaxmem straight to the user. There is no locking, no
synchronization, nothing. Maybe this is where our different views come
from: I am not seeing libxl doing any arbitrating at all, at least today.



>      3. A group of entities which operate in isolation by only ever
>         increasing or descreasing the max pages according to their own
>         requirements, without reference to anyone else. When QEMU
>         entered the fray, and with the various libxl fixes since, you
>         might think we are implementing this model, but we aren't
>         because the hypervisor interface is only "set", not
>         "increment/decrement" and so there is a racy read/modify/write
>         cycle in every entity now.

I don't think this is true: QEMU only sets maxmem at domain creation,
before "xl create" even returns. I think that is safe. We had an email
exchange at the time, I explained this behaviour and the general opinon
was that it is acceptable. I don't understand why it is not anymore.


>      4. Xen exposes multiple "maxpages" variables corresponding to the
>         various entities which have a need to control this (or multiple
>         per entity). Xen adds all of those up to come up with the actual
>         maxpages.
>      5. I thought there wasw one more but I've forgotten what it was.
> 
> We used to have #2, we now have a broken version of #3 (broken because
> it is racy due to the xen interface used).

I don't think we had #2, we always had a working version of #3 and we
still have, except for migration being broken.

I think libxc just needs to properly save/restore maxmem. I understand
that given the current interfaces it is difficult to do, but rethinking
the world because we cannot save/restore maxmem in a simple manner
doesn't seem the right answer to me.



> Fixing #2 would be a case of adding a qmp command to allow libxl to ask
> qmp what its extra memory needs are at any given point (start or day,
> post hotplug event etc, or even of every memset).
> 
> Fixing #3 would be a case of adding a new hypercall interface to allow
> the amount of memory to be changed by a delta instead of just set and
> having everyone use it. You'd need to decide on migration whether to
> propagate the current at stop and copy time and having everyone migrate
> their requirements vs having everyone reregister their requirements on
> the far end.
> 
> #4 is a new hypercall interface, which everyone would have to use. I
> happen to think this is a pretty silly design.
> 
> > the same way as other memory management tools were never accepted into
> > xl/libxl but kept to separate daemons, like xenballoond or squeezed.
> 
> I think squeezed and friends are a red-herring here, since they are
> either a client of one or more entities or in the more distributed ones
> are just one party among many.
> 
> > Let's step away from this specific issue for a second. If it is not an
> > host way policy manager but a per-VM layer on top of libxc, what value
> > is this indirection actually adding?
> 
> Given the raciness of libxc -- it is adding correctness.
> 
> > > What locking is there around QEMU's read/modify/write of the maxmem
> > > value? What happens if someone else also modifies maxmem at the
> > > same time?
> > 
> > It only happens at start of day before the domain is unpaused. At the
> > time I couldn't come up with a scenario where it would be an issue,
> > unless the admin is purposely trying to shut himself in the foot.
> 
> I don't think there is anything preventing a call to the toolstack to
> set the amount of memory for a domain right after the domid exists.
> 
> In a system with something like balloond or squeezed I can well imagine
> them doing something when a new domain was started resulting in opening
> of this race condition. (even if something is "ask libxl" not "call
> xc_setmaxpages").
> 
> Ian.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:10       ` Wei Liu
@ 2015-06-08 13:27         ` Stefano Stabellini
  2015-06-08 13:32           ` Wei Liu
  0 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 13:27 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Mon, 8 Jun 2015, Wei Liu wrote:
> On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote:
> [...]
> > > > > 3. Add a libxl layer that wraps necessary information, take over
> > > > >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> > > > >    part of migration v2 is a waste of effort.
> > > > > 
> > > > > There are several obstacles for libxl migration v2 at the moment. Libxl
> > > > > layer in migration v2 still has unresolved issues. It has
> > > > > inter-dependency with Remus / COLO.
> > > > > 
> > > > > Most importantly it doesn't inherently solve the problem. It still
> > > > > requires the current libxl JSON blob to contain information about max
> > > > > pages (or information used to derive max pages).
> > > > > 
> > > > > Andrew, correct me if I'm wrong.
> > > > > 
> > > > > 4. Add a none user configurable field in current libxl JSON structure to
> > > > >    record max pages information.
> > > > > 
> > > > > This is not desirable. All fields in libxl JSON should be user
> > > > > configurable.
> > > > > 
> > > > > 5. Add a user configurable field in current libxl JSON structure to
> > > > >    record how much more memory this domain needs. Admin is required to
> > > > >    fill in that value manually. In the mean time we revert the change in
> > > > >    QEMU and declare QEMU with that change buggy.
> > > > 
> > > > QEMU 2.3.0 was released with that change in it, so it is not quite
> > > > possible to revert it. Also I think it is the right change for QEMU.
> > > > 
> > > 
> > > It has security implications. Here is my reply copied from my mail to
> > > Ian:
> > > 
> > > I'm considering removing xc_domain_setmaxmem needs regardless of this
> > > bug because that's going to cause problem in QEMU upstream stubdom with
> > > strict XSM policy and deprivileged QEMU (may not have privilege to call
> > > setmaxmem).
> > 
> > QEMU running in the stubdom should be able to set the maxmem for its
> > target domain, but not for the others.
> > 
> 
> Right, but this is still not safe. I will explain below.
> 
> > 
> > > The security implication as it is now is big enough. One malicious guest
> > > that controls QEMU has a vector to DoS hypervisor by setting its own
> > > max_pages to -1;
> > 
> > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen
> > side?  Could you please give me a bit more info on this security issue?
> > 
> 
> Consider now we have a malicious guest. It has gained control of QEMU.
> It then calls xc_domain_setmaxmem to set it's own max_pages inside
> hypervisor to be -1 and start calling populate_physmap inside the guest
> kernel. This is going to make hypervisor OOM.
> 
> XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't
> think there is sensible way of distinguishing a valid setmaxmem call
> from a malicious setmaxmem call from QEMU. It's untrusted code base
> after all.

Valid maxmem happens before device-model/$DOMID/state is set to running.


> > 
> > > > 
> > > > > No response to this so far. But in fact I consider this the most viable
> > > > > solution.
> > > > > 
> > > > > It's a simple enough solution that is achievable within 4.6 time frame.
> > > > > It doesn't prevent us from doing useful work in the future
> > > > > (disaggregated architecture with stricter security policy). It provides
> > > > > a way to work around buggy QEMU (admin sets that value to prevent QEMU
> > > > > from bumping memory limit). It's orgthogonal to migration v2 which means
> > > > > it won't be blocked by migration v2 or block migration v2.
> > > > > 
> > > > > I tend to go with solution 5. Speak up if you don't agree with my
> > > > > analysis or you think I miss some aspects.
> > > > 
> > > > I disagree
> > > > 
> > > > 
> > > > > For long term we need to:
> > > > > 
> > > > > 1. Establish libxl as the arbitrator how much pages a domain can have.
> > > > >    Anything else doing stuff behind arbitrator's back is considered
> > > > >    buggy. This principle probably apply to other aspects of a domain as
> > > > >    well.
> > > > 
> > > > I disagree that libxl should be the arbitrator of a property that is
> > > > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > > > 
> > > 
> > > But it would be a burden for Xen once the arbitration goes beyond the
> > > issue we discuss here. It certainly doesn't have as much as information
> > > libxl has. In the long run we would still end up having libxl doing most
> > > of the work so we might as well establish the principle now.
> > 
> > I would like to see some concrete examples of how libxl is adding value
> > in this context instead of just adding layers of indirection.
> > 
> 
> No, not layer of indirection. The fact that initial max_pages being
> calculated in libxl suggests that libxl already has that responsibility
> to arbitrate that value.
> 
> Wei.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:27         ` Stefano Stabellini
@ 2015-06-08 13:32           ` Wei Liu
  2015-06-08 13:38             ` Stefano Stabellini
  0 siblings, 1 reply; 38+ messages in thread
From: Wei Liu @ 2015-06-08 13:32 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, Ian Jackson,
	dslutz, xen-devel

On Mon, Jun 08, 2015 at 02:27:11PM +0100, Stefano Stabellini wrote:
> On Mon, 8 Jun 2015, Wei Liu wrote:
> > On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote:
> > [...]
> > > > > > 3. Add a libxl layer that wraps necessary information, take over
> > > > > >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> > > > > >    part of migration v2 is a waste of effort.
> > > > > > 
> > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl
> > > > > > layer in migration v2 still has unresolved issues. It has
> > > > > > inter-dependency with Remus / COLO.
> > > > > > 
> > > > > > Most importantly it doesn't inherently solve the problem. It still
> > > > > > requires the current libxl JSON blob to contain information about max
> > > > > > pages (or information used to derive max pages).
> > > > > > 
> > > > > > Andrew, correct me if I'm wrong.
> > > > > > 
> > > > > > 4. Add a none user configurable field in current libxl JSON structure to
> > > > > >    record max pages information.
> > > > > > 
> > > > > > This is not desirable. All fields in libxl JSON should be user
> > > > > > configurable.
> > > > > > 
> > > > > > 5. Add a user configurable field in current libxl JSON structure to
> > > > > >    record how much more memory this domain needs. Admin is required to
> > > > > >    fill in that value manually. In the mean time we revert the change in
> > > > > >    QEMU and declare QEMU with that change buggy.
> > > > > 
> > > > > QEMU 2.3.0 was released with that change in it, so it is not quite
> > > > > possible to revert it. Also I think it is the right change for QEMU.
> > > > > 
> > > > 
> > > > It has security implications. Here is my reply copied from my mail to
> > > > Ian:
> > > > 
> > > > I'm considering removing xc_domain_setmaxmem needs regardless of this
> > > > bug because that's going to cause problem in QEMU upstream stubdom with
> > > > strict XSM policy and deprivileged QEMU (may not have privilege to call
> > > > setmaxmem).
> > > 
> > > QEMU running in the stubdom should be able to set the maxmem for its
> > > target domain, but not for the others.
> > > 
> > 
> > Right, but this is still not safe. I will explain below.
> > 
> > > 
> > > > The security implication as it is now is big enough. One malicious guest
> > > > that controls QEMU has a vector to DoS hypervisor by setting its own
> > > > max_pages to -1;
> > > 
> > > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen
> > > side?  Could you please give me a bit more info on this security issue?
> > > 
> > 
> > Consider now we have a malicious guest. It has gained control of QEMU.
> > It then calls xc_domain_setmaxmem to set it's own max_pages inside
> > hypervisor to be -1 and start calling populate_physmap inside the guest
> > kernel. This is going to make hypervisor OOM.
> > 
> > XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't
> > think there is sensible way of distinguishing a valid setmaxmem call
> > from a malicious setmaxmem call from QEMU. It's untrusted code base
> > after all.
> 
> Valid maxmem happens before device-model/$DOMID/state is set to running.
> 

But Xen wouldn't know about that. It couldn't read xenstore. Also
device-mode/$domid/state is writable by QEMU so we can't trust the
content as indicator either.

Wei.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:01         ` Stefano Stabellini
@ 2015-06-08 13:33           ` Jan Beulich
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Beulich @ 2015-06-08 13:33 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Ian Campbell, George Dunlap, Andrew Cooper, IanJackson,
	dslutz, xen-devel

>>> On 08.06.15 at 15:01, <stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 8 Jun 2015, Andrew Cooper wrote:
>> At the moment, set_max_mem is the only method the toolstack has of
>> putting a hard upper bound on a domains memory usage (along with a few
>> others like the shadow mem size, and PoD cache.)
>> 
>> In a disaggregated case, no deprivileged entity should be able to play
>> with this limit.  Being able to do so renders the security moot, as a
>> compromised stubdom can force a host OOM.
> 
> You have a good point but I wouldn't say that it renders the security
> moot, as one needs to break into the stubdom first, and then it still
> only gets as far as OOMing the host, not compromising other people's
> data. 

OOMing the host is quite likely to compromise other people's (i.e.
VM's) data, due to Xen having to fail any request to it that
involves memory allocation. I'm sure there are code paths here
that end in domain_crash().

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:32           ` Wei Liu
@ 2015-06-08 13:38             ` Stefano Stabellini
  2015-06-08 13:44               ` Andrew Cooper
  0 siblings, 1 reply; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 13:38 UTC (permalink / raw)
  To: Wei Liu
  Cc: Ian Campbell, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Mon, 8 Jun 2015, Wei Liu wrote:
> On Mon, Jun 08, 2015 at 02:27:11PM +0100, Stefano Stabellini wrote:
> > On Mon, 8 Jun 2015, Wei Liu wrote:
> > > On Mon, Jun 08, 2015 at 12:39:52PM +0100, Stefano Stabellini wrote:
> > > [...]
> > > > > > > 3. Add a libxl layer that wraps necessary information, take over
> > > > > > >    Andrew's work on libxl migration v2.  Having a libxl layer that's not
> > > > > > >    part of migration v2 is a waste of effort.
> > > > > > > 
> > > > > > > There are several obstacles for libxl migration v2 at the moment. Libxl
> > > > > > > layer in migration v2 still has unresolved issues. It has
> > > > > > > inter-dependency with Remus / COLO.
> > > > > > > 
> > > > > > > Most importantly it doesn't inherently solve the problem. It still
> > > > > > > requires the current libxl JSON blob to contain information about max
> > > > > > > pages (or information used to derive max pages).
> > > > > > > 
> > > > > > > Andrew, correct me if I'm wrong.
> > > > > > > 
> > > > > > > 4. Add a none user configurable field in current libxl JSON structure to
> > > > > > >    record max pages information.
> > > > > > > 
> > > > > > > This is not desirable. All fields in libxl JSON should be user
> > > > > > > configurable.
> > > > > > > 
> > > > > > > 5. Add a user configurable field in current libxl JSON structure to
> > > > > > >    record how much more memory this domain needs. Admin is required to
> > > > > > >    fill in that value manually. In the mean time we revert the change in
> > > > > > >    QEMU and declare QEMU with that change buggy.
> > > > > > 
> > > > > > QEMU 2.3.0 was released with that change in it, so it is not quite
> > > > > > possible to revert it. Also I think it is the right change for QEMU.
> > > > > > 
> > > > > 
> > > > > It has security implications. Here is my reply copied from my mail to
> > > > > Ian:
> > > > > 
> > > > > I'm considering removing xc_domain_setmaxmem needs regardless of this
> > > > > bug because that's going to cause problem in QEMU upstream stubdom with
> > > > > strict XSM policy and deprivileged QEMU (may not have privilege to call
> > > > > setmaxmem).
> > > > 
> > > > QEMU running in the stubdom should be able to set the maxmem for its
> > > > target domain, but not for the others.
> > > > 
> > > 
> > > Right, but this is still not safe. I will explain below.
> > > 
> > > > 
> > > > > The security implication as it is now is big enough. One malicious guest
> > > > > that controls QEMU has a vector to DoS hypervisor by setting its own
> > > > > max_pages to -1;
> > > > 
> > > > Is that an issue in the implementation of XEN_DOMCTL_max_mem on the Xen
> > > > side?  Could you please give me a bit more info on this security issue?
> > > > 
> > > 
> > > Consider now we have a malicious guest. It has gained control of QEMU.
> > > It then calls xc_domain_setmaxmem to set it's own max_pages inside
> > > hypervisor to be -1 and start calling populate_physmap inside the guest
> > > kernel. This is going to make hypervisor OOM.
> > > 
> > > XEN_DOMCTL_max_mem only sets the limit of domain at the moment. I don't
> > > think there is sensible way of distinguishing a valid setmaxmem call
> > > from a malicious setmaxmem call from QEMU. It's untrusted code base
> > > after all.
> > 
> > Valid maxmem happens before device-model/$DOMID/state is set to running.
> > 
> 
> But Xen wouldn't know about that. It couldn't read xenstore.

The way I would do it would be to drop the max_mem write priviledge from
the stubdom at the time QEMU writes to device-mode/$domid/state, in a
similar way to QEMU in dom0 dropping from root to unprivileged uid.
Alternatively, as libxl is watching for changes to
device-model/$DOMID/state, libxl could be the one that takes maxmem
writing privileges away from the stubdom at that time.


> Also device-mode/$domid/state is writable by QEMU so we can't trust
> the content as indicator either.

We can because the write happens before we unpause the guest

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:38             ` Stefano Stabellini
@ 2015-06-08 13:44               ` Andrew Cooper
  2015-06-08 13:45                 ` Stefano Stabellini
  0 siblings, 1 reply; 38+ messages in thread
From: Andrew Cooper @ 2015-06-08 13:44 UTC (permalink / raw)
  To: Stefano Stabellini, Wei Liu
  Cc: George Dunlap, Ian Jackson, Ian Campbell, dslutz, xen-devel

On 08/06/15 14:38, Stefano Stabellini wrote:
>> Also device-mode/$domid/state is writable by QEMU so we can't trust
>> > the content as indicator either.
> We can because the write happens before we unpause the guest

Only when creating the domain fresh.  On resume, the guest has possibly
had the chance to code-inject via the qemu save format.  There are many
CVEs in this area, and I am not willing to be all of them are fixed.

In XenServer, even loading VM state from the save file happens in the
deprivilelged environment.

~Andrew

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:44               ` Andrew Cooper
@ 2015-06-08 13:45                 ` Stefano Stabellini
  0 siblings, 0 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-08 13:45 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, George Dunlap,
	Ian Jackson, dslutz, xen-devel

On Mon, 8 Jun 2015, Andrew Cooper wrote:
> On 08/06/15 14:38, Stefano Stabellini wrote:
> >> Also device-mode/$domid/state is writable by QEMU so we can't trust
> >> > the content as indicator either.
> > We can because the write happens before we unpause the guest
> 
> Only when creating the domain fresh.  On resume, the guest has possibly
> had the chance to code-inject via the qemu save format.  There are many
> CVEs in this area, and I am not willing to be all of them are fixed.
> 
> In XenServer, even loading VM state from the save file happens in the
> deprivilelged environment.

QEMU doesn't do any maxmem changes at restore time.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:22         ` Stefano Stabellini
@ 2015-06-08 13:52           ` Ian Campbell
  2015-06-08 14:20           ` George Dunlap
  1 sibling, 0 replies; 38+ messages in thread
From: Ian Campbell @ 2015-06-08 13:52 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel

On Mon, 2015-06-08 at 14:22 +0100, Stefano Stabellini wrote:
> On Mon, 8 Jun 2015, Ian Campbell wrote:
> > On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote:
> > 
> > > > > I disagree that libxl should be the arbitrator of a property that is
> > > > > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > > > 
> > > > That's not what "arbitrator" means, an arbitrator decides what the value
> > > > should be, but that doesn't necessarily imply that it either stores,
> > > > maintains nor enforces that value. 
> > > 
> > > The way you describe it, it looks like some kind of host wide memory
> > > policy manager,
> > 
> > Not necessarily.
> > 
> > We need to make a distinction between the entity which enforces
> > maxpages, that (or those) which decide on what maxpages should be and
> > tells Xen.
> > 
> > Xen is the enforcer. It is also therefore by necessity the holder of the
> > canonical value for the maxpages, since it has to be.
> > 
> > Xen cannot be the second, it simply doesn't have the necessary
> > information to make a decision about a domain's memory limit, it just
> > blindly does whatever it is told by the last thing to tell it something.
> > 
> > There are lots of ways the desired value of maxpages can be decided on,
> > given a bunch of entities all with some level of responsibility for the
> > maxpages of a given domain:
> > 
> >      1. One single entity which decides on the limit and tells Xen. I
> >         think this is the xapi model (not squeezed/balloond, they just
> >         ask xapi to do things AFAIK, so are a red-herring). The unwary
> >         might think "libxl" was in this category, but it isn't since it
> >         is in fact multiple instances of the library.
> >      2. A group of entities which cooperate as a distributed system such
> >         that any one of them can recalculate (from from the shared
> >         state, or by intra entity communication) the current desired
> >         value of max pages and propagate it to Xen. Prior to QEMU
> >         starting to play with the max pages the multiple libxl's on a
> >         system fell into this case, coordinating via xenstore
> >         transactions and libxl userdata.
> 
> The current libxl/xl code doesn't use xenstore or userdata to coordinate
> maxmem values. Xenstore is used to store the memory "target" for the
> in-guest balloon driver. Userdata is used to store a few QEMU
> properties.
> 
> Maxmem is set at runtime by libxl is two cases:
> 
> - when the user calls xl mem-max
> xl mem-max -> libxl_domain_setmaxmem -> xc_domain_setmaxmem
> 
> libxl_domain_setmaxmem does a xenstore read to check the target value,
> not the maxmem value. It doesn't allow the user to lower maxmem below
> the current target value. Arguably this doesn't make too much sense.
> 
> - when libxl_set_memory_target is called with enforce = 1, that happens
> when the user executes xl mem-set
> 
> 
> There is no coordination as far as I can tell. Libxl is just exposing
> xc_domain_setmaxmem straight to the user. There is no locking, no
> synchronization, nothing. Maybe this is where our different views come
> from: I am not seeing libxl doing any arbitrating at all, at least today.

I think this is consistent with what I describe as #2, except that as it
happens there is no need for communication because other than the new
value for the "user requested max memory size" all the other factors are
hardcoded constants (LIBXL_MAXMEM_CONSTANT and friends), so there wasn't
actually any need to communicate in order to be able to recalculate the
new correct value for max pages.

If for example we supported multiple versions of libxl in parallel and
they might have different constants then this would be broken. Luckily
we don't support that.

> >      3. A group of entities which operate in isolation by only ever
> >         increasing or descreasing the max pages according to their own
> >         requirements, without reference to anyone else. When QEMU
> >         entered the fray, and with the various libxl fixes since, you
> >         might think we are implementing this model, but we aren't
> >         because the hypervisor interface is only "set", not
> >         "increment/decrement" and so there is a racy read/modify/write
> >         cycle in every entity now.
> 
> I don't think this is true: QEMU only sets maxmem at domain creation,
> before "xl create" even returns. I think that is safe. We had an email
> exchange at the time, I explained this behaviour and the general opinon
> was that it is acceptable. I don't understand why it is not anymore.

At the time we didn't appreciate the race inherent in this, I think. Or
else the argument is complex enough that it is going to need explaining
again (and again). In which case it really ought to be clearly written
down somewhere. The third option is that we ended up here by a series of
plausible looking changes (the frog in boiling water effect).

Anyway, without locking  the current scheme is not #3 as defined above,
due to the R/M/W race, it is just something a bit like #3 (but broken).

Maybe there is an argument around the locking in libxl being sufficient
at start of day, I'm not sure, it certainly doesn't look like
libxl_set_memory_target(enforce=1) does anything to interlock against
domain creation, so if it were called half way I'm not sure it would do
the right thing. IOW. I think libxl_set_memory_target is now racy
against parallel invocations vs libxl_domain_setmaxmem or during
creation.

I think/hope the xs transaction makes libxl_set_memory_target safe
against itself, but I'm not sure if the transaction fails and we go
around again we might use a stale ptr.max_memkb (because the lost
transaction might have been another libxl_set_memory_target).

> >      4. Xen exposes multiple "maxpages" variables corresponding to the
> >         various entities which have a need to control this (or multiple
> >         per entity). Xen adds all of those up to come up with the actual
> >         maxpages.
> >      5. I thought there wasw one more but I've forgotten what it was.
> > 
> > We used to have #2, we now have a broken version of #3 (broken because
> > it is racy due to the xen interface used).
> 
> I don't think we had #2, we always had a working version of #3 and we
> still have, except for migration being broken.

Before QEMU got involved #2 and #3 were for most practical purposes the
same, I suppose (given the trivial nop coordination in #2 described
above).

Adding QEMU strictly at start of day into the mix didn't inherently
break that (if we assume the libxl creation lock sufficient, which I'm
not sure of), it was the following changes to libxl to use a R/M/W for
updates which is where things start to get much more fishy IMHO.

> I think libxc just needs to properly save/restore maxmem. I understand
> that given the current interfaces it is difficult to do, but rethinking
> the world because we cannot save/restore maxmem in a simple manner
> doesn't seem the right answer to me.

I think save/restore is just what has caused us to notice that the
current model has issues, you are correct that we could apply a band aid
to (perhaps) fix save/restore with the current model. I don't think that
changes the fact that the model has issues though.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-05 17:10 ` Stefano Stabellini
  2015-06-05 18:10   ` Wei Liu
  2015-06-05 18:49   ` Ian Campbell
@ 2015-06-08 14:14   ` George Dunlap
  2 siblings, 0 replies; 38+ messages in thread
From: George Dunlap @ 2015-06-08 14:14 UTC (permalink / raw)
  To: Stefano Stabellini, Wei Liu
  Cc: Andrew Cooper, Ian Jackson, Ian Campbell, dslutz, xen-devel

On 06/05/2015 06:10 PM, Stefano Stabellini wrote:
>> 5. Add a user configurable field in current libxl JSON structure to
>>    record how much more memory this domain needs. Admin is required to
>>    fill in that value manually. In the mean time we revert the change in
>>    QEMU and declare QEMU with that change buggy.
> 
> QEMU 2.3.0 was released with that change in it, so it is not quite
> possible to revert it. 

This is at least one reason it should never have been in qemu in the
first place, and why we should deprecate that as soon as possible.

 -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 13:22         ` Stefano Stabellini
  2015-06-08 13:52           ` Ian Campbell
@ 2015-06-08 14:20           ` George Dunlap
  2015-06-08 15:01             ` Don Slutz
  1 sibling, 1 reply; 38+ messages in thread
From: George Dunlap @ 2015-06-08 14:20 UTC (permalink / raw)
  To: Stefano Stabellini, Ian Campbell
  Cc: Ian Jackson, Andrew Cooper, Wei Liu, dslutz, xen-devel

On 06/08/2015 02:22 PM, Stefano Stabellini wrote:
>>      3. A group of entities which operate in isolation by only ever
>>         increasing or descreasing the max pages according to their own
>>         requirements, without reference to anyone else. When QEMU
>>         entered the fray, and with the various libxl fixes since, you
>>         might think we are implementing this model, but we aren't
>>         because the hypervisor interface is only "set", not
>>         "increment/decrement" and so there is a racy read/modify/write
>>         cycle in every entity now.
> 
> I don't think this is true: QEMU only sets maxmem at domain creation,
> before "xl create" even returns. I think that is safe. We had an email
> exchange at the time, I explained this behaviour and the general opinon
> was that it is acceptable. I don't understand why it is not anymore.

Well for one, nobody on the hypervisor side seems to have been brought
in -- I definitely would have objected, and it sounds like AndyC would
have objected too.

I think we need to go back one level further.

So first, let's make a distinction between *pages*, which are actual
host RAM assigned to the guest and put in its p2m table, and *memory*,
which is a virtualization construct (i.e., virtual RAM and video memory
for virtual graphics cards).

The hypervisor only cares about *pages*.  It allocates pages to a
domain, it puts them in the p2m.  That's all it knows.

The purpose of max_pages in the hypervisor is to make sure that no guest
can allocate more host memory (pages) than it is allowed to have.

How many pages is a particular guest allowed to have?

Well pages are used for a number of purposes:
* To implement virtual RAM in the guest
* To implement video ram for virtual devices in qemu
* To implement virtual ROMs
* For magic "shared pages" used behind-the-scenes (not visible to the
guest)

(Feel free to add anything I missed.)

max_pages in the hypervisor must be set to the sum of all the pages the
domain is allowed to have.

So the first point is this: Xen doesn't have a clue about any of those.
 It doesn't know how much virtual RAM a guest has, how much video RAM it
has, how many virtual ROMs, how many magic shared pages, or anything.
All it knows are what pages are in the p2m table.

So although Xen certainly *enforces* max_pages, it is (at the moment) in
no position to *decide* what max_pages should be.

At the moment, in fact, nobody is.  There is no single place that has a
clear picture into how virtual RAM, guest devices, guest ROMs, and
"magic pages" convert into actual number of pages.  I think that's a bug.

And at the moment, pages in the p2m are allocated by a number of entities:
* In the libxc domain builder.
* In the guest balloon driver
* And now, in qemu, to allocate extra memory for virtual ROMs.

Did I miss anything?

For the first two, it's libxl that sets maxmem, based in its calculation
of the size of virtual RAM plus various other bits that will be needed.
 Having qemu *also* set maxmem was always the wrong thing to do, IMHO.

In theory, from the interface perspective, what libxl promises to
provide is virtual RAM.  When you say "memory=8192" in a domain config,
that means (or should mean) 8192MiB of virtual RAM, exclusive of video
RAM, virtual ROMs, and magic pages.  Then when you say "xl mem-set
4096", it should again be aiming at giving the VM the equivalent of
4096MiB of virtual RAM, exclusive of video RAM, &c &c.

We already have the problem that the balloon driver at the moment
doesn't actually know how big the guest RAM is, nor , but is being told
to make a balloon exactly big enough to bring the total RAM down to a
specific target.

I think we do need to have some place in the middle that actually knows
how much memory is actually needed for the different sub-systems, so it
can calculate and set maxmem appropriately.  libxl is the obvious place.

What about this:
* Libxl has a maximum amount of RAM that qemu is *allowed* to use to set
up virtual ROMs, video ram for virtual devices, &c
* At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) +
max_qemu_pages
* Qemu allocates as many pages as it needs for option ROMS, and writes
the amount that it actually did use into a special node in xenstore.
* When the domain is unpaused, libxl will set maxpages to PAGES(virtual
RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore.

I think also that probably libxl, rather than setting a target amount of
memory the balloon driver is supposed to aim at, should set the target
size of the balloon.  Once qemu tells it how many pages are actually
being used for virtual devices,

We could, in theory, expose all this information in xenstore such that
*either* libxl or qemu would be able to calculate max_pages based on the
numbers that were written there.  And that would work if we could
enforce a lock-step between the toolstack and qemu, as we can between
Xen and the toolstack.  But I think setting anything like this in stone
is a really bad idea; which unfortulately excludes the idea of putting
it in qemu.

 -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 12:11       ` Ian Campbell
  2015-06-08 13:22         ` Stefano Stabellini
@ 2015-06-08 14:53         ` Konrad Rzeszutek Wilk
  2015-06-08 15:20           ` George Dunlap
  1 sibling, 1 reply; 38+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-08 14:53 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Mon, Jun 08, 2015 at 01:11:38PM +0100, Ian Campbell wrote:
> On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote:
> 
> > > > I disagree that libxl should be the arbitrator of a property that is
> > > > stored, maintained and enforced by Xen. Xen should be the arbitrator.
> > > 
> > > That's not what "arbitrator" means, an arbitrator decides what the value
> > > should be, but that doesn't necessarily imply that it either stores,
> > > maintains nor enforces that value. 
> > 
> > The way you describe it, it looks like some kind of host wide memory
> > policy manager,
> 
> Not necessarily.
> 
> We need to make a distinction between the entity which enforces
> maxpages, that (or those) which decide on what maxpages should be and
> tells Xen.
> 
> Xen is the enforcer. It is also therefore by necessity the holder of the
> canonical value for the maxpages, since it has to be.
> 
> Xen cannot be the second, it simply doesn't have the necessary
> information to make a decision about a domain's memory limit, it just
> blindly does whatever it is told by the last thing to tell it something.
> 
> There are lots of ways the desired value of maxpages can be decided on,
> given a bunch of entities all with some level of responsibility for the
> maxpages of a given domain:
> 
>      1. One single entity which decides on the limit and tells Xen. I
>         think this is the xapi model (not squeezed/balloond, they just
>         ask xapi to do things AFAIK, so are a red-herring). The unwary
>         might think "libxl" was in this category, but it isn't since it
>         is in fact multiple instances of the library.
>      2. A group of entities which cooperate as a distributed system such
>         that any one of them can recalculate (from from the shared
>         state, or by intra entity communication) the current desired
>         value of max pages and propagate it to Xen. Prior to QEMU
>         starting to play with the max pages the multiple libxl's on a
>         system fell into this case, coordinating via xenstore
>         transactions and libxl userdata.
>      3. A group of entities which operate in isolation by only ever
>         increasing or descreasing the max pages according to their own
>         requirements, without reference to anyone else. When QEMU
>         entered the fray, and with the various libxl fixes since, you
>         might think we are implementing this model, but we aren't
>         because the hypervisor interface is only "set", not
>         "increment/decrement" and so there is a racy read/modify/write
>         cycle in every entity now.
>      4. Xen exposes multiple "maxpages" variables corresponding to the
>         various entities which have a need to control this (or multiple
>         per entity). Xen adds all of those up to come up with the actual
>         maxpages.
>      5. I thought there wasw one more but I've forgotten what it was.

claim hypercall kind of fits in this category. It sticks an stake in the ground
so that the 'memory' available for a guest will be available when QEMU,libxl
etc go up and down.

And when libxl is done it unclaims the memory (which hopefully at this
point has all been given to the guest).
> 
> We used to have #2, we now have a broken version of #3 (broken because
> it is racy due to the xen interface used).
> 
> Fixing #2 would be a case of adding a qmp command to allow libxl to ask
> qmp what its extra memory needs are at any given point (start or day,
> post hotplug event etc, or even of every memset).
> 
> Fixing #3 would be a case of adding a new hypercall interface to allow
> the amount of memory to be changed by a delta instead of just set and
> having everyone use it. You'd need to decide on migration whether to
> propagate the current at stop and copy time and having everyone migrate
> their requirements vs having everyone reregister their requirements on
> the far end.
> 
> #4 is a new hypercall interface, which everyone would have to use. I
> happen to think this is a pretty silly design.
> 
> > the same way as other memory management tools were never accepted into
> > xl/libxl but kept to separate daemons, like xenballoond or squeezed.
> 
> I think squeezed and friends are a red-herring here, since they are
> either a client of one or more entities or in the more distributed ones
> are just one party among many.
> 
> > Let's step away from this specific issue for a second. If it is not an
> > host way policy manager but a per-VM layer on top of libxc, what value
> > is this indirection actually adding?
> 
> Given the raciness of libxc -- it is adding correctness.
> 
> > > What locking is there around QEMU's read/modify/write of the maxmem
> > > value? What happens if someone else also modifies maxmem at the
> > > same time?
> > 
> > It only happens at start of day before the domain is unpaused. At the
> > time I couldn't come up with a scenario where it would be an issue,
> > unless the admin is purposely trying to shut himself in the foot.
> 
> I don't think there is anything preventing a call to the toolstack to
> set the amount of memory for a domain right after the domid exists.
> 
> In a system with something like balloond or squeezed I can well imagine
> them doing something when a new domain was started resulting in opening
> of this race condition. (even if something is "ask libxl" not "call
> xc_setmaxpages").
> 
> Ian.
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 14:20           ` George Dunlap
@ 2015-06-08 15:01             ` Don Slutz
  2015-06-08 15:37               ` George Dunlap
  0 siblings, 1 reply; 38+ messages in thread
From: Don Slutz @ 2015-06-08 15:01 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini, Ian Campbell
  Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org

On 06/08/15 10:20, George Dunlap wrote:
> On 06/08/2015 02:22 PM, Stefano Stabellini wrote:
>>>      3. A group of entities which operate in isolation by only ever
>>>         increasing or descreasing the max pages according to their own
>>>         requirements, without reference to anyone else. When QEMU
>>>         entered the fray, and with the various libxl fixes since, you
>>>         might think we are implementing this model, but we aren't
>>>         because the hypervisor interface is only "set", not
>>>         "increment/decrement" and so there is a racy read/modify/write
>>>         cycle in every entity now.
>>
>> I don't think this is true: QEMU only sets maxmem at domain creation,
>> before "xl create" even returns. I think that is safe. We had an email
>> exchange at the time, I explained this behaviour and the general opinon
>> was that it is acceptable. I don't understand why it is not anymore.
> 
> Well for one, nobody on the hypervisor side seems to have been brought
> in -- I definitely would have objected, and it sounds like AndyC would
> have objected too.
> 
> I think we need to go back one level further.
> 
> So first, let's make a distinction between *pages*, which are actual
> host RAM assigned to the guest and put in its p2m table, and *memory*,
> which is a virtualization construct (i.e., virtual RAM and video memory
> for virtual graphics cards).
> 
> The hypervisor only cares about *pages*.  It allocates pages to a
> domain, it puts them in the p2m.  That's all it knows.
> 
> The purpose of max_pages in the hypervisor is to make sure that no guest
> can allocate more host memory (pages) than it is allowed to have.
> 
> How many pages is a particular guest allowed to have?
> 
> Well pages are used for a number of purposes:
> * To implement virtual RAM in the guest
> * To implement video ram for virtual devices in qemu
> * To implement virtual ROMs
> * For magic "shared pages" used behind-the-scenes (not visible to the
> guest)
> 
> (Feel free to add anything I missed.)
> 
> max_pages in the hypervisor must be set to the sum of all the pages the
> domain is allowed to have.
> 
> So the first point is this: Xen doesn't have a clue about any of those.
>  It doesn't know how much virtual RAM a guest has, how much video RAM it
> has, how many virtual ROMs, how many magic shared pages, or anything.
> All it knows are what pages are in the p2m table.
> 
> So although Xen certainly *enforces* max_pages, it is (at the moment) in
> no position to *decide* what max_pages should be.
> 
> At the moment, in fact, nobody is.  There is no single place that has a
> clear picture into how virtual RAM, guest devices, guest ROMs, and
> "magic pages" convert into actual number of pages.  I think that's a bug.
> 
> And at the moment, pages in the p2m are allocated by a number of entities:
> * In the libxc domain builder.
> * In the guest balloon driver
> * And now, in qemu, to allocate extra memory for virtual ROMs.

This is not correct.  QEMU and hvmloader both allocate pages for their
use.  LIBXL_MAXMEM_CONSTANT allows QEMU and hvmloader to allocate some
pages.  The QEMU change only comes into play after LIBXL_MAXMEM_CONSTANT
has been reached.

> 
> Did I miss anything?
> 
> For the first two, it's libxl that sets maxmem, based in its calculation
> of the size of virtual RAM plus various other bits that will be needed.
>  Having qemu *also* set maxmem was always the wrong thing to do, IMHO.
> 

It does it for all 3 (4?) because it adds LIBXL_MAXMEM_CONSTANT.

> In theory, from the interface perspective, what libxl promises to
> provide is virtual RAM.  When you say "memory=8192" in a domain config,
> that means (or should mean) 8192MiB of virtual RAM, exclusive of video
> RAM, virtual ROMs, and magic pages.  Then when you say "xl mem-set
> 4096", it should again be aiming at giving the VM the equivalent of
> 4096MiB of virtual RAM, exclusive of video RAM, &c &c.


Not what is currently done.  virtual video RAM is subtracted from "memory=".

> 
> We already have the problem that the balloon driver at the moment
> doesn't actually know how big the guest RAM is, nor , but is being told
> to make a balloon exactly big enough to bring the total RAM down to a
> specific target.
> 
> I think we do need to have some place in the middle that actually knows
> how much memory is actually needed for the different sub-systems, so it
> can calculate and set maxmem appropriately.  libxl is the obvious place.

Maybe.  So you want libxl to know the detail of balloon overhead?  How
about the different sizes of all possible Option ROMs in all QEMU
version?  What about hvmloader usage of memory?

> 
> What about this:
> * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set
> up virtual ROMs, video ram for virtual devices, &c
> * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) +
> max_qemu_pages
> * Qemu allocates as many pages as it needs for option ROMS, and writes
> the amount that it actually did use into a special node in xenstore.
> * When the domain is unpaused, libxl will set maxpages to PAGES(virtual
> RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore.
> 

I think this does match What Wei Liu said:

On 06/05/15 15:06, Wei Liu wrote:> On Fri, Jun 05, 2015 at 06:13:01PM
+0100, Stefano Stabellini wrote:
>> On Fri, 5 Jun 2015, Ian Campbell wrote:
>>> On Fri, 2015-06-05 at 17:43 +0100, Wei Liu wrote:
...
>>>> 5. Add a user configurable field in current libxl JSON structure to
>>>>    record how much more memory this domain needs. Admin is required to
>>>>    fill in that value manually. In the mean time we revert the
change in
>>>>    QEMU and declare QEMU with that change buggy.
>>>>
>>>> No response to this so far. But in fact I consider this the most viable
>>>> solution.
>>>
>>> I initially thought that this was just #4 in a silly hat and was
>>> therefore no more acceptable than that.
>>>
>>> But actually I think you are suggesting that users should have to
>>> manually request additional RAM for option roms via some new interface
>>> and that the old thing in qemu should be deprecated and removed?
>>>
>>> How would a user know what value to use here? Just "a bigger one till it
>>> works"? That's, well, not super...
>>
>> This should not be a user configurable field. In fact it only depends on
>> the QEMU version in use.
>
> That field is generic so that we can use it to add some extra pages to
> domain. Using it to cover QEMU option roms would be one use case.  It's
> not very nice, but it's straight-forward.
>
> Wei.

> I think also that probably libxl, rather than setting a target amount of
> memory the balloon driver is supposed to aim at, should set the target
> size of the balloon.  Once qemu tells it how many pages are actually
> being used for virtual devices,
> 
> We could, in theory, expose all this information in xenstore such that
> *either* libxl or qemu would be able to calculate max_pages based on the
> numbers that were written there.  And that would work if we could
> enforce a lock-step between the toolstack and qemu, as we can between
> Xen and the toolstack.  But I think setting anything like this in stone
> is a really bad idea; which unfortulately excludes the idea of putting
> it in qemu.
> 

   -Don Slutz

>  -George
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 14:53         ` Konrad Rzeszutek Wilk
@ 2015-06-08 15:20           ` George Dunlap
  2015-06-08 15:42             ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 38+ messages in thread
From: George Dunlap @ 2015-06-08 15:20 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, Andrew Cooper, Ian Jackson, dslutz,
	xen-devel

On 06/08/2015 03:53 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Jun 08, 2015 at 01:11:38PM +0100, Ian Campbell wrote:
>> On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote:
>>
>>>>> I disagree that libxl should be the arbitrator of a property that is
>>>>> stored, maintained and enforced by Xen. Xen should be the arbitrator.
>>>>
>>>> That's not what "arbitrator" means, an arbitrator decides what the value
>>>> should be, but that doesn't necessarily imply that it either stores,
>>>> maintains nor enforces that value. 
>>>
>>> The way you describe it, it looks like some kind of host wide memory
>>> policy manager,
>>
>> Not necessarily.
>>
>> We need to make a distinction between the entity which enforces
>> maxpages, that (or those) which decide on what maxpages should be and
>> tells Xen.
>>
>> Xen is the enforcer. It is also therefore by necessity the holder of the
>> canonical value for the maxpages, since it has to be.
>>
>> Xen cannot be the second, it simply doesn't have the necessary
>> information to make a decision about a domain's memory limit, it just
>> blindly does whatever it is told by the last thing to tell it something.
>>
>> There are lots of ways the desired value of maxpages can be decided on,
>> given a bunch of entities all with some level of responsibility for the
>> maxpages of a given domain:
>>
>>      1. One single entity which decides on the limit and tells Xen. I
>>         think this is the xapi model (not squeezed/balloond, they just
>>         ask xapi to do things AFAIK, so are a red-herring). The unwary
>>         might think "libxl" was in this category, but it isn't since it
>>         is in fact multiple instances of the library.
>>      2. A group of entities which cooperate as a distributed system such
>>         that any one of them can recalculate (from from the shared
>>         state, or by intra entity communication) the current desired
>>         value of max pages and propagate it to Xen. Prior to QEMU
>>         starting to play with the max pages the multiple libxl's on a
>>         system fell into this case, coordinating via xenstore
>>         transactions and libxl userdata.
>>      3. A group of entities which operate in isolation by only ever
>>         increasing or descreasing the max pages according to their own
>>         requirements, without reference to anyone else. When QEMU
>>         entered the fray, and with the various libxl fixes since, you
>>         might think we are implementing this model, but we aren't
>>         because the hypervisor interface is only "set", not
>>         "increment/decrement" and so there is a racy read/modify/write
>>         cycle in every entity now.
>>      4. Xen exposes multiple "maxpages" variables corresponding to the
>>         various entities which have a need to control this (or multiple
>>         per entity). Xen adds all of those up to come up with the actual
>>         maxpages.
>>      5. I thought there wasw one more but I've forgotten what it was.
> 
> claim hypercall kind of fits in this category. It sticks an stake in the ground
> so that the 'memory' available for a guest will be available when QEMU,libxl
> etc go up and down.
> 
> And when libxl is done it unclaims the memory (which hopefully at this
> point has all been given to the guest).

Right -- in an ideal world maxpages and claim would be set at the same
time, so that you wouldn't have to worry (for instance) about qemu
failing because although it was *allowed* to have more memory to
allocate ROMs, it wan't able to due to lack of available host memory.

But deciding exactly how big either maxpages or the claim needs to be
involves either having a central place where all the values are stored
and calculated (1, 2, or 4), or having each of them just increment or
decrement the total based on their own need (3), with no way of
accounting for who asked for what.

 -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 15:01             ` Don Slutz
@ 2015-06-08 15:37               ` George Dunlap
  2015-06-08 16:06                 ` Don Slutz
  2015-06-09 10:14                 ` Stefano Stabellini
  0 siblings, 2 replies; 38+ messages in thread
From: George Dunlap @ 2015-06-08 15:37 UTC (permalink / raw)
  To: Don Slutz, Stefano Stabellini, Ian Campbell
  Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org

On 06/08/2015 04:01 PM, Don Slutz wrote:
> On 06/08/15 10:20, George Dunlap wrote:
>> And at the moment, pages in the p2m are allocated by a number of entities:
>> * In the libxc domain builder.
>> * In the guest balloon driver
>> * And now, in qemu, to allocate extra memory for virtual ROMs.
> 
> This is not correct.  QEMU and hvmloader both allocate pages for their
> use.  LIBXL_MAXMEM_CONSTANT allows QEMU and hvmloader to allocate some
> pages.  The QEMU change only comes into play after LIBXL_MAXMEM_CONSTANT
> has been reached.

Thanks -- so the correct statement here is (in time order):

Pages in the p2m are allocated by a number of entities:
* In the libxc domain builder
* In qemu
* In hvmloader
* In the guest balloon driver

>> For the first two, it's libxl that sets maxmem, based in its calculation
>> of the size of virtual RAM plus various other bits that will be needed.
>>  Having qemu *also* set maxmem was always the wrong thing to do, IMHO.
>>
> 
> It does it for all 3 (4?) because it adds LIBXL_MAXMEM_CONSTANT.

So the correct statement is:

In the past, libxl has set maxmem for all of those, based on its
calculation of virtual RAM plus various other bits that might be needed
(including pages needed by qemu or hvmloader).

The change as of qemu $WHATEVER is that now qemu also sets it if it
finds that libxl didn't give it enough "slack".  That was always the
wrong thing to do, IMHO.

>> In theory, from the interface perspective, what libxl promises to
>> provide is virtual RAM.  When you say "memory=8192" in a domain config,
>> that means (or should mean) 8192MiB of virtual RAM, exclusive of video
>> RAM, virtual ROMs, and magic pages.  Then when you say "xl mem-set
>> 4096", it should again be aiming at giving the VM the equivalent of
>> 4096MiB of virtual RAM, exclusive of video RAM, &c &c.
> 
> 
> Not what is currently done.  virtual video RAM is subtracted from "memory=".

Right.

After I sent this, it occurred to me that there were two sensible
interpretations of "memory=".  The first is, "This is how much virtual
RAM to give the guest.  Please allocate non-RAM pages in addition to
this."  The second is, "This is the total amount of host RAM I want the
guest to use.  Please take non-RAM pages from this total amount."

In reality we apparently do neither of these. :-)

I think both break the "principle of least surprise" in different ways,
but I suspect that admins on the whole would rather have the second
interpretation, as I think it makes their lives a bit easier.

>> We already have the problem that the balloon driver at the moment
>> doesn't actually know how big the guest RAM is, nor , but is being told
>> to make a balloon exactly big enough to bring the total RAM down to a
>> specific target.
>>
>> I think we do need to have some place in the middle that actually knows
>> how much memory is actually needed for the different sub-systems, so it
>> can calculate and set maxmem appropriately.  libxl is the obvious place.
> 
> Maybe.  So you want libxl to know the detail of balloon overhead?  How
> about the different sizes of all possible Option ROMs in all QEMU
> version?  What about hvmloader usage of memory?

I'm not sure what you mean by "balloon overhead", but if you mean "guest
pages wasted keeping track of pages which have been ballooned out", then
no, that's not what I mean.  Neither libxl nor the balloon driver keep
track of that at the moment.

I think that qemu needs to tell libxl how much memory it is using for
all of its needs -- including option ROMs.  (See my example below.)  For
older qemus we can just make some assumptions like we always have.

I do think it would make sense to have the hvmloader amount listed
somewhere explicitly.  I'm not sure how often hvmloader may need to
change the amount it uses for itself.

>> What about this:
>> * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set
>> up virtual ROMs, video ram for virtual devices, &c
>> * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) +
>> max_qemu_pages
>> * Qemu allocates as many pages as it needs for option ROMS, and writes
>> the amount that it actually did use into a special node in xenstore.
>> * When the domain is unpaused, libxl will set maxpages to PAGES(virtual
>> RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore.
>>
> 
> I think this does match What Wei Liu said:

The suggestion you quote below is that the *user* should have to put in
some number in the config file, not that qemu should write the number
into xenstore.

The key distinction of my suggestion was to set maxpages purposely high,
wait for qemu to use what it needs, then to reduce it down to what was
needed.

 -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 15:20           ` George Dunlap
@ 2015-06-08 15:42             ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 38+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-06-08 15:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Ian Jackson, dslutz, xen-devel

On Mon, Jun 08, 2015 at 04:20:48PM +0100, George Dunlap wrote:
> On 06/08/2015 03:53 PM, Konrad Rzeszutek Wilk wrote:
> > On Mon, Jun 08, 2015 at 01:11:38PM +0100, Ian Campbell wrote:
> >> On Mon, 2015-06-08 at 12:40 +0100, Stefano Stabellini wrote:
> >>
> >>>>> I disagree that libxl should be the arbitrator of a property that is
> >>>>> stored, maintained and enforced by Xen. Xen should be the arbitrator.
> >>>>
> >>>> That's not what "arbitrator" means, an arbitrator decides what the value
> >>>> should be, but that doesn't necessarily imply that it either stores,
> >>>> maintains nor enforces that value. 
> >>>
> >>> The way you describe it, it looks like some kind of host wide memory
> >>> policy manager,
> >>
> >> Not necessarily.
> >>
> >> We need to make a distinction between the entity which enforces
> >> maxpages, that (or those) which decide on what maxpages should be and
> >> tells Xen.
> >>
> >> Xen is the enforcer. It is also therefore by necessity the holder of the
> >> canonical value for the maxpages, since it has to be.
> >>
> >> Xen cannot be the second, it simply doesn't have the necessary
> >> information to make a decision about a domain's memory limit, it just
> >> blindly does whatever it is told by the last thing to tell it something.
> >>
> >> There are lots of ways the desired value of maxpages can be decided on,
> >> given a bunch of entities all with some level of responsibility for the
> >> maxpages of a given domain:
> >>
> >>      1. One single entity which decides on the limit and tells Xen. I
> >>         think this is the xapi model (not squeezed/balloond, they just
> >>         ask xapi to do things AFAIK, so are a red-herring). The unwary
> >>         might think "libxl" was in this category, but it isn't since it
> >>         is in fact multiple instances of the library.
> >>      2. A group of entities which cooperate as a distributed system such
> >>         that any one of them can recalculate (from from the shared
> >>         state, or by intra entity communication) the current desired
> >>         value of max pages and propagate it to Xen. Prior to QEMU
> >>         starting to play with the max pages the multiple libxl's on a
> >>         system fell into this case, coordinating via xenstore
> >>         transactions and libxl userdata.
> >>      3. A group of entities which operate in isolation by only ever
> >>         increasing or descreasing the max pages according to their own
> >>         requirements, without reference to anyone else. When QEMU
> >>         entered the fray, and with the various libxl fixes since, you
> >>         might think we are implementing this model, but we aren't
> >>         because the hypervisor interface is only "set", not
> >>         "increment/decrement" and so there is a racy read/modify/write
> >>         cycle in every entity now.
> >>      4. Xen exposes multiple "maxpages" variables corresponding to the
> >>         various entities which have a need to control this (or multiple
> >>         per entity). Xen adds all of those up to come up with the actual
> >>         maxpages.
> >>      5. I thought there wasw one more but I've forgotten what it was.
> > 
> > claim hypercall kind of fits in this category. It sticks an stake in the ground
> > so that the 'memory' available for a guest will be available when QEMU,libxl
> > etc go up and down.
> > 
> > And when libxl is done it unclaims the memory (which hopefully at this
> > point has all been given to the guest).
> 
> Right -- in an ideal world maxpages and claim would be set at the same
> time, so that you wouldn't have to worry (for instance) about qemu
> failing because although it was *allowed* to have more memory to
> allocate ROMs, it wan't able to due to lack of available host memory.
> 
> But deciding exactly how big either maxpages or the claim needs to be
> involves either having a central place where all the values are stored
> and calculated (1, 2, or 4), or having each of them just increment or
> decrement the total based on their own need (3), with no way of
> accounting for who asked for what.


Right. I was just pointing out that with claim call you have an semi
atomic mechanism to deal with the amount of memory you need - so you
don't have to worry about the race aspect (another guest taking away
the memory you want to use).

So what is the next step? (I concur BTW with your assesment - just not
sure which group [1,2, or 4 vs 3] is the right direction).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 15:37               ` George Dunlap
@ 2015-06-08 16:06                 ` Don Slutz
  2015-06-09 10:00                   ` George Dunlap
  2015-06-09 10:14                 ` Stefano Stabellini
  1 sibling, 1 reply; 38+ messages in thread
From: Don Slutz @ 2015-06-08 16:06 UTC (permalink / raw)
  To: George Dunlap, Stefano Stabellini, Ian Campbell
  Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org

On 06/08/15 11:37, George Dunlap wrote:
> On 06/08/2015 04:01 PM, Don Slutz wrote:
>> On 06/08/15 10:20, George Dunlap wrote:
>>> And at the moment, pages in the p2m are allocated by a number of entities:
>>> * In the libxc domain builder.
>>> * In the guest balloon driver
>>> * And now, in qemu, to allocate extra memory for virtual ROMs.
>>
>> This is not correct.  QEMU and hvmloader both allocate pages for their
>> use.  LIBXL_MAXMEM_CONSTANT allows QEMU and hvmloader to allocate some
>> pages.  The QEMU change only comes into play after LIBXL_MAXMEM_CONSTANT
>> has been reached.
> 
> Thanks -- so the correct statement here is (in time order):
> 
> Pages in the p2m are allocated by a number of entities:
> * In the libxc domain builder
> * In qemu
> * In hvmloader
> * In the guest balloon driver
> 

That is my understanding.  As Ian C pointed out there is a file:

docs/misc/libxl_memory.txt

That attempts to talk about this.

>>> For the first two, it's libxl that sets maxmem, based in its calculation
>>> of the size of virtual RAM plus various other bits that will be needed.
>>>  Having qemu *also* set maxmem was always the wrong thing to do, IMHO.
>>>
>>
>> It does it for all 3 (4?) because it adds LIBXL_MAXMEM_CONSTANT.
> 
> So the correct statement is:
> 
> In the past, libxl has set maxmem for all of those, based on its
> calculation of virtual RAM plus various other bits that might be needed
> (including pages needed by qemu or hvmloader).
> 
> The change as of qemu $WHATEVER is that now qemu also sets it if it
> finds that libxl didn't give it enough "slack".  That was always the
> wrong thing to do, IMHO.
> 

Ok.

>>> In theory, from the interface perspective, what libxl promises to
>>> provide is virtual RAM.  When you say "memory=8192" in a domain config,
>>> that means (or should mean) 8192MiB of virtual RAM, exclusive of video
>>> RAM, virtual ROMs, and magic pages.  Then when you say "xl mem-set
>>> 4096", it should again be aiming at giving the VM the equivalent of
>>> 4096MiB of virtual RAM, exclusive of video RAM, &c &c.
>>
>>
>> Not what is currently done.  virtual video RAM is subtracted from "memory=".
> 
> Right.
> 
> After I sent this, it occurred to me that there were two sensible
> interpretations of "memory=".  The first is, "This is how much virtual
> RAM to give the guest.  Please allocate non-RAM pages in addition to
> this."  The second is, "This is the total amount of host RAM I want the
> guest to use.  Please take non-RAM pages from this total amount."
> 
> In reality we apparently do neither of these. :-)
> 
> I think both break the "principle of least surprise" in different ways,
> but I suspect that admins on the whole would rather have the second
> interpretation, as I think it makes their lives a bit easier.
> 

Before I knew as much about this as I currently do, I had assumed that
second interpretation was what libxl did.  Normally video RAM is the
largest amount and so the smaller delta (LIBXL_MAXMEM_CONSTANT 1MiB
and LIBXL_HVM_EXTRA_MEMORY 2MiB) just was not noticed.

There is also shadow memory, which needs to be in the above.

>>> We already have the problem that the balloon driver at the moment
>>> doesn't actually know how big the guest RAM is, nor , but is being told
>>> to make a balloon exactly big enough to bring the total RAM down to a
>>> specific target.
>>>
>>> I think we do need to have some place in the middle that actually knows
>>> how much memory is actually needed for the different sub-systems, so it
>>> can calculate and set maxmem appropriately.  libxl is the obvious place.
>>
>> Maybe.  So you want libxl to know the detail of balloon overhead?  How
>> about the different sizes of all possible Option ROMs in all QEMU
>> version?  What about hvmloader usage of memory?
> 
> I'm not sure what you mean by "balloon overhead", but if you mean "guest
> pages wasted keeping track of pages which have been ballooned out", then
> no, that's not what I mean.  Neither libxl nor the balloon driver keep
> track of that at the moment.
> 

I was trying to refer to:

NOTE: Because of the way ballooning works, the guest has to allocate
memory to keep track of maxmem pages, regardless of how much memory it
actually has available to it.  A guest with maxmem=262144 and
memory=8096 will report significantly less memory available for use than
a system with maxmem=8096 memory=8096 due to the memory overhead of
having to track the unused pages.

(from xl.cfg man page).

> I think that qemu needs to tell libxl how much memory it is using for
> all of its needs -- including option ROMs.  (See my example below.)  For
> older qemus we can just make some assumptions like we always have.
> 

I am happy with this.  Note: I think libxl could determine this number
now without QEMU changes.  However it does depend on no other thread
changing a "staring" domain's memory while libxl is calculating this.

> I do think it would make sense to have the hvmloader amount listed
> somewhere explicitly.  I'm not sure how often hvmloader may need to
> change the amount it uses for itself.
> 

hvmloader does yet a different method.  If
xc_domain_populate_physmap_exact() fails, it reduces guest RAM (if my
memory is correct).

>>> What about this:
>>> * Libxl has a maximum amount of RAM that qemu is *allowed* to use to set
>>> up virtual ROMs, video ram for virtual devices, &c
>>> * At start-of-day, it sets maxpages to PAGES(virtual RAM)+PAGES(magic) +
>>> max_qemu_pages
>>> * Qemu allocates as many pages as it needs for option ROMS, and writes
>>> the amount that it actually did use into a special node in xenstore.
>>> * When the domain is unpaused, libxl will set maxpages to PAGES(virtual
>>> RAM) + PAGES(magic) + actual_qemu_pages that it gets from xenstore.
>>>
>>
>> I think this does match What Wei Liu said:
> 
> The suggestion you quote below is that the *user* should have to put in
> some number in the config file, not that qemu should write the number
> into xenstore.
> 
> The key distinction of my suggestion was to set maxpages purposely high,
> wait for qemu to use what it needs, then to reduce it down to what was
> needed.
> 

Sorry, I did not get that.

   -Don Slutz

>  -George
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 16:06                 ` Don Slutz
@ 2015-06-09 10:00                   ` George Dunlap
  2015-06-09 10:17                     ` Wei Liu
  0 siblings, 1 reply; 38+ messages in thread
From: George Dunlap @ 2015-06-09 10:00 UTC (permalink / raw)
  To: Don Slutz, Stefano Stabellini, Ian Campbell
  Cc: Ian Jackson, Andrew Cooper, Wei Liu, xen-devel@lists.xen.org

On 06/08/2015 05:06 PM, Don Slutz wrote:
> On 06/08/15 11:37, George Dunlap wrote: 
>>>> We already have the problem that the balloon driver at the moment
>>>> doesn't actually know how big the guest RAM is, nor , but is being told
>>>> to make a balloon exactly big enough to bring the total RAM down to a
>>>> specific target.
>>>>
>>>> I think we do need to have some place in the middle that actually knows
>>>> how much memory is actually needed for the different sub-systems, so it
>>>> can calculate and set maxmem appropriately.  libxl is the obvious place.
>>>
>>> Maybe.  So you want libxl to know the detail of balloon overhead?  How
>>> about the different sizes of all possible Option ROMs in all QEMU
>>> version?  What about hvmloader usage of memory?
>>
>> I'm not sure what you mean by "balloon overhead", but if you mean "guest
>> pages wasted keeping track of pages which have been ballooned out", then
>> no, that's not what I mean.  Neither libxl nor the balloon driver keep
>> track of that at the moment.
>>
> 
> I was trying to refer to:
> 
> NOTE: Because of the way ballooning works, the guest has to allocate
> memory to keep track of maxmem pages, regardless of how much memory it
> actually has available to it.  A guest with maxmem=262144 and
> memory=8096 will report significantly less memory available for use than
> a system with maxmem=8096 memory=8096 due to the memory overhead of
> having to track the unused pages.
> 
> (from xl.cfg man page).

Right -- that's what I guessed.  As I said, at the moment the balloon
driver is (in theory) given the target and asked to balloon down such
that tot_pages that the guest uses would be equal to the tot_pages it
would use if it were given that amount of memory.  It's got nothing to
do with what a user process sees from inside the guest (which is what
this note is referring to).

>> I think that qemu needs to tell libxl how much memory it is using for
>> all of its needs -- including option ROMs.  (See my example below.)  For
>> older qemus we can just make some assumptions like we always have.
>>
> 
> I am happy with this.  Note: I think libxl could determine this number
> now without QEMU changes.  However it does depend on no other thread
> changing a "staring" domain's memory while libxl is calculating this.
> 
>> I do think it would make sense to have the hvmloader amount listed
>> somewhere explicitly.  I'm not sure how often hvmloader may need to
>> change the amount it uses for itself.
>>
> 
> hvmloader does yet a different method.  If
> xc_domain_populate_physmap_exact() fails, it reduces guest RAM (if my
> memory is correct).

This makes me wonder if we could make qemu do the same thing.

 -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-08 15:37               ` George Dunlap
  2015-06-08 16:06                 ` Don Slutz
@ 2015-06-09 10:14                 ` Stefano Stabellini
  2015-06-09 11:20                   ` George Dunlap
  2015-06-09 12:45                   ` Ian Campbell
  1 sibling, 2 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-09 10:14 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Ian Jackson, Don Slutz, xen-devel@lists.xen.org

On Mon, 8 Jun 2015, George Dunlap wrote:
> I think that qemu needs to tell libxl how much memory it is using for
> all of its needs -- including option ROMs.  (See my example below.)  For
> older qemus we can just make some assumptions like we always have.

I'll just note that I have still not seen any evidence that telling
libxl would improve things. It seems to me that we are just adding one
more layer of indirection just to solve the migration issue elsewhere.
I haven't seen convincing examples that things would be better by telling
libxl yet, except that we would be able to fix the migration problem.

I don't think that the current solution is inherently racy. If we are
really interested in an honest evaluation of the current solution in
terms of races, I would be happy to do so. When I committed it, I didn't
do it without thinking. I don't think that libxl should be in the middle
of this, because I don't think it has anything to add.

That said, as I am not reading any new arguments and it doesn't look
like the thread is going in the right direction from my POV, I don't
think there is any need to repeat myself again, so I'll step away from
this thread and let you decide what you think is the best way forward.

Of course I'll evaluate any QEMU patches on their merits as always.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-09 10:00                   ` George Dunlap
@ 2015-06-09 10:17                     ` Wei Liu
  0 siblings, 0 replies; 38+ messages in thread
From: Wei Liu @ 2015-06-09 10:17 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Ian Jackson, Don Slutz, xen-devel@lists.xen.org

On Tue, Jun 09, 2015 at 11:00:12AM +0100, George Dunlap wrote:
> >> I think that qemu needs to tell libxl how much memory it is using for
> >> all of its needs -- including option ROMs.  (See my example below.)  For
> >> older qemus we can just make some assumptions like we always have.
> >>
> > 
> > I am happy with this.  Note: I think libxl could determine this number
> > now without QEMU changes.  However it does depend on no other thread
> > changing a "staring" domain's memory while libxl is calculating this.
> > 
> >> I do think it would make sense to have the hvmloader amount listed
> >> somewhere explicitly.  I'm not sure how often hvmloader may need to
> >> change the amount it uses for itself.
> >>
> > 
> > hvmloader does yet a different method.  If
> > xc_domain_populate_physmap_exact() fails, it reduces guest RAM (if my
> > memory is correct).
> 
> This makes me wonder if we could make qemu do the same thing.
> 

I don't think so, because QEMU can't adjust hvm_info.

>  -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-09 10:14                 ` Stefano Stabellini
@ 2015-06-09 11:20                   ` George Dunlap
  2015-06-16 16:44                     ` Stefano Stabellini
  2015-06-09 12:45                   ` Ian Campbell
  1 sibling, 1 reply; 38+ messages in thread
From: George Dunlap @ 2015-06-09 11:20 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, Ian Campbell, Andrew Cooper, Ian Jackson, Don Slutz,
	xen-devel@lists.xen.org

On 06/09/2015 11:14 AM, Stefano Stabellini wrote:
> On Mon, 8 Jun 2015, George Dunlap wrote:
>> I think that qemu needs to tell libxl how much memory it is using for
>> all of its needs -- including option ROMs.  (See my example below.)  For
>> older qemus we can just make some assumptions like we always have.
> 
> I'll just note that I have still not seen any evidence that telling
> libxl would improve things. It seems to me that we are just adding one
> more layer of indirection just to solve the migration issue elsewhere.
> I haven't seen convincing examples that things would be better by telling
> libxl yet, except that we would be able to fix the migration problem.

So from a migration perspective, I agree there's not much difference
between "libxl reads max_pages from Xen, inserts it into the migration
stream, and sets it again on the other side" and "libxl has global
knowledge of what max_pages should be, inserts it into the migration
stream, and sets it again on the other side".

Improvements of having it in libxl:

1. libxl is not an external interface; we can change and improve things
as we go along.  Whatever we put in qemu we have to support indefinitely.

2. We can begin to solve the "where is the memory" mess that is the
current state of things.

3. In particular, we could conceivably actually change the interface to
be consistent, so that "memory" means "the maximum amount of memory used
by the guest including all overhead", rather than the random who knows
what we have now.  This will be impossible if we do it in qemu.

Having more than one place change max_pages would be poor design even if
we were still packaging everything together in the same release.  Having
an external entity with which we must be backwards-compatible change
max_pages it is just a bad idea and always was.

> When I committed it, I didn't
> do it without thinking. I don't think that libxl should be in the middle
> of this, because I don't think it has anything to add.

Just to be clear, nobody thinks you did (at least, as far as I know).
The problem is that all of us are so specialized: some people work
almost entirely within qemu, others in the kernel, others in the
toolstack, others in Xen.  It's just natural for people who work mainly
within one component to look at the problem from a particular
perspective.  But it causes exactly this sort of problem, where the left
hand doesn't know what the right hand is doing, and the overall system
is really uncoordinated.

 -George

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-09 10:14                 ` Stefano Stabellini
  2015-06-09 11:20                   ` George Dunlap
@ 2015-06-09 12:45                   ` Ian Campbell
  2015-06-17 13:35                     ` Stefano Stabellini
  1 sibling, 1 reply; 38+ messages in thread
From: Ian Campbell @ 2015-06-09 12:45 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Wei Liu, George Dunlap, Andrew Cooper, Ian Jackson, Don Slutz,
	xen-devel@lists.xen.org

On Tue, 2015-06-09 at 11:14 +0100, Stefano Stabellini wrote:
> I don't think that the current solution is inherently racy. If we are
> really interested in an honest evaluation of the current solution in
> terms of races, I would be happy to do so.

Consider a domain with 1M of RAM (==target and maxmem for the sake of
argument) and two simultaneous calls to libxl_set_memory_target, both
with relative=0 and enforce=1, one with target=3 and the other with
target=5.

target=3 call			target=5 call

				get ptr.max_memkb, gets 1

get ptr.max_memkb, gets 1
start transaction
do xenstore stuff, seeing target=1, setting target=3
memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb;
memorykb = 1 - 1 + 3
xc_setmaxmem(3)
transaction commit (success)

Now target=maxmem=3

				start transaction
				do xenstore stuff, seeing target=3, setting target=5
				memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb;
				memorykb = 1 - 3 + 5
				xc_setmaxmem(3)
				transaction commit (success)

Now target=5, maxmem=3.

The obvious fix of moving the get ptr.max_memkb inside the transaction
fails in a different way in the case where the first transaction commit
fails and is retried after the second one, I think.

Prior to 0c029c4da21 "libxl_set_memory_target: retain the same maxmem
offset on top of the current target" this issue didn't exist because
memorykb was just memorykb = new_target_memkb + videoram.

BTW, I noticed some other (unrelated) dubious stuff in there while
looking at this, in particular the setmaxmem is not rolled back if
set_pod_target fails. Also the transiently "wrong" maxmem between the
setmaxmem and a failed transaction getting redone might possibly give
rise to some interesting cases, especially if anything else fails the
second time around the loop.

> When I committed it, I didn't do it without thinking.

Likewise I thought about it when reviewing, but it seems we have both
missed an important aspect.

Ian.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-09 11:20                   ` George Dunlap
@ 2015-06-16 16:44                     ` Stefano Stabellini
  0 siblings, 0 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-16 16:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Wei Liu, Ian Campbell, Stefano Stabellini, Andrew Cooper,
	Ian Jackson, Don Slutz, xen-devel@lists.xen.org

On Tue, 9 Jun 2015, George Dunlap wrote:
> On 06/09/2015 11:14 AM, Stefano Stabellini wrote:
> > On Mon, 8 Jun 2015, George Dunlap wrote:
> >> I think that qemu needs to tell libxl how much memory it is using for
> >> all of its needs -- including option ROMs.  (See my example below.)  For
> >> older qemus we can just make some assumptions like we always have.
> > 
> > I'll just note that I have still not seen any evidence that telling
> > libxl would improve things. It seems to me that we are just adding one
> > more layer of indirection just to solve the migration issue elsewhere.
> > I haven't seen convincing examples that things would be better by telling
> > libxl yet, except that we would be able to fix the migration problem.
> 
> So from a migration perspective, I agree there's not much difference
> between "libxl reads max_pages from Xen, inserts it into the migration
> stream, and sets it again on the other side" and "libxl has global
> knowledge of what max_pages should be, inserts it into the migration
> stream, and sets it again on the other side".
> 
> Improvements of having it in libxl:
> 
> 1. libxl is not an external interface; we can change and improve things
> as we go along.

Actually libxl is an external interface, did you mean libxc?


> Whatever we put in qemu we have to support indefinitely.

The maxmem call is already in QEMU :-)


> 2. We can begin to solve the "where is the memory" mess that is the
> current state of things.
>
> 3. In particular, we could conceivably actually change the interface to
> be consistent, so that "memory" means "the maximum amount of memory used
> by the guest including all overhead", rather than the random who knows
> what we have now. 

That's it, I agree.  But the issue is that we don't have a solution or
even the design of a solution. If we had, and the plan included
reverting the QEMU commit, then fine, let's do it.


> This will be impossible if we do it in qemu.
>
> Having more than one place change max_pages would be poor design even if
> we were still packaging everything together in the same release.  Having
> an external entity with which we must be backwards-compatible change
> max_pages it is just a bad idea and always was.

I disagree, if it was such an obvious design mistake, more people would
have picked up on it immediately when the patch was posted the first
time around.  This is blurry terrain at best.


> > When I committed it, I didn't
> > do it without thinking. I don't think that libxl should be in the middle
> > of this, because I don't think it has anything to add.
> 
> Just to be clear, nobody thinks you did (at least, as far as I know).
> The problem is that all of us are so specialized: some people work
> almost entirely within qemu, others in the kernel, others in the
> toolstack, others in Xen.  It's just natural for people who work mainly
> within one component to look at the problem from a particular
> perspective.  But it causes exactly this sort of problem, where the left
> hand doesn't know what the right hand is doing, and the overall system
> is really uncoordinated.

That is true.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: QEMU bumping memory bug analysis
  2015-06-09 12:45                   ` Ian Campbell
@ 2015-06-17 13:35                     ` Stefano Stabellini
  0 siblings, 0 replies; 38+ messages in thread
From: Stefano Stabellini @ 2015-06-17 13:35 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Wei Liu, Stefano Stabellini, George Dunlap, Andrew Cooper,
	Ian Jackson, Don Slutz, xen-devel@lists.xen.org

On Tue, 9 Jun 2015, Ian Campbell wrote:
> On Tue, 2015-06-09 at 11:14 +0100, Stefano Stabellini wrote:
> > I don't think that the current solution is inherently racy. If we are
> > really interested in an honest evaluation of the current solution in
> > terms of races, I would be happy to do so.
> 
> Consider a domain with 1M of RAM (==target and maxmem for the sake of
> argument) and two simultaneous calls to libxl_set_memory_target, both
> with relative=0 and enforce=1, one with target=3 and the other with
> target=5.
> 
> target=3 call			target=5 call
> 
> 				get ptr.max_memkb, gets 1
> 
> get ptr.max_memkb, gets 1
> start transaction
> do xenstore stuff, seeing target=1, setting target=3
> memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb;
> memorykb = 1 - 1 + 3
> xc_setmaxmem(3)
> transaction commit (success)
> 
> Now target=maxmem=3
> 
> 				start transaction
> 				do xenstore stuff, seeing target=3, setting target=5
> 				memorykb = ptr.max_memkb - current_target_memkb + new_target_memkb;
> 				memorykb = 1 - 3 + 5
> 				xc_setmaxmem(3)
> 				transaction commit (success)
> 
> Now target=5, maxmem=3.

Yes, I see the issue. Pretty nasty!

> The obvious fix of moving the get ptr.max_memkb inside the transaction
> fails in a different way in the case where the first transaction commit
> fails and is retried after the second one, I think.

The underlying problem is that xc_domain_setmaxmem only takes absolute
values.  Retrieving the old maxmem value and setting the new one cannot
be done atomically, so they need to be protected by a lock or a
transaction.  You are right that moving get ptr.max_memkb inside the
transaction would only fix the problem if we also properly rolled back
the old maxmem value in case of failures.  But I don't think it is
actually possible because it could race against other
libxl_set_memory_target calls.  Am I wrong?  This would also be a
problem before 0c029c4da21.

> Prior to 0c029c4da21 "libxl_set_memory_target: retain the same maxmem
> offset on top of the current target" this issue didn't exist because
> memorykb was just memorykb = new_target_memkb + videoram.

It is true that reverting 0c029c4da21 would improve the situation.
However I think that you found a problem here that goes beyond
0c029c4da21 :-(

> BTW, I noticed some other (unrelated) dubious stuff in there while
> looking at this, in particular the setmaxmem is not rolled back if
> set_pod_target fails.

Yes, you are right! It is also not rolled back in case the xenstore
transaction fails with errno != EAGAIN, even before 0c029c4da21.

> Also the transiently "wrong" maxmem between the
> setmaxmem and a failed transaction getting redone might possibly give
> rise to some interesting cases, especially if anything else fails the
> second time around the loop.

I am thinking that the while idea of an "enforce" option was a bad to
begin with. Not only we should revert 0c029c4da21, but actually we
should just get rid of "enforce" altogether? I am suggesting it because
I don't think that reverting 0c029c4da21 would fix all the issues.

In a pre-0c029c4da21 world:

target=3 enforce=1

start transaction
do xenstore stuff, seeing target=1, setting target=3
memorykb = new_target_memkb + videoram;
xc_setmaxmem(3+videoram)
transaction commit (fail errno=ESOMETHING)

Now target=1, maxmem=3+videoram

Unless we say that we don't care about leaving the system in a
consistent state in case of failures != EAGAIN.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2015-06-17 13:35 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-05 16:43 QEMU bumping memory bug analysis Wei Liu
2015-06-05 16:58 ` Ian Campbell
2015-06-05 17:13   ` Stefano Stabellini
2015-06-05 19:06     ` Wei Liu
2015-06-05 17:17   ` Andrew Cooper
2015-06-05 17:39   ` Wei Liu
2015-06-05 17:10 ` Stefano Stabellini
2015-06-05 18:10   ` Wei Liu
2015-06-08 11:39     ` Stefano Stabellini
2015-06-08 12:14       ` Andrew Cooper
2015-06-08 13:01         ` Stefano Stabellini
2015-06-08 13:33           ` Jan Beulich
2015-06-08 13:10       ` Wei Liu
2015-06-08 13:27         ` Stefano Stabellini
2015-06-08 13:32           ` Wei Liu
2015-06-08 13:38             ` Stefano Stabellini
2015-06-08 13:44               ` Andrew Cooper
2015-06-08 13:45                 ` Stefano Stabellini
2015-06-05 18:49   ` Ian Campbell
2015-06-08 11:40     ` Stefano Stabellini
2015-06-08 12:11       ` Ian Campbell
2015-06-08 13:22         ` Stefano Stabellini
2015-06-08 13:52           ` Ian Campbell
2015-06-08 14:20           ` George Dunlap
2015-06-08 15:01             ` Don Slutz
2015-06-08 15:37               ` George Dunlap
2015-06-08 16:06                 ` Don Slutz
2015-06-09 10:00                   ` George Dunlap
2015-06-09 10:17                     ` Wei Liu
2015-06-09 10:14                 ` Stefano Stabellini
2015-06-09 11:20                   ` George Dunlap
2015-06-16 16:44                     ` Stefano Stabellini
2015-06-09 12:45                   ` Ian Campbell
2015-06-17 13:35                     ` Stefano Stabellini
2015-06-08 14:53         ` Konrad Rzeszutek Wilk
2015-06-08 15:20           ` George Dunlap
2015-06-08 15:42             ` Konrad Rzeszutek Wilk
2015-06-08 14:14   ` George Dunlap

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.