From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Re: QEMU bumping memory bug analysis
Date: Mon, 8 Jun 2015 13:14:58 +0100
Message-ID: <55758742.5060602@citrix.com>
References: <20150605164354.GK29102@zion.uk.xensource.com>
	<alpine.DEB.2.02.1506051751380.19838@kaball.uk.xensource.com>
	<20150605181014.GM29102@zion.uk.xensource.com>
	<alpine.DEB.2.02.1506081211210.19838@kaball.uk.xensource.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <alpine.DEB.2.02.1506081211210.19838@kaball.uk.xensource.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Stefano Stabellini <stefano.stabellini@eu.citrix.com>, Wei Liu <wei.liu2@citrix.com>
Cc: George Dunlap <george.dunlap@eu.citrix.com>, Ian Jackson <Ian.Jackson@eu.citrix.com>, Ian Campbell <ian.campbell@citrix.com>, dslutz@verizon.com, xen-devel@lists.xen.org
List-Id: xen-devel@lists.xenproject.org

On 08/06/15 12:39, Stefano Stabellini wrote:
> On Fri, 5 Jun 2015, Wei Liu wrote:
>> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote:
>>> On Fri, 5 Jun 2015, Wei Liu wrote:
>>>> Hi all
>>>>
>>>> This bug is now considered a blocker for 4.6 release.
>>>>
>>>> The premises of the problem remain the same (George's translated
>>>> version):
>>>>
>>>> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at
>>>>    the moment it calls set_max_mem() to increase max_pages so that it can
>>>>    allocate more pages to the guest.  libxl doesn't know what max_pages a
>>>>    domain needs prior to qemu start-up.
>>>>
>>>> 2. Libxl doesn't know max_pages even after qemu start-up, because there
>>>>    is no mechanism to communicate between qemu and libxl.
>>> I might not know what is the right design for the overall solution, but
>>> I do know that libxl shouldn't have its own state tracking for
>>> max_pages, because max_pages is kept, maintained and enforced by Xen.
>>>
>>> Ian might still remember, but at the beginning of the xl/libxl project,
>>> we had few simple design principles. One of which was that we should not
>>> have two places where we keep track of the same thing. If Xen keeps
>>> track of something, libxl should avoid it.
>>>
>>> In this specific case, libxl can ask Xen at any time what max_pages is
>>> for the domain, so I don't think that libxl should store it or have its
>>> own tracking for it.
>>>
>>> Even if QEMU called into libxl to change max_pages, I don't think that
>>> libxl should store max_pages anywhere. It is already stored in Xen and
>>> can be retrieved at any time.
>>>
>> I think you're talking about keep track of that record permanently. I
>> only care about getting the right value at the right time and transfer
>> it to the other end. Getting the value whenever needed is OK.
>>
>>>> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages.
>>>>    Those pages are only accounted for in the hypervisor.  Libxl
>>>>    (currently) does not extract that value from the hypervisor.
>>>>
>>>> Several solutions were proposed:
>>>>
>>>> 1. Add a new record type in libxc migration stream and call setmaxmem
>>>>    in the middle of xc migration stream.
>>>>
>>>> Main objections are calling xc_domain_setmaxmem in the middle of xc
>>>> migration stream is layer violation. Also this prevents us from
>>>> disaggregating domain construction to a less privileged domain.
>>> It seems to me that max_pages is one of the memory properties of the
>>> domain, so it should be saved and restored together with the rest of
>>> memory.
>>>
>> #1 was actually referring to Don's patch. When processing libxc record
>> in the middle of the stream, we should not alter the size of memory.
>> It's not safe because we don't know whether that record comes early
>> enough before we exceed the limit.
>>
>> The only safe way of doing it is to mandate that specific record at
>> the beginning of libxc stream, which might have other implications.
> I see, that makes sense
>
>
>>>> 2. Use libxl toolstack save restore blob to tranmit max pages
>>>>    information to remote end.
>>>>
>>>> This is considered a bodge and has been proven not to work because
>>>> toolstack blob restore happens after xc_domain_restore.
>>> Saving and restoring max_pages in libxl seems to me like a layering
>>> violation. I would avoid 2. 3. 4. and 5.
>>>
>> No, not really. If we follow the principle of "libxl is the
>> arbitrator". It needs to have thorough information on every aspect of
>> the domain and set up limit.
> I am not sure I buy this "libxl is the arbitrator" concept. I am not
> seeing libxl as adding much value in this context.
>
>  
>>>> 3. Add a libxl layer that wraps necessary information, take over
>>>>    Andrew's work on libxl migration v2.  Having a libxl layer that's not
>>>>    part of migration v2 is a waste of effort.
>>>>
>>>> There are several obstacles for libxl migration v2 at the moment. Libxl
>>>> layer in migration v2 still has unresolved issues. It has
>>>> inter-dependency with Remus / COLO.
>>>>
>>>> Most importantly it doesn't inherently solve the problem. It still
>>>> requires the current libxl JSON blob to contain information about max
>>>> pages (or information used to derive max pages).
>>>>
>>>> Andrew, correct me if I'm wrong.
>>>>
>>>> 4. Add a none user configurable field in current libxl JSON structure to
>>>>    record max pages information.
>>>>
>>>> This is not desirable. All fields in libxl JSON should be user
>>>> configurable.
>>>>
>>>> 5. Add a user configurable field in current libxl JSON structure to
>>>>    record how much more memory this domain needs. Admin is required to
>>>>    fill in that value manually. In the mean time we revert the change in
>>>>    QEMU and declare QEMU with that change buggy.
>>> QEMU 2.3.0 was released with that change in it, so it is not quite
>>> possible to revert it. Also I think it is the right change for QEMU.
>>>
>> It has security implications. Here is my reply copied from my mail to
>> Ian:
>>
>> I'm considering removing xc_domain_setmaxmem needs regardless of this
>> bug because that's going to cause problem in QEMU upstream stubdom with
>> strict XSM policy and deprivileged QEMU (may not have privilege to call
>> setmaxmem).
> QEMU running in the stubdom should be able to set the maxmem for its
> target domain, but not for the others.

At the moment, set_max_mem is the only method the toolstack has of
putting a hard upper bound on a domains memory usage (along with a few
others like the shadow mem size, and PoD cache.)

In a disaggregated case, no deprivileged entity should be able to play
with this limit.  Being able to do so renders the security moot, as a
compromised stubdom can force a host OOM.

~Andrew