From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: QEMU bumping memory bug analysis Date: Mon, 8 Jun 2015 13:14:58 +0100 Message-ID: <55758742.5060602@citrix.com> References: <20150605164354.GK29102@zion.uk.xensource.com> <20150605181014.GM29102@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Stefano Stabellini , Wei Liu Cc: George Dunlap , Ian Jackson , Ian Campbell , dslutz@verizon.com, xen-devel@lists.xen.org List-Id: xen-devel@lists.xenproject.org On 08/06/15 12:39, Stefano Stabellini wrote: > On Fri, 5 Jun 2015, Wei Liu wrote: >> On Fri, Jun 05, 2015 at 06:10:17PM +0100, Stefano Stabellini wrote: >>> On Fri, 5 Jun 2015, Wei Liu wrote: >>>> Hi all >>>> >>>> This bug is now considered a blocker for 4.6 release. >>>> >>>> The premises of the problem remain the same (George's translated >>>> version): >>>> >>>> 1. QEMU may need extra pages from Xen to implement option ROMS, and so at >>>> the moment it calls set_max_mem() to increase max_pages so that it can >>>> allocate more pages to the guest. libxl doesn't know what max_pages a >>>> domain needs prior to qemu start-up. >>>> >>>> 2. Libxl doesn't know max_pages even after qemu start-up, because there >>>> is no mechanism to communicate between qemu and libxl. >>> I might not know what is the right design for the overall solution, but >>> I do know that libxl shouldn't have its own state tracking for >>> max_pages, because max_pages is kept, maintained and enforced by Xen. >>> >>> Ian might still remember, but at the beginning of the xl/libxl project, >>> we had few simple design principles. One of which was that we should not >>> have two places where we keep track of the same thing. If Xen keeps >>> track of something, libxl should avoid it. >>> >>> In this specific case, libxl can ask Xen at any time what max_pages is >>> for the domain, so I don't think that libxl should store it or have its >>> own tracking for it. >>> >>> Even if QEMU called into libxl to change max_pages, I don't think that >>> libxl should store max_pages anywhere. It is already stored in Xen and >>> can be retrieved at any time. >>> >> I think you're talking about keep track of that record permanently. I >> only care about getting the right value at the right time and transfer >> it to the other end. Getting the value whenever needed is OK. >> >>>> 3. QEMU calls xc_domain_setmaxmem to increase max_pages by N pages. >>>> Those pages are only accounted for in the hypervisor. Libxl >>>> (currently) does not extract that value from the hypervisor. >>>> >>>> Several solutions were proposed: >>>> >>>> 1. Add a new record type in libxc migration stream and call setmaxmem >>>> in the middle of xc migration stream. >>>> >>>> Main objections are calling xc_domain_setmaxmem in the middle of xc >>>> migration stream is layer violation. Also this prevents us from >>>> disaggregating domain construction to a less privileged domain. >>> It seems to me that max_pages is one of the memory properties of the >>> domain, so it should be saved and restored together with the rest of >>> memory. >>> >> #1 was actually referring to Don's patch. When processing libxc record >> in the middle of the stream, we should not alter the size of memory. >> It's not safe because we don't know whether that record comes early >> enough before we exceed the limit. >> >> The only safe way of doing it is to mandate that specific record at >> the beginning of libxc stream, which might have other implications. > I see, that makes sense > > >>>> 2. Use libxl toolstack save restore blob to tranmit max pages >>>> information to remote end. >>>> >>>> This is considered a bodge and has been proven not to work because >>>> toolstack blob restore happens after xc_domain_restore. >>> Saving and restoring max_pages in libxl seems to me like a layering >>> violation. I would avoid 2. 3. 4. and 5. >>> >> No, not really. If we follow the principle of "libxl is the >> arbitrator". It needs to have thorough information on every aspect of >> the domain and set up limit. > I am not sure I buy this "libxl is the arbitrator" concept. I am not > seeing libxl as adding much value in this context. > > >>>> 3. Add a libxl layer that wraps necessary information, take over >>>> Andrew's work on libxl migration v2. Having a libxl layer that's not >>>> part of migration v2 is a waste of effort. >>>> >>>> There are several obstacles for libxl migration v2 at the moment. Libxl >>>> layer in migration v2 still has unresolved issues. It has >>>> inter-dependency with Remus / COLO. >>>> >>>> Most importantly it doesn't inherently solve the problem. It still >>>> requires the current libxl JSON blob to contain information about max >>>> pages (or information used to derive max pages). >>>> >>>> Andrew, correct me if I'm wrong. >>>> >>>> 4. Add a none user configurable field in current libxl JSON structure to >>>> record max pages information. >>>> >>>> This is not desirable. All fields in libxl JSON should be user >>>> configurable. >>>> >>>> 5. Add a user configurable field in current libxl JSON structure to >>>> record how much more memory this domain needs. Admin is required to >>>> fill in that value manually. In the mean time we revert the change in >>>> QEMU and declare QEMU with that change buggy. >>> QEMU 2.3.0 was released with that change in it, so it is not quite >>> possible to revert it. Also I think it is the right change for QEMU. >>> >> It has security implications. Here is my reply copied from my mail to >> Ian: >> >> I'm considering removing xc_domain_setmaxmem needs regardless of this >> bug because that's going to cause problem in QEMU upstream stubdom with >> strict XSM policy and deprivileged QEMU (may not have privilege to call >> setmaxmem). > QEMU running in the stubdom should be able to set the maxmem for its > target domain, but not for the others. At the moment, set_max_mem is the only method the toolstack has of putting a hard upper bound on a domains memory usage (along with a few others like the shadow mem size, and PoD cache.) In a disaggregated case, no deprivileged entity should be able to play with this limit. Being able to do so renders the security moot, as a compromised stubdom can force a host OOM. ~Andrew