From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48896) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5VBo-0003HM-Qs for qemu-devel@nongnu.org; Thu, 18 Jun 2015 04:28:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Z5VBk-00036P-0S for qemu-devel@nongnu.org; Thu, 18 Jun 2015 04:28:48 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35916) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Z5VBj-00036L-Oy for qemu-devel@nongnu.org; Thu, 18 Jun 2015 04:28:43 -0400 Message-ID: <55828133.90400@redhat.com> Date: Thu, 18 Jun 2015 10:28:35 +0200 From: Paolo Bonzini MIME-Version: 1.0 References: <1434450415-11339-1-git-send-email-dgilbert@redhat.com> <1434450415-11339-2-git-send-email-dgilbert@redhat.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v7 01/42] Start documenting how postcopy works. List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Li, Liang Z" , "Dr. David Alan Gilbert (git)" , "qemu-devel@nongnu.org" Cc: "aarcange@redhat.com" , "yamahata@private.email.ne.jp" , "quintela@redhat.com" , "luis@cs.umu.se" , "amit.shah@redhat.com" , "david@gibson.dropbear.id.au" On 18/06/2015 09:50, Li, Liang Z wrote: > Do you have any idea or plan to deal with the failure happened during > the postcopy phase? >=20 > Lost the guest is too frightening for a cloud provider, we have a > discussion with Alibaba, they said that they can't use the postcopy > feature unless there is a mechanism to find the guest back. There's no solution to this problem, except for rollback to a previous snapshot. To give an idea, an example of an intended usecase for postcopy is datacenter evacuation in 30 minutes after a tsunami alert. That's not a case where you care much about losing guests to network failures. Why is there no solution? Let's look at one of the best surveys on migration, http://courses.cs.vt.edu/~cs5204/fall05-kafura/Papers/Migration/ProcessMi= gration.pdf (warning, 59 pages!): [3.2] If only part of the task state is transferred to another node, the task can start executing sooner, and the initial migration costs are lower. [3.4] Fault resilience can be improved in several ways. The impact of failures during migration can be reduced by maintaining process state on both the source and destination sites until the destination site instance is successfully promoted to a regular process and the source node is informed about this. [3.5] Migration algorithms should avoid linear dependencies on the amount of state to be transferred. For example, the eager data transfer strategy has costs proportional to the address space size "Pre"copy means "start copying *before* promoting the destination to be the primary host" and it has such a linear dependency on the amount of state to be transferred. "Post"copy means "delay some copying to *after* promoting the destination to be the primary host". So we have: Precopy Postcopy 3.2 Performance - (1) - (2) 3.4 Fault resilience + - 3.5 Scalability - + (1) smaller impact, longer freeze time (2) larger impact, extremely short freeze time Postcopy can also limit the length of the non-resilient phase, by starting with a precopy phase and only switching to postcopy after some time. Then you have: Precopy Hybrid Postcopy 3.2 Performance - (1) + (3) - (2) 3.4 Fault resilience + - -- 3.5 Scalability - + + (3) intermediate impact, extremely short freeze time but there is still going to be a phase where migration is not resilient to network faults. Cloud operators can use a combination of precopy and postcopy. For example, I would not use postcopy for mass migration when doing host updates, but it can be used as a last resort before a scheduled downtime. For example, say you're doing a rolling update and you want it complete by next Sunday. 90% of the guests are shut down by the customers or can be migrated successfully with precopy. The others do not converge and their SLA does not let you throttle them to complete precopy migration. You then tell your customers that either they shutdown and restart their instances before Saturday 8:00 PM, or they might be shut down forcibly. Then for customers who haven't rebooted you can do postcopy---you have alerted them that something might go wrong. So even though postcopy would not be a first choice, it can still help cloud operators. Paolo