From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48896)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Z5VBo-0003HM-Qs
	for qemu-devel@nongnu.org; Thu, 18 Jun 2015 04:28:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Z5VBk-00036P-0S
	for qemu-devel@nongnu.org; Thu, 18 Jun 2015 04:28:48 -0400
Received: from mx1.redhat.com ([209.132.183.28]:35916)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <pbonzini@redhat.com>) id 1Z5VBj-00036L-Oy
	for qemu-devel@nongnu.org; Thu, 18 Jun 2015 04:28:43 -0400
Message-ID: <55828133.90400@redhat.com>
Date: Thu, 18 Jun 2015 10:28:35 +0200
From: Paolo Bonzini <pbonzini@redhat.com>
MIME-Version: 1.0
References: <1434450415-11339-1-git-send-email-dgilbert@redhat.com>
	<1434450415-11339-2-git-send-email-dgilbert@redhat.com>
	<F2CBF3009FA73547804AE4C663CAB28E55056F@shsmsx102.ccr.corp.intel.com>
In-Reply-To: <F2CBF3009FA73547804AE4C663CAB28E55056F@shsmsx102.ccr.corp.intel.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [PATCH v7 01/42] Start documenting how postcopy
	works.
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Li, Liang Z" <liang.z.li@intel.com>, "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Cc: "aarcange@redhat.com" <aarcange@redhat.com>, "yamahata@private.email.ne.jp" <yamahata@private.email.ne.jp>, "quintela@redhat.com" <quintela@redhat.com>, "luis@cs.umu.se" <luis@cs.umu.se>, "amit.shah@redhat.com" <amit.shah@redhat.com>, "david@gibson.dropbear.id.au" <david@gibson.dropbear.id.au>


On 18/06/2015 09:50, Li, Liang Z wrote:
> Do you have any idea or plan to deal with the failure happened during
> the postcopy phase?
>=20
> Lost the guest  is too frightening for a cloud provider, we have a
> discussion with Alibaba, they said that they can't use the postcopy
> feature unless there is a mechanism to find the guest back.

There's no solution to this problem, except for rollback to a previous
snapshot.

To give an idea, an example of an intended usecase for postcopy is
datacenter evacuation in 30 minutes after a tsunami alert.  That's not a
case where you care much about losing guests to network failures.

Why is there no solution?  Let's look at one of the best surveys on
migration,
http://courses.cs.vt.edu/~cs5204/fall05-kafura/Papers/Migration/ProcessMi=
gration.pdf
(warning, 59 pages!):

  [3.2] If only part of the task state is transferred to another node,
  the task can start executing sooner, and the initial migration costs
  are lower.

  [3.4] Fault resilience can be improved in several ways. The impact of
  failures during migration can be reduced by maintaining process state
  on both the source and destination sites until the destination site
  instance is successfully promoted to a regular process and the source
  node is informed about this.

  [3.5] Migration algorithms should avoid linear dependencies on the
  amount of state to be transferred. For example, the eager data
  transfer strategy has costs proportional to the address space size

"Pre"copy means "start copying *before* promoting the destination to be
the primary host" and it has such a linear dependency on the amount of
state to be transferred. "Post"copy means "delay some copying to *after*
promoting the destination to be the primary host".

So we have:

                           Precopy            Postcopy
   3.2 Performance            - (1)             - (2)
   3.4 Fault resilience       +                 -
   3.5 Scalability            -                 +

      (1) smaller impact, longer freeze time
      (2) larger impact, extremely short freeze time

Postcopy can also limit the length of the non-resilient phase, by
starting with a precopy phase and only switching to postcopy after some
time.  Then you have:

                           Precopy        Hybrid      Postcopy
   3.2 Performance            - (1)          + (3)        - (2)
   3.4 Fault resilience       +              -            --
   3.5 Scalability            -              +            +

      (3) intermediate impact, extremely short freeze time

but there is still going to be a phase where migration is not resilient
to network faults.

Cloud operators can use a combination of precopy and postcopy.  For
example, I would not use postcopy for mass migration when doing
host updates, but it can be used as a last resort before a scheduled
downtime.

For example, say you're doing a rolling update and you want it complete
by next Sunday.  90% of the guests are shut down by the customers or can
be migrated successfully with precopy.  The others do not converge and
their SLA does not let you throttle them to complete precopy migration.

You then tell your customers that either they shutdown and restart their
instances before Saturday 8:00 PM, or they might be shut down forcibly.
 Then for customers who haven't rebooted you can do
postcopy---you have alerted them that something might go wrong.  So even
though postcopy would not be a first choice, it can still help cloud
operators.

Paolo