Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer

QEMU-Devel Archive mirror
 help / color / mirror / Atom feed

From: Avihai Horon <avihaih@nvidia.com>
To: Peter Xu <peterx@redhat.com>,
	"Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
	"Fabiano Rosas" <farosas@suse.de>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	qemu-devel@nongnu.org
Subject: Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Date: Tue, 7 May 2024 11:41:05 +0300	[thread overview]
Message-ID: <7e855ccb-d5af-490f-94ab-61141fa30ba8@nvidia.com> (raw)
In-Reply-To: <ZjkZyP9Ty0TpTCTx@x1n>


On 06/05/2024 20:56, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, May 06, 2024 at 06:26:46PM +0200, Maciej S. Szmigiero wrote:
>> On 29.04.2024 17:09, Peter Xu wrote:
>>> On Fri, Apr 26, 2024 at 07:34:09PM +0200, Maciej S. Szmigiero wrote:
>>>> On 24.04.2024 00:35, Peter Xu wrote:
>>>>> On Wed, Apr 24, 2024 at 12:25:08AM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 24.04.2024 00:20, Peter Xu wrote:
>>>>>>> On Tue, Apr 23, 2024 at 06:15:35PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> On 19.04.2024 17:31, Peter Xu wrote:
>>>>>>>>> On Fri, Apr 19, 2024 at 11:07:21AM +0100, Daniel P. Berrangé wrote:
>>>>>>>>>> On Thu, Apr 18, 2024 at 04:02:49PM -0400, Peter Xu wrote:
>>>>>>>>>>> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>>>>>> I think one of the reasons for these results is that mixed (RAM + device
>>>>>>>>>>>> state) multifd channels participate in the RAM sync process
>>>>>>>>>>>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>>>>>>>>>>> Firstly, I'm wondering whether we can have better names for these new
>>>>>>>>>>> hooks.  Currently (only comment on the async* stuff):
>>>>>>>>>>>
>>>>>>>>>>>        - complete_precopy_async
>>>>>>>>>>>        - complete_precopy
>>>>>>>>>>>        - complete_precopy_async_wait
>>>>>>>>>>>
>>>>>>>>>>> But perhaps better:
>>>>>>>>>>>
>>>>>>>>>>>        - complete_precopy_begin
>>>>>>>>>>>        - complete_precopy
>>>>>>>>>>>        - complete_precopy_end
>>>>>>>>>>>
>>>>>>>>>>> ?
>>>>>>>>>>>
>>>>>>>>>>> As I don't see why the device must do something with async in such hook.
>>>>>>>>>>> To me it's more like you're splitting one process into multiple, then
>>>>>>>>>>> begin/end sounds more generic.
>>>>>>>>>>>
>>>>>>>>>>> Then, if with that in mind, IIUC we can already split ram_save_complete()
>>>>>>>>>>> into >1 phases too. For example, I would be curious whether the performance
>>>>>>>>>>> will go back to normal if we offloading multifd_send_sync_main() into the
>>>>>>>>>>> complete_precopy_end(), because we really only need one shot of that, and I
>>>>>>>>>>> am quite surprised it already greatly affects VFIO dumping its own things.
>>>>>>>>>>>
>>>>>>>>>>> I would even ask one step further as what Dan was asking: have you thought
>>>>>>>>>>> about dumping VFIO states via multifd even during iterations?  Would that
>>>>>>>>>>> help even more than this series (which IIUC only helps during the blackout
>>>>>>>>>>> phase)?
>>>>>>>>>> To dump during RAM iteration, the VFIO device will need to have
>>>>>>>>>> dirty tracking and iterate on its state, because the guest CPUs
>>>>>>>>>> will still be running potentially changing VFIO state. That seems
>>>>>>>>>> impractical in the general case.
>>>>>>>>> We already do such interations in vfio_save_iterate()?
>>>>>>>>>
>>>>>>>>> My understanding is the recent VFIO work is based on the fact that the VFIO
>>>>>>>>> device can track device state changes more or less (besides being able to
>>>>>>>>> save/load full states).  E.g. I still remember in our QE tests some old
>>>>>>>>> devices report much more dirty pages than expected during the iterations
>>>>>>>>> when we were looking into such issue that a huge amount of dirty pages
>>>>>>>>> reported.  But newer models seem to have fixed that and report much less.
>>>>>>>>>
>>>>>>>>> That issue was about GPU not NICs, though, and IIUC a major portion of such
>>>>>>>>> tracking used to be for GPU vRAMs.  So maybe I was mixing up these, and
>>>>>>>>> maybe they work differently.
>>>>>>>> The device which this series was developed against (Mellanox ConnectX-7)
>>>>>>>> is already transferring its live state before the VM gets stopped (via
>>>>>>>> save_live_iterate SaveVMHandler).
>>>>>>>>
>>>>>>>> It's just that in addition to the live state it has more than 400 MiB
>>>>>>>> of state that cannot be transferred while the VM is still running.
>>>>>>>> And that fact hurts a lot with respect to the migration downtime.
>>>>>>>>
>>>>>>>> AFAIK it's a very similar story for (some) GPUs.
>>>>>>> So during iteration phase VFIO cannot yet leverage the multifd channels
>>>>>>> when with this series, am I right?
>>>>>> That's right.
>>>>>>
>>>>>>> Is it possible to extend that use case too?
>>>>>> I guess so, but since this phase (iteration while the VM is still
>>>>>> running) doesn't impact downtime it is much less critical.
>>>>> But it affects the bandwidth, e.g. even with multifd enabled, the device
>>>>> iteration data will still bottleneck at ~15Gbps on a common system setup
>>>>> the best case, even if the hosts are 100Gbps direct connected.  Would that
>>>>> be a concern in the future too, or it's known problem and it won't be fixed
>>>>> anyway?
>>>> I think any improvements to the migration performance are good, even if
>>>> they don't impact downtime.
>>>>
>>>> It's just that this patch set focuses on the downtime phase as the more
>>>> critical thing.
>>>>
>>>> After this gets improved there's no reason why not to look at improving
>>>> performance of the VM live phase too if it brings sensible improvements.
>>>>
>>>>> I remember Avihai used to have plan to look into similar issues, I hope
>>>>> this is exactly what he is looking for.  Otherwise changing migration
>>>>> protocol from time to time is cumbersome; we always need to provide a flag
>>>>> to make sure old systems migrates in the old ways, new systems run the new
>>>>> ways, and for such a relatively major change I'd want to double check on
>>>>> how far away we can support offload VFIO iterations data to multifd.
>>>> The device state transfer is indicated by a new flag in the multifd
>>>> header (MULTIFD_FLAG_DEVICE_STATE).
>>>>
>>>> If we are to use multifd channels for VM live phase transfers these
>>>> could simply re-use the same flag type.
>>> Right, and that's also my major purpose of such request to consider both
>>> issues.
>>>
>>> If supporting iterators can be easy on top of this, I am thinking whether
>>> we should do this in one shot.  The problem is even if the flag type can be
>>> reused, old/new qemu binaries may not be compatible and may not migrate
>>> well when:
>>>
>>>     - The old qemu only supports the downtime optimizations
>>>     - The new qemu supports both downtime + iteration optimizations
>> I think the situation here will be the same as with any new flag
>> affecting the migration wire protocol - if the old version of QEMU
>> doesn't support that flag then it has to be kept at its backward-compatible
>> setting for migration to succeed.
>>
>>> IIUC, at least the device threads are currently created only at the end of
>>> migration when switching over for the downtime-only optimization (aka, this
>>> series).  Then it means it won't be compatible with a new QEMU as the
>>> threads there will need to be created before iteration starts to take
>>> iteration data.  So I believe we'll need yet another flag to tune the
>>> behavior of such, one for each optimizations (downtime v.s. data during
>>> iterations).  If they work mostly similarly, I want to avoid two flags.
>>> It'll be chaos for user to see such similar flags and they'll be pretty
>>> confusing.
>> The VFIO loading threads are created from vfio_load_setup(), which is
>> called at the very beginning of the migration, so they should be already
>> there.
>>
>> However, they aren't currently prepared to receive VM live phase data.
>>
>>> If possible, I wish we can spend some time looking into that if they're so
>>> close, and if it's low hanging fruit when on top of this series, maybe we
>>> can consider doing that in one shot.
>> I'm still trying to figure out the complete explanation why dedicated
>> device state channels improve downtime as there was a bunch of holidays
>> last week here.
> No rush.  I am not sure whether it'll reduce downtime, but it may improve
> total migration time when multiple devices are used.
>
>> I will have a look later what would it take to add VM live phase multifd
>> device state transfer support and also how invasive it would be as I
>> think it's better to keep the number of code conflicts in a patch set
>> to a manageable size as it reduces the chance of accidentally
>> introducing regressions when forward-porting the patch set to the git master.
> Yes it makes sense.  It'll be good to look one step further in this case,
> then:
>
>    - If it's easy to add support then we do in one batch, or
>
>    - If it's not easy to add support, but if we can find a compatible way so
>      that ABI can be transparent when adding that later, it'll be also nice, or
>
>    - If we have solid clue it should be a major separate work, and we must
>      need a new flag, then we at least know we should simply split the
>      effort due to that complexity
>
> The hope is option (1)/(2) would work out.
>
> I hope Avihai can also chim in here (or please reach him out) because I
> remember he used to consider proposing such a whole solution, but maybe I
> just misunderstood.  I suppose no harm to check with him.

Yes, I was working on parallel VFIO migration, but in a different 
approach (not over multifd) which I'm not sure is relevant to this series.
I've been skimming over your discussions but haven't had the time to go 
over Maciej's series thoroughly.
I will try to find time to do this next week and see if I can help.

Thanks.

next prev parent reply	other threads:[~2024-05-07  8:42 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-04-29 20:04   ` Peter Xu
2024-05-06 16:25     ` Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
2024-04-17 12:11   ` Maciej S. Szmigiero
2024-04-17 16:35     ` Daniel P. Berrangé
2024-04-18  9:50       ` Maciej S. Szmigiero
2024-04-18 10:39         ` Daniel P. Berrangé
2024-04-18 18:14           ` Maciej S. Szmigiero
2024-04-18 20:02             ` Peter Xu
2024-04-19 10:07               ` Daniel P. Berrangé
2024-04-19 15:31                 ` Peter Xu
2024-04-23 16:15                   ` Maciej S. Szmigiero
2024-04-23 22:20                     ` Peter Xu
2024-04-23 22:25                       ` Maciej S. Szmigiero
2024-04-23 22:35                         ` Peter Xu
2024-04-26 17:34                           ` Maciej S. Szmigiero
2024-04-29 15:09                             ` Peter Xu
2024-05-06 16:26                               ` Maciej S. Szmigiero
2024-05-06 17:56                                 ` Peter Xu
2024-05-07  8:41                                   ` Avihai Horon [this message]
2024-05-07 16:13                                     ` Peter Xu
2024-05-07 17:23                                       ` Avihai Horon
2024-04-23 16:14               ` Maciej S. Szmigiero
2024-04-23 22:27                 ` Peter Xu
2024-04-26 17:35                   ` Maciej S. Szmigiero
2024-04-29 20:34                     ` Peter Xu
2024-04-19 10:20             ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e855ccb-d5af-490f-94ab-61141fa30ba8@nvidia.com \
    --to=avihaih@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=joao.m.martins@oracle.com \
    --cc=mail@maciej.szmigiero.name \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).