Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer

QEMU-Devel Archive mirror
 help / color / mirror / Atom feed

From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
To: Peter Xu <peterx@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
	"Fabiano Rosas" <farosas@suse.de>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Avihai Horon" <avihaih@nvidia.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	qemu-devel@nongnu.org
Subject: Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Date: Tue, 23 Apr 2024 18:14:18 +0200	[thread overview]
Message-ID: <9fa34f3e-2f65-442b-8872-513565bb335c@maciej.szmigiero.name> (raw)
In-Reply-To: <ZiF8aWVfW7kPuOtn@x1n>

On 18.04.2024 22:02, Peter Xu wrote:
> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
>>> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> (..)
>>>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>>>> make a whole lot of sense to me either.
>>>>>>>
>>>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>>>> doing device state transfer, all the multifd channels are idle.
>>>>>>> We should just use all those channels to transfer the device state,
>>>>>>> in parallel.  Reserving channels just guarantees many idle channels
>>>>>>> during RAM transfer, and further idle channels during vmstate
>>>>>>> transfer.
>>>>>>>
>>>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>>>> resources all the time.
>>>>>>
>>>>>> The reason for having dedicated device state channels is that they
>>>>>> provide lower downtime in my tests.
>>>>>>
>>>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>>>> channels) I get a downtime of about 1250 msec.
>>>>>>
>>>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>>>> state channels that give downtime of about 1100 ms it means that using
>>>>>> dedicated channels gets about 14% downtime improvement.
>>>>>
>>>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>>>> this series still has the RAM transfer iterations running first,
>>>>> and then the VFIO VMstate at the end, simply making use of multifd
>>>>> channels for parallelism of the end phase. your reply though makes
>>>>> me question my interpretation though.
>>>>>
>>>>> Let me try to illustrate channel flow in various scenarios, time
>>>>> flowing left to right:
>>>>>
>>>>> 1. serialized RAM, then serialized VM state  (ie historical migration)
>>>>>
>>>>>          main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>>>
>>>>>
>>>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>
>>>>>
>>>>> 3. parallel RAM, then parallel VM state
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd4:                                                     | VFIO VM state |
>>>>>      multifd5:                                                     | VFIO VM state |
>>>>>
>>>>>
>>>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>>>
>>>>>          main: | Init |                                            | VM state |
>>>>>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>      multifd4:        | VFIO VM state                                         |
>>>>>      multifd5:        | VFIO VM state                                         |
>>>>>
>>>>>
>>>>> I thought this series was implementing approx (3), but are you actually
>>>>> implementing (4), or something else entirely ?
>>>>
>>>> You are right that this series operation is approximately implementing
>>>> the schema described as numer 3 in your diagrams.
>>>
>>>> However, there are some additional details worth mentioning:
>>>> * There's some but relatively small amount of VFIO data being
>>>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>>>> still running.
>>>>
>>>> This is still happening via the main migration channel.
>>>> Parallelizing this transfer in the future might make sense too,
>>>> although obviously this doesn't impact the downtime.
>>>>
>>>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>>>> VFIO device state gets transferred via multifd channels.
>>>>
>>>> However, these multifd channels (if they are not dedicated to device
>>>> state transfer) aren't idle during that time.
>>>> Rather they seem to be transferring the residual RAM data.
>>>>
>>>> That's most likely what causes the additional observed downtime
>>>> when dedicated device state transfer multifd channels aren't used.
>>>
>>> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
>>> an explanation. Allow me to work through the scenarios though, as I
>>> still think my suggestion to not have separate dedicate channels is
>>> better....
>>>
>>>
>>> Lets say hypothetically we have an existing deployment today that
>>> uses 6 multifd channels for RAM. ie:
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>
>>> That value of 6 was chosen because that corresponds to the amount
>>> of network & CPU utilization the admin wants to allow, for this
>>> VM to migrate. All 6 channels are fully utilized at all times.
>>>
>>>
>>> If we now want to parallelize VFIO VM state, the peak network
>>> and CPU utilization the admin wants to reserve for the VM should
>>> not change. Thus the admin will still wants to configure only 6
>>> channels total.
>>>
>>> With your proposal the admin has to reduce RAM transfer to 4 of the
>>> channels, in order to then reserve 2 channels for VFIO VM state, so we
>>> get a flow like:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:                                                     | VFIO VM state |
>>>       multifd6:                                                     | VFIO VM state |
>>>
>>> This is bad, as it reduces performance of RAM transfer. VFIO VM
>>> state transfer is better, but that's not a net win overall.
>>>
>>>
>>>
>>> So lets say the admin was happy to increase the number of multifd
>>> channels from 6 to 8.
>>>
>>> This series proposes that they would leave RAM using 6 channels as
>>> before, and now reserve the 2 extra ones for VFIO VM state:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>       multifd7:                                                     | VFIO VM state |
>>>       multifd8:                                                     | VFIO VM state |
>>>
>>>
>>> RAM would perform as well as it did historically, and VM state would
>>> improve due to the 2 parallel channels, and not competing with the
>>> residual RAM transfer.
>>>
>>> This is what your latency comparison numbers show as a benefit for
>>> this channel reservation design.
>>>
>>> I believe this comparison is inappropriate / unfair though, as it is
>>> comparing a situation with 6 total channels against a situation with
>>> 8 total channels.
>>>
>>> If the admin was happy to increase the total channels to 8, then they
>>> should allow RAM to use all 8 channels, and then VFIO VM state +
>>> residual RAM to also use the very same set of 8 channels:
>>>
>>>           main: | Init |                                            | VM state |
>>>       multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>       multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>
>>> This will speed up initial RAM iters still further & the final switch
>>> over phase even more. If residual RAM is larger than VFIO VM state,
>>> then it will dominate the switchover latency, so having VFIO VM state
>>> compete is not a problem. If VFIO VM state is larger than residual RAM,
>>> then allowing it acces to all 8 channels instead of only 2 channels
>>> will be a clear win.
>>
>> I re-did the measurement with increased the number of multifd channels,
>> first to (total count/dedicated count) 25/0, then to 100/0.
>>
>> The results did not improve:
>> With 25/0 multifd mixed channels config I still get around 1250 msec
>> downtime - the same as with 15/0 or 11/0 mixed configs I measured
>> earlier.
>>
>> But with the (pretty insane) 100/0 mixed channel config the whole setup
>> gets so for into the law of diminishing returns that the results actually
>> get worse: the downtime is now about 1450 msec.
>> I guess that's from all the extra overhead from switching between 100
>> multifd channels.
> 
> 100 threads are probably too much indeed.
> 
> However I agree with Dan's question raised, and I'd like to second that.
> It so far looks better if the multifd channels can be managed just like a
> pool of workers without assignments to specific jobs.  It looks like this
> series is already getting there, it's a pity we lose that genericity only
> because some other side effects on the ram sync semantics.

We don't lose any genericity since by default the transfer is done via
mixed RAM / device state multifd channels from a shared pool.

It's only when x-multifd-channels-device-state is set to value > 0 then
the requested multifd channel counts gets dedicated to device state.

It could be seen as a fine-tuning option for cases where tests show that
it provides some benefits to the particular workload - just like many
other existing migration options are.

14% downtime improvement is too much to waste - I'm not sure that's only
due to avoiding RAM syncs, it's possible that there are other subtle
performance interactions too.

For even more genericity this option could be named like
x-multifd-channels-map and contain an array of channel settings like
"ram,ram,ram,device-state,device-state".
Then a possible future other uses of multifd channels wouldn't even need
a new dedicated option.

>>
>> I think one of the reasons for these results is that mixed (RAM + device
>> state) multifd channels participate in the RAM sync process
>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
> 
> Firstly, I'm wondering whether we can have better names for these new
> hooks.  Currently (only comment on the async* stuff):
> 
>    - complete_precopy_async
>    - complete_precopy
>    - complete_precopy_async_wait
> 
> But perhaps better:
> 
>    - complete_precopy_begin
>    - complete_precopy
>    - complete_precopy_end
> 
> ?
> 
> As I don't see why the device must do something with async in such hook.
> To me it's more like you're splitting one process into multiple, then
> begin/end sounds more generic.

Ack, I will rename these hooks to begin/end.

> Then, if with that in mind, IIUC we can already split ram_save_complete()
> into >1 phases too. For example, I would be curious whether the performance
> will go back to normal if we offloading multifd_send_sync_main() into the
> complete_precopy_end(), because we really only need one shot of that, and I
> am quite surprised it already greatly affects VFIO dumping its own things.

AFAIK there's already just one multifd_send_sync_main() during downtime -
the one called from save_live_complete_precopy SaveVMHandler.

In order to truly never interfere with device state transfer the sync would
need to be ordered after the device state transfer is complete - that is,
after VFIO complete_precopy_end (complete_precopy_async_wait) handler
returns.

> I would even ask one step further as what Dan was asking: have you thought
> about dumping VFIO states via multifd even during iterations?  Would that
> help even more than this series (which IIUC only helps during the blackout
> phase)?
> 
> It could mean that the "async*" hooks can be done differently, and I'm not
> sure whether they're needed at all, e.g. when threads are created during
> save_setup but cleaned up in save_cleanup.

Responded to this thread in another e-mail message.

> Thanks,
> 

Thanks,
Maciej

next prev parent reply	other threads:[~2024-04-23 16:16 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-04-29 20:04   ` Peter Xu
2024-05-06 16:25     ` Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
2024-04-17 12:11   ` Maciej S. Szmigiero
2024-04-17 16:35     ` Daniel P. Berrangé
2024-04-18  9:50       ` Maciej S. Szmigiero
2024-04-18 10:39         ` Daniel P. Berrangé
2024-04-18 18:14           ` Maciej S. Szmigiero
2024-04-18 20:02             ` Peter Xu
2024-04-19 10:07               ` Daniel P. Berrangé
2024-04-19 15:31                 ` Peter Xu
2024-04-23 16:15                   ` Maciej S. Szmigiero
2024-04-23 22:20                     ` Peter Xu
2024-04-23 22:25                       ` Maciej S. Szmigiero
2024-04-23 22:35                         ` Peter Xu
2024-04-26 17:34                           ` Maciej S. Szmigiero
2024-04-29 15:09                             ` Peter Xu
2024-05-06 16:26                               ` Maciej S. Szmigiero
2024-05-06 17:56                                 ` Peter Xu
2024-05-07  8:41                                   ` Avihai Horon
2024-05-07 16:13                                     ` Peter Xu
2024-05-07 17:23                                       ` Avihai Horon
2024-04-23 16:14               ` Maciej S. Szmigiero [this message]
2024-04-23 22:27                 ` Peter Xu
2024-04-26 17:35                   ` Maciej S. Szmigiero
2024-04-29 20:34                     ` Peter Xu
2024-04-19 10:20             ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9fa34f3e-2f65-442b-8872-513565bb335c@maciej.szmigiero.name \
    --to=mail@maciej.szmigiero.name \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=joao.m.martins@oracle.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).