From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
To: Peter Xu <peterx@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>,
"Fabiano Rosas" <farosas@suse.de>,
"Alex Williamson" <alex.williamson@redhat.com>,
"Cédric Le Goater" <clg@redhat.com>,
"Eric Blake" <eblake@redhat.com>,
"Markus Armbruster" <armbru@redhat.com>,
"Avihai Horon" <avihaih@nvidia.com>,
"Joao Martins" <joao.m.martins@oracle.com>,
qemu-devel@nongnu.org
Subject: Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Date: Tue, 23 Apr 2024 18:14:18 +0200 [thread overview]
Message-ID: <9fa34f3e-2f65-442b-8872-513565bb335c@maciej.szmigiero.name> (raw)
In-Reply-To: <ZiF8aWVfW7kPuOtn@x1n>
On 18.04.2024 22:02, Peter Xu wrote:
> On Thu, Apr 18, 2024 at 08:14:15PM +0200, Maciej S. Szmigiero wrote:
>> On 18.04.2024 12:39, Daniel P. Berrangé wrote:
>>> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
>> (..)
>>>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>>>> make a whole lot of sense to me either.
>>>>>>>
>>>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>>>> doing device state transfer, all the multifd channels are idle.
>>>>>>> We should just use all those channels to transfer the device state,
>>>>>>> in parallel. Reserving channels just guarantees many idle channels
>>>>>>> during RAM transfer, and further idle channels during vmstate
>>>>>>> transfer.
>>>>>>>
>>>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>>>> resources all the time.
>>>>>>
>>>>>> The reason for having dedicated device state channels is that they
>>>>>> provide lower downtime in my tests.
>>>>>>
>>>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>>>> channels) I get a downtime of about 1250 msec.
>>>>>>
>>>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>>>> state channels that give downtime of about 1100 ms it means that using
>>>>>> dedicated channels gets about 14% downtime improvement.
>>>>>
>>>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>>>> this series still has the RAM transfer iterations running first,
>>>>> and then the VFIO VMstate at the end, simply making use of multifd
>>>>> channels for parallelism of the end phase. your reply though makes
>>>>> me question my interpretation though.
>>>>>
>>>>> Let me try to illustrate channel flow in various scenarios, time
>>>>> flowing left to right:
>>>>>
>>>>> 1. serialized RAM, then serialized VM state (ie historical migration)
>>>>>
>>>>> main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>>>
>>>>>
>>>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>>>
>>>>> main: | Init | | VM state |
>>>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>>
>>>>>
>>>>> 3. parallel RAM, then parallel VM state
>>>>>
>>>>> main: | Init | | VM state |
>>>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd4: | VFIO VM state |
>>>>> multifd5: | VFIO VM state |
>>>>>
>>>>>
>>>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>>>
>>>>> main: | Init | | VM state |
>>>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>>> multifd4: | VFIO VM state |
>>>>> multifd5: | VFIO VM state |
>>>>>
>>>>>
>>>>> I thought this series was implementing approx (3), but are you actually
>>>>> implementing (4), or something else entirely ?
>>>>
>>>> You are right that this series operation is approximately implementing
>>>> the schema described as numer 3 in your diagrams.
>>>
>>>> However, there are some additional details worth mentioning:
>>>> * There's some but relatively small amount of VFIO data being
>>>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>>>> still running.
>>>>
>>>> This is still happening via the main migration channel.
>>>> Parallelizing this transfer in the future might make sense too,
>>>> although obviously this doesn't impact the downtime.
>>>>
>>>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>>>> VFIO device state gets transferred via multifd channels.
>>>>
>>>> However, these multifd channels (if they are not dedicated to device
>>>> state transfer) aren't idle during that time.
>>>> Rather they seem to be transferring the residual RAM data.
>>>>
>>>> That's most likely what causes the additional observed downtime
>>>> when dedicated device state transfer multifd channels aren't used.
>>>
>>> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
>>> an explanation. Allow me to work through the scenarios though, as I
>>> still think my suggestion to not have separate dedicate channels is
>>> better....
>>>
>>>
>>> Lets say hypothetically we have an existing deployment today that
>>> uses 6 multifd channels for RAM. ie:
>>> main: | Init | | VM state |
>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>>
>>> That value of 6 was chosen because that corresponds to the amount
>>> of network & CPU utilization the admin wants to allow, for this
>>> VM to migrate. All 6 channels are fully utilized at all times.
>>>
>>>
>>> If we now want to parallelize VFIO VM state, the peak network
>>> and CPU utilization the admin wants to reserve for the VM should
>>> not change. Thus the admin will still wants to configure only 6
>>> channels total.
>>>
>>> With your proposal the admin has to reduce RAM transfer to 4 of the
>>> channels, in order to then reserve 2 channels for VFIO VM state, so we
>>> get a flow like:
>>>
>>> main: | Init | | VM state |
>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd5: | VFIO VM state |
>>> multifd6: | VFIO VM state |
>>>
>>> This is bad, as it reduces performance of RAM transfer. VFIO VM
>>> state transfer is better, but that's not a net win overall.
>>>
>>>
>>>
>>> So lets say the admin was happy to increase the number of multifd
>>> channels from 6 to 8.
>>>
>>> This series proposes that they would leave RAM using 6 channels as
>>> before, and now reserve the 2 extra ones for VFIO VM state:
>>>
>>> main: | Init | | VM state |
>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>>> multifd7: | VFIO VM state |
>>> multifd8: | VFIO VM state |
>>>
>>>
>>> RAM would perform as well as it did historically, and VM state would
>>> improve due to the 2 parallel channels, and not competing with the
>>> residual RAM transfer.
>>>
>>> This is what your latency comparison numbers show as a benefit for
>>> this channel reservation design.
>>>
>>> I believe this comparison is inappropriate / unfair though, as it is
>>> comparing a situation with 6 total channels against a situation with
>>> 8 total channels.
>>>
>>> If the admin was happy to increase the total channels to 8, then they
>>> should allow RAM to use all 8 channels, and then VFIO VM state +
>>> residual RAM to also use the very same set of 8 channels:
>>>
>>> main: | Init | | VM state |
>>> multifd1: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd2: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd3: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd4: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd5: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd6: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd7: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>> multifd8: | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>>>
>>> This will speed up initial RAM iters still further & the final switch
>>> over phase even more. If residual RAM is larger than VFIO VM state,
>>> then it will dominate the switchover latency, so having VFIO VM state
>>> compete is not a problem. If VFIO VM state is larger than residual RAM,
>>> then allowing it acces to all 8 channels instead of only 2 channels
>>> will be a clear win.
>>
>> I re-did the measurement with increased the number of multifd channels,
>> first to (total count/dedicated count) 25/0, then to 100/0.
>>
>> The results did not improve:
>> With 25/0 multifd mixed channels config I still get around 1250 msec
>> downtime - the same as with 15/0 or 11/0 mixed configs I measured
>> earlier.
>>
>> But with the (pretty insane) 100/0 mixed channel config the whole setup
>> gets so for into the law of diminishing returns that the results actually
>> get worse: the downtime is now about 1450 msec.
>> I guess that's from all the extra overhead from switching between 100
>> multifd channels.
>
> 100 threads are probably too much indeed.
>
> However I agree with Dan's question raised, and I'd like to second that.
> It so far looks better if the multifd channels can be managed just like a
> pool of workers without assignments to specific jobs. It looks like this
> series is already getting there, it's a pity we lose that genericity only
> because some other side effects on the ram sync semantics.
We don't lose any genericity since by default the transfer is done via
mixed RAM / device state multifd channels from a shared pool.
It's only when x-multifd-channels-device-state is set to value > 0 then
the requested multifd channel counts gets dedicated to device state.
It could be seen as a fine-tuning option for cases where tests show that
it provides some benefits to the particular workload - just like many
other existing migration options are.
14% downtime improvement is too much to waste - I'm not sure that's only
due to avoiding RAM syncs, it's possible that there are other subtle
performance interactions too.
For even more genericity this option could be named like
x-multifd-channels-map and contain an array of channel settings like
"ram,ram,ram,device-state,device-state".
Then a possible future other uses of multifd channels wouldn't even need
a new dedicated option.
>>
>> I think one of the reasons for these results is that mixed (RAM + device
>> state) multifd channels participate in the RAM sync process
>> (MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.
>
> Firstly, I'm wondering whether we can have better names for these new
> hooks. Currently (only comment on the async* stuff):
>
> - complete_precopy_async
> - complete_precopy
> - complete_precopy_async_wait
>
> But perhaps better:
>
> - complete_precopy_begin
> - complete_precopy
> - complete_precopy_end
>
> ?
>
> As I don't see why the device must do something with async in such hook.
> To me it's more like you're splitting one process into multiple, then
> begin/end sounds more generic.
Ack, I will rename these hooks to begin/end.
> Then, if with that in mind, IIUC we can already split ram_save_complete()
> into >1 phases too. For example, I would be curious whether the performance
> will go back to normal if we offloading multifd_send_sync_main() into the
> complete_precopy_end(), because we really only need one shot of that, and I
> am quite surprised it already greatly affects VFIO dumping its own things.
AFAIK there's already just one multifd_send_sync_main() during downtime -
the one called from save_live_complete_precopy SaveVMHandler.
In order to truly never interfere with device state transfer the sync would
need to be ordered after the device state transfer is complete - that is,
after VFIO complete_precopy_end (complete_precopy_async_wait) handler
returns.
> I would even ask one step further as what Dan was asking: have you thought
> about dumping VFIO states via multifd even during iterations? Would that
> help even more than this series (which IIUC only helps during the blackout
> phase)?
>
> It could mean that the "async*" hooks can be done differently, and I'm not
> sure whether they're needed at all, e.g. when threads are created during
> save_setup but cleaned up in save_cleanup.
Responded to this thread in another e-mail message.
> Thanks,
>
Thanks,
Maciej
next prev parent reply other threads:[~2024-04-23 16:16 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-04-29 20:04 ` Peter Xu
2024-05-06 16:25 ` Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-04-17 8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
2024-04-17 12:11 ` Maciej S. Szmigiero
2024-04-17 16:35 ` Daniel P. Berrangé
2024-04-18 9:50 ` Maciej S. Szmigiero
2024-04-18 10:39 ` Daniel P. Berrangé
2024-04-18 18:14 ` Maciej S. Szmigiero
2024-04-18 20:02 ` Peter Xu
2024-04-19 10:07 ` Daniel P. Berrangé
2024-04-19 15:31 ` Peter Xu
2024-04-23 16:15 ` Maciej S. Szmigiero
2024-04-23 22:20 ` Peter Xu
2024-04-23 22:25 ` Maciej S. Szmigiero
2024-04-23 22:35 ` Peter Xu
2024-04-26 17:34 ` Maciej S. Szmigiero
2024-04-29 15:09 ` Peter Xu
2024-05-06 16:26 ` Maciej S. Szmigiero
2024-05-06 17:56 ` Peter Xu
2024-05-07 8:41 ` Avihai Horon
2024-05-07 16:13 ` Peter Xu
2024-05-07 17:23 ` Avihai Horon
2024-04-23 16:14 ` Maciej S. Szmigiero [this message]
2024-04-23 22:27 ` Peter Xu
2024-04-26 17:35 ` Maciej S. Szmigiero
2024-04-29 20:34 ` Peter Xu
2024-04-19 10:20 ` Daniel P. Berrangé
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9fa34f3e-2f65-442b-8872-513565bb335c@maciej.szmigiero.name \
--to=mail@maciej.szmigiero.name \
--cc=alex.williamson@redhat.com \
--cc=armbru@redhat.com \
--cc=avihaih@nvidia.com \
--cc=berrange@redhat.com \
--cc=clg@redhat.com \
--cc=eblake@redhat.com \
--cc=farosas@suse.de \
--cc=joao.m.martins@oracle.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).