From: Xuewei Niu <niuxuewei97@gmail.com>
To: sgarzare@redhat.com
Cc: fupan.lfp@antgroup.com, mst@redhat.com,
niuxuewei.nxw@antgroup.com, niuxuewei97@gmail.com,
parav@nvidia.com, stefanha@redhat.com,
virtio-comment@lists.linux.dev
Subject: Re: [PATCH v6 RESEND] virtio-vsock: Add support for multi devices
Date: Wed, 9 Apr 2025 14:55:58 +0800 [thread overview]
Message-ID: <20250409065558.3052615-1-niuxuewei.nxw@antgroup.com> (raw)
In-Reply-To: <CAGxU2F4Qejw2hd45SduH=OwzUZVR6xYJATRyDskukHU8+2nkGw@mail.gmail.com>
> On Mon, 7 Apr 2025 at 04:17, Xuewei Niu <niuxuewei97@gmail.com> wrote:
> >
> > > On Mon, Mar 31, 2025 at 02:18:27PM +0800, Xuewei Niu wrote:
> > > >> > On Wed, 26 Mar 2025 at 11:32, Stefano Garzarella <sgarzare@redhat.com> wrote:
> > > >> > > On Wed, Mar 26, 2025 at 06:00:31PM +0800, Xuewei Niu wrote:
> > > >> > > >> On Tue, Mar 25, 2025 at 11:19:46AM +0800, Xuewei Niu wrote:
> > > >> > > >> >> On Mon, Mar 24, 2025 at 02:43:35PM +0800, Xuewei Niu wrote:
> > > >> > > >> >> >This patch brings a new feature, called "multi devices", to the virtio
> > > >> > > >> >> >vsock. It introduces a "VIRTIO_VSOCK_F_MULTI_DEVICES" feature bit, and a
> > > >> > > >> >> >"device_order" field to the config for the virtio vsock.
> > > >> > > >> >> >
> > > >> > > >> >> >== Motivition ==
> > > >> > > >> >> >
> > > >> > > >> >> >Vsock is a lightweight and widely used data exchange mechanism between host
> > > >> > > >> >> >and guest. Currently, the virtio-vsock only supports one device, resulting
> > > >> > > >> >> >in the inability to enable more than one backend. For instance, two devices
> > > >> > > >> >> >are required: one to transfer data to the VMM via virtio-vsock,
> > > >> > > >> >>
> > > >> > > >> >> Come to think of it, AF_VSOCK defines CID 0 (VMADDR_CID_HYPERVISOR) to
> > > >> > > >> >> communicate with the hypervisor, but in virtio-vsock we never supported
> > > >> > > >> >> it. Could this be the use case?
> > > >> > > >> >>
> > > >> > > >> >> We could in this way add a new feature for those devices that
> > > >> > > >> >> communicate only with the VMM, where the CID of the VM is quite useless.
> > > >> > > >> >> So instead of having multiple CIDs per VM, we could continue to have a
> > > >> > > >> >> single CID, but the transport could support 2 devices, one to
> > > >> > > >> >> communicate with the VMM (CID = 0) and one to communicate with the host
> > > >> > > >> >> apps (CID = 2).
> > > >> > > >> >>
> > > >> > > >> >> Maybe this is orthogonal to this proposal, though, because it might
> > > >> > > >> >> still make sense to have multiple vsock devices, even though it's not
> > > >> > > >> >> very clear to me.
> > > >> > > >> >
> > > >> > > >> >In terms of the current situation, two devices are enough.
> > > >> > > >> >
> > > >> > > >> >We are the team of Kata Containers, so we are focusing on cloud-native
> > > >> > > >> >computing. What I mentioned below might be beyond the scope of the virtio
> > > >> > > >> >spec, just for your reference.
> > > >> > > >> >
> > > >> > > >> >The background is that the architecture of proxy mesh has been evolved over
> > > >> > > >> >the past few years: from per-pod to per-host (e.g. Istio Ambient Mesh[1]).
> > > >> > > >> >
> > > >> > > >> >Thanks to the TSI[2] and vhost-user protocol, network packets can bypass
> > > >> > > >> >both host and guest network stacks. It is possible to establish a fast path
> > > >> > > >> >between the pod and the proxy.
> > > >> > > >> >
> > > >> > > >> >When we have multiple networks, it is intuitive to have multiple NICs. So
> > > >> > > >> >does vsock.
> > > >> > > >>
> > > >> > > >> Be careful though, we don't want to complicate vsock to become like a
> > > >> > > >> NIC.
> > > >> > > >>
> > > >> > > >> >
> > > >> > > >> >When multiple networks are availble, it means that it is possible to have
> > > >> > > >> >multiple proxies(i.e. user processes). In this case, two devices are not
> > > >> > > >> >enough. This feature makes vsock more flexible and scalable.
> > > >> > > >>
> > > >> > > >> This is a good point, but I really don't understand why a VM should have
> > > >> > > >> multiple CIDs assigned.
> > > >> > > >
> > > >> > > >I think priority is not the biggest issue here. So let us focus on how to
> > > >> > > >route the connection to the right device among more than two devices.
> > > >> > >
> > > >> > > That's why I was recommending a different approach. IMO the user should
> > > >> > > not do this, but that should be transparent, hidden in the driver.
> > > >> > >
> > > >> > > By supporting VMADDR_CID_HYPERVISOR, we know very well if a packet is to
> > > >> > > be sent to the VMM, then we have to use the device that supports it.
> > > >> > > Whereas if the user connects to VMADDR_CID_HOST we have to use the other
> > > >> > > device.
> > > >> > >
> > > >> > > The user doesn't have to do anything, only use the right destination CID
> > > >> > > if it wants to talk to the VMM or another host process.
> > > >> >
> > > >> > Obviously, if we want to support more than 2 devices, we need this
> > > >> > that you are proposing. But IMO we need also to support
> > > >> > VMADDR_CID_HYPERVISOR, and we should prevent the user from doing
> > > >> > bind() on a random CID if one of the two devices only talks to the
> > > >> > VMM.
> > > >>
> > > >> I agree with supporting `VMADDR_CID_HYPERVISOR` for virtio-vsock. I can
> > > >> work on this later.
> > >
> > > Would be nice to have both together, but I'm fine if you want to
> > > postpone it.
> > >
> > > >>
> > > >> > Because, again, how does the user know which CID to bind?
> > > >>
> > > >> Nice catch! I am trying to give a solution for this issue regarding the
> > > >> scenario of more than two devices.
> > > >>
> > > >> Let users access the `device_order` and the `guest_cid` field. Host user
> > > >> program and guest user program can make an advance agreement. For example,
> > > >> the first device (whose `device_order` is smallest) is used to communicate
> > > >> with host process 1, the second device is used to host process 2, and so
> > > >> on.
> > > >>
> > > >> The guest user program want to direct the message to host process 2, then
> > > >> the things would be:
> > > >>
> > > >> 1. Guest user program gets the second device's `guest_cid`.
> > > >> 2. Guest user program binds to the CID.
> > > >>
> > > >> This could be worked because the `device_order` is a VM-level
> > > >> configuration. (On the contrary, the `guest_cid` is a host-level
> > > >> configuration).
> > > >>
> > > >> If people don't need this feature (use 1 or 2 devices only), they can use
> > > >> vsock as the simple way. Otherwise, people should accept the more
> > > >> complicated way.
> > > >>
> > > >> WDYT?
> > > >
> > > >Or we can replace the device_order with the guest_lid (aka local id). The
> > > >guest_lid is a VM-level address space, while the guest_cid is a host-level
> > > >address space.
> > > >
> > > >```c
> > > >struct virtio_vsock_config {
> > > > __le64 guest_cid;
> > > > __le16 guest_lid; /* previous device_order */
> > > >};
> > > >```
> > > >
> > > >With this design, the relationship between the device and the guest_lid
> > > >should be set properly before building the guest app and launching the
> > > >VM.
> > > >
> > > >For example, host process 0's guest_lid is 1000, and host process 1's is
> > > >2000. Their guest_cid will be determined when the VM started. The device
> > > >table should be like this:
> > > >
> > > >* device0: process=VM guest_lid=0 guest_cid=0 <default device>
> > > >* device1: process=0 guest_lid=1000 guest_cid=x
> > > >* device2: process=1 guest_lid=2000 guest_cid=y
> > > >
> > > >The driver should expose an interface, such as ioctl, receiving a
> > > >local_cid. Guest apps can use it to obtain the actual guest_cid.
> > >
> > > No, please, I don't think adding virtio-specific behaviour in AF_VSOCK
> > > is what we want.
> > >
> > > Let's continue with device_order and see what others say.
> > >
> > > I think we need to try to get a better understanding of what to do,
> > > depending on the direction:
> > >
> > > - host -> guest: it might make sense multiple devices with different
> > > CIDs, and the host will know which one to use depending on the CID
> > > assigned to the device (e.g. vhost, vhost-user, device in VMM)
> > >
> > > - guest -> host: again I think we should differentiate the device to use
> > > depending on the destination CID which can be VMADDR_CID_HOST,
> > > VMADDR_CID_HYPERVISOR, or in the case where sibling communication is
> > > supported a CID >= 3, so maybe we should have some features or flags
> > > in the config space to describe destination CID supported for each
> > > device
> >
> > I don't understand the point of adding a new features/flags. Could you
> > explain a bit more?
>
> The idea is to inform the guest which addresses are reachable by the
> device, so the guest can easily decide which device to use. I'm
> talking about the destination, so CID_HOST(2), CID_HYPERVISOS(0) or a
> sibling VM (CID >=3).
>
> >
> > We have had the guest_cid field in the config space. The guest knows all
> > devices present in the VM.
>
> Okay, but how can the guest figure out from this information which
> device to use to talk to the hypervisor or an application in the host?
>
> >
> > If the app tries to bind a random CID, it will fail since the driver can't
> > find the device by the CID.
>
> I'm not talking about the source CID on which to do bind() (which I
> honestly don't like), but I'm talking about the destination CID on
> which to do connect().
>
> > > so that the guest knows which device to use depending on the destination
> > > CID.
> >
> > Yes, this is what I was describing in the previous comment. The message
> > will be directed to the device by the destination CID.
>
> Sorry, I don't understand how you do this without having an
> information from the device about what addresses it supports. Can you
> elaborate a bit?
Thanks for your explanation. So things you were talking about are as
follows:
1) guest app as a server: the app MUST do `bind()` to a CID that is
available in current VM.
2) guest app as a client: the guest driver picks a device and uses the
device's CID as src CID, so that the guest app don't need to do `bind()`,
but only do `connect()`.
The key point is who takes responsibility for picking a device:
1) I prefer the guest app to do such thing: do `bind()` to pick one, then
do `connect()`.
2) You prefer the guest driver: only do `connect()`, and the guest driver
picks one according to the dst CID.
Am I right?
I'm open to both ideas, but I have some concerns:
1) Two devices are in the different namespaces, e.g. host kernel(vhost) and
hybrid vsock(vhost-user), which might cause two same CIDs (e.g.
VMADDR_CID_HOST). If that happened, the driver can't distinguish them.
Instead, we can avoid this by letting the guest app pick a device.
2) What if the number of VMs is too large? For instance, 1,000 VMs (1,000
CIDs) will need at least 8000B of config space. (Hmm, it looks like an
extreme example, I don't know if it will happen in real world.)
Thanks,
Xuewei
next prev parent reply other threads:[~2025-04-09 6:56 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-24 6:43 [PATCH v6 RESEND] virtio-vsock: Add support for multi devices Xuewei Niu
2025-03-24 13:51 ` Stefano Garzarella
2025-03-25 3:19 ` Xuewei Niu
2025-03-26 8:50 ` Stefano Garzarella
2025-03-26 10:00 ` Xuewei Niu
2025-03-26 10:32 ` Stefano Garzarella
2025-03-26 10:36 ` Stefano Garzarella
2025-03-26 2:59 ` Xuewei Niu
2025-03-26 9:03 ` Stefano Garzarella
2025-03-27 8:18 ` Xuewei Niu
2025-03-31 6:18 ` Xuewei Niu
2025-04-01 11:15 ` Stefano Garzarella
2025-04-07 2:17 ` Xuewei Niu
2025-04-08 13:34 ` Stefano Garzarella
2025-04-09 6:55 ` Xuewei Niu [this message]
2025-04-09 9:34 ` Stefano Garzarella
2025-04-10 3:05 ` Xuewei Niu
2025-04-10 7:21 ` Stefano Garzarella
2025-04-10 8:58 ` Xuewei Niu
2025-04-10 10:38 ` Stefano Garzarella
2025-04-10 10:47 ` Xuewei Niu
2025-04-10 10:49 ` Stefano Garzarella
2025-04-10 13:47 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250409065558.3052615-1-niuxuewei.nxw@antgroup.com \
--to=niuxuewei97@gmail.com \
--cc=fupan.lfp@antgroup.com \
--cc=mst@redhat.com \
--cc=niuxuewei.nxw@antgroup.com \
--cc=parav@nvidia.com \
--cc=sgarzare@redhat.com \
--cc=stefanha@redhat.com \
--cc=virtio-comment@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).