From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755638AbbLDQcM (ORCPT ); Fri, 4 Dec 2015 11:32:12 -0500 Received: from mga11.intel.com ([192.55.52.93]:48786 "EHLO mga11.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753803AbbLDQcI (ORCPT ); Fri, 4 Dec 2015 11:32:08 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,380,1444719600"; d="scan'208";a="854006813" Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC To: "Michael S. Tsirkin" , Alexander Duyck References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> Cc: "Dong, Eddie" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" , "Ronciak, John" , "linux-api@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Williams, Mitch A" , Netdev , "Nelson, Shannon" , Wei Yang , "zajec5@gmail.com" From: "Lan, Tianyu" Message-ID: <5661C000.8070201@intel.com> Date: Sat, 5 Dec 2015 00:32:00 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <20151202105955-mutt-send-email-mst@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: >>> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >>>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: >> >>>>> There are several components to this: >>>>> - dma_map_* needs to prevent page from >>>>> being migrated while device is running. >>>>> For example, expose some kind of bitmap from guest >>>>> to host, set bit there while page is mapped. >>>>> What happens if we stop the guest and some >>>>> bits are still set? See dma_alloc_coherent below >>>>> for some ideas. >>>> >>>> Yeah, I could see something like this working. Maybe we could do >>>> something like what was done for the NX bit and make use of the upper >>>> order bits beyond the limits of the memory range to mark pages as >>>> non-migratable? >>>> >>>> I'm curious. What we have with a DMA mapped region is essentially >>>> shared memory between the guest and the device. How would we resolve >>>> something like this with IVSHMEM, or are we blocked there as well in >>>> terms of migration? >>> >>> I have some ideas. Will post later. >> >> I look forward to it. >> >>>>> - dma_unmap_* needs to mark page as dirty >>>>> This can be done by writing into a page. >>>>> >>>>> - dma_sync_* needs to mark page as dirty >>>>> This is trickier as we can not change the data. >>>>> One solution is using atomics. >>>>> For example: >>>>> int x = ACCESS_ONCE(*p); >>>>> cmpxchg(p, x, x); >>>>> Seems to do a write without changing page >>>>> contents. >>>> >>>> Like I said we can probably kill 2 birds with one stone by just >>>> implementing our own dma_mark_clean() for x86 virtualized >>>> environments. >>>> >>>> I'd say we could take your solution one step further and just use 0 >>>> instead of bothering to read the value. After all it won't write the >>>> area if the value at the offset is not 0. >>> >>> Really almost any atomic that has no side effect will do. >>> atomic or with 0 >>> atomic and with ffffffff >>> >>> It's just that cmpxchg already happens to have a portable >>> wrapper. >> >> I was originally thinking maybe an atomic_add with 0 would be the way >> to go. > > cmpxchg with any value too. > >> Either way though we still are using a locked prefix and >> having to dirty a cache line per page which is going to come at some >> cost. > > I agree. It's likely not necessary for everyone > to be doing this: only people that both > run within the VM and want migration to work > need to do this logging. > > So set some module option to have driver tell hypervisor that it > supports logging. If bus mastering is enabled before this, migration is > blocked. Or even pass some flag from hypervisor so > driver can detect it needs to log writes. > I guess this could be put in device config somewhere, > though in practice it's a global thing, not a per device one, so > maybe we need some new channel to > pass this flag to guest. CPUID? > Or maybe we can put some kind of agent in the initrd > and use the existing guest agent channel after all. > agent in initrd could open up a lot of new possibilities. > > >>>>> - dma_alloc_coherent memory (e.g. device rings) >>>>> must be migrated after device stopped modifying it. >>>>> Just stopping the VCPU is not enough: >>>>> you must make sure device is not changing it. >>>>> >>>>> Or maybe the device has some kind of ring flush operation, >>>>> if there was a reasonably portable way to do this >>>>> (e.g. a flush capability could maybe be added to SRIOV) >>>>> then hypervisor could do this. >>>> >>>> This is where things start to get messy. I was suggesting the >>>> suspend/resume to resolve this bit, but it might be possible to also >>>> deal with this via something like this via clearing the bus master >>>> enable bit for the VF. If I am not mistaken that should disable MSI-X >>>> interrupts and halt any DMA. That should work as long as you have >>>> some mechanism that is tracking the pages in use for DMA. >>> >>> A bigger issue is recovering afterwards. >> >> Agreed. >> >>>>> In case you need to resume on source, you >>>>> really need to follow the same path >>>>> as on destination, preferably detecting >>>>> device reset and restoring the device >>>>> state. >>>> >>>> The problem with detecting the reset is that you would likely have to >>>> be polling to do something like that. >>> >>> We could some event to guest to notify it about this event >>> through a new or existing channel. >>> >>> Or we could make it possible for userspace to trigger this, >>> then notify guest through the guest agent. >> >> The first thing that comes to mind would be to use something like PCIe >> Advanced Error Reporting, however I don't know if we can put a >> requirement on the system supporting the q35 machine type or not in >> order to support migration. > > You mean require pci express? This sounds quite reasonable. > >>>> I believe the fm10k driver >>>> already has code like that in place where it will detect a reset as a >>>> part of its watchdog, however the response time is something like 2 >>>> seconds for that. That was one of the reasons I preferred something >>>> like hot-plug as that should be functioning as soon as the guest is up >>>> and it is a mechanism that operates outside of the VF drivers. >>> >>> That's pretty minor. >>> A bigger issue is making sure guest does not crash >>> when device is suddenly reset under it's legs. >> >> I know the ixgbevf driver should already have logic to address some of >> that. If you look through the code there should be logic there for a >> surprise removal support in ixgbevf. The only issue is that unlike >> fm10k it will not restore itself after a resume or slot_reset call. > > So if it's the question of driver installing a slot_reset handler, this > sounds quite reasonable. > > It would be nice to be able to detect that guest supports > this removal, too, and block migration if it doesn't. > For example, show this capability > in an attribute in sysfs, make guest agent read that. > From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33637) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a4tHa-0001DK-Lj for qemu-devel@nongnu.org; Fri, 04 Dec 2015 11:32:32 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a4tHW-0002fX-Go for qemu-devel@nongnu.org; Fri, 04 Dec 2015 11:32:30 -0500 Received: from mga14.intel.com ([192.55.52.115]:52910) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a4tHW-0002eM-6i for qemu-devel@nongnu.org; Fri, 04 Dec 2015 11:32:26 -0500 References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> From: "Lan, Tianyu" Message-ID: <5661C000.8070201@intel.com> Date: Sat, 5 Dec 2015 00:32:00 +0800 MIME-Version: 1.0 In-Reply-To: <20151202105955-mutt-send-email-mst@redhat.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Michael S. Tsirkin" , Alexander Duyck Cc: Wei Yang , "Tantilov, Emil S" , "kvm@vger.kernel.org" , "qemu-devel@nongnu.org" , "Brandeburg, Jesse" , "Rustad, Mark D" , "Wyborny, Carolyn" , Eric Auger , "Skidmore, Donald C" , "zajec5@gmail.com" , Alexander Graf , intel-wired-lan , "Kirsher, Jeffrey T" , Or Gerlitz , "Williams, Mitch A" , "Jani, Nrupal" , Bjorn Helgaas , "a.motakis@virtualopensystems.com" , "b.reynal@virtualopensystems.com" , "linux-api@vger.kernel.org" , "Nelson, Shannon" , "Dong, Eddie" , Alex Williamson , "linux-kernel@vger.kernel.org" , "Ronciak, John" , Netdev , Paolo Bonzini Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: >>> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >>>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: >> >>>>> There are several components to this: >>>>> - dma_map_* needs to prevent page from >>>>> being migrated while device is running. >>>>> For example, expose some kind of bitmap from guest >>>>> to host, set bit there while page is mapped. >>>>> What happens if we stop the guest and some >>>>> bits are still set? See dma_alloc_coherent below >>>>> for some ideas. >>>> >>>> Yeah, I could see something like this working. Maybe we could do >>>> something like what was done for the NX bit and make use of the upper >>>> order bits beyond the limits of the memory range to mark pages as >>>> non-migratable? >>>> >>>> I'm curious. What we have with a DMA mapped region is essentially >>>> shared memory between the guest and the device. How would we resolve >>>> something like this with IVSHMEM, or are we blocked there as well in >>>> terms of migration? >>> >>> I have some ideas. Will post later. >> >> I look forward to it. >> >>>>> - dma_unmap_* needs to mark page as dirty >>>>> This can be done by writing into a page. >>>>> >>>>> - dma_sync_* needs to mark page as dirty >>>>> This is trickier as we can not change the data. >>>>> One solution is using atomics. >>>>> For example: >>>>> int x = ACCESS_ONCE(*p); >>>>> cmpxchg(p, x, x); >>>>> Seems to do a write without changing page >>>>> contents. >>>> >>>> Like I said we can probably kill 2 birds with one stone by just >>>> implementing our own dma_mark_clean() for x86 virtualized >>>> environments. >>>> >>>> I'd say we could take your solution one step further and just use 0 >>>> instead of bothering to read the value. After all it won't write the >>>> area if the value at the offset is not 0. >>> >>> Really almost any atomic that has no side effect will do. >>> atomic or with 0 >>> atomic and with ffffffff >>> >>> It's just that cmpxchg already happens to have a portable >>> wrapper. >> >> I was originally thinking maybe an atomic_add with 0 would be the way >> to go. > > cmpxchg with any value too. > >> Either way though we still are using a locked prefix and >> having to dirty a cache line per page which is going to come at some >> cost. > > I agree. It's likely not necessary for everyone > to be doing this: only people that both > run within the VM and want migration to work > need to do this logging. > > So set some module option to have driver tell hypervisor that it > supports logging. If bus mastering is enabled before this, migration is > blocked. Or even pass some flag from hypervisor so > driver can detect it needs to log writes. > I guess this could be put in device config somewhere, > though in practice it's a global thing, not a per device one, so > maybe we need some new channel to > pass this flag to guest. CPUID? > Or maybe we can put some kind of agent in the initrd > and use the existing guest agent channel after all. > agent in initrd could open up a lot of new possibilities. > > >>>>> - dma_alloc_coherent memory (e.g. device rings) >>>>> must be migrated after device stopped modifying it. >>>>> Just stopping the VCPU is not enough: >>>>> you must make sure device is not changing it. >>>>> >>>>> Or maybe the device has some kind of ring flush operation, >>>>> if there was a reasonably portable way to do this >>>>> (e.g. a flush capability could maybe be added to SRIOV) >>>>> then hypervisor could do this. >>>> >>>> This is where things start to get messy. I was suggesting the >>>> suspend/resume to resolve this bit, but it might be possible to also >>>> deal with this via something like this via clearing the bus master >>>> enable bit for the VF. If I am not mistaken that should disable MSI-X >>>> interrupts and halt any DMA. That should work as long as you have >>>> some mechanism that is tracking the pages in use for DMA. >>> >>> A bigger issue is recovering afterwards. >> >> Agreed. >> >>>>> In case you need to resume on source, you >>>>> really need to follow the same path >>>>> as on destination, preferably detecting >>>>> device reset and restoring the device >>>>> state. >>>> >>>> The problem with detecting the reset is that you would likely have to >>>> be polling to do something like that. >>> >>> We could some event to guest to notify it about this event >>> through a new or existing channel. >>> >>> Or we could make it possible for userspace to trigger this, >>> then notify guest through the guest agent. >> >> The first thing that comes to mind would be to use something like PCIe >> Advanced Error Reporting, however I don't know if we can put a >> requirement on the system supporting the q35 machine type or not in >> order to support migration. > > You mean require pci express? This sounds quite reasonable. > >>>> I believe the fm10k driver >>>> already has code like that in place where it will detect a reset as a >>>> part of its watchdog, however the response time is something like 2 >>>> seconds for that. That was one of the reasons I preferred something >>>> like hot-plug as that should be functioning as soon as the guest is up >>>> and it is a mechanism that operates outside of the VF drivers. >>> >>> That's pretty minor. >>> A bigger issue is making sure guest does not crash >>> when device is suddenly reset under it's legs. >> >> I know the ixgbevf driver should already have logic to address some of >> that. If you look through the code there should be logic there for a >> surprise removal support in ixgbevf. The only issue is that unlike >> fm10k it will not restore itself after a resume or slot_reset call. > > So if it's the question of driver installing a slot_reset handler, this > sounds quite reasonable. > > It would be nice to be able to detect that guest supports > this removal, too, and block migration if it doesn't. > For example, show this capability > in an attribute in sysfs, make guest agent read that. > From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Lan, Tianyu" Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC Date: Sat, 5 Dec 2015 00:32:00 +0800 Message-ID: <5661C000.8070201@intel.com> References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: "Dong, Eddie" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" , Alexander Duyck Return-path: In-Reply-To: <20151202105955-mutt-send-email-mst@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: >>> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >>>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: >> >>>>> There are several components to this: >>>>> - dma_map_* needs to prevent page from >>>>> being migrated while device is running. >>>>> For example, expose some kind of bitmap from guest >>>>> to host, set bit there while page is mapped. >>>>> What happens if we stop the guest and some >>>>> bits are still set? See dma_alloc_coherent below >>>>> for some ideas. >>>> >>>> Yeah, I could see something like this working. Maybe we could do >>>> something like what was done for the NX bit and make use of the upper >>>> order bits beyond the limits of the memory range to mark pages as >>>> non-migratable? >>>> >>>> I'm curious. What we have with a DMA mapped region is essentially >>>> shared memory between the guest and the device. How would we resolve >>>> something like this with IVSHMEM, or are we blocked there as well in >>>> terms of migration? >>> >>> I have some ideas. Will post later. >> >> I look forward to it. >> >>>>> - dma_unmap_* needs to mark page as dirty >>>>> This can be done by writing into a page. >>>>> >>>>> - dma_sync_* needs to mark page as dirty >>>>> This is trickier as we can not change the data. >>>>> One solution is using atomics. >>>>> For example: >>>>> int x = ACCESS_ONCE(*p); >>>>> cmpxchg(p, x, x); >>>>> Seems to do a write without changing page >>>>> contents. >>>> >>>> Like I said we can probably kill 2 birds with one stone by just >>>> implementing our own dma_mark_clean() for x86 virtualized >>>> environments. >>>> >>>> I'd say we could take your solution one step further and just use 0 >>>> instead of bothering to read the value. After all it won't write the >>>> area if the value at the offset is not 0. >>> >>> Really almost any atomic that has no side effect will do. >>> atomic or with 0 >>> atomic and with ffffffff >>> >>> It's just that cmpxchg already happens to have a portable >>> wrapper. >> >> I was originally thinking maybe an atomic_add with 0 would be the way >> to go. > > cmpxchg with any value too. > >> Either way though we still are using a locked prefix and >> having to dirty a cache line per page which is going to come at some >> cost. > > I agree. It's likely not necessary for everyone > to be doing this: only people that both > run within the VM and want migration to work > need to do this logging. > > So set some module option to have driver tell hypervisor that it > supports logging. If bus mastering is enabled before this, migration is > blocked. Or even pass some flag from hypervisor so > driver can detect it needs to log writes. > I guess this could be put in device config somewhere, > though in practice it's a global thing, not a per device one, so > maybe we need some new channel to > pass this flag to guest. CPUID? > Or maybe we can put some kind of agent in the initrd > and use the existing guest agent channel after all. > agent in initrd could open up a lot of new possibilities. > > >>>>> - dma_alloc_coherent memory (e.g. device rings) >>>>> must be migrated after device stopped modifying it. >>>>> Just stopping the VCPU is not enough: >>>>> you must make sure device is not changing it. >>>>> >>>>> Or maybe the device has some kind of ring flush operation, >>>>> if there was a reasonably portable way to do this >>>>> (e.g. a flush capability could maybe be added to SRIOV) >>>>> then hypervisor could do this. >>>> >>>> This is where things start to get messy. I was suggesting the >>>> suspend/resume to resolve this bit, but it might be possible to also >>>> deal with this via something like this via clearing the bus master >>>> enable bit for the VF. If I am not mistaken that should disable MSI-X >>>> interrupts and halt any DMA. That should work as long as you have >>>> some mechanism that is tracking the pages in use for DMA. >>> >>> A bigger issue is recovering afterwards. >> >> Agreed. >> >>>>> In case you need to resume on source, you >>>>> really need to follow the same path >>>>> as on destination, preferably detecting >>>>> device reset and restoring the device >>>>> state. >>>> >>>> The problem with detecting the reset is that you would likely have to >>>> be polling to do something like that. >>> >>> We could some event to guest to notify it about this event >>> through a new or existing channel. >>> >>> Or we could make it possible for userspace to trigger this, >>> then notify guest through the guest agent. >> >> The first thing that comes to mind would be to use something like PCIe >> Advanced Error Reporting, however I don't know if we can put a >> requirement on the system supporting the q35 machine type or not in >> order to support migration. > > You mean require pci express? This sounds quite reasonable. > >>>> I believe the fm10k driver >>>> already has code like that in place where it will detect a reset as a >>>> part of its watchdog, however the response time is something like 2 >>>> seconds for that. That was one of the reasons I preferred something >>>> like hot-plug as that should be functioning as soon as the guest is up >>>> and it is a mechanism that operates outside of the VF drivers. >>> >>> That's pretty minor. >>> A bigger issue is making sure guest does not crash >>> when device is suddenly reset under it's legs. >> >> I know the ixgbevf driver should already have logic to address some of >> that. If you look through the code there should be logic there for a >> surprise removal support in ixgbevf. The only issue is that unlike >> fm10k it will not restore itself after a resume or slot_reset call. > > So if it's the question of driver installing a slot_reset handler, this > sounds quite reasonable. > > It would be nice to be able to detect that guest supports > this removal, too, and block migration if it doesn't. > For example, show this capability > in an attribute in sysfs, make guest agent read that. > From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Lan, Tianyu" Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC Date: Sat, 5 Dec 2015 00:32:00 +0800 Message-ID: <5661C000.8070201@intel.com> References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20151202105955-mutt-send-email-mst@redhat.com> Sender: linux-kernel-owner@vger.kernel.org To: "Michael S. Tsirkin" , Alexander Duyck Cc: "Dong, Eddie" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" List-Id: linux-api@vger.kernel.org Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: >>> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >>>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: >> >>>>> There are several components to this: >>>>> - dma_map_* needs to prevent page from >>>>> being migrated while device is running. >>>>> For example, expose some kind of bitmap from guest >>>>> to host, set bit there while page is mapped. >>>>> What happens if we stop the guest and some >>>>> bits are still set? See dma_alloc_coherent below >>>>> for some ideas. >>>> >>>> Yeah, I could see something like this working. Maybe we could do >>>> something like what was done for the NX bit and make use of the upper >>>> order bits beyond the limits of the memory range to mark pages as >>>> non-migratable? >>>> >>>> I'm curious. What we have with a DMA mapped region is essentially >>>> shared memory between the guest and the device. How would we resolve >>>> something like this with IVSHMEM, or are we blocked there as well in >>>> terms of migration? >>> >>> I have some ideas. Will post later. >> >> I look forward to it. >> >>>>> - dma_unmap_* needs to mark page as dirty >>>>> This can be done by writing into a page. >>>>> >>>>> - dma_sync_* needs to mark page as dirty >>>>> This is trickier as we can not change the data. >>>>> One solution is using atomics. >>>>> For example: >>>>> int x = ACCESS_ONCE(*p); >>>>> cmpxchg(p, x, x); >>>>> Seems to do a write without changing page >>>>> contents. >>>> >>>> Like I said we can probably kill 2 birds with one stone by just >>>> implementing our own dma_mark_clean() for x86 virtualized >>>> environments. >>>> >>>> I'd say we could take your solution one step further and just use 0 >>>> instead of bothering to read the value. After all it won't write the >>>> area if the value at the offset is not 0. >>> >>> Really almost any atomic that has no side effect will do. >>> atomic or with 0 >>> atomic and with ffffffff >>> >>> It's just that cmpxchg already happens to have a portable >>> wrapper. >> >> I was originally thinking maybe an atomic_add with 0 would be the way >> to go. > > cmpxchg with any value too. > >> Either way though we still are using a locked prefix and >> having to dirty a cache line per page which is going to come at some >> cost. > > I agree. It's likely not necessary for everyone > to be doing this: only people that both > run within the VM and want migration to work > need to do this logging. > > So set some module option to have driver tell hypervisor that it > supports logging. If bus mastering is enabled before this, migration is > blocked. Or even pass some flag from hypervisor so > driver can detect it needs to log writes. > I guess this could be put in device config somewhere, > though in practice it's a global thing, not a per device one, so > maybe we need some new channel to > pass this flag to guest. CPUID? > Or maybe we can put some kind of agent in the initrd > and use the existing guest agent channel after all. > agent in initrd could open up a lot of new possibilities. > > >>>>> - dma_alloc_coherent memory (e.g. device rings) >>>>> must be migrated after device stopped modifying it. >>>>> Just stopping the VCPU is not enough: >>>>> you must make sure device is not changing it. >>>>> >>>>> Or maybe the device has some kind of ring flush operation, >>>>> if there was a reasonably portable way to do this >>>>> (e.g. a flush capability could maybe be added to SRIOV) >>>>> then hypervisor could do this. >>>> >>>> This is where things start to get messy. I was suggesting the >>>> suspend/resume to resolve this bit, but it might be possible to also >>>> deal with this via something like this via clearing the bus master >>>> enable bit for the VF. If I am not mistaken that should disable MSI-X >>>> interrupts and halt any DMA. That should work as long as you have >>>> some mechanism that is tracking the pages in use for DMA. >>> >>> A bigger issue is recovering afterwards. >> >> Agreed. >> >>>>> In case you need to resume on source, you >>>>> really need to follow the same path >>>>> as on destination, preferably detecting >>>>> device reset and restoring the device >>>>> state. >>>> >>>> The problem with detecting the reset is that you would likely have to >>>> be polling to do something like that. >>> >>> We could some event to guest to notify it about this event >>> through a new or existing channel. >>> >>> Or we could make it possible for userspace to trigger this, >>> then notify guest through the guest agent. >> >> The first thing that comes to mind would be to use something like PCIe >> Advanced Error Reporting, however I don't know if we can put a >> requirement on the system supporting the q35 machine type or not in >> order to support migration. > > You mean require pci express? This sounds quite reasonable. > >>>> I believe the fm10k driver >>>> already has code like that in place where it will detect a reset as a >>>> part of its watchdog, however the response time is something like 2 >>>> seconds for that. That was one of the reasons I preferred something >>>> like hot-plug as that should be functioning as soon as the guest is up >>>> and it is a mechanism that operates outside of the VF drivers. >>> >>> That's pretty minor. >>> A bigger issue is making sure guest does not crash >>> when device is suddenly reset under it's legs. >> >> I know the ixgbevf driver should already have logic to address some of >> that. If you look through the code there should be logic there for a >> surprise removal support in ixgbevf. The only issue is that unlike >> fm10k it will not restore itself after a resume or slot_reset call. > > So if it's the question of driver installing a slot_reset handler, this > sounds quite reasonable. > > It would be nice to be able to detect that guest supports > this removal, too, and block migration if it doesn't. > For example, show this capability > in an attribute in sysfs, make guest agent read that. > From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lan, Tianyu Date: Sat, 5 Dec 2015 00:32:00 +0800 Subject: [Intel-wired-lan] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC In-Reply-To: <20151202105955-mutt-send-email-mst@redhat.com> References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> Message-ID: <5661C000.8070201@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: intel-wired-lan@osuosl.org List-ID: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: >>> On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >>>> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: >> >>>>> There are several components to this: >>>>> - dma_map_* needs to prevent page from >>>>> being migrated while device is running. >>>>> For example, expose some kind of bitmap from guest >>>>> to host, set bit there while page is mapped. >>>>> What happens if we stop the guest and some >>>>> bits are still set? See dma_alloc_coherent below >>>>> for some ideas. >>>> >>>> Yeah, I could see something like this working. Maybe we could do >>>> something like what was done for the NX bit and make use of the upper >>>> order bits beyond the limits of the memory range to mark pages as >>>> non-migratable? >>>> >>>> I'm curious. What we have with a DMA mapped region is essentially >>>> shared memory between the guest and the device. How would we resolve >>>> something like this with IVSHMEM, or are we blocked there as well in >>>> terms of migration? >>> >>> I have some ideas. Will post later. >> >> I look forward to it. >> >>>>> - dma_unmap_* needs to mark page as dirty >>>>> This can be done by writing into a page. >>>>> >>>>> - dma_sync_* needs to mark page as dirty >>>>> This is trickier as we can not change the data. >>>>> One solution is using atomics. >>>>> For example: >>>>> int x = ACCESS_ONCE(*p); >>>>> cmpxchg(p, x, x); >>>>> Seems to do a write without changing page >>>>> contents. >>>> >>>> Like I said we can probably kill 2 birds with one stone by just >>>> implementing our own dma_mark_clean() for x86 virtualized >>>> environments. >>>> >>>> I'd say we could take your solution one step further and just use 0 >>>> instead of bothering to read the value. After all it won't write the >>>> area if the value at the offset is not 0. >>> >>> Really almost any atomic that has no side effect will do. >>> atomic or with 0 >>> atomic and with ffffffff >>> >>> It's just that cmpxchg already happens to have a portable >>> wrapper. >> >> I was originally thinking maybe an atomic_add with 0 would be the way >> to go. > > cmpxchg with any value too. > >> Either way though we still are using a locked prefix and >> having to dirty a cache line per page which is going to come at some >> cost. > > I agree. It's likely not necessary for everyone > to be doing this: only people that both > run within the VM and want migration to work > need to do this logging. > > So set some module option to have driver tell hypervisor that it > supports logging. If bus mastering is enabled before this, migration is > blocked. Or even pass some flag from hypervisor so > driver can detect it needs to log writes. > I guess this could be put in device config somewhere, > though in practice it's a global thing, not a per device one, so > maybe we need some new channel to > pass this flag to guest. CPUID? > Or maybe we can put some kind of agent in the initrd > and use the existing guest agent channel after all. > agent in initrd could open up a lot of new possibilities. > > >>>>> - dma_alloc_coherent memory (e.g. device rings) >>>>> must be migrated after device stopped modifying it. >>>>> Just stopping the VCPU is not enough: >>>>> you must make sure device is not changing it. >>>>> >>>>> Or maybe the device has some kind of ring flush operation, >>>>> if there was a reasonably portable way to do this >>>>> (e.g. a flush capability could maybe be added to SRIOV) >>>>> then hypervisor could do this. >>>> >>>> This is where things start to get messy. I was suggesting the >>>> suspend/resume to resolve this bit, but it might be possible to also >>>> deal with this via something like this via clearing the bus master >>>> enable bit for the VF. If I am not mistaken that should disable MSI-X >>>> interrupts and halt any DMA. That should work as long as you have >>>> some mechanism that is tracking the pages in use for DMA. >>> >>> A bigger issue is recovering afterwards. >> >> Agreed. >> >>>>> In case you need to resume on source, you >>>>> really need to follow the same path >>>>> as on destination, preferably detecting >>>>> device reset and restoring the device >>>>> state. >>>> >>>> The problem with detecting the reset is that you would likely have to >>>> be polling to do something like that. >>> >>> We could some event to guest to notify it about this event >>> through a new or existing channel. >>> >>> Or we could make it possible for userspace to trigger this, >>> then notify guest through the guest agent. >> >> The first thing that comes to mind would be to use something like PCIe >> Advanced Error Reporting, however I don't know if we can put a >> requirement on the system supporting the q35 machine type or not in >> order to support migration. > > You mean require pci express? This sounds quite reasonable. > >>>> I believe the fm10k driver >>>> already has code like that in place where it will detect a reset as a >>>> part of its watchdog, however the response time is something like 2 >>>> seconds for that. That was one of the reasons I preferred something >>>> like hot-plug as that should be functioning as soon as the guest is up >>>> and it is a mechanism that operates outside of the VF drivers. >>> >>> That's pretty minor. >>> A bigger issue is making sure guest does not crash >>> when device is suddenly reset under it's legs. >> >> I know the ixgbevf driver should already have logic to address some of >> that. If you look through the code there should be logic there for a >> surprise removal support in ixgbevf. The only issue is that unlike >> fm10k it will not restore itself after a resume or slot_reset call. > > So if it's the question of driver installing a slot_reset handler, this > sounds quite reasonable. > > It would be nice to be able to detect that guest supports > this removal, too, and block migration if it doesn't. > For example, show this capability > in an attribute in sysfs, make guest agent read that. >