From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933697AbbLGPlF (ORCPT ); Mon, 7 Dec 2015 10:41:05 -0500 Received: from mga14.intel.com ([192.55.52.115]:13551 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933530AbbLGPk6 (ORCPT ); Mon, 7 Dec 2015 10:40:58 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,395,1444719600"; d="scan'208";a="855681778" Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC To: Alexander Duyck , "Michael S. Tsirkin" References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> <5661C000.8070201@intel.com> <5661C86D.3010904@gmail.com> Cc: "Dong, Eddie" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" , "Ronciak, John" , "linux-api@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Williams, Mitch A" , Netdev , "Nelson, Shannon" , Wei Yang , "zajec5@gmail.com" From: "Lan, Tianyu" Message-ID: <5665A884.2020102@intel.com> Date: Mon, 7 Dec 2015 23:40:52 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <5661C86D.3010904@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> >> We still need to support Windows guest for migration and this is why our >> patches keep all changes in the driver since it's impossible to change >> Windows kernel. > > That is a poor argument. I highly doubt Microsoft is interested in > having to modify all of the drivers that will support direct assignment > in order to support migration. They would likely request something > similar to what I have in that they will want a way to do DMA tracking > with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. > >> Following is my idea to do DMA tracking. >> >> Inject event to VF driver after memory iterate stage >> and before stop VCPU and then VF driver marks dirty all >> using DMA memory. The new allocated pages also need to >> be marked dirty before stopping VCPU. All dirty memory >> in this time slot will be migrated until stop-and-copy >> stage. We also need to make sure to disable VF via clearing the >> bus master enable bit for VF before migrating these memory. > > The ordering of your explanation here doesn't quite work. What needs to > happen is that you have to disable DMA and then mark the pages as dirty. > What the disabling of the BME does is signal to the hypervisor that > the device is now stopped. The ixgbevf_suspend call already supported > by the driver is almost exactly what is needed to take care of something > like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. > > The question is how we would go about triggering it. I really don't > think the PCI configuration space approach is the right idea. > I wonder > if we couldn't get away with some sort of ACPI event instead. We > already require ACPI support in order to shut down the system > gracefully, I wonder if we couldn't get away with something similar in > order to suspend/resume the direct assigned devices gracefully. > I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. >> The dma page allocated by VF driver also needs to reserve space >> to do dummy write. > > No, this will not work. If for example you have a VF driver allocating > memory for a 9K receive how will that work? It isn't as if you can poke > a hole in the contiguous memory. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55231) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a5xuR-00010y-Bl for qemu-devel@nongnu.org; Mon, 07 Dec 2015 10:41:04 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a5xuO-0001rf-2u for qemu-devel@nongnu.org; Mon, 07 Dec 2015 10:41:03 -0500 Received: from mga11.intel.com ([192.55.52.93]:60419) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a5xuN-0001rD-OB for qemu-devel@nongnu.org; Mon, 07 Dec 2015 10:40:59 -0500 References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> <5661C000.8070201@intel.com> <5661C86D.3010904@gmail.com> From: "Lan, Tianyu" Message-ID: <5665A884.2020102@intel.com> Date: Mon, 7 Dec 2015 23:40:52 +0800 MIME-Version: 1.0 In-Reply-To: <5661C86D.3010904@gmail.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Duyck , "Michael S. Tsirkin" Cc: Wei Yang , "Tantilov, Emil S" , "kvm@vger.kernel.org" , "qemu-devel@nongnu.org" , "Brandeburg, Jesse" , "Rustad, Mark D" , "Wyborny, Carolyn" , Eric Auger , "Skidmore, Donald C" , "zajec5@gmail.com" , Alexander Graf , intel-wired-lan , "Kirsher, Jeffrey T" , Or Gerlitz , "Williams, Mitch A" , "Jani, Nrupal" , Bjorn Helgaas , "a.motakis@virtualopensystems.com" , "b.reynal@virtualopensystems.com" , "linux-api@vger.kernel.org" , "Nelson, Shannon" , "Dong, Eddie" , Alex Williamson , "linux-kernel@vger.kernel.org" , "Ronciak, John" , Netdev , Paolo Bonzini On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> >> We still need to support Windows guest for migration and this is why our >> patches keep all changes in the driver since it's impossible to change >> Windows kernel. > > That is a poor argument. I highly doubt Microsoft is interested in > having to modify all of the drivers that will support direct assignment > in order to support migration. They would likely request something > similar to what I have in that they will want a way to do DMA tracking > with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. > >> Following is my idea to do DMA tracking. >> >> Inject event to VF driver after memory iterate stage >> and before stop VCPU and then VF driver marks dirty all >> using DMA memory. The new allocated pages also need to >> be marked dirty before stopping VCPU. All dirty memory >> in this time slot will be migrated until stop-and-copy >> stage. We also need to make sure to disable VF via clearing the >> bus master enable bit for VF before migrating these memory. > > The ordering of your explanation here doesn't quite work. What needs to > happen is that you have to disable DMA and then mark the pages as dirty. > What the disabling of the BME does is signal to the hypervisor that > the device is now stopped. The ixgbevf_suspend call already supported > by the driver is almost exactly what is needed to take care of something > like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. > > The question is how we would go about triggering it. I really don't > think the PCI configuration space approach is the right idea. > I wonder > if we couldn't get away with some sort of ACPI event instead. We > already require ACPI support in order to shut down the system > gracefully, I wonder if we couldn't get away with something similar in > order to suspend/resume the direct assigned devices gracefully. > I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. >> The dma page allocated by VF driver also needs to reserve space >> to do dummy write. > > No, this will not work. If for example you have a VF driver allocating > memory for a 9K receive how will that work? It isn't as if you can poke > a hole in the contiguous memory. From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Lan, Tianyu" Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC Date: Mon, 7 Dec 2015 23:40:52 +0800 Message-ID: <5665A884.2020102@intel.com> References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> <5661C000.8070201@intel.com> <5661C86D.3010904@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: "Dong, Eddie" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" , "Michael S. Tsirkin" Return-path: In-Reply-To: <5661C86D.3010904@gmail.com> Sender: linux-kernel-owner@vger.kernel.org List-Id: netdev.vger.kernel.org On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> >> We still need to support Windows guest for migration and this is why our >> patches keep all changes in the driver since it's impossible to change >> Windows kernel. > > That is a poor argument. I highly doubt Microsoft is interested in > having to modify all of the drivers that will support direct assignment > in order to support migration. They would likely request something > similar to what I have in that they will want a way to do DMA tracking > with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. > >> Following is my idea to do DMA tracking. >> >> Inject event to VF driver after memory iterate stage >> and before stop VCPU and then VF driver marks dirty all >> using DMA memory. The new allocated pages also need to >> be marked dirty before stopping VCPU. All dirty memory >> in this time slot will be migrated until stop-and-copy >> stage. We also need to make sure to disable VF via clearing the >> bus master enable bit for VF before migrating these memory. > > The ordering of your explanation here doesn't quite work. What needs to > happen is that you have to disable DMA and then mark the pages as dirty. > What the disabling of the BME does is signal to the hypervisor that > the device is now stopped. The ixgbevf_suspend call already supported > by the driver is almost exactly what is needed to take care of something > like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. > > The question is how we would go about triggering it. I really don't > think the PCI configuration space approach is the right idea. > I wonder > if we couldn't get away with some sort of ACPI event instead. We > already require ACPI support in order to shut down the system > gracefully, I wonder if we couldn't get away with something similar in > order to suspend/resume the direct assigned devices gracefully. > I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. >> The dma page allocated by VF driver also needs to reserve space >> to do dummy write. > > No, this will not work. If for example you have a VF driver allocating > memory for a 9K receive how will that work? It isn't as if you can poke > a hole in the contiguous memory. From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Lan, Tianyu" Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC Date: Mon, 7 Dec 2015 23:40:52 +0800 Message-ID: <5665A884.2020102@intel.com> References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> <5661C000.8070201@intel.com> <5661C86D.3010904@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <5661C86D.3010904@gmail.com> Sender: linux-kernel-owner@vger.kernel.org To: Alexander Duyck , "Michael S. Tsirkin" Cc: "Dong, Eddie" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" List-Id: linux-api@vger.kernel.org On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> >> We still need to support Windows guest for migration and this is why our >> patches keep all changes in the driver since it's impossible to change >> Windows kernel. > > That is a poor argument. I highly doubt Microsoft is interested in > having to modify all of the drivers that will support direct assignment > in order to support migration. They would likely request something > similar to what I have in that they will want a way to do DMA tracking > with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. > >> Following is my idea to do DMA tracking. >> >> Inject event to VF driver after memory iterate stage >> and before stop VCPU and then VF driver marks dirty all >> using DMA memory. The new allocated pages also need to >> be marked dirty before stopping VCPU. All dirty memory >> in this time slot will be migrated until stop-and-copy >> stage. We also need to make sure to disable VF via clearing the >> bus master enable bit for VF before migrating these memory. > > The ordering of your explanation here doesn't quite work. What needs to > happen is that you have to disable DMA and then mark the pages as dirty. > What the disabling of the BME does is signal to the hypervisor that > the device is now stopped. The ixgbevf_suspend call already supported > by the driver is almost exactly what is needed to take care of something > like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. > > The question is how we would go about triggering it. I really don't > think the PCI configuration space approach is the right idea. > I wonder > if we couldn't get away with some sort of ACPI event instead. We > already require ACPI support in order to shut down the system > gracefully, I wonder if we couldn't get away with something similar in > order to suspend/resume the direct assigned devices gracefully. > I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. >> The dma page allocated by VF driver also needs to reserve space >> to do dummy write. > > No, this will not work. If for example you have a VF driver allocating > memory for a 9K receive how will that work? It isn't as if you can poke > a hole in the contiguous memory. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Lan, Tianyu Date: Mon, 7 Dec 2015 23:40:52 +0800 Subject: [Intel-wired-lan] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC In-Reply-To: <5661C86D.3010904@gmail.com> References: <565BF285.4040507@intel.com> <565DB6FF.1050602@intel.com> <20151201171140-mutt-send-email-mst@redhat.com> <20151201193026-mutt-send-email-mst@redhat.com> <20151202105955-mutt-send-email-mst@redhat.com> <5661C000.8070201@intel.com> <5661C86D.3010904@gmail.com> Message-ID: <5665A884.2020102@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: intel-wired-lan@osuosl.org List-ID: On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> >> We still need to support Windows guest for migration and this is why our >> patches keep all changes in the driver since it's impossible to change >> Windows kernel. > > That is a poor argument. I highly doubt Microsoft is interested in > having to modify all of the drivers that will support direct assignment > in order to support migration. They would likely request something > similar to what I have in that they will want a way to do DMA tracking > with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. > >> Following is my idea to do DMA tracking. >> >> Inject event to VF driver after memory iterate stage >> and before stop VCPU and then VF driver marks dirty all >> using DMA memory. The new allocated pages also need to >> be marked dirty before stopping VCPU. All dirty memory >> in this time slot will be migrated until stop-and-copy >> stage. We also need to make sure to disable VF via clearing the >> bus master enable bit for VF before migrating these memory. > > The ordering of your explanation here doesn't quite work. What needs to > happen is that you have to disable DMA and then mark the pages as dirty. > What the disabling of the BME does is signal to the hypervisor that > the device is now stopped. The ixgbevf_suspend call already supported > by the driver is almost exactly what is needed to take care of something > like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. > > The question is how we would go about triggering it. I really don't > think the PCI configuration space approach is the right idea. > I wonder > if we couldn't get away with some sort of ACPI event instead. We > already require ACPI support in order to shut down the system > gracefully, I wonder if we couldn't get away with something similar in > order to suspend/resume the direct assigned devices gracefully. > I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. >> The dma page allocated by VF driver also needs to reserve space >> to do dummy write. > > No, this will not work. If for example you have a VF driver allocating > memory for a 9K receive how will that work? It isn't as if you can poke > a hole in the contiguous memory.