From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752402AbbKZD4g (ORCPT ); Wed, 25 Nov 2015 22:56:36 -0500 Received: from mail-ig0-f193.google.com ([209.85.213.193]:35013 "EHLO mail-ig0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752012AbbKZD4d convert rfc822-to-8bit (ORCPT ); Wed, 25 Nov 2015 22:56:33 -0500 MIME-Version: 1.0 In-Reply-To: References: <1448372298-28386-1-git-send-email-tianyu.lan@intel.com> <5654722D.4010409@gmail.com> <56552888.90108@intel.com> <56556F98.5060507@intel.com> Date: Wed, 25 Nov 2015 19:56:32 -0800 Message-ID: Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC From: Alexander Duyck To: "Dong, Eddie" Cc: "Lan, Tianyu" , "a.motakis@virtualopensystems.com" , Alex Williamson , "b.reynal@virtualopensystems.com" , Bjorn Helgaas , "Wyborny, Carolyn" , "Skidmore, Donald C" , "Jani, Nrupal" , Alexander Graf , "kvm@vger.kernel.org" , Paolo Bonzini , "qemu-devel@nongnu.org" , "Tantilov, Emil S" , Or Gerlitz , "Rustad, Mark D" , "Michael S. Tsirkin" , Eric Auger , intel-wired-lan , "Kirsher, Jeffrey T" , "Brandeburg, Jesse" , "Ronciak, John" , "linux-api@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Williams, Mitch A" , Netdev , "Nelson, Shannon" , Wei Yang , "zajec5@gmail.com" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrote: >> > On 2015年11月25日 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you plan to >> support this on. What about things like direct assignment of regular Ethernet >> ports? What you really need is a solution that will work generically on any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver in guest to self-emulate or track the device state, so that the migration may be possible. > I don't think we can modify OS, without modifying the drivers, even using the PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only Microsoft can do. While, modifying driver is relatively simple and manageable to device vendors, if the device vendor want to support state-clone based migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38065) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1nfe-0001To-FK for qemu-devel@nongnu.org; Wed, 25 Nov 2015 22:56:35 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a1nfd-0005zN-F4 for qemu-devel@nongnu.org; Wed, 25 Nov 2015 22:56:34 -0500 Received: from mail-ig0-x244.google.com ([2607:f8b0:4001:c05::244]:34657) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1nfd-0005zD-8P for qemu-devel@nongnu.org; Wed, 25 Nov 2015 22:56:33 -0500 Received: by igbxf8 with SMTP id xf8so545978igb.1 for ; Wed, 25 Nov 2015 19:56:32 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <1448372298-28386-1-git-send-email-tianyu.lan@intel.com> <5654722D.4010409@gmail.com> <56552888.90108@intel.com> <56556F98.5060507@intel.com> Date: Wed, 25 Nov 2015 19:56:32 -0800 Message-ID: From: Alexander Duyck Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dong, Eddie" Cc: Wei Yang , "Tantilov, Emil S" , "kvm@vger.kernel.org" , "Michael S. Tsirkin" , "qemu-devel@nongnu.org" , "Brandeburg, Jesse" , "Rustad, Mark D" , "Wyborny, Carolyn" , Eric Auger , "Skidmore, Donald C" , "zajec5@gmail.com" , Alexander Graf , intel-wired-lan , "Kirsher, Jeffrey T" , Or Gerlitz , "Williams, Mitch A" , "Jani, Nrupal" , Bjorn Helgaas , "a.motakis@virtualopensystems.com" , "Lan, Tianyu" , "b.reynal@virtualopensystems.com" , "linux-api@vger.kernel.org" , "Nelson, Shannon" , Alex Williamson , "linux-kernel@vger.kernel.org" , "Ronciak, John" , Netdev , Paolo Bonzini On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrot= e: >> > On 2015=E5=B9=B411=E6=9C=8825=E6=97=A5 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you pl= an to >> support this on. What about things like direct assignment of regular Et= hernet >> ports? What you really need is a solution that will work generically on= any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver i= n guest to self-emulate or track the device state, so that the migration ma= y be possible. > I don't think we can modify OS, without modifying the drivers, even using= the PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only = Microsoft can do. While, modifying driver is relatively simple and manageab= le to device vendors, if the device vendor want to support state-clone base= d migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC Date: Wed, 25 Nov 2015 19:56:32 -0800 Message-ID: References: <1448372298-28386-1-git-send-email-tianyu.lan@intel.com> <5654722D.4010409@gmail.com> <56552888.90108@intel.com> <56556F98.5060507@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: Wei Yang , "Tantilov, Emil S" , "kvm@vger.kernel.org" , "Michael S. Tsirkin" , "qemu-devel@nongnu.org" , "Brandeburg, Jesse" , "Rustad, Mark D" , "Wyborny, Carolyn" , Eric Auger , "Skidmore, Donald C" , "zajec5@gmail.com" , Alexander Graf , intel-wired-lan , "Kirsher, Jeffrey T" , Or Gerlitz , "Williams, Mitch A" , "Jani, Nrupal" , Bjorn Helgaas , "a.motakis@virtualopensystems.com" , "Lan, Tianyu" Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org List-Id: netdev.vger.kernel.org On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrot= e: >> > On 2015=E5=B9=B411=E6=9C=8825=E6=97=A5 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you pl= an to >> support this on. What about things like direct assignment of regular Et= hernet >> ports? What you really need is a solution that will work generically on= any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver i= n guest to self-emulate or track the device state, so that the migration ma= y be possible. > I don't think we can modify OS, without modifying the drivers, even using= the PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only = Microsoft can do. While, modifying driver is relatively simple and manageab= le to device vendors, if the device vendor want to support state-clone base= d migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Subject: Re: [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC Date: Wed, 25 Nov 2015 19:56:32 -0800 Message-ID: References: <1448372298-28386-1-git-send-email-tianyu.lan@intel.com> <5654722D.4010409@gmail.com> <56552888.90108@intel.com> <56556F98.5060507@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org Sender: qemu-devel-bounces+gceq-qemu-devel=gmane.org@nongnu.org To: "Dong, Eddie" Cc: Wei Yang , "Tantilov, Emil S" , "kvm@vger.kernel.org" , "Michael S. Tsirkin" , "qemu-devel@nongnu.org" , "Brandeburg, Jesse" , "Rustad, Mark D" , "Wyborny, Carolyn" , Eric Auger , "Skidmore, Donald C" , "zajec5@gmail.com" , Alexander Graf , intel-wired-lan , "Kirsher, Jeffrey T" , Or Gerlitz , "Williams, Mitch A" , "Jani, Nrupal" , Bjorn Helgaas , "a.motakis@virtualopensystems.com" , "Lan, Tianyu" List-Id: linux-api@vger.kernel.org On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrot= e: >> > On 2015=E5=B9=B411=E6=9C=8825=E6=97=A5 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you pl= an to >> support this on. What about things like direct assignment of regular Et= hernet >> ports? What you really need is a solution that will work generically on= any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver i= n guest to self-emulate or track the device state, so that the migration ma= y be possible. > I don't think we can modify OS, without modifying the drivers, even using= the PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only = Microsoft can do. While, modifying driver is relatively simple and manageab= le to device vendors, if the device vendor want to support state-clone base= d migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Duyck Date: Wed, 25 Nov 2015 19:56:32 -0800 Subject: [Intel-wired-lan] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC In-Reply-To: References: <1448372298-28386-1-git-send-email-tianyu.lan@intel.com> <5654722D.4010409@gmail.com> <56552888.90108@intel.com> <56556F98.5060507@intel.com> Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: intel-wired-lan@osuosl.org List-ID: On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrote: >> > On 2015?11?25? 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you plan to >> support this on. What about things like direct assignment of regular Ethernet >> ports? What you really need is a solution that will work generically on any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver in guest to self-emulate or track the device state, so that the migration may be possible. > I don't think we can modify OS, without modifying the drivers, even using the PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only Microsoft can do. While, modifying driver is relatively simple and manageable to device vendors, if the device vendor want to support state-clone based migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex