[Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support)
@ 2015-07-13 14:56 Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask Alexey Kardashevskiy
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13 14:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Michael Roth,
	David Gibson

Here are few patches to prepare an existing listener for handling memory
preregistration for SPAPR guests running on POWER8.

This used to be a part of DDW patchset but now is separated as requested.
I left versions in changelog of 5/5 for convenience.

Please comment, specifically about how many memory listeners we want to
have in VFIO.

The existing listener listens on PCI address space which in fact is RAM
on x86 but is not on sPAPR. Memory preregistration requires listening
on RAM for sPAPR too. Having a listener without a filter is tricky as it
will be notified on VIO address spaces too so having more that one listener
makes some sense. Any input is welcome.

Thanks!

Alexey Kardashevskiy (5):
  vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  vfio: Skip PCI BARs in memory listener
  vfio: Store IOMMU type in container
  vfio: Refactor memory listener to accommodate more IOMMU types
  vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)

 hw/vfio/common.c              | 207 ++++++++++++++++++++++++++++++++----------
 hw/vfio/pci.c                 |  30 +++---
 include/hw/vfio/vfio-common.h |   7 ++
 trace-events                  |   2 +
 4 files changed, 183 insertions(+), 63 deletions(-)

-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
@ 2015-07-13 14:56 ` Alexey Kardashevskiy
  2015-07-13 19:13   ` Alex Williamson
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 2/5] vfio: Skip PCI BARs in memory listener Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13 14:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Michael Roth,
	David Gibson

These started switching from TARGET_PAGE_MASK (hardcoded as 4K) to
a real host page size:
4e51361d7 "cpu-all: complete "real" host page size API" and
f7ceed190 "vfio: cpu: Use "real" page size API"

This finished the transition by:
- %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/
- %s/TARGET_PAGE_ALIGN/REAL_HOST_PAGE_ALIGN/
- removing bitfield length for offsets in VFIOQuirk::data as
qemu_real_host_page_mask is not a macro

This keeps using TARGET_PAGE_MASK for IOMMU regions though as it is
the minimum page size which IOMMU regions may be using and at the moment
memory regions do not carry the actual page size.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

In reality DMA windows are always a lot bigger than a single 4K page
and aligned to 32/64MB, may be only use here qemu_real_host_page_mask?
---
 hw/vfio/common.c | 26 ++++++++++++++++----------
 hw/vfio/pci.c    | 30 +++++++++++++++---------------
 2 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 85ee9b0..d115ec9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -321,6 +321,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
     Int128 llend;
     void *vaddr;
     int ret;
+    const bool is_iommu = memory_region_is_iommu(section->mr);
+    const hwaddr page_mask =
+        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_add_skip(
@@ -330,16 +333,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
         return;
     }
 
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
         error_report("%s received unaligned region", __func__);
         return;
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
     llend = int128_make64(section->offset_within_address_space);
     llend = int128_add(llend, section->size);
-    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+    llend = int128_and(llend, int128_exts64(page_mask));
 
     if (int128_ge(int128_make64(iova), llend)) {
         return;
@@ -347,7 +350,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
 
     memory_region_ref(section->mr);
 
-    if (memory_region_is_iommu(section->mr)) {
+    if (is_iommu) {
         VFIOGuestIOMMU *giommu;
 
         trace_vfio_listener_region_add_iommu(iova,
@@ -423,6 +426,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
                                             iommu_data.type1.listener);
     hwaddr iova, end;
     int ret;
+    const bool is_iommu = memory_region_is_iommu(section->mr);
+    const hwaddr page_mask =
+        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
 
     if (vfio_listener_skipped_section(section)) {
         trace_vfio_listener_region_del_skip(
@@ -432,13 +438,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
         return;
     }
 
-    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
-                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
+    if (unlikely((section->offset_within_address_space & ~page_mask) !=
+                 (section->offset_within_region & ~page_mask))) {
         error_report("%s received unaligned region", __func__);
         return;
     }
 
-    if (memory_region_is_iommu(section->mr)) {
+    if (is_iommu) {
         VFIOGuestIOMMU *giommu;
 
         QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
@@ -459,9 +465,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
          */
     }
 
-    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
     end = (section->offset_within_address_space + int128_get64(section->size)) &
-          TARGET_PAGE_MASK;
+          page_mask;
 
     if (iova >= end) {
         return;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2ed877f..7694afe 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -50,16 +50,16 @@ typedef struct VFIOQuirk {
     struct VFIOPCIDevice *vdev;
     QLIST_ENTRY(VFIOQuirk) next;
     struct {
-        uint32_t base_offset:TARGET_PAGE_BITS;
-        uint32_t address_offset:TARGET_PAGE_BITS;
+        uint32_t base_offset;
+        uint32_t address_offset;
         uint32_t address_size:3;
         uint32_t bar:3;
 
         uint32_t address_match;
         uint32_t address_mask;
 
-        uint32_t address_val:TARGET_PAGE_BITS;
-        uint32_t data_offset:TARGET_PAGE_BITS;
+        uint32_t address_val;
+        uint32_t data_offset;
         uint32_t data_size:3;
 
         uint8_t flags;
@@ -1319,8 +1319,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
 {
     VFIOQuirk *quirk = opaque;
     VFIOPCIDevice *vdev = quirk->vdev;
-    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
-    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
+    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
+    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
     uint64_t data;
 
     if (vfio_flags_enabled(quirk->data.flags, quirk->data.read_flags) &&
@@ -1349,8 +1349,8 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
 {
     VFIOQuirk *quirk = opaque;
     VFIOPCIDevice *vdev = quirk->vdev;
-    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
-    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
+    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
+    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
 
     if (vfio_flags_enabled(quirk->data.flags, quirk->data.write_flags) &&
         ranges_overlap(addr, size, offset, quirk->data.address_mask + 1)) {
@@ -1650,9 +1650,9 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr)
 
     memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
                           "vfio-ati-bar2-4000-quirk",
-                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
+                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
     memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
-                          quirk->data.address_match & TARGET_PAGE_MASK,
+                          quirk->data.address_match & qemu_real_host_page_mask,
                           &quirk->mem, 1);
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
@@ -1888,7 +1888,7 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
     VFIOQuirk *quirk = opaque;
     VFIOPCIDevice *vdev = quirk->vdev;
     PCIDevice *pdev = &vdev->pdev;
-    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
+    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
 
     vfio_generic_quirk_write(opaque, addr, data, size);
 
@@ -1943,9 +1943,9 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr)
 
     memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_88000_quirk,
                           quirk, "vfio-nvidia-bar0-88000-quirk",
-                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
+                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
     memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
-                          quirk->data.address_match & TARGET_PAGE_MASK,
+                          quirk->data.address_match & qemu_real_host_page_mask,
                           &quirk->mem, 1);
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
@@ -1980,9 +1980,9 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr)
 
     memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
                           "vfio-nvidia-bar0-1800-quirk",
-                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
+                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
     memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
-                          quirk->data.address_match & TARGET_PAGE_MASK,
+                          quirk->data.address_match & qemu_real_host_page_mask,
                           &quirk->mem, 1);
 
     QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v2 2/5] vfio: Skip PCI BARs in memory listener
  2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask Alexey Kardashevskiy
@ 2015-07-13 14:56 ` Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 3/5] vfio: Store IOMMU type in container Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13 14:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Michael Roth,
	David Gibson

In some cases PCI BARs are registered as RAM via
memory_region_init_ram_ptr() and the vfio_memory_listener will be called
on them too. However DMA will not be performed to/from these regions so
just skip them.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/vfio/common.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d115ec9..225cdc7 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -248,7 +248,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
             * are never accessed by the CPU and beyond the address width of
             * some IOMMU hardware.  TODO: VFIO should tell us the IOMMU width.
             */
-           section->offset_within_address_space & (1ULL << 63);
+           section->offset_within_address_space & (1ULL << 63) ||
+           memory_region_is_skip_dump(section->mr);
 }
 
 static void vfio_iommu_map_notify(Notifier *n, void *data)
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v2 3/5] vfio: Store IOMMU type in container
  2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 2/5] vfio: Skip PCI BARs in memory listener Alexey Kardashevskiy
@ 2015-07-13 14:56 ` Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 4/5] vfio: Refactor memory listener to accommodate more IOMMU types Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13 14:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Michael Roth,
	David Gibson

So far we were managing not to have an IOMMU type stored anywhere but
since we are going to implement different behavior for different IOMMU
types in the same memory listener, we need to know IOMMU type after
initialization.

This adds an IOMMU type into VFIOContainer and initializes it. This
adds SPAPR IOMMU data into the iommu_data union; for now it only includes
the existing Type1 data struct. Since zero is not used for any type,
no additional initialization is necessary for VFIOContainer::type.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* added VFIOContainer::iommu_data::spapr
---
 hw/vfio/common.c              | 11 ++++++-----
 include/hw/vfio/vfio-common.h |  6 ++++++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 225cdc7..510ab64 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -683,8 +683,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        ret = ioctl(fd, VFIO_SET_IOMMU,
-                    v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU);
+        container->iommu_data.type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -712,7 +712,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             ret = -errno;
             goto free_container_exit;
         }
-        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        container->iommu_data.type = VFIO_SPAPR_TCE_IOMMU;
+        ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
             ret = -errno;
@@ -731,10 +732,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
             goto free_container_exit;
         }
 
-        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.spapr.common.listener = vfio_memory_listener;
         container->iommu_data.release = vfio_listener_release;
 
-        memory_listener_register(&container->iommu_data.type1.listener,
+        memory_listener_register(&container->iommu_data.spapr.common.listener,
                                  container->space->as);
 
     } else {
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 59a321d..135ea64 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -70,13 +70,19 @@ typedef struct VFIOType1 {
     bool initialized;
 } VFIOType1;
 
+typedef struct VFIOSPAPR {
+    VFIOType1 common;
+} VFIOSPAPR;
+
 typedef struct VFIOContainer {
     VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     struct {
         /* enable abstraction to support various iommu backends */
+        unsigned type;
         union {
             VFIOType1 type1;
+            VFIOSPAPR spapr;
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v2 4/5] vfio: Refactor memory listener to accommodate more IOMMU types
  2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 3/5] vfio: Store IOMMU type in container Alexey Kardashevskiy
@ 2015-07-13 14:56 ` Alexey Kardashevskiy
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 5/5] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
  2015-07-13 19:37 ` [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alex Williamson
  5 siblings, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13 14:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Michael Roth,
	David Gibson

The vfio_memory_listener is registered for PCI address space. On Type1
IOMMU that falls back to @address_space_memory and the listener is
called on RAM blocks. On sPAPR IOMMU is guest visible and the listener
is called on DMA windows. Therefore Type1 IOMMU only handled RAM regions
and sPAPR IOMMU only handled IOMMU regions.

With the memory preregistration, there is need to handle RAM regions in
the listener for sPAPR IOMMU too. To make next patches simpler/nicer,
this moves DMA mapping/unmapping code to a case in a switch.

This separates actual listener events handlers from the callbacks;
these will be used later with another memory listener.

This should cause no change in behavior.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* split vfio_listener_region_add and vfio_listener_region_del to
listener callbacks and actual handlers; the latter will be reused in
memory preregistration
---
 hw/vfio/common.c | 77 ++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 52 insertions(+), 25 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 510ab64..daff0d9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -313,11 +313,10 @@ out:
     rcu_read_unlock();
 }
 
-static void vfio_listener_region_add(MemoryListener *listener,
-                                     MemoryRegionSection *section)
+static void vfio_listener_region_do_add(VFIOContainer *container,
+                                        MemoryListener *listener,
+                                        MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
     hwaddr iova, end;
     Int128 llend;
     void *vaddr;
@@ -397,34 +396,43 @@ static void vfio_listener_region_add(MemoryListener *listener,
             section->offset_within_region +
             (iova - section->offset_within_address_space);
 
-    trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
+    switch (container->iommu_data.type) {
+    case VFIO_TYPE1_IOMMU:
+    case VFIO_TYPE1v2_IOMMU:
+        trace_vfio_listener_region_add_ram(iova, end - 1, vaddr);
 
-    ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
-    if (ret) {
-        error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
-                     "0x%"HWADDR_PRIx", %p) = %d (%m)",
-                     container, iova, end - iova, vaddr, ret);
+        ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
+        if (ret) {
+            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         container, iova, end - iova, vaddr, ret);
+            goto error_exit;
+        }
+        break;
+    }
+
+    return;
+
+error_exit:
 
-        /*
-         * On the initfn path, store the first error in the container so we
-         * can gracefully fail.  Runtime, there's not much we can do other
-         * than throw a hardware error.
-         */
-        if (!container->iommu_data.type1.initialized) {
-            if (!container->iommu_data.type1.error) {
-                container->iommu_data.type1.error = ret;
-            }
-        } else {
-            hw_error("vfio: DMA mapping failed, unable to continue");
+    /*
+     * On the initfn path, store the first error in the container so we
+     * can gracefully fail.  Runtime, there's not much we can do other
+     * than throw a hardware error.
+     */
+    if (!container->iommu_data.type1.initialized) {
+        if (!container->iommu_data.type1.error) {
+            container->iommu_data.type1.error = ret;
         }
+    } else {
+        hw_error("vfio: DMA mapping failed, unable to continue");
     }
 }
 
-static void vfio_listener_region_del(MemoryListener *listener,
-                                     MemoryRegionSection *section)
+static void vfio_listener_region_do_del(VFIOContainer *container,
+                                        MemoryListener *listener,
+                                        MemoryRegionSection *section)
 {
-    VFIOContainer *container = container_of(listener, VFIOContainer,
-                                            iommu_data.type1.listener);
     hwaddr iova, end;
     int ret;
     const bool is_iommu = memory_region_is_iommu(section->mr);
@@ -485,6 +493,25 @@ static void vfio_listener_region_del(MemoryListener *listener,
     }
 }
 
+static void vfio_listener_region_add(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+
+    vfio_listener_region_do_add(container, listener, section);
+}
+
+
+static void vfio_listener_region_del(MemoryListener *listener,
+                                     MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.type1.listener);
+
+    vfio_listener_region_do_del(container, listener, section);
+}
+
 static const MemoryListener vfio_memory_listener = {
     .region_add = vfio_listener_region_add,
     .region_del = vfio_listener_region_del,
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [Qemu-devel] [RFC PATCH qemu v2 5/5] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering)
  2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 4/5] vfio: Refactor memory listener to accommodate more IOMMU types Alexey Kardashevskiy
@ 2015-07-13 14:56 ` Alexey Kardashevskiy
  2015-07-13 19:37 ` [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alex Williamson
  5 siblings, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-13 14:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Michael Roth,
	David Gibson

This makes use of the new "memory registering" feature. The idea is
to provide the userspace ability to notify the host kernel about pages
which are going to be used for DMA. Having this information, the host
kernel can pin them all once per user process, do locked pages
accounting (once) and not spent time on doing that in real time with
possible failures which cannot be handled nicely in some cases.

This adds a guest RAM memory listener which notifies a VFIO container
about memory which needs to be pinned/unpinned. VFIO MMIO regions
(i.e. "skip dump" regions) are skipped. This does not reuse the existing
listener without filters as at the moment of the listener registration
we know exactly what we want to listen on but it is tricky to do
the filtering from the listener as it will be called for IBM VIO
address spaces too (in addition to PCI and RAM).

The feature is only enabled for SPAPR IOMMU v2. The host kernel changes
are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does
not call it when v2 is detected and enabled.

This does not change the guest visible interface.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v2:
* added another listener for RAM


Changelog from the times when this was a part of DDW patchset:
v11:
* merged register_listener into vfio_memory_listener

v9:
* since there is no more SPAPR-specific data in container::iommu_data,
the memory preregistration fields are common and potentially can be used
by other architectures

v7:
* in vfio_spapr_ram_listener_region_del(), do unref() after ioctl()
* s'ramlistener'register_listener'

v6:
* fixed commit log (s/guest/userspace/), added note about no guest visible
change
* fixed error checking if ram registration failed
* added alignment check for section->offset_within_region

v5:
* simplified the patch
* added trace points
* added round_up() for the size
* SPAPR IOMMU v2 used
---
 hw/vfio/common.c              | 92 +++++++++++++++++++++++++++++++++++++++----
 include/hw/vfio/vfio-common.h |  1 +
 trace-events                  |  2 +
 3 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index daff0d9..d8ce04a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -409,6 +409,19 @@ static void vfio_listener_region_do_add(VFIOContainer *container,
             goto error_exit;
         }
         break;
+
+    case VFIO_SPAPR_TCE_v2_IOMMU: {
+        struct vfio_iommu_spapr_register_memory reg = {
+            .argsz = sizeof(reg),
+            .flags = 0,
+            .vaddr = (uint64_t) vaddr,
+            .size = end - iova
+        };
+
+        ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_REGISTER_MEMORY, &reg);
+        trace_vfio_ram_register(reg.vaddr, reg.size, ret ? -errno : 0);
+        break;
+    }
     }
 
     return;
@@ -491,6 +504,26 @@ static void vfio_listener_region_do_del(VFIOContainer *container,
                      "0x%"HWADDR_PRIx") = %d (%m)",
                      container, iova, end - iova, ret);
     }
+
+    switch (container->iommu_data.type) {
+    case VFIO_SPAPR_TCE_v2_IOMMU:
+        if (!is_iommu) {
+            void *vaddr = memory_region_get_ram_ptr(section->mr) +
+                section->offset_within_region +
+                (iova - section->offset_within_address_space);
+            struct vfio_iommu_spapr_register_memory reg = {
+                .argsz = sizeof(reg),
+                .flags = 0,
+                .vaddr = (uint64_t) vaddr,
+                .size = end - iova
+            };
+
+            ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY,
+                        &reg);
+            trace_vfio_ram_unregister(reg.vaddr, reg.size, ret ? -errno : 0);
+        }
+        break;
+    }
 }
 
 static void vfio_listener_region_add(MemoryListener *listener,
@@ -517,8 +550,35 @@ static const MemoryListener vfio_memory_listener = {
     .region_del = vfio_listener_region_del,
 };
 
+static void vfio_spapr_ram_listener_region_add(MemoryListener *listener,
+                                               MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.ram_listener);
+
+    vfio_listener_region_do_add(container, listener, section);
+}
+
+
+static void vfio_spapr_ram_listener_region_del(MemoryListener *listener,
+                                               MemoryRegionSection *section)
+{
+    VFIOContainer *container = container_of(listener, VFIOContainer,
+                                            iommu_data.spapr.ram_listener);
+
+    vfio_listener_region_do_del(container, listener, section);
+}
+
+static const MemoryListener vfio_spapr_ram_listener = {
+    .region_add = vfio_spapr_ram_listener_region_add,
+    .region_del = vfio_spapr_ram_listener_region_del,
+};
+
 static void vfio_listener_release(VFIOContainer *container)
 {
+    if (container->iommu_data.type == VFIO_SPAPR_TCE_v2_IOMMU) {
+        memory_listener_unregister(&container->iommu_data.spapr.ram_listener);
+    }
     memory_listener_unregister(&container->iommu_data.type1.listener);
 }
 
@@ -732,14 +792,18 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 
         container->iommu_data.type1.initialized = true;
 
-    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) ||
+               ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) {
+        bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU);
+
         ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
         if (ret) {
             error_report("vfio: failed to set group container: %m");
             ret = -errno;
             goto free_container_exit;
         }
-        container->iommu_data.type = VFIO_SPAPR_TCE_IOMMU;
+        container->iommu_data.type =
+            v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU;
         ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_data.type);
         if (ret) {
             error_report("vfio: failed to set iommu for container: %m");
@@ -752,18 +816,30 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
          * when container fd is closed so we do not call it explicitly
          * in this file.
          */
-        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
-        if (ret) {
-            error_report("vfio: failed to enable container: %m");
-            ret = -errno;
-            goto free_container_exit;
+        if (!v2) {
+            ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+            if (ret) {
+                error_report("vfio: failed to enable container: %m");
+                ret = -errno;
+                goto free_container_exit;
+            }
         }
 
         container->iommu_data.spapr.common.listener = vfio_memory_listener;
         container->iommu_data.release = vfio_listener_release;
-
         memory_listener_register(&container->iommu_data.spapr.common.listener,
                                  container->space->as);
+        if (v2) {
+            container->iommu_data.spapr.ram_listener = vfio_spapr_ram_listener;
+            memory_listener_register(&container->iommu_data.spapr.ram_listener,
+                                     &address_space_memory);
+            if (container->iommu_data.spapr.common.error) {
+                error_report("vfio: RAM memory listener initialization failed for container");
+                goto listener_release_exit;
+            }
+
+            container->iommu_data.spapr.common.initialized = true;
+        }
 
     } else {
         error_report("vfio: No available IOMMU models");
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 135ea64..8edd572 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -72,6 +72,7 @@ typedef struct VFIOType1 {
 
 typedef struct VFIOSPAPR {
     VFIOType1 common;
+    MemoryListener ram_listener;
 } VFIOSPAPR;
 
 typedef struct VFIOContainer {
diff --git a/trace-events b/trace-events
index d24d80a..f859ad0 100644
--- a/trace-events
+++ b/trace-events
@@ -1582,6 +1582,8 @@ vfio_disconnect_container(int fd) "close container->fd=%d"
 vfio_put_group(int fd) "close group->fd=%d"
 vfio_get_device(const char * name, unsigned int flags, unsigned int num_regions, unsigned int num_irqs) "Device %s flags: %u, regions: %u, irqs: %u"
 vfio_put_base_device(int fd) "close vdev->fd=%d"
+vfio_ram_register(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
+vfio_ram_unregister(uint64_t va, uint64_t size, int ret) "va=%"PRIx64" size=%"PRIx64" ret=%d"
 
 # hw/vfio/platform.c
 vfio_platform_populate_regions(int region_index, unsigned long flag, unsigned long size, int fd, unsigned long offset) "- region %d flags = 0x%lx, size = 0x%lx, fd= %d, offset = 0x%lx"
-- 
2.4.0.rc3.8.gfb3e7d5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask Alexey Kardashevskiy
@ 2015-07-13 19:13   ` Alex Williamson
  2015-07-14  6:58     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2015-07-13 19:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Michael Roth, qemu-ppc, qemu-devel, David Gibson

On Tue, 2015-07-14 at 00:56 +1000, Alexey Kardashevskiy wrote:
> These started switching from TARGET_PAGE_MASK (hardcoded as 4K) to
> a real host page size:
> 4e51361d7 "cpu-all: complete "real" host page size API" and
> f7ceed190 "vfio: cpu: Use "real" page size API"
> 
> This finished the transition by:
> - %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/
> - %s/TARGET_PAGE_ALIGN/REAL_HOST_PAGE_ALIGN/
> - removing bitfield length for offsets in VFIOQuirk::data as
> qemu_real_host_page_mask is not a macro

Can we assume that none of the changes to quirks have actually been
tested?  I don't really support them being bundled in here since they
really aren't related to what you're doing.  For DMA we generally want
to be host IOMMU page aligned, which we can generally assume is the same
as host page aligned, but quirks are simply I/O regions, so I think they
ought to continue to be target page aligned.

> This keeps using TARGET_PAGE_MASK for IOMMU regions though as it is
> the minimum page size which IOMMU regions may be using and at the moment
> memory regions do not carry the actual page size.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> In reality DMA windows are always a lot bigger than a single 4K page
> and aligned to 32/64MB, may be only use here qemu_real_host_page_mask?

I don't understand what this is asking either.  While the bulk of memory
is going to be mapped in larger chunks, we do occasionally see 4k
mappings on x86, particularly in some of the legacy low memory areas.

> ---
>  hw/vfio/common.c | 26 ++++++++++++++++----------
>  hw/vfio/pci.c    | 30 +++++++++++++++---------------
>  2 files changed, 31 insertions(+), 25 deletions(-)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 85ee9b0..d115ec9 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -321,6 +321,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      Int128 llend;
>      void *vaddr;
>      int ret;
> +    const bool is_iommu = memory_region_is_iommu(section->mr);
> +    const hwaddr page_mask =
> +        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_add_skip(
> @@ -330,16 +333,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
>          return;
>      }
>  
> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
>          error_report("%s received unaligned region", __func__);
>          return;
>      }
>  
> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>      llend = int128_make64(section->offset_within_address_space);
>      llend = int128_add(llend, section->size);
> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> +    llend = int128_and(llend, int128_exts64(page_mask));
>  
>      if (int128_ge(int128_make64(iova), llend)) {
>          return;
> @@ -347,7 +350,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>  
>      memory_region_ref(section->mr);
>  
> -    if (memory_region_is_iommu(section->mr)) {
> +    if (is_iommu) {
>          VFIOGuestIOMMU *giommu;
>  
>          trace_vfio_listener_region_add_iommu(iova,
> @@ -423,6 +426,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>                                              iommu_data.type1.listener);
>      hwaddr iova, end;
>      int ret;
> +    const bool is_iommu = memory_region_is_iommu(section->mr);
> +    const hwaddr page_mask =
> +        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
>  
>      if (vfio_listener_skipped_section(section)) {
>          trace_vfio_listener_region_del_skip(
> @@ -432,13 +438,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          return;
>      }
>  
> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> +                 (section->offset_within_region & ~page_mask))) {
>          error_report("%s received unaligned region", __func__);
>          return;
>      }
>  
> -    if (memory_region_is_iommu(section->mr)) {
> +    if (is_iommu) {
>          VFIOGuestIOMMU *giommu;
>  
>          QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> @@ -459,9 +465,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>           */
>      }
>  
> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
>      end = (section->offset_within_address_space + int128_get64(section->size)) &
> -          TARGET_PAGE_MASK;
> +          page_mask;
>  
>      if (iova >= end) {
>          return;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 2ed877f..7694afe 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -50,16 +50,16 @@ typedef struct VFIOQuirk {
>      struct VFIOPCIDevice *vdev;
>      QLIST_ENTRY(VFIOQuirk) next;
>      struct {
> -        uint32_t base_offset:TARGET_PAGE_BITS;
> -        uint32_t address_offset:TARGET_PAGE_BITS;
> +        uint32_t base_offset;
> +        uint32_t address_offset;
>          uint32_t address_size:3;
>          uint32_t bar:3;
>  
>          uint32_t address_match;
>          uint32_t address_mask;
>  
> -        uint32_t address_val:TARGET_PAGE_BITS;
> -        uint32_t data_offset:TARGET_PAGE_BITS;
> +        uint32_t address_val;
> +        uint32_t data_offset;
>          uint32_t data_size:3;
>  
>          uint8_t flags;
> @@ -1319,8 +1319,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
>  {
>      VFIOQuirk *quirk = opaque;
>      VFIOPCIDevice *vdev = quirk->vdev;
> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
> -    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
> +    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
>      uint64_t data;
>  
>      if (vfio_flags_enabled(quirk->data.flags, quirk->data.read_flags) &&
> @@ -1349,8 +1349,8 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
>  {
>      VFIOQuirk *quirk = opaque;
>      VFIOPCIDevice *vdev = quirk->vdev;
> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
> -    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
> +    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
>  
>      if (vfio_flags_enabled(quirk->data.flags, quirk->data.write_flags) &&
>          ranges_overlap(addr, size, offset, quirk->data.address_mask + 1)) {
> @@ -1650,9 +1650,9 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr)
>  
>      memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
>                            "vfio-ati-bar2-4000-quirk",
> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
>      memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
> -                          quirk->data.address_match & TARGET_PAGE_MASK,
> +                          quirk->data.address_match & qemu_real_host_page_mask,
>                            &quirk->mem, 1);
>  
>      QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
> @@ -1888,7 +1888,7 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
>      VFIOQuirk *quirk = opaque;
>      VFIOPCIDevice *vdev = quirk->vdev;
>      PCIDevice *pdev = &vdev->pdev;
> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
>  
>      vfio_generic_quirk_write(opaque, addr, data, size);
>  
> @@ -1943,9 +1943,9 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr)
>  
>      memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_88000_quirk,
>                            quirk, "vfio-nvidia-bar0-88000-quirk",
> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
>      memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
> -                          quirk->data.address_match & TARGET_PAGE_MASK,
> +                          quirk->data.address_match & qemu_real_host_page_mask,
>                            &quirk->mem, 1);
>  
>      QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
> @@ -1980,9 +1980,9 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr)
>  
>      memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
>                            "vfio-nvidia-bar0-1800-quirk",
> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
>      memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
> -                          quirk->data.address_match & TARGET_PAGE_MASK,
> +                          quirk->data.address_match & qemu_real_host_page_mask,
>                            &quirk->mem, 1);
>  
>      QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support)
  2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 5/5] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
@ 2015-07-13 19:37 ` Alex Williamson
  5 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2015-07-13 19:37 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Michael Roth, qemu-ppc, qemu-devel, David Gibson

On Tue, 2015-07-14 at 00:56 +1000, Alexey Kardashevskiy wrote:
> Here are few patches to prepare an existing listener for handling memory
> preregistration for SPAPR guests running on POWER8.
> 
> This used to be a part of DDW patchset but now is separated as requested.
> I left versions in changelog of 5/5 for convenience.
> 
> Please comment, specifically about how many memory listeners we want to
> have in VFIO.
> 
> The existing listener listens on PCI address space which in fact is RAM
> on x86 but is not on sPAPR. Memory preregistration requires listening
> on RAM for sPAPR too. Having a listener without a filter is tricky as it
> will be notified on VIO address spaces too so having more that one listener
> makes some sense. Any input is welcome.

What I think you're doing here is using the same core memory listener
functions wrapped in separate memory listeners so that you can look only
at the address spaces you're interested in, iommu vs ram.  I think that
largely achieves the consolidation I was looking for.  I don't
particularly care how many different ways you register memory listeners,
but I don't want to have a new set of memory listener callbacks that do
largely the same thing as the existing callback.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  2015-07-13 19:13   ` Alex Williamson
@ 2015-07-14  6:58     ` Alexey Kardashevskiy
  2015-07-14 12:17       ` Alexey Kardashevskiy
  2015-07-14 22:28       ` Alex Williamson
  0 siblings, 2 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14  6:58 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Michael Roth, qemu-ppc, qemu-devel, David Gibson

On 07/14/2015 05:13 AM, Alex Williamson wrote:
> On Tue, 2015-07-14 at 00:56 +1000, Alexey Kardashevskiy wrote:
>> These started switching from TARGET_PAGE_MASK (hardcoded as 4K) to
>> a real host page size:
>> 4e51361d7 "cpu-all: complete "real" host page size API" and
>> f7ceed190 "vfio: cpu: Use "real" page size API"
>>
>> This finished the transition by:
>> - %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/
>> - %s/TARGET_PAGE_ALIGN/REAL_HOST_PAGE_ALIGN/
>> - removing bitfield length for offsets in VFIOQuirk::data as
>> qemu_real_host_page_mask is not a macro
>
> Can we assume that none of the changes to quirks have actually been
> tested?

No, why? :) I tried it on one of NVIDIAs I got here -
VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2)
The driver was from NVIDIA (not nouveau) and the test was "acos" (some 
basic CUDA test).

> I don't really support them being bundled in here since they
> really aren't related to what you're doing.

This makes sense, I'll move them to a separate patch and add a note how it 
helps on a 64k-pages host.

> For DMA we generally want
> to be host IOMMU page aligned,

Do all known IOMMUs use a constant page size? IOMMU memory region does not 
have an IOMMU page size/mask and I wanted to add it there but not sure if 
it is generic enough.

> which we can generally assume is the same
> as host page aligned,

They are almost never the same on sPAPR for 32bit windows...

> but quirks are simply I/O regions, so I think they
> ought to continue to be target page aligned.

Without s/TARGET_PAGE_MASK/qemu_real_host_page_mask/, 
&vfio_nvidia_88000_quirk fails+exits in kvm_set_phys_mem() as the size of 
section is 0x88000. It still works with x-mmap=false (or TCG, I suppose) 
though.


>> This keeps using TARGET_PAGE_MASK for IOMMU regions though as it is
>> the minimum page size which IOMMU regions may be using and at the moment
>> memory regions do not carry the actual page size.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> In reality DMA windows are always a lot bigger than a single 4K page
>> and aligned to 32/64MB, may be only use here qemu_real_host_page_mask?
>
> I don't understand what this is asking either.  While the bulk of memory
> is going to be mapped in larger chunks, we do occasionally see 4k
> mappings on x86, particularly in some of the legacy low memory areas.


The question was not about individual mappings - these are handled by a 
iommu memory region notifier; here we are dealing with DMA windows which 
are always megabytes but nothing really prevents/prohibits a guest from 
requesting a 4K _window_ (with a single TCE entry). Whether we want to 
support such small windows or not - this was a question.


>
>> ---
>>   hw/vfio/common.c | 26 ++++++++++++++++----------
>>   hw/vfio/pci.c    | 30 +++++++++++++++---------------
>>   2 files changed, 31 insertions(+), 25 deletions(-)
>>
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index 85ee9b0..d115ec9 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -321,6 +321,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>       Int128 llend;
>>       void *vaddr;
>>       int ret;
>> +    const bool is_iommu = memory_region_is_iommu(section->mr);
>> +    const hwaddr page_mask =
>> +        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
>>
>>       if (vfio_listener_skipped_section(section)) {
>>           trace_vfio_listener_region_add_skip(
>> @@ -330,16 +333,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>           return;
>>       }
>>
>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>> +                 (section->offset_within_region & ~page_mask))) {
>>           error_report("%s received unaligned region", __func__);
>>           return;
>>       }
>>
>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
>>       llend = int128_make64(section->offset_within_address_space);
>>       llend = int128_add(llend, section->size);
>> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
>> +    llend = int128_and(llend, int128_exts64(page_mask));
>>
>>       if (int128_ge(int128_make64(iova), llend)) {
>>           return;
>> @@ -347,7 +350,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>
>>       memory_region_ref(section->mr);
>>
>> -    if (memory_region_is_iommu(section->mr)) {
>> +    if (is_iommu) {
>>           VFIOGuestIOMMU *giommu;
>>
>>           trace_vfio_listener_region_add_iommu(iova,
>> @@ -423,6 +426,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>                                               iommu_data.type1.listener);
>>       hwaddr iova, end;
>>       int ret;
>> +    const bool is_iommu = memory_region_is_iommu(section->mr);
>> +    const hwaddr page_mask =
>> +        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
>>
>>       if (vfio_listener_skipped_section(section)) {
>>           trace_vfio_listener_region_del_skip(
>> @@ -432,13 +438,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>           return;
>>       }
>>
>> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
>> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
>> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
>> +                 (section->offset_within_region & ~page_mask))) {
>>           error_report("%s received unaligned region", __func__);
>>           return;
>>       }
>>
>> -    if (memory_region_is_iommu(section->mr)) {
>> +    if (is_iommu) {
>>           VFIOGuestIOMMU *giommu;
>>
>>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
>> @@ -459,9 +465,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
>>            */
>>       }
>>
>> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
>>       end = (section->offset_within_address_space + int128_get64(section->size)) &
>> -          TARGET_PAGE_MASK;
>> +          page_mask;
>>
>>       if (iova >= end) {
>>           return;
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 2ed877f..7694afe 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -50,16 +50,16 @@ typedef struct VFIOQuirk {
>>       struct VFIOPCIDevice *vdev;
>>       QLIST_ENTRY(VFIOQuirk) next;
>>       struct {
>> -        uint32_t base_offset:TARGET_PAGE_BITS;
>> -        uint32_t address_offset:TARGET_PAGE_BITS;
>> +        uint32_t base_offset;
>> +        uint32_t address_offset;
>>           uint32_t address_size:3;
>>           uint32_t bar:3;
>>
>>           uint32_t address_match;
>>           uint32_t address_mask;
>>
>> -        uint32_t address_val:TARGET_PAGE_BITS;
>> -        uint32_t data_offset:TARGET_PAGE_BITS;
>> +        uint32_t address_val;
>> +        uint32_t data_offset;
>>           uint32_t data_size:3;
>>
>>           uint8_t flags;
>> @@ -1319,8 +1319,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
>>   {
>>       VFIOQuirk *quirk = opaque;
>>       VFIOPCIDevice *vdev = quirk->vdev;
>> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
>> -    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
>> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
>> +    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
>>       uint64_t data;
>>
>>       if (vfio_flags_enabled(quirk->data.flags, quirk->data.read_flags) &&
>> @@ -1349,8 +1349,8 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
>>   {
>>       VFIOQuirk *quirk = opaque;
>>       VFIOPCIDevice *vdev = quirk->vdev;
>> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
>> -    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
>> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
>> +    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
>>
>>       if (vfio_flags_enabled(quirk->data.flags, quirk->data.write_flags) &&
>>           ranges_overlap(addr, size, offset, quirk->data.address_mask + 1)) {
>> @@ -1650,9 +1650,9 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr)
>>
>>       memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
>>                             "vfio-ati-bar2-4000-quirk",
>> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
>> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
>>       memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
>> -                          quirk->data.address_match & TARGET_PAGE_MASK,
>> +                          quirk->data.address_match & qemu_real_host_page_mask,
>>                             &quirk->mem, 1);
>>
>>       QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
>> @@ -1888,7 +1888,7 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
>>       VFIOQuirk *quirk = opaque;
>>       VFIOPCIDevice *vdev = quirk->vdev;
>>       PCIDevice *pdev = &vdev->pdev;
>> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
>> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
>>
>>       vfio_generic_quirk_write(opaque, addr, data, size);
>>
>> @@ -1943,9 +1943,9 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr)
>>
>>       memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_88000_quirk,
>>                             quirk, "vfio-nvidia-bar0-88000-quirk",
>> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
>> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
>>       memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
>> -                          quirk->data.address_match & TARGET_PAGE_MASK,
>> +                          quirk->data.address_match & qemu_real_host_page_mask,
>>                             &quirk->mem, 1);
>>
>>       QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
>> @@ -1980,9 +1980,9 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr)
>>
>>       memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
>>                             "vfio-nvidia-bar0-1800-quirk",
>> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
>> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
>>       memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
>> -                          quirk->data.address_match & TARGET_PAGE_MASK,
>> +                          quirk->data.address_match & qemu_real_host_page_mask,
>>                             &quirk->mem, 1);
>>
>>       QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
>
>
>


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  2015-07-14  6:58     ` Alexey Kardashevskiy
@ 2015-07-14 12:17       ` Alexey Kardashevskiy
  2015-07-14 22:28       ` Alex Williamson
  1 sibling, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-14 12:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Michael Roth, qemu-ppc, qemu-devel, David Gibson

On 07/14/2015 04:58 PM, Alexey Kardashevskiy wrote:
> On 07/14/2015 05:13 AM, Alex Williamson wrote:
>> On Tue, 2015-07-14 at 00:56 +1000, Alexey Kardashevskiy wrote:
>>> These started switching from TARGET_PAGE_MASK (hardcoded as 4K) to
>>> a real host page size:
>>> 4e51361d7 "cpu-all: complete "real" host page size API" and
>>> f7ceed190 "vfio: cpu: Use "real" page size API"
>>>
>>> This finished the transition by:
>>> - %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/
>>> - %s/TARGET_PAGE_ALIGN/REAL_HOST_PAGE_ALIGN/
>>> - removing bitfield length for offsets in VFIOQuirk::data as
>>> qemu_real_host_page_mask is not a macro
>>
>> Can we assume that none of the changes to quirks have actually been
>> tested?
>
> No, why? :) I tried it on one of NVIDIAs I got here -
> VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2)
> The driver was from NVIDIA (not nouveau) and the test was "acos" (some
> basic CUDA test).

Ufff. My bad. CUDA only works if I remove quirks at all.

Using host page size helps to boot the guest with this Quadro (otherwise 
abort in kvm_set_phys_mem()) but simple CUDA test produces EEH:

aik@aiktest-u14:~$ ./acos
[   47.488858] EEH: Frozen PHB#0-PE#1 detected
[   47.488861] EEH: PE location: vfio_vfio-pci:0000:00:01.0, PHB location: 
vfio_vfio-pci:0000:00:01.0

[   51.559681] Unable to handle kernel paging request for data at address 
0x00000000
[   51.559873] Faulting instruction address: 0xd00000000550d168
[   51.560038] Oops: Kernel access of bad area, sig: 11 [#1]


Does this mean "nack" to %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/ in 
quirks?


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  2015-07-14  6:58     ` Alexey Kardashevskiy
  2015-07-14 12:17       ` Alexey Kardashevskiy
@ 2015-07-14 22:28       ` Alex Williamson
  2015-07-15  0:49         ` Alexey Kardashevskiy
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2015-07-14 22:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Michael Roth, qemu-ppc, qemu-devel, David Gibson

On Tue, 2015-07-14 at 16:58 +1000, Alexey Kardashevskiy wrote:
> On 07/14/2015 05:13 AM, Alex Williamson wrote:
> > On Tue, 2015-07-14 at 00:56 +1000, Alexey Kardashevskiy wrote:
> >> These started switching from TARGET_PAGE_MASK (hardcoded as 4K) to
> >> a real host page size:
> >> 4e51361d7 "cpu-all: complete "real" host page size API" and
> >> f7ceed190 "vfio: cpu: Use "real" page size API"
> >>
> >> This finished the transition by:
> >> - %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/
> >> - %s/TARGET_PAGE_ALIGN/REAL_HOST_PAGE_ALIGN/
> >> - removing bitfield length for offsets in VFIOQuirk::data as
> >> qemu_real_host_page_mask is not a macro
> >
> > Can we assume that none of the changes to quirks have actually been
> > tested?
> 
> No, why? :) I tried it on one of NVIDIAs I got here -
> VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2)
> The driver was from NVIDIA (not nouveau) and the test was "acos" (some 
> basic CUDA test).

That's only one of a handful or more quirks.  The VGA related quirks are
all for backdoors in the bootstrap process and MMIO access to config
space, things that I would not expect to see on power.  So power
probably isn't a useful host to test these things.

> > I don't really support them being bundled in here since they
> > really aren't related to what you're doing.
> 
> This makes sense, I'll move them to a separate patch and add a note how it 
> helps on a 64k-pages host.
> 
> > For DMA we generally want
> > to be host IOMMU page aligned,
> 
> Do all known IOMMUs use a constant page size? IOMMU memory region does not 
> have an IOMMU page size/mask and I wanted to add it there but not sure if 
> it is generic enough.

AMD supports nearly any power-of-2 size (>=4k), Intel supports 4k +
optionally 2M and 1G.  The vfio type1 iommu driver looks for physically
contiguous ranges to give to the hardware iommu driver, which makes use
of whatever optimal page size it can.  Therefore, we really don't care
what the hardware page size is beyond the assumption that it supports
4k.  When hugepages are used by the VM, we expect the iommu will
automatically make use of them if supported.  A non-VM vfio userspace
driver might care a little bit more about the supported page sizes, I
imagine.

> > which we can generally assume is the same
> > as host page aligned,
> 
> They are almost never the same on sPAPR for 32bit windows...
> 
> > but quirks are simply I/O regions, so I think they
> > ought to continue to be target page aligned.
> 
> Without s/TARGET_PAGE_MASK/qemu_real_host_page_mask/, 
> &vfio_nvidia_88000_quirk fails+exits in kvm_set_phys_mem() as the size of 
> section is 0x88000. It still works with x-mmap=false (or TCG, I suppose) 
> though.

Think about what this is doing to the guest.  There's a 4k window
(because PCIe extended config space is 4k) at offset 0x88000 of the MMIO
BAR that allows access to PCI config space of the device.  With 4k pages
this all aligns quite nicely and only config space accesses are trapped.
With 64k pages, you're trapping everything from 0x80000 to 0x8ffff.  We
have no idea what else might live in that space and what kind of
performance impact it'll cause to the operation of the device.  If you
don't know that you need it and can't meet the mapping criteria, don't
enable it.  BTW, this quirk is for GeForce, not Quadro.  The region
seems not to be used by Quadro drivers and we can't programatically tell
the difference between Quadro and GeForce hardware, so we leave it
enabled for either on x86.

Maybe these really should be real_host_page_size, but I can't really
picture how an underlying 64k page size host gets imposed on a guest
that's only aware of a 4k page size.  For instance, what prevents a 4k
guest from mapping PCI BARs with 4k alignment.  Multiple BARs for
different devices fit within a 64k host page.  MMU mappings would seem
to have similar issues.  These quirks need to be on the granularity we
take a target page fault, which I imagine is the same as the
real_host_page_size.  I'd expect that unless you're going to support
consumer graphics or crappy realtek NICs, none of these quirks are
relevant to you.

> >> This keeps using TARGET_PAGE_MASK for IOMMU regions though as it is
> >> the minimum page size which IOMMU regions may be using and at the moment
> >> memory regions do not carry the actual page size.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>
> >> In reality DMA windows are always a lot bigger than a single 4K page
> >> and aligned to 32/64MB, may be only use here qemu_real_host_page_mask?
> >
> > I don't understand what this is asking either.  While the bulk of memory
> > is going to be mapped in larger chunks, we do occasionally see 4k
> > mappings on x86, particularly in some of the legacy low memory areas.
> 
> 
> The question was not about individual mappings - these are handled by a 
> iommu memory region notifier; here we are dealing with DMA windows which 
> are always megabytes but nothing really prevents/prohibits a guest from 
> requesting a 4K _window_ (with a single TCE entry). Whether we want to 
> support such small windows or not - this was a question.

But you're using the same memory listener that does deak with 4k
mappings.  What optimization would you hope to achieve by assuming only
larger mappings that compensates for the lack of generality?  Thanks,

Alex

> >> ---
> >>   hw/vfio/common.c | 26 ++++++++++++++++----------
> >>   hw/vfio/pci.c    | 30 +++++++++++++++---------------
> >>   2 files changed, 31 insertions(+), 25 deletions(-)
> >>
> >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> >> index 85ee9b0..d115ec9 100644
> >> --- a/hw/vfio/common.c
> >> +++ b/hw/vfio/common.c
> >> @@ -321,6 +321,9 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>       Int128 llend;
> >>       void *vaddr;
> >>       int ret;
> >> +    const bool is_iommu = memory_region_is_iommu(section->mr);
> >> +    const hwaddr page_mask =
> >> +        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
> >>
> >>       if (vfio_listener_skipped_section(section)) {
> >>           trace_vfio_listener_region_add_skip(
> >> @@ -330,16 +333,16 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>           return;
> >>       }
> >>
> >> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> >> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> >> +                 (section->offset_within_region & ~page_mask))) {
> >>           error_report("%s received unaligned region", __func__);
> >>           return;
> >>       }
> >>
> >> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +    iova = ROUND_UP(section->offset_within_address_space, ~page_mask + 1);
> >>       llend = int128_make64(section->offset_within_address_space);
> >>       llend = int128_add(llend, section->size);
> >> -    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> >> +    llend = int128_and(llend, int128_exts64(page_mask));
> >>
> >>       if (int128_ge(int128_make64(iova), llend)) {
> >>           return;
> >> @@ -347,7 +350,7 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>
> >>       memory_region_ref(section->mr);
> >>
> >> -    if (memory_region_is_iommu(section->mr)) {
> >> +    if (is_iommu) {
> >>           VFIOGuestIOMMU *giommu;
> >>
> >>           trace_vfio_listener_region_add_iommu(iova,
> >> @@ -423,6 +426,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>                                               iommu_data.type1.listener);
> >>       hwaddr iova, end;
> >>       int ret;
> >> +    const bool is_iommu = memory_region_is_iommu(section->mr);
> >> +    const hwaddr page_mask =
> >> +        is_iommu ? TARGET_PAGE_MASK : qemu_real_host_page_mask;
> >>
> >>       if (vfio_listener_skipped_section(section)) {
> >>           trace_vfio_listener_region_del_skip(
> >> @@ -432,13 +438,13 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>           return;
> >>       }
> >>
> >> -    if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) !=
> >> -                 (section->offset_within_region & ~TARGET_PAGE_MASK))) {
> >> +    if (unlikely((section->offset_within_address_space & ~page_mask) !=
> >> +                 (section->offset_within_region & ~page_mask))) {
> >>           error_report("%s received unaligned region", __func__);
> >>           return;
> >>       }
> >>
> >> -    if (memory_region_is_iommu(section->mr)) {
> >> +    if (is_iommu) {
> >>           VFIOGuestIOMMU *giommu;
> >>
> >>           QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> >> @@ -459,9 +465,9 @@ static void vfio_listener_region_del(MemoryListener *listener,
> >>            */
> >>       }
> >>
> >> -    iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +    iova = REAL_HOST_PAGE_ALIGN(section->offset_within_address_space);
> >>       end = (section->offset_within_address_space + int128_get64(section->size)) &
> >> -          TARGET_PAGE_MASK;
> >> +          page_mask;
> >>
> >>       if (iova >= end) {
> >>           return;
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 2ed877f..7694afe 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -50,16 +50,16 @@ typedef struct VFIOQuirk {
> >>       struct VFIOPCIDevice *vdev;
> >>       QLIST_ENTRY(VFIOQuirk) next;
> >>       struct {
> >> -        uint32_t base_offset:TARGET_PAGE_BITS;
> >> -        uint32_t address_offset:TARGET_PAGE_BITS;
> >> +        uint32_t base_offset;
> >> +        uint32_t address_offset;
> >>           uint32_t address_size:3;
> >>           uint32_t bar:3;
> >>
> >>           uint32_t address_match;
> >>           uint32_t address_mask;
> >>
> >> -        uint32_t address_val:TARGET_PAGE_BITS;
> >> -        uint32_t data_offset:TARGET_PAGE_BITS;
> >> +        uint32_t address_val;
> >> +        uint32_t data_offset;
> >>           uint32_t data_size:3;
> >>
> >>           uint8_t flags;
> >> @@ -1319,8 +1319,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque,
> >>   {
> >>       VFIOQuirk *quirk = opaque;
> >>       VFIOPCIDevice *vdev = quirk->vdev;
> >> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
> >> -    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
> >> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
> >> +    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
> >>       uint64_t data;
> >>
> >>       if (vfio_flags_enabled(quirk->data.flags, quirk->data.read_flags) &&
> >> @@ -1349,8 +1349,8 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr,
> >>   {
> >>       VFIOQuirk *quirk = opaque;
> >>       VFIOPCIDevice *vdev = quirk->vdev;
> >> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
> >> -    hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK;
> >> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
> >> +    hwaddr offset = quirk->data.address_match & ~qemu_real_host_page_mask;
> >>
> >>       if (vfio_flags_enabled(quirk->data.flags, quirk->data.write_flags) &&
> >>           ranges_overlap(addr, size, offset, quirk->data.address_mask + 1)) {
> >> @@ -1650,9 +1650,9 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr)
> >>
> >>       memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
> >>                             "vfio-ati-bar2-4000-quirk",
> >> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
> >> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
> >>       memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
> >> -                          quirk->data.address_match & TARGET_PAGE_MASK,
> >> +                          quirk->data.address_match & qemu_real_host_page_mask,
> >>                             &quirk->mem, 1);
> >>
> >>       QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
> >> @@ -1888,7 +1888,7 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr,
> >>       VFIOQuirk *quirk = opaque;
> >>       VFIOPCIDevice *vdev = quirk->vdev;
> >>       PCIDevice *pdev = &vdev->pdev;
> >> -    hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK;
> >> +    hwaddr base = quirk->data.address_match & qemu_real_host_page_mask;
> >>
> >>       vfio_generic_quirk_write(opaque, addr, data, size);
> >>
> >> @@ -1943,9 +1943,9 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr)
> >>
> >>       memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_88000_quirk,
> >>                             quirk, "vfio-nvidia-bar0-88000-quirk",
> >> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
> >> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
> >>       memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
> >> -                          quirk->data.address_match & TARGET_PAGE_MASK,
> >> +                          quirk->data.address_match & qemu_real_host_page_mask,
> >>                             &quirk->mem, 1);
> >>
> >>       QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
> >> @@ -1980,9 +1980,9 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr)
> >>
> >>       memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk,
> >>                             "vfio-nvidia-bar0-1800-quirk",
> >> -                          TARGET_PAGE_ALIGN(quirk->data.address_mask + 1));
> >> +                          REAL_HOST_PAGE_ALIGN(quirk->data.address_mask + 1));
> >>       memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem,
> >> -                          quirk->data.address_match & TARGET_PAGE_MASK,
> >> +                          quirk->data.address_match & qemu_real_host_page_mask,
> >>                             &quirk->mem, 1);
> >>
> >>       QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next);
> >
> >
> >
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask
  2015-07-14 22:28       ` Alex Williamson
@ 2015-07-15  0:49         ` Alexey Kardashevskiy
  0 siblings, 0 replies; 12+ messages in thread
From: Alexey Kardashevskiy @ 2015-07-15  0:49 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Michael Roth, qemu-ppc, qemu-devel, David Gibson

On 07/15/2015 08:28 AM, Alex Williamson wrote:
> On Tue, 2015-07-14 at 16:58 +1000, Alexey Kardashevskiy wrote:
>> On 07/14/2015 05:13 AM, Alex Williamson wrote:
>>> On Tue, 2015-07-14 at 00:56 +1000, Alexey Kardashevskiy wrote:
>>>> These started switching from TARGET_PAGE_MASK (hardcoded as 4K) to
>>>> a real host page size:
>>>> 4e51361d7 "cpu-all: complete "real" host page size API" and
>>>> f7ceed190 "vfio: cpu: Use "real" page size API"
>>>>
>>>> This finished the transition by:
>>>> - %s/TARGET_PAGE_MASK/qemu_real_host_page_mask/
>>>> - %s/TARGET_PAGE_ALIGN/REAL_HOST_PAGE_ALIGN/
>>>> - removing bitfield length for offsets in VFIOQuirk::data as
>>>> qemu_real_host_page_mask is not a macro
>>>
>>> Can we assume that none of the changes to quirks have actually been
>>> tested?
>>
>> No, why? :) I tried it on one of NVIDIAs I got here -
>> VGA compatible controller: NVIDIA Corporation GM107GL [Quadro K2200] (rev a2)
>> The driver was from NVIDIA (not nouveau) and the test was "acos" (some
>> basic CUDA test).
>
> That's only one of a handful or more quirks.  The VGA related quirks are
> all for backdoors in the bootstrap process and MMIO access to config
> space, things that I would not expect to see on power.  So power
> probably isn't a useful host to test these things.
>
>>> I don't really support them being bundled in here since they
>>> really aren't related to what you're doing.
>>
>> This makes sense, I'll move them to a separate patch and add a note how it
>> helps on a 64k-pages host.
>>
>>> For DMA we generally want
>>> to be host IOMMU page aligned,
>>
>> Do all known IOMMUs use a constant page size? IOMMU memory region does not
>> have an IOMMU page size/mask and I wanted to add it there but not sure if
>> it is generic enough.
>
> AMD supports nearly any power-of-2 size (>=4k), Intel supports 4k +
> optionally 2M and 1G.  The vfio type1 iommu driver looks for physically
> contiguous ranges to give to the hardware iommu driver, which makes use
> of whatever optimal page size it can.  Therefore, we really don't care
> what the hardware page size is beyond the assumption that it supports
> 4k.  When hugepages are used by the VM, we expect the iommu will
> automatically make use of them if supported.  A non-VM vfio userspace
> driver might care a little bit more about the supported page sizes, I
> imagine.

Oh. Cooler that us (p8).

>>> which we can generally assume is the same
>>> as host page aligned,
>>
>> They are almost never the same on sPAPR for 32bit windows...
>>
>>> but quirks are simply I/O regions, so I think they
>>> ought to continue to be target page aligned.
>>
>> Without s/TARGET_PAGE_MASK/qemu_real_host_page_mask/,
>> &vfio_nvidia_88000_quirk fails+exits in kvm_set_phys_mem() as the size of
>> section is 0x88000. It still works with x-mmap=false (or TCG, I suppose)
>> though.
>
> Think about what this is doing to the guest.  There's a 4k window
> (because PCIe extended config space is 4k) at offset 0x88000 of the MMIO
> BAR that allows access to PCI config space of the device.  With 4k pages
> this all aligns quite nicely and only config space accesses are trapped.
> With 64k pages, you're trapping everything from 0x80000 to 0x8ffff.  We
> have no idea what else might live in that space and what kind of
> performance impact it'll cause to the operation of the device.  If you

If the alternative is not to work all, then working slower (potentially) is ok.


> don't know that you need it and can't meet the mapping criteria, don't
> enable it.  BTW, this quirk is for GeForce, not Quadro.  The region
> seems not to be used by Quadro drivers and we can't programatically tell
> the difference between Quadro and GeForce hardware, so we leave it
> enabled for either on x86.

What kind of driver exploits this? If it is windows-only, then I can safely 
drop all of these hacks.

Or it is a adapter's bios? Or host bios?

>
> Maybe these really should be real_host_page_size, but I can't really
> picture how an underlying 64k page size host gets imposed on a guest
> that's only aware of a 4k page size.  For instance, what prevents a 4k
> guest from mapping PCI BARs with 4k alignment.  Multiple BARs for
> different devices fit within a 64k host page.  MMU mappings would seem
> to have similar issues.  These quirks need to be on the granularity we
> take a target page fault, which I imagine is the same as the
> real_host_page_size.  I'd expect that unless you're going to support
> consumer graphics or crappy realtek NICs, none of these quirks are
> relevant to you.
>
>>>> This keeps using TARGET_PAGE_MASK for IOMMU regions though as it is
>>>> the minimum page size which IOMMU regions may be using and at the moment
>>>> memory regions do not carry the actual page size.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>
>>>> In reality DMA windows are always a lot bigger than a single 4K page
>>>> and aligned to 32/64MB, may be only use here qemu_real_host_page_mask?
>>>
>>> I don't understand what this is asking either.  While the bulk of memory
>>> is going to be mapped in larger chunks, we do occasionally see 4k
>>> mappings on x86, particularly in some of the legacy low memory areas.
>>
>>
>> The question was not about individual mappings - these are handled by a
>> iommu memory region notifier; here we are dealing with DMA windows which
>> are always megabytes but nothing really prevents/prohibits a guest from
>> requesting a 4K _window_ (with a single TCE entry). Whether we want to
>> support such small windows or not - this was a question.
>
> But you're using the same memory listener that does deak with 4k
> mappings.  What optimization would you hope to achieve by assuming only
> larger mappings that compensates for the lack of generality?  Thanks,

I gave up here already - I reworked this part in v3 :)

Thanks for review and education.


-- 
Alexey

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-07-15  0:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-13 14:56 [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alexey Kardashevskiy
2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 1/5] vfio: Switch from TARGET_PAGE_MASK to qemu_real_host_page_mask Alexey Kardashevskiy
2015-07-13 19:13   ` Alex Williamson
2015-07-14  6:58     ` Alexey Kardashevskiy
2015-07-14 12:17       ` Alexey Kardashevskiy
2015-07-14 22:28       ` Alex Williamson
2015-07-15  0:49         ` Alexey Kardashevskiy
2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 2/5] vfio: Skip PCI BARs in memory listener Alexey Kardashevskiy
2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 3/5] vfio: Store IOMMU type in container Alexey Kardashevskiy
2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 4/5] vfio: Refactor memory listener to accommodate more IOMMU types Alexey Kardashevskiy
2015-07-13 14:56 ` [Qemu-devel] [RFC PATCH qemu v2 5/5] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-13 19:37 ` [Qemu-devel] [RFC PATCH qemu v2 0/5] vfio: SPAPR IOMMU v2 (memory preregistration support) Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.