All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [v3 00/15] Add VT-d Posted-Interrupts support
@ 2015-06-24  5:18 Feng Wu
  2015-06-24  5:18 ` [v3 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
                   ` (14 more replies)
  0 siblings, 15 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

You can find the VT-d Posted-Interrtups Spec. in the following URL:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html

This patch set follow the following design:
http://article.gmane.org/gmane.comp.emulators.xen.devel/236476

v3:
Changelogs are in the head of each patch.

v2:
1. Add the design doc.
2. Coding style fix.
3. Add some comments for struct pi_desc.
4. Extend 'struct iremap_entry' to a more common format.
5. Delete the atomic helper functions for pi descriptor manipulation.
6. Add the new command line in docs/misc/xen-command-line.markdown.
7. Use macros to replace some magic numbers.

Feng Wu (15):
  Vt-d Posted-intterrupt (PI) design
  Add helper macro for X86_FEATURE_CX16 feature detection
  Add cmpxchg16b support for x86-64
  iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  vt-d: VT-d Posted-Interrupts feature detection
  vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  vmx: Initialize VT-d Posted-Interrupts Descriptor
  Suppress posting interrupts when 'SN' is set
  vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  vt-d: Add API to update IRTE when VT-d PI is used
  Update IRTE according to guest interrupt config changes
  vmx: posted-interrupt handling when vCPU is blocked
  vmx: Properly handle notification event when vCPU is running
  Update Posted-Interrupts Descriptor during vCPU scheduling
  Add a command line parameter for VT-d posted-interrupts

 docs/misc/vtd-pi.txt                   | 333 +++++++++++++++++++++++++++++++++
 docs/misc/xen-command-line.markdown    |   9 +-
 xen/arch/x86/hvm/hvm.c                 |   6 +
 xen/arch/x86/hvm/vmx/vmcs.c            |  21 +++
 xen/arch/x86/hvm/vmx/vmx.c             | 263 +++++++++++++++++++++++++-
 xen/common/schedule.c                  |   4 +
 xen/drivers/passthrough/io.c           |  96 +++++++++-
 xen/drivers/passthrough/iommu.c        |  12 +-
 xen/drivers/passthrough/vtd/intremap.c | 190 ++++++++++++++-----
 xen/drivers/passthrough/vtd/iommu.c    |  18 +-
 xen/drivers/passthrough/vtd/iommu.h    |  45 +++--
 xen/drivers/passthrough/vtd/utils.c    |  10 +-
 xen/include/asm-arm/domain.h           |   2 +
 xen/include/asm-x86/cpufeature.h       |   2 +
 xen/include/asm-x86/hvm/hvm.h          |   3 +
 xen/include/asm-x86/hvm/vmx/vmcs.h     |  27 ++-
 xen/include/asm-x86/hvm/vmx/vmx.h      |  18 ++
 xen/include/asm-x86/iommu.h            |   2 +
 xen/include/asm-x86/x86_64/system.h    |  28 +++
 xen/include/xen/iommu.h                |   2 +-
 xen/include/xen/types.h                |   5 +
 21 files changed, 1019 insertions(+), 77 deletions(-)
 create mode 100644 docs/misc/vtd-pi.txt

-- 
2.1.0

^ permalink raw reply	[flat|nested] 155+ messages in thread

* [v3 01/15] Vt-d Posted-intterrupt (PI) design
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-24  6:15   ` Meng Xu
  2015-07-08  7:21   ` Tian, Kevin
  2015-06-24  5:18 ` [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection Feng Wu
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

Add the design doc for VT-d PI.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
 docs/misc/vtd-pi.txt | 333 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 333 insertions(+)
 create mode 100644 docs/misc/vtd-pi.txt

diff --git a/docs/misc/vtd-pi.txt b/docs/misc/vtd-pi.txt
new file mode 100644
index 0000000..f41c688
--- /dev/null
+++ b/docs/misc/vtd-pi.txt
@@ -0,0 +1,333 @@
+Authors: Feng Wu <feng.wu@intel.com>
+
+VT-d Posted-interrupt (PI) design for XEN
+
+Background
+==========
+With the development of virtualization, there are more and more device
+assignment requirements. However, today when a VM is running with
+assigned devices (such as, NIC), external interrupt handling for the assigned
+devices always needs VMM intervention.
+
+VT-d Posted-interrupt is a more enhanced method to handle interrupts
+in the virtualization environment. Interrupt posting is the process by
+which an interrupt request is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex.
+
+With VT-d Posted-interrupt we can get the following advantages:
+- Direct delivery of external interrupts to running vCPUs without VMM
+intervention
+- Decrease the interrupt migration complexity. On vCPU migration, software
+can atomically co-migrate all interrupts targeting the migrating vCPU. For
+virtual machines with assigned devices, migrating a vCPU across pCPUs
+either incur the overhead of forwarding interrupts in software (e.g. via VMM
+generated IPIS), or complexity to independently migrate each interrupt targeting
+the vCPU to the new pCPU. However, after enabling VT-d PI, the destination vCPU
+of an external interrupt from assigned devices is stored in the IRTE (i.e.
+Posted-interrupt Descriptor Address), when vCPU is migrated to another pCPU,
+we will set this new pCPU in the 'NDST' filed of Posted-interrupt descriptor, this
+make the interrupt migration automatic.
+
+Here is what Xen currently does for external interrupts from assigned devices:
+
+When a VM is running and an external interrupt from an assigned device occurs
+for it. VM-EXIT happens, then:
+
+vmx_do_extint() --> do_IRQ() --> __do_IRQ_guest() --> hvm_do_IRQ_dpci() -->
+raise_softirq_for(pirq_dpci) --> raise_softirq(HVM_DPCI_SOFTIRQ)
+
+softirq HVM_DPCI_SOFTIRQ is bound to dpci_softirq()
+
+dpci_softirq() --> hvm_dirq_assist() --> vmsi_deliver_pirq() --> vmsi_deliver() -->
+vmsi_inj_irq() --> vlapic_set_irq()
+
+vlapic_set_irq() does the following things:
+1. If CPU-side posted-interrupt is supported, call vmx_deliver_posted_intr() to deliver
+the virtual interrupt via posted-interrupt infrastructure.
+2. Else if CPU-side posted-interrupt is not supported, set the related vIRR in vLAPIC
+page and call vcpu_kick() to kick the related vCPU. Before VM-Entry, vmx_intr_assist()
+will help to inject the interrupt to guests.
+
+However, after VT-d PI is supported, when a guest is running in non-root and an
+external interrupt from an assigned device occurs for it. No VM-Exit is needed,
+the guest can handle this totally in non-root mode, thus avoiding all the above
+code flow.
+
+Posted-interrupt Introduction
+========================
+There are two components to the Posted-interrupt architecture:
+Processor Support and Root-Complex Support
+
+- Processor Support
+Posted-interrupt processing is a feature by which a processor processes
+the virtual interrupts by recording them as pending on the virtual-APIC
+page.
+
+Posted-interrupt processing is enabled by setting the process posted
+interrupts VM-execution control. The processing is performed in response
+to the arrival of an interrupt with the posted-interrupt notification vector.
+In response to such an interrupt, the processor processes virtual interrupts
+recorded in a data structure called a posted-interrupt descriptor.
+
+More information about APICv and CPU-side Posted-interrupt, please refer
+to Chapter 29, and Section 29.6 in the Intel SDM:
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+
+- Root-Complex Support
+Interrupt posting is the process by which an interrupt request (from IOAPIC
+or MSI/MSIx capable sources) is recorded in a memory-resident
+posted-interrupt-descriptor structure by the root-complex, followed by
+an optional notification event issued to the CPU complex. The interrupt
+request arriving at the root-complex carry the identity of the interrupt
+request source and a 'remapping-index'. The remapping-index is used to
+look-up an entry from the memory-resident interrupt-remap-table. Unlike
+with interrupt-remapping, the interrupt-remap-table-entry for a posted-
+interrupt, specifies a virtual-vector and a pointer to the posted-interrupt
+descriptor. The virtual-vector specifies the vector of the interrupt to be
+recorded in the posted-interrupt descriptor. The posted-interrupt descriptor
+hosts storage for the virtual-vectors and contains the attributes of the
+notification event (interrupt) to be issued to the CPU complex to inform
+CPU/software about pending interrupts recorded in the posted-interrupt
+descriptor.
+
+More information about VT-d PI, please refer to
+http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
+
+Important Definitions
+==================
+There are some changes to IRTE and posted-interrupt descriptor after
+VT-d PI is introduced:
+IRTE:
+Posted-interrupt Descriptor Address: the address of the posted-interrupt descriptor
+Virtual Vector: the guest vector of the interrupt
+URG: indicates if the interrupt is urgent
+
+Posted-interrupt descriptor:
+The Posted Interrupt Descriptor hosts the following fields:
+Posted Interrupt Request (PIR): Provide storage for posting (recording) interrupts (one bit
+per vector, for up to 256 vectors).
+
+Outstanding Notification (ON): Indicate if there is a notification event outstanding (not
+processed by processor or software) for this Posted Interrupt Descriptor. When this field is 0,
+hardware modifies it from 0 to 1 when generating a notification event, and the entity receiving
+the notification event (processor or software) resets it as part of posted interrupt processing.
+
+Suppress Notification (SN): Indicate if a notification event is to be suppressed (not
+generated) for non-urgent interrupt requests (interrupts processed through an IRTE with
+URG=0).
+
+Notification Vector (NV): Specify the vector for notification event (interrupt).
+
+Notification Destination (NDST): Specify the physical APIC-ID of the destination logical
+processor for the notification event.
+
+Design Overview
+==============
+In this design, we will cover the following items:
+1. Add a variable to control whether enable VT-d posted-interrupt or not.
+2. VT-d PI feature detection.
+3. Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
+4. Extend IRTE structure to support VT-d PI.
+5. Introduce a new global vector which is used for waking up the blocked vCPU.
+6. Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+7. Update posted-interrupt descriptor during vCPU scheduling (when the state
+of the vCPU is transmitted among RUNSTATE_running / RUNSTATE_blocked/
+RUNSTATE_runnable / RUNSTATE_offline).
+8. How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+9. New boot command line for Xen, which controls VT-d PI feature by user.
+10. Multicast/broadcast and lowest priority interrupts consideration.
+
+
+Implementation details
+===================
+- New variable to control VT-d PI
+
+Like variable 'iommu_intremap' for interrupt remapping, it is very straightforward
+to add a new one 'iommu_intpost' for posted-interrupt. 'iommu_intpost' is set
+only when interrupt remapping and VT-d posted-interrupt are both enabled.
+
+- VT-d PI feature detection.
+Bit 59 in VT-d Capability Register is used to report VT-d Posted-interrupt support.
+
+- Extend posted-interrupt descriptor structure to cover VT-d PI specific stuff.
+Here is the new structure for posted-interrupt descriptor:
+
+struct pi_desc {
+    DECLARE_BITMAP(pir, NR_VECTORS);
+    union {
+        struct
+        {
+        u16 on     : 1,  /* bit 256 - Outstanding Notification */
+            sn     : 1,  /* bit 257 - Suppress Notification */
+            rsvd_1 : 14; /* bit 271:258 - Reserved */
+        u8  nv;          /* bit 279:272 - Notification Vector */
+        u8  rsvd_2;      /* bit 287:280 - Reserved */
+        u32 ndst;        /* bit 319:288 - Notification Destination */
+        };
+        u64 control;
+    };
+    u32 rsvd[6];
+} __attribute__ ((aligned (64)));
+
+- Extend IRTE structure to support VT-d PI.
+
+Here is the new structure for IRTE:
+/* interrupt remap entry */
+struct iremap_entry {
+  union {
+    struct { u64 lo, hi; };
+    struct {
+        u16 p       : 1,
+            fpd     : 1,
+            dm      : 1,
+            rh      : 1,
+            tm      : 1,
+            dlm     : 3,
+            avail   : 4,
+            res_1   : 4;
+        u8  vector;
+        u8  res_2;
+        u32 dst;
+        u16 sid;
+        u16 sq      : 2,
+            svt     : 2,
+            res_3   : 12;
+        u32 res_4   : 32;
+    } remap;
+    struct {
+        u16 p       : 1,
+            fpd     : 1,
+            res_1   : 6,
+            avail   : 4,
+            res_2   : 2,
+            urg     : 1,
+            im      : 1;
+        u8  vector;
+        u8  res_3;
+        u32 res_4   : 6,
+            pda_l   : 26;
+        u16 sid;
+        u16 sq      : 2,
+            svt     : 2,
+            res_5   : 12;
+        u32 pda_h;
+    } post;
+  };
+};
+
+- Introduce a new global vector which is used to wake up the blocked vCPU.
+
+Currently, there is a global vector 'posted_intr_vector', which is used as the
+global notification vector for all vCPUs in the system. This vector is stored in
+VMCS and CPU considers it as a _special_ vector, uses it to notify the related
+pCPU when an interrupt is recorded in the posted-interrupt descriptor.
+
+This existing global vector is a _special_ vector to CPU, CPU handle it in a
+_special_ way compared to normal vectors, please refer to 29.6 in Intel SDM
+http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdf
+for more information about how CPU handles it.
+
+After having VT-d PI, VT-d engine can issue notification event when the
+assigned devices issue interrupts. We need add a new global vector to
+wakeup the blocked vCPU, please refer to later section in this design for
+how to use this new global vector.
+
+- Update IRTE when guest modifies the interrupt configuration (MSI/MSIx configuration).
+After VT-d PI is introduced, the format of IRTE is changed as follows:
+	Descriptor Address: the address of the posted-interrupt descriptor
+	Virtual Vector: the guest vector of the interrupt
+	URG: indicates if the interrupt is urgent
+	Other fields continue to have the same meaning
+
+'Descriptor Address' tells the destination vCPU of this interrupt, since
+each vCPU has a dedicated posted-interrupt descriptor.
+
+'Virtual Vector' tells the guest vector of the interrupt.
+
+When guest changes the configuration of the interrupts, such as, the
+cpu affinity, or the vector, we need to update the associated IRTE accordingly.
+
+- Update posted-interrupt descriptor during vCPU scheduling
+
+The basic idea here is:
+1. When vCPU's state is RUNSTATE_running,
+        - Set 'NV' to 'posted_intr_vector'.
+        - Clear 'SN' to accept posted-interrupts.
+        - Set 'NDST' to the pCPU on which the vCPU will be running.
+2. When vCPU's state is RUNSTATE_blocked,
+        - Set 'NV' to ' pi_wakeup_vector ', so we can wake up the
+          related vCPU when posted-interrupt happens for it.
+          Please refer to the above section about the new global vector.
+        - Clear 'SN' to accept posted-interrupts
+3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
+        - Set 'SN' to suppress non-urgent interrupts
+          (Current, we only support non-urgent interrupts)
+         When vCPU is in RUNSTATE_runnable or RUNSTATE_offline,
+         It is not needed to accept posted-interrupt notification event,
+         since we don't change the behavior of scheduler when the interrupt
+         occurs, we still need wait the next scheduling of the vCPU.
+         When external interrupts from assigned devices occur, the interrupts
+         are recorded in PIR, and will be synced to IRR before VM-Entry.
+        - Set 'NV' to 'posted_intr_vector'.
+
+- How to wakeup blocked vCPU when an interrupt is posted for it (wakeup notification handler).
+
+Here is the scenario for the usage of the new global vector:
+
+1. vCPU0 is running on pCPU0
+2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
+3. An external interrupt from an assigned device occurs for vCPU0, if we
+still use 'posted_intr_vector' as the notification vector for vCPU0, the
+notification event for vCPU0 (the event will go to pCPU1) will be consumed
+by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
+case is that vCPU0 will never be woken up again since the wakeup event
+for it is always consumed by other vCPUs incorrectly. So we need introduce
+another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
+
+After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
+event using this new vector. Since this new vector is not a SPECIAL one to CPU,
+it is just a normal vector. To cpu, it just receives an normal external interrupt,
+then we can get control in the handler of this new vector. In this case, hypervisor
+can do something in it, such as wakeup the blocked vCPU.
+
+Here are what we do for the blocked vCPU:
+1. Define a per-cpu list 'pi_blocked_vcpu', which stored the blocked
+vCPU on the pCPU.
+2. When the vCPU's state is changed to RUNSTATE_blocked, insert the vCPU
+to the per-cpu list belonging to the pCPU it was running.
+3. When the vCPU is unblocked, remove the vCPU from the related pCPU list.
+
+In the handler of 'pi_wakeup_vector', we do:
+1. Get the physical CPU.
+2. Iterate the list 'pi_blocked_vcpu' of the current pCPU, if 'ON' is set,
+we unblock the associated vCPU.
+
+- New boot command line for Xen, which controls VT-d PI feature by user.
+
+Like 'intremap' for interrupt remapping, we add a new boot command line
+'intpost' for posted-interrupts.
+
+- Multicast/broadcast and lowest priority interrupts consideration.
+
+With VT-d PI, the destination vCPU information of an external interrupt
+from assigned devices is stored in IRTE, this makes the following
+consideration of the design:
+1. Multicast/broadcast interrupts cannot be posted.
+2. For lowest-priority interrupts, new Intel CPU/Chipset/root-complex
+(starting from Nehalem) ignore TPR value, and instead supported two other
+ways (configurable by BIOS) on how the handle lowest priority interrupts:
+	A) Round robin: In this method, the chipset simply delivers lowest priority
+interrupts in a round-robin manner across all the available logical CPUs. While
+this provides good load balancing, this was not the best thing to do always as
+interrupts from the same device (like NIC) will start running on all the CPUs
+thrashing caches and taking locks. This led to the next scheme.
+	B) Vector hashing: In this method, hardware would apply a hash function
+on the vector value in the interrupt request, and use that hash to pick a logical
+CPU to route the lowest priority interrupt. This way, a given vector always goes
+to the same logical CPU, avoiding the thrashing problem above.
+
+So, gist of above is that, lowest priority interrupts has never been delivered as
+"lowest priority" in physical hardware.
+
+I will emulate vector hashing for posted-interrupt for XEN.
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
  2015-06-24  5:18 ` [v3 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-24 17:31   ` Andrew Cooper
  2015-07-08  7:23   ` Tian, Kevin
  2015-06-24  5:18 ` [v3 03/15] Add cmpxchg16b support for x86-64 Feng Wu
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

Add macro cpu_has_cx16 to detect X86_FEATURE_CX16 feature.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Newly added. We need to atomically update the IRTE in PI format
  via CMPXCHG16B which is only available with this feature.

 xen/include/asm-x86/cpufeature.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 7963a3a..63c1fe8 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -216,6 +216,8 @@
 
 #define cpu_has_cpuid_faulting	boot_cpu_has(X86_FEATURE_CPUID_FAULTING)
 
+#define cpu_has_cx16            boot_cpu_has(X86_FEATURE_CX16)
+
 enum _cache_type {
     CACHE_TYPE_NULL = 0,
     CACHE_TYPE_DATA = 1,
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 03/15] Add cmpxchg16b support for x86-64
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
  2015-06-24  5:18 ` [v3 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
  2015-06-24  5:18 ` [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-24 18:35   ` Andrew Cooper
  2015-07-10 12:57   ` Jan Beulich
  2015-06-24  5:18 ` [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
                   ` (11 subsequent siblings)
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

This patch adds cmpxchg16b support for x86-64, so software
can perform 128-bit atomic write/read.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
Newly added.

 xen/include/asm-x86/x86_64/system.h | 28 ++++++++++++++++++++++++++++
 xen/include/xen/types.h             |  5 +++++
 2 files changed, 33 insertions(+)

diff --git a/xen/include/asm-x86/x86_64/system.h b/xen/include/asm-x86/x86_64/system.h
index 662813a..a910d00 100644
--- a/xen/include/asm-x86/x86_64/system.h
+++ b/xen/include/asm-x86/x86_64/system.h
@@ -6,6 +6,34 @@
                                    (unsigned long)(n),sizeof(*(ptr))))
 
 /*
+ * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
+ * identical, store NEW in MEM.  Return the initial value in MEM.
+ * Success is indicated by comparing RETURN with OLD.
+ *
+ * This function can only be called when cpu_has_cx16 is ture.
+ */
+
+static always_inline uint128_t __cmpxchg16b(
+    volatile void *ptr, uint128_t old, uint128_t new)
+{
+    uint128_t prev;
+
+    ASSERT(cpu_has_cx16);
+
+    asm volatile ( "lock; cmpxchg16b %4"
+                   : "=d" (prev.high), "=a" (prev.low)
+                   : "c" (new.high), "b" (new.low),
+                   "m" (*__xg((volatile void *)ptr)),
+                   "0" (old.high), "1" (old.low)
+                   : "memory" );
+
+    return prev;
+}
+
+#define cmpxchg16b(ptr,o,n)                                             \
+    __cmpxchg16b((ptr), *(uint128_t *)(o), *(uint128_t *)(n))
+
+/*
  * This function causes value _o to be changed to _n at location _p.
  * If this access causes a fault then we return 1, otherwise we return 0.
  * If no fault occurs then _o is updated to the value we saw at _p. If this
diff --git a/xen/include/xen/types.h b/xen/include/xen/types.h
index 8596ded..30f8a44 100644
--- a/xen/include/xen/types.h
+++ b/xen/include/xen/types.h
@@ -47,6 +47,11 @@ typedef         __u64           uint64_t;
 typedef         __u64           u_int64_t;
 typedef         __s64           int64_t;
 
+typedef struct {
+        uint64_t low;
+        uint64_t high;
+} uint128_t;
+
 struct domain;
 struct vcpu;
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (2 preceding siblings ...)
  2015-06-24  5:18 ` [v3 03/15] Add cmpxchg16b support for x86-64 Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-25  9:06   ` Andrew Cooper
  2015-07-08  7:30   ` Tian, Kevin
  2015-06-24  5:18 ` [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Remove pointless initializer for 'iommu_intpost'.
- Some adjustment for "if no intremap then no intpost" logic.
    * For parse_iommu_param(), move it to the end of the function,
      so we don't need to add the some logic when introduing the
      new kernel parameter 'intpost' in later patch.
    * Add this logic in iommu_setup() after iommu_hardware_setup()
      is called.

 xen/drivers/passthrough/iommu.c | 10 +++++++++-
 xen/include/xen/iommu.h         |  2 +-
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 06cb38f..597f676 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -39,6 +39,7 @@ static void iommu_dump_p2m_table(unsigned char key);
  *   no-snoop                   Disable VT-d Snoop Control
  *   no-qinval                  Disable VT-d Queued Invalidation
  *   no-intremap                Disable VT-d Interrupt Remapping
+ *   no-intpost                 Disable VT-d Interrupt posting
  */
 custom_param("iommu", parse_iommu_param);
 bool_t __initdata iommu_enable = 1;
@@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
+bool_t __read_mostly iommu_intpost;
 bool_t __read_mostly iommu_hap_pt_share = 1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -112,6 +114,9 @@ static void __init parse_iommu_param(char *s)
 
         s = ss + 1;
     } while ( ss );
+
+    if ( !iommu_intremap )
+        iommu_intpost = 0;
 }
 
 int iommu_domain_init(struct domain *d)
@@ -305,6 +310,9 @@ int __init iommu_setup(void)
         panic("Couldn't enable %s and iommu=required/force",
               !iommu_enabled ? "IOMMU" : "Interrupt Remapping");
 
+    if ( !iommu_intremap )
+        iommu_intpost = 0;
+
     if ( !iommu_enabled )
     {
         iommu_snoop = 0;
@@ -372,7 +380,7 @@ void iommu_crash_shutdown(void)
     const struct iommu_ops *ops = iommu_get_ops();
     if ( iommu_enabled )
         ops->crash_shutdown();
-    iommu_enabled = iommu_intremap = 0;
+    iommu_enabled = iommu_intremap = iommu_intpost = 0;
 }
 
 bool_t iommu_has_feature(struct domain *d, enum iommu_feature feature)
diff --git a/xen/include/xen/iommu.h b/xen/include/xen/iommu.h
index b30bf41..a123cce 100644
--- a/xen/include/xen/iommu.h
+++ b/xen/include/xen/iommu.h
@@ -31,7 +31,7 @@
 extern bool_t iommu_enable, iommu_enabled;
 extern bool_t force_iommu, iommu_verbose;
 extern bool_t iommu_workaround_bios_bug, iommu_passthrough;
-extern bool_t iommu_snoop, iommu_qinval, iommu_intremap;
+extern bool_t iommu_snoop, iommu_qinval, iommu_intremap, iommu_intpost;
 extern bool_t iommu_hap_pt_share;
 extern bool_t iommu_debug;
 extern bool_t amd_iommu_perdev_intremap;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (3 preceding siblings ...)
  2015-06-24  5:18 ` [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-25 10:21   ` Andrew Cooper
  2015-07-08  7:32   ` Tian, Kevin
  2015-06-24  5:18 ` [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
                   ` (9 subsequent siblings)
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.

This patch adds feature detection logic for VT-d posted-interrupt.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Remove the "if no intremap then no intpost" logic in
  intel_vtd_setup(), it is covered in the iommu_setup().
- Add "if no intremap then no intpost" logic in the end
  of init_vtd_hw() which is called by vtd_resume().

So the logic exists in the following three places:
- parse_iommu_param()
- iommu_setup()
- init_vtd_hw()

 xen/drivers/passthrough/vtd/iommu.c | 18 ++++++++++++++++--
 xen/drivers/passthrough/vtd/iommu.h |  1 +
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 9053a1f..4221185 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -2071,6 +2071,9 @@ static int init_vtd_hw(void)
                 disable_intremap(drhd->iommu);
     }
 
+    if ( !iommu_intremap )
+        iommu_intpost = 0;
+
     /*
      * Set root entries for each VT-d engine.  After set root entry,
      * must globally invalidate context cache, and then globally
@@ -2133,8 +2136,8 @@ int __init intel_vtd_setup(void)
     }
 
     /* We enable the following features only if they are supported by all VT-d
-     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
-     * Interrupt Remapping.
+     * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
+     * Remapping, and Posted Interrupt
      */
     for_each_drhd_unit ( drhd )
     {
@@ -2162,6 +2165,15 @@ int __init intel_vtd_setup(void)
         if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
             iommu_intremap = 0;
 
+        /*
+         * We cannot use posted interrupt if X86_FEATURE_CX16 is
+         * not supported, since we count on this feature to
+         * atomically update 16-byte IRTE in posted format.
+         */
+        if ( !iommu_intremap &&
+             (!cap_intr_post(iommu->cap) || !cpu_has_cx16) )
+            iommu_intpost = 0;
+
         if ( !vtd_ept_page_compatible(iommu) )
             iommu_hap_pt_share = 0;
 
@@ -2187,6 +2199,7 @@ int __init intel_vtd_setup(void)
     P(iommu_passthrough, "Dom0 DMA Passthrough");
     P(iommu_qinval, "Queued Invalidation");
     P(iommu_intremap, "Interrupt Remapping");
+    P(iommu_intpost, "Posted Interrupt");
     P(iommu_hap_pt_share, "Shared EPT tables");
 #undef P
 
@@ -2206,6 +2219,7 @@ int __init intel_vtd_setup(void)
     iommu_passthrough = 0;
     iommu_qinval = 0;
     iommu_intremap = 0;
+    iommu_intpost = 0;
     return ret;
 }
 
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 80f8830..e807253 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -69,6 +69,7 @@
 /*
  * Decoding Capability Register
  */
+#define cap_intr_post(c)       (((c) >> 59) & 1)
 #define cap_read_drain(c)      (((c) >> 55) & 1)
 #define cap_write_drain(c)     (((c) >> 54) & 1)
 #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (4 preceding siblings ...)
  2015-06-24  5:18 ` [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 15:04   ` Andrew Cooper
                     ` (2 more replies)
  2015-06-24  5:18 ` [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
                   ` (8 subsequent siblings)
  14 siblings, 3 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

Extend struct pi_desc according to VT-d Posted-Interrupts Spec.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Use u32 instead of u64 for the bitfield in 'struct pi_desc'

 xen/include/asm-x86/hvm/vmx/vmcs.h | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index 1104bda..dedfaef 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -81,8 +81,19 @@ struct vmx_domain {
 
 struct pi_desc {
     DECLARE_BITMAP(pir, NR_VECTORS);
-    u32 control;
-    u32 rsvd[7];
+    union {
+        struct
+        {
+        u16 on     : 1,  /* bit 256 - Outstanding Notification */
+            sn     : 1,  /* bit 257 - Suppress Notification */
+            rsvd_1 : 14; /* bit 271:258 - Reserved */
+        u8  nv;          /* bit 279:272 - Notification Vector */
+        u8  rsvd_2;      /* bit 287:280 - Reserved */
+        u32 ndst;        /* bit 319:288 - Notification Destination */
+        };
+        u64 control;
+    };
+    u32 rsvd[6];
 } __attribute__ ((aligned (64)));
 
 #define ept_get_wl(ept)   ((ept)->ept_wl)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (5 preceding siblings ...)
  2015-06-24  5:18 ` [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 15:32   ` Andrew Cooper
  2015-07-08  7:53   ` Tian, Kevin
  2015-06-24  5:18 ` [v3 08/15] Suppress posting interrupts when 'SN' is set Feng Wu
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

This patch initializes the VT-d Posted-interrupt Descriptor.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Move pi_desc_init() to xen/arch/x86/hvm/vmx/vmcs.c
- Remove the 'inline' flag of pi_desc_init()

 xen/arch/x86/hvm/vmx/vmcs.c       | 18 ++++++++++++++++++
 xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
 2 files changed, 20 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 3aff365..11dc1b5 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -40,6 +40,7 @@
 #include <asm/flushtlb.h>
 #include <asm/shadow.h>
 #include <asm/tboot.h>
+#include <asm/apic.h>
 
 static bool_t __read_mostly opt_vpid_enabled = 1;
 boolean_param("vpid", opt_vpid_enabled);
@@ -921,6 +922,20 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
     virtual_vmcs_exit(vvmcs);
 }
 
+static void pi_desc_init(struct vcpu *v)
+{
+    uint32_t dest;
+
+    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
+
+    dest = cpu_physical_id(v->processor);
+
+    if ( x2apic_enabled )
+        v->arch.hvm_vmx.pi_desc.ndst = dest;
+    else
+        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
+}
+
 static int construct_vmcs(struct vcpu *v)
 {
     struct domain *d = v->domain;
@@ -1054,6 +1069,9 @@ static int construct_vmcs(struct vcpu *v)
 
     if ( cpu_has_vmx_posted_intr_processing )
     {
+        if ( iommu_intpost )
+            pi_desc_init(v);
+
         __vmwrite(PI_DESC_ADDR, virt_to_maddr(&v->arch.hvm_vmx.pi_desc));
         __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector);
     }
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 35f804a..5853563 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -89,6 +89,8 @@ typedef enum {
 #define EPT_EMT_WB              6
 #define EPT_EMT_RSV2            7
 
+#define PI_xAPIC_NDST_MASK      0xFF00
+
 void vmx_asm_vmexit_handler(struct cpu_user_regs);
 void vmx_asm_do_vmentry(void);
 void vmx_intr_assist(void);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (6 preceding siblings ...)
  2015-06-24  5:18 ` [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 15:41   ` Andrew Cooper
                     ` (2 more replies)
  2015-06-24  5:18 ` [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
                   ` (6 subsequent siblings)
  14 siblings, 3 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

Currently, we don't support urgent interrupt, all interrupts
are recognized as non-urgent interrupt, so we cannot send
posted-interrupt when 'SN' is set.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
use cmpxchg to test SN/ON and set ON

 xen/arch/x86/hvm/vmx/vmx.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 0837627..b94ef6a 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1686,6 +1686,8 @@ static void __vmx_deliver_posted_interrupt(struct vcpu *v)
 
 static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
 {
+    struct pi_desc old, new, prev;
+
     if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
         return;
 
@@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
          */
         pi_set_on(&v->arch.hvm_vmx.pi_desc);
     }
-    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
+    else
     {
+        prev.control = 0;
+
+        do {
+            old.control = v->arch.hvm_vmx.pi_desc.control &
+                          ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);
+            new.control = v->arch.hvm_vmx.pi_desc.control |
+                          1 << POSTED_INTR_ON;
+
+            /*
+             * Currently, we don't support urgent interrupt, all
+             * interrupts are recognized as non-urgent interrupt,
+             * so we cannot send posted-interrupt when 'SN' is set.
+             * Besides that, if 'ON' is already set, we cannot set
+             * posted-interrupts as well.
+             */
+            if ( prev.sn || prev.on )
+            {
+                vcpu_kick(v);
+                return;
+            }
+
+            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
+                                   old.control, new.control);
+        } while ( prev.control != old.control );
+
         __vmx_deliver_posted_interrupt(v);
-        return;
     }
-
-    vcpu_kick(v);
 }
 
 static void vmx_sync_pir_to_irr(struct vcpu *v)
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (7 preceding siblings ...)
  2015-06-24  5:18 ` [v3 08/15] Suppress posting interrupts when 'SN' is set Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 16:04   ` Andrew Cooper
                     ` (2 more replies)
  2015-06-24  5:18 ` [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
                   ` (5 subsequent siblings)
  14 siblings, 3 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Use u32 instead of u64 to define the bitfields in 'struct iremap_entry'
- Limit using bitfield if possible

 xen/drivers/passthrough/vtd/intremap.c | 92 +++++++++++++++++-----------------
 xen/drivers/passthrough/vtd/iommu.h    | 42 ++++++++++------
 xen/drivers/passthrough/vtd/utils.c    | 10 ++--
 3 files changed, 80 insertions(+), 64 deletions(-)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index 0333686..b7a42f6 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -123,9 +123,9 @@ static u16 hpetid_to_bdf(unsigned int hpet_id)
 static void set_ire_sid(struct iremap_entry *ire,
                         unsigned int svt, unsigned int sq, unsigned int sid)
 {
-    ire->hi.svt = svt;
-    ire->hi.sq = sq;
-    ire->hi.sid = sid;
+    ire->remap.svt = svt;
+    ire->remap.sq = sq;
+    ire->remap.sid = sid;
 }
 
 static void set_ioapic_source_id(int apic_id, struct iremap_entry *ire)
@@ -220,7 +220,7 @@ static unsigned int alloc_remap_entry(struct iommu *iommu, unsigned int nr)
         else
             p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-        if ( p->lo_val || p->hi_val ) /* not a free entry */
+        if ( p->lo || p->hi ) /* not a free entry */
             found = 0;
         else if ( ++found == nr )
             break;
@@ -254,7 +254,7 @@ static int remap_entry_to_ioapic_rte(
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
     {
         dprintk(XENLOG_ERR VTDPREFIX,
                 "%s: index (%d) get an empty entry!\n",
@@ -264,13 +264,13 @@ static int remap_entry_to_ioapic_rte(
         return -EFAULT;
     }
 
-    old_rte->vector = iremap_entry->lo.vector;
-    old_rte->delivery_mode = iremap_entry->lo.dlm;
-    old_rte->dest_mode = iremap_entry->lo.dm;
-    old_rte->trigger = iremap_entry->lo.tm;
+    old_rte->vector = iremap_entry->remap.vector;
+    old_rte->delivery_mode = iremap_entry->remap.dlm;
+    old_rte->dest_mode = iremap_entry->remap.dm;
+    old_rte->trigger = iremap_entry->remap.tm;
     old_rte->__reserved_2 = 0;
     old_rte->dest.logical.__reserved_1 = 0;
-    old_rte->dest.logical.logical_dest = iremap_entry->lo.dst >> 8;
+    old_rte->dest.logical.logical_dest = iremap_entry->remap.dst >> 8;
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -318,27 +318,28 @@ static int ioapic_rte_to_remap_entry(struct iommu *iommu,
     if ( rte_upper )
     {
         if ( x2apic_enabled )
-            new_ire.lo.dst = value;
+            new_ire.remap.dst = value;
         else
-            new_ire.lo.dst = (value >> 24) << 8;
+            new_ire.remap.dst = (value >> 24) << 8;
     }
     else
     {
         *(((u32 *)&new_rte) + 0) = value;
-        new_ire.lo.fpd = 0;
-        new_ire.lo.dm = new_rte.dest_mode;
-        new_ire.lo.tm = new_rte.trigger;
-        new_ire.lo.dlm = new_rte.delivery_mode;
+        new_ire.remap.fpd = 0;
+        new_ire.remap.dm = new_rte.dest_mode;
+        new_ire.remap.tm = new_rte.trigger;
+        new_ire.remap.dlm = new_rte.delivery_mode;
         /* Hardware require RH = 1 for LPR delivery mode */
-        new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-        new_ire.lo.avail = 0;
-        new_ire.lo.res_1 = 0;
-        new_ire.lo.vector = new_rte.vector;
-        new_ire.lo.res_2 = 0;
+        new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+        new_ire.remap.avail = 0;
+        new_ire.remap.res_1 = 0;
+        new_ire.remap.vector = new_rte.vector;
+        new_ire.remap.res_2 = 0;
 
         set_ioapic_source_id(IO_APIC_ID(apic), &new_ire);
-        new_ire.hi.res_1 = 0;
-        new_ire.lo.p = 1;     /* finally, set present bit */
+        new_ire.remap.res_3 = 0;
+        new_ire.remap.res_4 = 0;
+        new_ire.remap.p = 1;     /* finally, set present bit */
 
         /* now construct new ioapic rte entry */
         remap_rte->vector = new_rte.vector;
@@ -511,7 +512,7 @@ static int remap_entry_to_msi_msg(
     GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, index,
                      iremap_entries, iremap_entry);
 
-    if ( iremap_entry->hi_val == 0 && iremap_entry->lo_val == 0 )
+    if ( iremap_entry->hi == 0 && iremap_entry->lo == 0 )
     {
         dprintk(XENLOG_ERR VTDPREFIX,
                 "%s: index (%d) get an empty entry!\n",
@@ -524,25 +525,25 @@ static int remap_entry_to_msi_msg(
     msg->address_hi = MSI_ADDR_BASE_HI;
     msg->address_lo =
         MSI_ADDR_BASE_LO |
-        ((iremap_entry->lo.dm == 0) ?
+        ((iremap_entry->remap.dm == 0) ?
             MSI_ADDR_DESTMODE_PHYS:
             MSI_ADDR_DESTMODE_LOGIC) |
-        ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+        ((iremap_entry->remap.dlm != dest_LowestPrio) ?
             MSI_ADDR_REDIRECTION_CPU:
             MSI_ADDR_REDIRECTION_LOWPRI);
     if ( x2apic_enabled )
-        msg->dest32 = iremap_entry->lo.dst;
+        msg->dest32 = iremap_entry->remap.dst;
     else
-        msg->dest32 = (iremap_entry->lo.dst >> 8) & 0xff;
+        msg->dest32 = (iremap_entry->remap.dst >> 8) & 0xff;
     msg->address_lo |= MSI_ADDR_DEST_ID(msg->dest32);
 
     msg->data =
         MSI_DATA_TRIGGER_EDGE |
         MSI_DATA_LEVEL_ASSERT |
-        ((iremap_entry->lo.dlm != dest_LowestPrio) ?
+        ((iremap_entry->remap.dlm != dest_LowestPrio) ?
             MSI_DATA_DELIVERY_FIXED:
             MSI_DATA_DELIVERY_LOWPRI) |
-        iremap_entry->lo.vector;
+        iremap_entry->remap.vector;
 
     unmap_vtd_domain_page(iremap_entries);
     spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
@@ -601,29 +602,30 @@ static int msi_msg_to_remap_entry(
     memcpy(&new_ire, iremap_entry, sizeof(struct iremap_entry));
 
     /* Set interrupt remapping table entry */
-    new_ire.lo.fpd = 0;
-    new_ire.lo.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
-    new_ire.lo.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
-    new_ire.lo.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
+    new_ire.remap.fpd = 0;
+    new_ire.remap.dm = (msg->address_lo >> MSI_ADDR_DESTMODE_SHIFT) & 0x1;
+    new_ire.remap.tm = (msg->data >> MSI_DATA_TRIGGER_SHIFT) & 0x1;
+    new_ire.remap.dlm = (msg->data >> MSI_DATA_DELIVERY_MODE_SHIFT) & 0x1;
     /* Hardware require RH = 1 for LPR delivery mode */
-    new_ire.lo.rh = (new_ire.lo.dlm == dest_LowestPrio);
-    new_ire.lo.avail = 0;
-    new_ire.lo.res_1 = 0;
-    new_ire.lo.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
-                        MSI_DATA_VECTOR_MASK;
-    new_ire.lo.res_2 = 0;
+    new_ire.remap.rh = (new_ire.remap.dlm == dest_LowestPrio);
+    new_ire.remap.avail = 0;
+    new_ire.remap.res_1 = 0;
+    new_ire.remap.vector = (msg->data >> MSI_DATA_VECTOR_SHIFT) &
+                            MSI_DATA_VECTOR_MASK;
+    new_ire.remap.res_2 = 0;
     if ( x2apic_enabled )
-        new_ire.lo.dst = msg->dest32;
+        new_ire.remap.dst = msg->dest32;
     else
-        new_ire.lo.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
-                          & 0xff) << 8;
+        new_ire.remap.dst = ((msg->address_lo >> MSI_ADDR_DEST_ID_SHIFT)
+                             & 0xff) << 8;
 
     if ( pdev )
         set_msi_source_id(pdev, &new_ire);
     else
         set_hpet_source_id(msi_desc->hpet_id, &new_ire);
-    new_ire.hi.res_1 = 0;
-    new_ire.lo.p = 1;    /* finally, set present bit */
+    new_ire.remap.res_3 = 0;
+    new_ire.remap.res_4 = 0;
+    new_ire.remap.p = 1;    /* finally, set present bit */
 
     /* now construct new MSI/MSI-X rte entry */
     remap_rte = (struct msi_msg_remap_entry *)msg;
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index e807253..49daa70 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -289,29 +289,43 @@ struct dma_pte {
 /* interrupt remap entry */
 struct iremap_entry {
   union {
-    u64 lo_val;
+    struct { u64 lo, hi; };
     struct {
-        u64 p       : 1,
+        u16 p       : 1,
             fpd     : 1,
             dm      : 1,
             rh      : 1,
             tm      : 1,
             dlm     : 3,
             avail   : 4,
-            res_1   : 4,
-            vector  : 8,
-            res_2   : 8,
-            dst     : 32;
-    }lo;
-  };
-  union {
-    u64 hi_val;
+            res_1   : 4;
+        u8  vector;
+        u8  res_2;
+        u32 dst;
+        u16 sid;
+        u16 sq      : 2,
+            svt     : 2,
+            res_3   : 12;
+        u32 res_4   : 32;
+    } remap;
     struct {
-        u64 sid     : 16,
-            sq      : 2,
+        u16 p       : 1,
+            fpd     : 1,
+            res_1   : 6,
+            avail   : 4,
+            res_2   : 2,
+            urg     : 1,
+            im      : 1;
+        u8  vector;
+        u8  res_3;
+        u32 res_4   : 6,
+            pda_l   : 26;
+        u16 sid;
+        u16 sq      : 2,
             svt     : 2,
-            res_1   : 44;
-    }hi;
+            res_5   : 12;
+        u32 pda_h;
+    } post;
   };
 };
 
diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
index bd14c02..a5fe237 100644
--- a/xen/drivers/passthrough/vtd/utils.c
+++ b/xen/drivers/passthrough/vtd/utils.c
@@ -238,14 +238,14 @@ static void dump_iommu_info(unsigned char key)
                 else
                     p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
 
-                if ( !p->lo.p )
+                if ( !p->remap.p )
                     continue;
                 printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
                     "   %x %x\n", i,
-                    (u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
-                    (u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
-                    (u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
-                    (u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
+                    (u32)p->remap.svt, (u32)p->remap.sq, (u32)p->remap.sid,
+                    (u32)p->remap.dst, (u32)p->remap.vector, (u32)p->remap.avail,
+                    (u32)p->remap.dlm, (u32)p->remap.tm, (u32)p->remap.rh,
+                    (u32)p->remap.dm, (u32)p->remap.fpd, (u32)p->remap.p);
                 print_cnt++;
             }
             if ( iremap_entries )
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (8 preceding siblings ...)
  2015-06-24  5:18 ` [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 16:22   ` Andrew Cooper
                     ` (2 more replies)
  2015-06-24  5:18 ` [v3 11/15] Update IRTE according to guest interrupt config changes Feng Wu
                   ` (4 subsequent siblings)
  14 siblings, 3 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

This patch adds an API which is used to update the IRTE
for posted-interrupt when guest changes MSI/MSI-X information.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Remove "adding PDA_MASK()" when updating 'pda_l' and 'pda_h' for IRTE.
- Change the return type of pi_update_irte() to int.
- Remove some pointless printk message in pi_update_irte().
- Use structure assignment instead of memcpy() for irte copy.

 xen/drivers/passthrough/vtd/intremap.c | 98 ++++++++++++++++++++++++++++++++++
 xen/drivers/passthrough/vtd/iommu.h    |  2 +
 xen/include/asm-x86/iommu.h            |  2 +
 3 files changed, 102 insertions(+)

diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
index b7a42f6..401a9d1 100644
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -900,3 +900,101 @@ void iommu_disable_x2apic_IR(void)
     for_each_drhd_unit ( drhd )
         disable_qinval(drhd->iommu);
 }
+
+static inline void setup_posted_irte(
+    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t gvec)
+{
+    new_ire->post.urg = 0;
+    new_ire->post.vector = gvec;
+    new_ire->post.pda_l = virt_to_maddr(pi_desc) >> (32 - PDA_LOW_BIT);
+    new_ire->post.pda_h = virt_to_maddr(pi_desc) >> 32;
+
+    new_ire->post.res_1 = 0;
+    new_ire->post.res_2 = 0;
+    new_ire->post.res_3 = 0;
+    new_ire->post.res_4 = 0;
+    new_ire->post.res_5 = 0;
+
+    new_ire->post.im = 1;
+}
+
+/*
+ * This function is used to update the IRTE for posted-interrupt
+ * when guest changes MSI/MSI-X information.
+ */
+int pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
+{
+    struct irq_desc *desc;
+    struct msi_desc *msi_desc;
+    int remap_index;
+    int rc = 0;
+    struct pci_dev *pci_dev;
+    struct acpi_drhd_unit *drhd;
+    struct iommu *iommu;
+    struct ir_ctrl *ir_ctrl;
+    struct iremap_entry *iremap_entries = NULL, *p = NULL;
+    struct iremap_entry new_ire;
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    unsigned long flags;
+    uint128_t old_ire, ret;
+
+    desc = pirq_spin_lock_irq_desc(pirq, NULL);
+    if ( !desc )
+        return -ENOMEM;
+
+    msi_desc = desc->msi_desc;
+    if ( !msi_desc )
+    {
+        rc = -EBADSLT;
+        goto unlock_out;
+    }
+
+    pci_dev = msi_desc->dev;
+    if ( !pci_dev )
+    {
+        rc = -ENODEV;
+        goto unlock_out;
+    }
+
+    remap_index = msi_desc->remap_index;
+    drhd = acpi_find_matched_drhd_unit(pci_dev);
+    if ( !drhd )
+    {
+        rc = -ENODEV;
+        goto unlock_out;
+    }
+
+    iommu = drhd->iommu;
+    ir_ctrl = iommu_ir_ctrl(iommu);
+    if ( !ir_ctrl )
+    {
+        rc = -ENODEV;
+        goto unlock_out;
+    }
+
+    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
+
+    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
+    new_ire = *p;
+
+    /* Setup/Update interrupt remapping table entry. */
+    setup_posted_irte(&new_ire, pi_desc, gvec);
+
+    do {
+        old_ire = *(uint128_t *)p;
+        ret = cmpxchg16b(p, &old_ire, &new_ire);
+    } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );
+
+    iommu_flush_cache_entry(p, sizeof(struct iremap_entry));
+    iommu_flush_iec_index(iommu, 0, remap_index);
+
+    if ( iremap_entries )
+        unmap_vtd_domain_page(iremap_entries);
+
+    spin_unlock_irqrestore(&ir_ctrl->iremap_lock, flags);
+
+ unlock_out:
+    spin_unlock_irq(&desc->lock);
+
+    return rc;
+}
diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
index 49daa70..9ce941e 100644
--- a/xen/drivers/passthrough/vtd/iommu.h
+++ b/xen/drivers/passthrough/vtd/iommu.h
@@ -329,6 +329,8 @@ struct iremap_entry {
   };
 };
 
+#define PDA_LOW_BIT    26
+
 /* Max intr remapping table page order is 8, as max number of IRTEs is 64K */
 #define IREMAP_PAGE_ORDER  8
 
diff --git a/xen/include/asm-x86/iommu.h b/xen/include/asm-x86/iommu.h
index e7a65da..2a1523e 100644
--- a/xen/include/asm-x86/iommu.h
+++ b/xen/include/asm-x86/iommu.h
@@ -32,6 +32,8 @@ int iommu_supports_eim(void);
 int iommu_enable_x2apic_IR(void);
 void iommu_disable_x2apic_IR(void);
 
+int pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec);
+
 #endif /* !__ARCH_X86_IOMMU_H__ */
 /*
  * Local variables:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (9 preceding siblings ...)
  2015-06-24  5:18 ` [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 16:46   ` Andrew Cooper
                     ` (2 more replies)
  2015-06-24  5:18 ` [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
                   ` (3 subsequent siblings)
  14 siblings, 3 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

When guest changes its interrupt configuration (such as, vector, etc.)
for direct-assigned devices, we need to update the associated IRTE
with the new guest vector, so external interrupts from the assigned
devices can be injected to guests without VM-Exit.

For lowest-priority interrupts, we use vector-hashing mechamisn to find
the destination vCPU. This follows the hardware behavior, since modern
Intel CPUs use vector hashing to handle the lowest-priority interrupt.

For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
still use interrupt remapping.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Use bitmap to store the all the possible destination vCPUs of an
interrupt, then trying to find the right destination from the bitmap
- Typo and some small changes

 xen/drivers/passthrough/io.c | 96 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 9b77334..18e24e1 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -26,6 +26,7 @@
 #include <asm/hvm/iommu.h>
 #include <asm/hvm/support.h>
 #include <xen/hvm/irq.h>
+#include <asm/io_apic.h>
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
@@ -199,6 +200,78 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
     xfree(dpci);
 }
 
+/*
+ * The purpose of this routine is to find the right destination vCPU for
+ * an interrupt which will be delivered by VT-d posted-interrupt. There
+ * are several cases as below:
+ *
+ * - For lowest-priority interrupts, we find the destination vCPU from the
+ *   guest vector using vector-hashing mechanism and return true. This follows
+ *   the hardware behavior, since modern Intel CPUs use vector hashing to
+ *   handle the lowest-priority interrupt.
+ * - Otherwise, for single destination interrupt, it is straightforward to
+ *   find the destination vCPU and return true.
+ * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
+ *   so return false.
+ *
+ *   Here is the details about the vector-hashing mechanism:
+ *   1. For lowest-priority interrupts, store all the possible destination
+ *      vCPUs in an array.
+ *   2. Use "gvec % max number of destination vCPUs" to find the right
+ *      destination vCPU in the array for the lowest-priority interrupt.
+ */
+static struct vcpu *pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,
+                                      uint8_t dest_mode, uint8_t delivery_mode,
+                                      uint8_t gvec)
+{
+    unsigned long *dest_vcpu_bitmap = NULL;
+    unsigned int dest_vcpu_num = 0, idx = 0;
+    int size = (d->max_vcpus + BITS_PER_LONG - 1) / BITS_PER_LONG;
+    struct vcpu *v, *dest = NULL;
+    int i;
+
+    dest_vcpu_bitmap = xzalloc_array(unsigned long, size);
+    if ( !dest_vcpu_bitmap )
+    {
+        dprintk(XENLOG_G_INFO,
+                "dom%d: failed to allocate memory\n", d->domain_id);
+        return NULL;
+    }
+
+    for_each_vcpu ( d, v )
+    {
+        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
+                                dest_id, dest_mode) )
+            continue;
+
+        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
+        dest_vcpu_num++;
+    }
+
+    if ( delivery_mode == dest_LowestPrio )
+    {
+        if (  dest_vcpu_num != 0 )
+        {
+            for ( i = 0; i <= gvec % dest_vcpu_num; i++)
+                idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
+            idx--;
+
+            BUG_ON(idx >= d->max_vcpus || idx < 0);
+            dest = d->vcpu[idx];
+        }
+    }
+    else if (  dest_vcpu_num == 1 )
+    {
+        idx = find_first_bit(dest_vcpu_bitmap, d->max_vcpus);
+        BUG_ON(idx >= d->max_vcpus || idx < 0);
+        dest = d->vcpu[idx];
+    }
+
+    xfree(dest_vcpu_bitmap);
+
+    return dest;
+}
+
 int pt_irq_create_bind(
     struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
 {
@@ -257,7 +330,7 @@ int pt_irq_create_bind(
     {
     case PT_IRQ_TYPE_MSI:
     {
-        uint8_t dest, dest_mode;
+        uint8_t dest, dest_mode, delivery_mode;
         int dest_vcpu_id;
 
         if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
@@ -330,11 +403,32 @@ int pt_irq_create_bind(
         /* Calculate dest_vcpu_id for MSI-type pirq migration. */
         dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
         dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
+        delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
+                        VMSI_DELIV_MASK;
         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
         spin_unlock(&d->event_lock);
         if ( dest_vcpu_id >= 0 )
             hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
+
+        /* Use interrupt posting if it is supported */
+        if ( iommu_intpost )
+        {
+            struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
+                                        delivery_mode, pirq_dpci->gmsi.gvec);
+
+            if ( !vcpu )
+                dprintk(XENLOG_G_WARNING,
+                        "dom%u: failed to find the dest vCPU for PI, guest "
+                        "vector:0x%x use software way to deliver the "
+                        " interrupts.\n", d->domain_id, pirq_dpci->gmsi.gvec);
+            else if ( pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec ) != 0 )
+                dprintk(XENLOG_G_WARNING,
+                        "%pv: failed to update PI IRTE, guest vector:0x%x "
+                        "use software way to deliver the interrupts.\n",
+                        vcpu, pirq_dpci->gmsi.gvec);
+        }
+
         break;
     }
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (10 preceding siblings ...)
  2015-06-24  5:18 ` [v3 11/15] Update IRTE according to guest interrupt config changes Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-06-29 17:07   ` Andrew Cooper
                     ` (3 more replies)
  2015-06-24  5:18 ` [v3 13/15] vmx: Properly handle notification event when vCPU is running Feng Wu
                   ` (2 subsequent siblings)
  14 siblings, 4 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

This patch includes the following aspects:
- Add a global vector to wake up the blocked vCPU
  when an interrupt is being posted to it (This
  part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
- Adds a new per-vCPU tasklet to wakeup the blocked
  vCPU. It can be used in the case vcpu_unblock
  cannot be called directly.
- Define two per-cpu variables:
      * pi_blocked_vcpu:
      A list storing the vCPUs which were blocked on this pCPU.

      * pi_blocked_vcpu_lock:
      The spinlock to protect pi_blocked_vcpu.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- This patch is generated by merging the following three patches in v2:
   [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
   [RFC v2 10/15] vmx: Define two per-cpu variables
   [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
- rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
- Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
- rename 'vcpu_wakeup_tasklet_handler' to 'pi_vcpu_wakeup_tasklet_handler'
- Make pi_wakeup_interrupt() static
- Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
- move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
- Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
- Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'

 xen/arch/x86/hvm/vmx/vmcs.c        |  3 +++
 xen/arch/x86/hvm/vmx/vmx.c         | 54 ++++++++++++++++++++++++++++++++++++++
 xen/include/asm-x86/hvm/hvm.h      |  1 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |  5 ++++
 xen/include/asm-x86/hvm/vmx/vmx.h  |  5 ++++
 5 files changed, 68 insertions(+)

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 11dc1b5..0c5ce3f 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -631,6 +631,9 @@ int vmx_cpu_up(void)
     if ( cpu_has_vmx_vpid )
         vpid_sync_all();
 
+    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
+    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
+
     return 0;
 }
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index b94ef6a..7db6009 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -82,7 +82,20 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content);
 static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
 static void vmx_invlpg_intercept(unsigned long vaddr);
 
+/*
+ * We maintian a per-CPU linked-list of vCPU, so in PI wakeup handler we
+ * can find which vCPU should be waken up.
+ */
+DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
+DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
+
 uint8_t __read_mostly posted_intr_vector;
+uint8_t __read_mostly pi_wakeup_vector;
+
+static void pi_vcpu_wakeup_tasklet_handler(unsigned long arg)
+{
+    vcpu_unblock((struct vcpu *)arg);
+}
 
 static int vmx_domain_initialise(struct domain *d)
 {
@@ -148,11 +161,19 @@ static int vmx_vcpu_initialise(struct vcpu *v)
     if ( v->vcpu_id == 0 )
         v->arch.user_regs.eax = 1;
 
+    tasklet_init(
+        &v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet,
+        pi_vcpu_wakeup_tasklet_handler,
+        (unsigned long)v);
+
+    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+
     return 0;
 }
 
 static void vmx_vcpu_destroy(struct vcpu *v)
 {
+    tasklet_kill(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
     /*
      * There are cases that domain still remains in log-dirty mode when it is
      * about to be destroyed (ex, user types 'xl destroy <dom>'), in which case
@@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata vmx_function_table = {
     .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
 };
 
+/*
+ * Handle VT-d posted-interrupt when VCPU is blocked.
+ */
+static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
+{
+    struct arch_vmx_struct *vmx;
+    unsigned int cpu = smp_processor_id();
+
+    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
+
+    /*
+     * FIXME: The length of the list depends on how many
+     * vCPU is current blocked on this specific pCPU.
+     * This may hurt the interrupt latency if the list
+     * grows to too many entries.
+     */
+    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
+                        pi_blocked_vcpu_list)
+        if ( vmx->pi_desc.on )
+            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
+
+    spin_unlock(&per_cpu(pi_blocked_vcpu_lock, cpu));
+
+    ack_APIC_irq();
+    this_cpu(irq_count)++;
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -1884,11 +1932,17 @@ const struct hvm_function_table * __init start_vmx(void)
     }
 
     if ( cpu_has_vmx_posted_intr_processing )
+    {
         alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+
+        if ( iommu_intpost )
+            alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
+    }
     else
     {
         vmx_function_table.deliver_posted_intr = NULL;
         vmx_function_table.sync_pir_to_irr = NULL;
+        vmx_function_table.pi_desc_update = NULL;
     }
 
     if ( cpu_has_vmx_ept
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index 77eeac5..e621c30 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -195,6 +195,7 @@ struct hvm_function_table {
     void (*deliver_posted_intr)(struct vcpu *v, u8 vector);
     void (*sync_pir_to_irr)(struct vcpu *v);
     void (*handle_eoi)(u8 vector);
+    void (*pi_desc_update)(struct vcpu *v, int old_state);
 
     /*Walk nested p2m  */
     int (*nhvm_hap_walk_L1_p2m)(struct vcpu *v, paddr_t L2_gpa,
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index dedfaef..b6b34d1 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -162,6 +162,11 @@ struct arch_vmx_struct {
     struct page_info     *vmwrite_bitmap;
 
     struct page_info     *pml_pg;
+
+    /* Tasklet for pi_wakeup_blocked_vcpu() */
+    struct tasklet       pi_vcpu_wakeup_tasklet;
+
+    struct list_head     pi_blocked_vcpu_list;
 };
 
 int vmx_create_vmcs(struct vcpu *v);
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 5853563..663af33 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -29,6 +29,11 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+DECLARE_PER_CPU(struct list_head, pi_blocked_vcpu);
+DECLARE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
+
+extern uint8_t pi_wakeup_vector;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 13/15] vmx: Properly handle notification event when vCPU is running
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (11 preceding siblings ...)
  2015-06-24  5:18 ` [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-07-08 11:03   ` Tian, Kevin
  2015-07-10 14:40   ` Jan Beulich
  2015-06-24  5:18 ` [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
  2015-06-24  5:18 ` [v3 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
  14 siblings, 2 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

When a vCPU is running in Root mode and a notification event
has been injected to it. we need to set VCPU_KICK_SOFTIRQ for
the current cpu, so the pending interrupt in PIRR will be
synced to vIRR before VM-Exit in time.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
- Make pi_notification_interrupt() static

 xen/arch/x86/hvm/vmx/vmx.c | 55 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 7db6009..5795afd 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -1896,6 +1896,59 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
     this_cpu(irq_count)++;
 }
 
+/*
+ * Handle VT-d posted-interrupt when VCPU is running.
+ */
+
+static void pi_notification_interrupt(struct cpu_user_regs *regs)
+{
+    /*
+     * We get here when a vCPU is running in root-mode
+     * (such as via hypercall, or any other reasons which
+     * can result in VM-Exit), and before vCPU is back to
+     * non-root, external interrupts from an assigned
+     * device happen and a notification event is delivered
+     * to this logical CPU.
+     *
+     * we need to set VCPU_KICK_SOFTIRQ for the current
+     * cpu, just like __vmx_deliver_posted_interrupt().
+     *
+     * So the pending interrupt in PIRR will be synced to
+     * vIRR before VM-Exit in time.
+     *
+     * Please refer to the following code fragments from
+     * xen/arch/x86/hvm/vmx/entry.S:
+     *
+     * .Lvmx_do_vmentry
+     *
+     *  ......
+     *  point 1
+     *
+     *  cmp  %ecx,(%rdx,%rax,1)
+     *  jnz  .Lvmx_process_softirqs
+     *
+     *  ......
+     *
+     *  je   .Lvmx_launch
+     *
+     *  ......
+     *
+     * .Lvmx_process_softirqs:
+     *  sti
+     *  call do_softirq
+     *  jmp  .Lvmx_do_vmentry
+     *
+     *  If VT-d engine issues a notification event at
+     *  point 1 above, it cannot be delivered to the guest
+     *  during this VM-entry without raising the softirq
+     *  in this notification handler.
+     */
+    raise_softirq(VCPU_KICK_SOFTIRQ);
+
+    ack_APIC_irq();
+    this_cpu(irq_count)++;
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -1933,7 +1986,7 @@ const struct hvm_function_table * __init start_vmx(void)
 
     if ( cpu_has_vmx_posted_intr_processing )
     {
-        alloc_direct_apic_vector(&posted_intr_vector, event_check_interrupt);
+        alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
 
         if ( iommu_intpost )
             alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (12 preceding siblings ...)
  2015-06-24  5:18 ` [v3 13/15] vmx: Properly handle notification event when vCPU is running Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
       [not found]   ` <55918214.4030102@citrix.com>
                     ` (2 more replies)
  2015-06-24  5:18 ` [v3 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
  14 siblings, 3 replies; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

The basic idea here is:
1. When vCPU's state is RUNSTATE_running,
        - set 'NV' to 'Notification Vector'.
        - Clear 'SN' to accpet PI.
        - set 'NDST' to the right pCPU.
2. When vCPU's state is RUNSTATE_blocked,
        - set 'NV' to 'Wake-up Vector', so we can wake up the
          related vCPU when posted-interrupt happens for it.
        - Clear 'SN' to accpet PI.
3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
        - Set 'SN' to suppress non-urgent interrupts.
          (Current, we only support non-urgent interrupts)
        - Set 'NV' back to 'Notification Vector' if needed.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
* Use write_atomic() to update 'NV' and 'NDST' fileds.
* Use MASK_INSR() to get the value for 'NDST' field
* Add ASSERT_UNREACHABLE() for the break case in vmx_pi_desc_update()
* Remove pointless NULL assignment to 'vmx_function_table.pi_desc_update'
* Call hvm_funcs.pi_desc_update() in arch-specific files
* coding style

 xen/arch/x86/hvm/hvm.c             |   6 ++
 xen/arch/x86/hvm/vmx/vmx.c         | 122 +++++++++++++++++++++++++++++++++++++
 xen/common/schedule.c              |   4 ++
 xen/include/asm-arm/domain.h       |   2 +
 xen/include/asm-x86/hvm/hvm.h      |   2 +
 xen/include/asm-x86/hvm/vmx/vmcs.h |   7 +++
 xen/include/asm-x86/hvm/vmx/vmx.h  |  11 ++++
 7 files changed, 154 insertions(+)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 2736802..64ce381 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -6475,6 +6475,12 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v)
     return hvm_funcs.nhvm_intr_blocked(v);
 }
 
+void arch_pi_desc_update(struct vcpu *v, int old_state)
+{
+    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
+        hvm_funcs.pi_desc_update(v, old_state);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5795afd..cf4f292 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -168,6 +168,7 @@ static int vmx_vcpu_initialise(struct vcpu *v)
 
     INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
 
+    v->arch.hvm_vmx.pi_block_cpu = -1;
     return 0;
 }
 
@@ -1778,6 +1779,124 @@ static void vmx_handle_eoi(u8 vector)
     __vmwrite(GUEST_INTR_STATUS, status);
 }
 
+static void vmx_pi_desc_update(struct vcpu *v, int old_state)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    struct pi_desc old, new;
+    unsigned long flags;
+
+    ASSERT(iommu_intpost);
+
+    switch ( v->runstate.state )
+    {
+    case RUNSTATE_runnable:
+    case RUNSTATE_offline:
+        /*
+         * We don't need to send notification event to a non-running
+         * vcpu, the interrupt information will be delivered to it before
+         * VM-ENTRY when the vcpu is scheduled to run next time.
+         */
+        pi_set_sn(pi_desc);
+
+        /*
+         * If the state is transferred from RUNSTATE_blocked,
+         * we should set 'NV' feild back to posted_intr_vector,
+         * so the Posted-Interrupts can be delivered to the vCPU
+         * by VT-d HW after it is scheduled to run.
+         */
+        if ( old_state == RUNSTATE_blocked )
+        {
+            write_atomic((uint8_t*)&new.nv, posted_intr_vector);
+
+            /*
+             * Delete the vCPU from the related block list
+             * if we are resuming from blocked state
+             */
+            ASSERT(v->arch.hvm_vmx.pi_block_cpu != -1);
+            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                              v->arch.hvm_vmx.pi_block_cpu), flags);
+            list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                                   v->arch.hvm_vmx.pi_block_cpu), flags);
+            v->arch.hvm_vmx.pi_block_cpu = -1;
+        }
+        break;
+
+    case RUNSTATE_blocked:
+        ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
+
+        /*
+         * The vCPU is blocked on the block list. Add the blocked
+         * vCPU on the list of the v->arch.hvm_vmx.pi_block_cpu,
+         * which is the destination of the wake-up notification event.
+         */
+        v->arch.hvm_vmx.pi_block_cpu = v->processor;
+        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                          v->arch.hvm_vmx.pi_block_cpu), flags);
+        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+                      &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                               v->arch.hvm_vmx.pi_block_cpu), flags);
+
+        do {
+            old.control = new.control = pi_desc->control;
+
+            /*
+             * We should not block the vCPU if
+             * an interrupt was posted for it.
+             */
+
+            if ( old.on )
+            {
+                /*
+                 * The vCPU will be removed from the block list
+                 * during its state transferring from RUNSTATE_blocked
+                 * to RUNSTATE_runnable after the following tasklet
+                 * is executed.
+                 */
+                tasklet_schedule(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
+                return;
+            }
+
+            /*
+             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+             * so when external interrupts from assigned deivces happen,
+             * wakeup notifiction event will go to
+             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+             * we can find the vCPU in the right list to wake up.
+             */
+            if ( x2apic_enabled )
+                new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+            else
+                new.ndst = MASK_INSR(cpu_physical_id(
+                                     v->arch.hvm_vmx.pi_block_cpu),
+                                     PI_xAPIC_NDST_MASK);
+            new.sn = 0;
+            new.nv = pi_wakeup_vector;
+        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                  != old.control );
+        break;
+
+    case RUNSTATE_running:
+        ASSERT( pi_desc->sn == 1 );
+
+        if ( x2apic_enabled )
+            write_atomic(&new.ndst, cpu_physical_id(v->processor));
+        else
+            write_atomic(&new.ndst,
+                         MASK_INSR(cpu_physical_id(v->processor),
+                         PI_xAPIC_NDST_MASK));
+
+        pi_clear_sn(pi_desc);
+
+        break;
+
+    default:
+        ASSERT_UNREACHABLE();
+        break;
+    }
+}
+
 void vmx_hypervisor_cpuid_leaf(uint32_t sub_idx,
                                uint32_t *eax, uint32_t *ebx,
                                uint32_t *ecx, uint32_t *edx)
@@ -1989,7 +2108,10 @@ const struct hvm_function_table * __init start_vmx(void)
         alloc_direct_apic_vector(&posted_intr_vector, pi_notification_interrupt);
 
         if ( iommu_intpost )
+        {
             alloc_direct_apic_vector(&pi_wakeup_vector, pi_wakeup_interrupt);
+            vmx_function_table.pi_desc_update = vmx_pi_desc_update;
+        }
     }
     else
     {
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 6b02f98..20727d6 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -142,6 +142,7 @@ static inline void vcpu_runstate_change(
     struct vcpu *v, int new_state, s_time_t new_entry_time)
 {
     s_time_t delta;
+    int old_state;
 
     ASSERT(v->runstate.state != new_state);
     ASSERT(spin_is_locked(per_cpu(schedule_data,v->processor).schedule_lock));
@@ -157,7 +158,10 @@ static inline void vcpu_runstate_change(
         v->runstate.state_entry_time = new_entry_time;
     }
 
+    old_state = v->runstate.state;
     v->runstate.state = new_state;
+
+    arch_pi_desc_update(v, old_state);
 }
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index f1a087e..9603cf0 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -265,6 +265,8 @@ static inline unsigned int domain_max_vcpus(const struct domain *d)
     return MAX_VIRT_CPUS;
 }
 
+static void arch_pi_desc_update(struct vcpu *v, int old_state) {}
+
 #endif /* __ASM_DOMAIN_H__ */
 
 /*
diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index e621c30..e175417 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -510,6 +510,8 @@ bool_t nhvm_vmcx_hap_enabled(struct vcpu *v);
 /* interrupt */
 enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v);
 
+void arch_pi_desc_update(struct vcpu *v, int old_state);
+
 #ifndef NDEBUG
 /* Permit use of the Forced Emulation Prefix in HVM guests */
 extern bool_t opt_hvm_fep;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index b6b34d1..ea8fbe5 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -167,6 +167,13 @@ struct arch_vmx_struct {
     struct tasklet       pi_vcpu_wakeup_tasklet;
 
     struct list_head     pi_blocked_vcpu_list;
+
+    /*
+     * Before vCPU is blocked, it is added to the global per-cpu list
+     * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+     * event to 'pi_block_cpu' and wakeup the related vCPU.
+     */
+    int                  pi_block_cpu;
 };
 
 int vmx_create_vmcs(struct vcpu *v);
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 663af33..ea02e21 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -108,6 +108,7 @@ void vmx_update_cpu_exec_control(struct vcpu *v);
 void vmx_update_secondary_exec_control(struct vcpu *v);
 
 #define POSTED_INTR_ON  0
+#define POSTED_INTR_SN  1
 static inline int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
 {
     return test_and_set_bit(vector, pi_desc->pir);
@@ -133,6 +134,16 @@ static inline unsigned long pi_get_pir(struct pi_desc *pi_desc, int group)
     return xchg(&pi_desc->pir[group], 0);
 }
 
+static inline void pi_set_sn(struct pi_desc *pi_desc)
+{
+    set_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
+static inline void pi_clear_sn(struct pi_desc *pi_desc)
+{
+    clear_bit(POSTED_INTR_SN, &pi_desc->control);
+}
+
 /*
  * Exit Reasons
  */
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* [v3 15/15] Add a command line parameter for VT-d posted-interrupts
  2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
                   ` (13 preceding siblings ...)
  2015-06-24  5:18 ` [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
@ 2015-06-24  5:18 ` Feng Wu
  2015-07-08 11:25   ` Tian, Kevin
  14 siblings, 1 reply; 155+ messages in thread
From: Feng Wu @ 2015-06-24  5:18 UTC (permalink / raw)
  To: xen-devel
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, jbeulich,
	yang.z.zhang, feng.wu

Enable VT-d Posted-Interrupts and add a command line
parameter for it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
---
v3:
Remove the redundant "no intremp then no intpost" logic

 docs/misc/xen-command-line.markdown | 9 ++++++++-
 xen/drivers/passthrough/iommu.c     | 4 +++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index aa684c0..f8ec15f 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -848,7 +848,7 @@ debug hypervisor only).
 > Default: `new` unless directed-EOI is supported
 
 ### iommu
-> `= List of [ <boolean> | force | required | intremap | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | verbose | debug ]`
+> `= List of [ <boolean> | force | required | intremap | intpost | qinval | snoop | sharept | dom0-passthrough | dom0-strict | amd-iommu-perdev-intremap | workaround_bios_bug | verbose | debug ]`
 
 > Sub-options:
 
@@ -875,6 +875,13 @@ debug hypervisor only).
 >> Control the use of interrupt remapping (DMA remapping will always be enabled
 >> if IOMMU functionality is enabled).
 
+> `intpost`
+
+> Default: `true`
+
+>> Control the use of interrupt posting, interrupt posting is dependant on
+>> interrupt remapping.
+
 > `qinval` (VT-d)
 
 > Default: `true`
diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
index 597f676..e13251c 100644
--- a/xen/drivers/passthrough/iommu.c
+++ b/xen/drivers/passthrough/iommu.c
@@ -52,7 +52,7 @@ bool_t __read_mostly iommu_passthrough;
 bool_t __read_mostly iommu_snoop = 1;
 bool_t __read_mostly iommu_qinval = 1;
 bool_t __read_mostly iommu_intremap = 1;
-bool_t __read_mostly iommu_intpost;
+bool_t __read_mostly iommu_intpost = 1;
 bool_t __read_mostly iommu_hap_pt_share = 1;
 bool_t __read_mostly iommu_debug;
 bool_t __read_mostly amd_iommu_perdev_intremap = 1;
@@ -97,6 +97,8 @@ static void __init parse_iommu_param(char *s)
             iommu_qinval = val;
         else if ( !strcmp(s, "intremap") )
             iommu_intremap = val;
+        else if ( !strcmp(s, "intpost") )
+            iommu_intpost = val;
         else if ( !strcmp(s, "debug") )
         {
             iommu_debug = val;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* Re: [v3 01/15] Vt-d Posted-intterrupt (PI) design
  2015-06-24  5:18 ` [v3 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
@ 2015-06-24  6:15   ` Meng Xu
  2015-06-24  6:19     ` Wu, Feng
  2015-07-08  7:21   ` Tian, Kevin
  1 sibling, 1 reply; 155+ messages in thread
From: Meng Xu @ 2015-06-24  6:15 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel@lists.xen.org, Jan Beulich, yang.z.zhang@intel.com

Hi Feng,

One minor thing:

> +Important Definitions
> +==================
> +There are some changes to IRTE and posted-interrupt descriptor after
> +VT-d PI is introduced:
> +IRTE:
It seems that you forgot to define IRTE. :-)

I guess it stands for Interrupt Remapping Table Entry? (Probably I'm wrong. :-))

Thanks,

Meng


-----------
Meng Xu
PhD Student in Computer and Information Science
University of Pennsylvania
http://www.cis.upenn.edu/~mengxu/

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 01/15] Vt-d Posted-intterrupt (PI) design
  2015-06-24  6:15   ` Meng Xu
@ 2015-06-24  6:19     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-06-24  6:19 UTC (permalink / raw)
  To: Meng Xu
  Cc: Tian, Kevin, Keir Fraser, George Dunlap, Andrew Cooper,
	xen-devel@lists.xen.org, Jan Beulich, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Meng Xu [mailto:xumengpanda@gmail.com]
> Sent: Wednesday, June 24, 2015 2:16 PM
> To: Wu, Feng
> Cc: xen-devel@lists.xen.org; Tian, Kevin; Keir Fraser; George Dunlap; Andrew
> Cooper; Jan Beulich; Zhang, Yang Z
> Subject: Re: [Xen-devel] [v3 01/15] Vt-d Posted-intterrupt (PI) design
> 
> Hi Feng,
> 
> One minor thing:
> 
> > +Important Definitions
> > +==================
> > +There are some changes to IRTE and posted-interrupt descriptor after
> > +VT-d PI is introduced:
> > +IRTE:
> It seems that you forgot to define IRTE. :-)
> 
> I guess it stands for Interrupt Remapping Table Entry? (Probably I'm wrong. :-))

Yes, you're right. I will add this in the next version. Thanks for pointing it out!

Thanks,
Feng

> 
> Thanks,
> 
> Meng
> 
> 
> -----------
> Meng Xu
> PhD Student in Computer and Information Science
> University of Pennsylvania
> http://www.cis.upenn.edu/~mengxu/

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection
  2015-06-24  5:18 ` [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection Feng Wu
@ 2015-06-24 17:31   ` Andrew Cooper
  2015-07-08  7:23   ` Tian, Kevin
  1 sibling, 0 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-06-24 17:31 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: george.dunlap, yang.z.zhang, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> Add macro cpu_has_cx16 to detect X86_FEATURE_CX16 feature.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-06-24  5:18 ` [v3 03/15] Add cmpxchg16b support for x86-64 Feng Wu
@ 2015-06-24 18:35   ` Andrew Cooper
  2015-07-08  7:06     ` Wu, Feng
  2015-07-10 12:57   ` Jan Beulich
  1 sibling, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-24 18:35 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: george.dunlap, yang.z.zhang, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> This patch adds cmpxchg16b support for x86-64, so software
> can perform 128-bit atomic write/read.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> Newly added.
>
>  xen/include/asm-x86/x86_64/system.h | 28 ++++++++++++++++++++++++++++
>  xen/include/xen/types.h             |  5 +++++
>  2 files changed, 33 insertions(+)
>
> diff --git a/xen/include/asm-x86/x86_64/system.h b/xen/include/asm-x86/x86_64/system.h
> index 662813a..a910d00 100644
> --- a/xen/include/asm-x86/x86_64/system.h
> +++ b/xen/include/asm-x86/x86_64/system.h
> @@ -6,6 +6,34 @@
>                                     (unsigned long)(n),sizeof(*(ptr))))
>  
>  /*
> + * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
> + * identical, store NEW in MEM.  Return the initial value in MEM.
> + * Success is indicated by comparing RETURN with OLD.
> + *
> + * This function can only be called when cpu_has_cx16 is ture.
> + */
> +
> +static always_inline uint128_t __cmpxchg16b(
> +    volatile void *ptr, uint128_t old, uint128_t new)

It is not nice for register scheduling taking uint128_t's by value. 
Instead, I would pass them by pointer and let the inlining sort the
eventual references out.

> +{
> +    uint128_t prev;
> +
> +    ASSERT(cpu_has_cx16);

Given that if this assertion were to fail, cmpxchg16b would fail with
#UD, I would hand-code a asm_fixup section which in turn panics.  This
avoids a situation where non-debug builds could die with an unqualified
#UD exception.

Also, you must enforce 16-byte alignment of the memory reference, as
described in the manual.

~Andrew

> +
> +    asm volatile ( "lock; cmpxchg16b %4"
> +                   : "=d" (prev.high), "=a" (prev.low)
> +                   : "c" (new.high), "b" (new.low),
> +                   "m" (*__xg((volatile void *)ptr)),
> +                   "0" (old.high), "1" (old.low)
> +                   : "memory" );
> +
> +    return prev;
> +}
> +
> +#define cmpxchg16b(ptr,o,n)                                             \
> +    __cmpxchg16b((ptr), *(uint128_t *)(o), *(uint128_t *)(n))
> +
> +/*
>   * This function causes value _o to be changed to _n at location _p.
>   * If this access causes a fault then we return 1, otherwise we return 0.
>   * If no fault occurs then _o is updated to the value we saw at _p. If this
> diff --git a/xen/include/xen/types.h b/xen/include/xen/types.h
> index 8596ded..30f8a44 100644
> --- a/xen/include/xen/types.h
> +++ b/xen/include/xen/types.h
> @@ -47,6 +47,11 @@ typedef         __u64           uint64_t;
>  typedef         __u64           u_int64_t;
>  typedef         __s64           int64_t;
>  
> +typedef struct {
> +        uint64_t low;
> +        uint64_t high;
> +} uint128_t;
> +
>  struct domain;
>  struct vcpu;
>  

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-06-24  5:18 ` [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
@ 2015-06-25  9:06   ` Andrew Cooper
  2015-06-25  9:47     ` Wu, Feng
  2015-07-08  7:30   ` Tian, Kevin
  1 sibling, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-25  9:06 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: george.dunlap, yang.z.zhang, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
>
> This patch adds variable 'iommu_intpost' to control whether enable VT-d
> posted-interrupt or not in the generic IOMMU code.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Remove pointless initializer for 'iommu_intpost'.
> - Some adjustment for "if no intremap then no intpost" logic.
>     * For parse_iommu_param(), move it to the end of the function,
>       so we don't need to add the some logic when introduing the
>       new kernel parameter 'intpost' in later patch.
>     * Add this logic in iommu_setup() after iommu_hardware_setup()
>       is called.
>
>  xen/drivers/passthrough/iommu.c | 10 +++++++++-
>  xen/include/xen/iommu.h         |  2 +-
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
> index 06cb38f..597f676 100644
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -39,6 +39,7 @@ static void iommu_dump_p2m_table(unsigned char key);
>   *   no-snoop                   Disable VT-d Snoop Control
>   *   no-qinval                  Disable VT-d Queued Invalidation
>   *   no-intremap                Disable VT-d Interrupt Remapping
> + *   no-intpost                 Disable VT-d Interrupt posting
>   */
>  custom_param("iommu", parse_iommu_param);
>  bool_t __initdata iommu_enable = 1;
> @@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
>  bool_t __read_mostly iommu_snoop = 1;
>  bool_t __read_mostly iommu_qinval = 1;
>  bool_t __read_mostly iommu_intremap = 1;
> +bool_t __read_mostly iommu_intpost;
>  bool_t __read_mostly iommu_hap_pt_share = 1;
>  bool_t __read_mostly iommu_debug;
>  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
> @@ -112,6 +114,9 @@ static void __init parse_iommu_param(char *s)
>  
>          s = ss + 1;
>      } while ( ss );
> +
> +    if ( !iommu_intremap )
> +        iommu_intpost = 0;

This check is redundant - It will unconditionally be performed by
iommu_setup().  I would just drop the hunk.

However, what you are missing is parsing logic to catch a command line
configuration such as "iommu=intremap,no-intpost"

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-06-25  9:06   ` Andrew Cooper
@ 2015-06-25  9:47     ` Wu, Feng
  2015-06-25 10:16       ` Andrew Cooper
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-06-25  9:47 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, June 25, 2015 5:06 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] [v3 04/15] iommu: Add iommu_intpost to control VT-d
> Posted-Interrupts feature
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> > With VT-d Posted-Interrupts enabled, external interrupts from
> > direct-assigned devices can be delivered to guests without VMM
> > intervention when guest is running in non-root mode.
> >
> > This patch adds variable 'iommu_intpost' to control whether enable VT-d
> > posted-interrupt or not in the generic IOMMU code.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - Remove pointless initializer for 'iommu_intpost'.
> > - Some adjustment for "if no intremap then no intpost" logic.
> >     * For parse_iommu_param(), move it to the end of the function,
> >       so we don't need to add the some logic when introduing the
> >       new kernel parameter 'intpost' in later patch.
> >     * Add this logic in iommu_setup() after iommu_hardware_setup()
> >       is called.
> >
> >  xen/drivers/passthrough/iommu.c | 10 +++++++++-
> >  xen/include/xen/iommu.h         |  2 +-
> >  2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/drivers/passthrough/iommu.c
> b/xen/drivers/passthrough/iommu.c
> > index 06cb38f..597f676 100644
> > --- a/xen/drivers/passthrough/iommu.c
> > +++ b/xen/drivers/passthrough/iommu.c
> > @@ -39,6 +39,7 @@ static void iommu_dump_p2m_table(unsigned char
> key);
> >   *   no-snoop                   Disable VT-d Snoop Control
> >   *   no-qinval                  Disable VT-d Queued Invalidation
> >   *   no-intremap                Disable VT-d Interrupt Remapping
> > + *   no-intpost                 Disable VT-d Interrupt posting
> >   */
> >  custom_param("iommu", parse_iommu_param);
> >  bool_t __initdata iommu_enable = 1;
> > @@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
> >  bool_t __read_mostly iommu_snoop = 1;
> >  bool_t __read_mostly iommu_qinval = 1;
> >  bool_t __read_mostly iommu_intremap = 1;
> > +bool_t __read_mostly iommu_intpost;
> >  bool_t __read_mostly iommu_hap_pt_share = 1;
> >  bool_t __read_mostly iommu_debug;
> >  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
> > @@ -112,6 +114,9 @@ static void __init parse_iommu_param(char *s)
> >
> >          s = ss + 1;
> >      } while ( ss );
> > +
> > +    if ( !iommu_intremap )
> > +        iommu_intpost = 0;
> 
> This check is redundant - It will unconditionally be performed by
> iommu_setup().  I would just drop the hunk.
> 
> However, what you are missing is parsing logic to catch a command line
> configuration such as "iommu=intremap,no-intpost"

Doesn't the above code cover this command line?

Thanks,
Feng

> 
> ~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-06-25  9:47     ` Wu, Feng
@ 2015-06-25 10:16       ` Andrew Cooper
  2015-06-25 12:47         ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-25 10:16 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: george.dunlap@eu.citrix.com, Zhang, Yang Z, Tian, Kevin,
	keir@xen.org, jbeulich@suse.com

On 25/06/15 10:47, Wu, Feng wrote:
>
>> -----Original Message-----
>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> Sent: Thursday, June 25, 2015 5:06 PM
>> To: Wu, Feng; xen-devel@lists.xen.org
>> Cc: Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
>> jbeulich@suse.com; Zhang, Yang Z
>> Subject: Re: [Xen-devel] [v3 04/15] iommu: Add iommu_intpost to control VT-d
>> Posted-Interrupts feature
>>
>> On 24/06/15 06:18, Feng Wu wrote:
>>> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
>>> With VT-d Posted-Interrupts enabled, external interrupts from
>>> direct-assigned devices can be delivered to guests without VMM
>>> intervention when guest is running in non-root mode.
>>>
>>> This patch adds variable 'iommu_intpost' to control whether enable VT-d
>>> posted-interrupt or not in the generic IOMMU code.
>>>
>>> Signed-off-by: Feng Wu <feng.wu@intel.com>
>>> ---
>>> v3:
>>> - Remove pointless initializer for 'iommu_intpost'.
>>> - Some adjustment for "if no intremap then no intpost" logic.
>>>     * For parse_iommu_param(), move it to the end of the function,
>>>       so we don't need to add the some logic when introduing the
>>>       new kernel parameter 'intpost' in later patch.
>>>     * Add this logic in iommu_setup() after iommu_hardware_setup()
>>>       is called.
>>>
>>>  xen/drivers/passthrough/iommu.c | 10 +++++++++-
>>>  xen/include/xen/iommu.h         |  2 +-
>>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/xen/drivers/passthrough/iommu.c
>> b/xen/drivers/passthrough/iommu.c
>>> index 06cb38f..597f676 100644
>>> --- a/xen/drivers/passthrough/iommu.c
>>> +++ b/xen/drivers/passthrough/iommu.c
>>> @@ -39,6 +39,7 @@ static void iommu_dump_p2m_table(unsigned char
>> key);
>>>   *   no-snoop                   Disable VT-d Snoop Control
>>>   *   no-qinval                  Disable VT-d Queued Invalidation
>>>   *   no-intremap                Disable VT-d Interrupt Remapping
>>> + *   no-intpost                 Disable VT-d Interrupt posting
>>>   */
>>>  custom_param("iommu", parse_iommu_param);
>>>  bool_t __initdata iommu_enable = 1;
>>> @@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
>>>  bool_t __read_mostly iommu_snoop = 1;
>>>  bool_t __read_mostly iommu_qinval = 1;
>>>  bool_t __read_mostly iommu_intremap = 1;
>>> +bool_t __read_mostly iommu_intpost;
>>>  bool_t __read_mostly iommu_hap_pt_share = 1;
>>>  bool_t __read_mostly iommu_debug;
>>>  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
>>> @@ -112,6 +114,9 @@ static void __init parse_iommu_param(char *s)
>>>
>>>          s = ss + 1;
>>>      } while ( ss );
>>> +
>>> +    if ( !iommu_intremap )
>>> +        iommu_intpost = 0;
>> This check is redundant - It will unconditionally be performed by
>> iommu_setup().  I would just drop the hunk.
>>
>> However, what you are missing is parsing logic to catch a command line
>> configuration such as "iommu=intremap,no-intpost"
> Doesn't the above code cover this command line?

How would you expect it to? I do not see any strcmp( , "intpost") in
this patch.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-06-24  5:18 ` [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
@ 2015-06-25 10:21   ` Andrew Cooper
  2015-06-25 13:02     ` Wu, Feng
  2015-07-08  7:32   ` Tian, Kevin
  1 sibling, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-25 10:21 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
>
> This patch adds feature detection logic for VT-d posted-interrupt.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Remove the "if no intremap then no intpost" logic in
>   intel_vtd_setup(), it is covered in the iommu_setup().
> - Add "if no intremap then no intpost" logic in the end
>   of init_vtd_hw() which is called by vtd_resume().
>
> So the logic exists in the following three places:
> - parse_iommu_param()
> - iommu_setup()
> - init_vtd_hw()
>
>  xen/drivers/passthrough/vtd/iommu.c | 18 ++++++++++++++++--
>  xen/drivers/passthrough/vtd/iommu.h |  1 +
>  2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
> index 9053a1f..4221185 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2071,6 +2071,9 @@ static int init_vtd_hw(void)
>                  disable_intremap(drhd->iommu);
>      }
>  
> +    if ( !iommu_intremap )
> +        iommu_intpost = 0;
> +

Yet again, this curious logic.

There should be exactly 1 place which performs this dependency check,
not scattered in multiple places.

>      /*
>       * Set root entries for each VT-d engine.  After set root entry,
>       * must globally invalidate context cache, and then globally
> @@ -2133,8 +2136,8 @@ int __init intel_vtd_setup(void)
>      }
>  
>      /* We enable the following features only if they are supported by all VT-d
> -     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
> -     * Interrupt Remapping.
> +     * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
> +     * Remapping, and Posted Interrupt
>       */
>      for_each_drhd_unit ( drhd )
>      {
> @@ -2162,6 +2165,15 @@ int __init intel_vtd_setup(void)
>          if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
>              iommu_intremap = 0;
>  
> +        /*
> +         * We cannot use posted interrupt if X86_FEATURE_CX16 is
> +         * not supported, since we count on this feature to
> +         * atomically update 16-byte IRTE in posted format.
> +         */
> +        if ( !iommu_intremap &&
> +             (!cap_intr_post(iommu->cap) || !cpu_has_cx16) )
> +            iommu_intpost = 0;

intremap is not relevant to CX16 being a prerequisite of intpost.

~Andrew

> +
>          if ( !vtd_ept_page_compatible(iommu) )
>              iommu_hap_pt_share = 0;
>  
> @@ -2187,6 +2199,7 @@ int __init intel_vtd_setup(void)
>      P(iommu_passthrough, "Dom0 DMA Passthrough");
>      P(iommu_qinval, "Queued Invalidation");
>      P(iommu_intremap, "Interrupt Remapping");
> +    P(iommu_intpost, "Posted Interrupt");
>      P(iommu_hap_pt_share, "Shared EPT tables");
>  #undef P
>  
> @@ -2206,6 +2219,7 @@ int __init intel_vtd_setup(void)
>      iommu_passthrough = 0;
>      iommu_qinval = 0;
>      iommu_intremap = 0;
> +    iommu_intpost = 0;
>      return ret;
>  }
>  
> diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
> index 80f8830..e807253 100644
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -69,6 +69,7 @@
>  /*
>   * Decoding Capability Register
>   */
> +#define cap_intr_post(c)       (((c) >> 59) & 1)
>  #define cap_read_drain(c)      (((c) >> 55) & 1)
>  #define cap_write_drain(c)     (((c) >> 54) & 1)
>  #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-06-25 10:16       ` Andrew Cooper
@ 2015-06-25 12:47         ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-06-25 12:47 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, June 25, 2015 6:16 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] [v3 04/15] iommu: Add iommu_intpost to control VT-d
> Posted-Interrupts feature
> 
> On 25/06/15 10:47, Wu, Feng wrote:
> >
> >> -----Original Message-----
> >> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> >> Sent: Thursday, June 25, 2015 5:06 PM
> >> To: Wu, Feng; xen-devel@lists.xen.org
> >> Cc: Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> >> jbeulich@suse.com; Zhang, Yang Z
> >> Subject: Re: [Xen-devel] [v3 04/15] iommu: Add iommu_intpost to control
> VT-d
> >> Posted-Interrupts feature
> >>
> >> On 24/06/15 06:18, Feng Wu wrote:
> >>> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> >>> With VT-d Posted-Interrupts enabled, external interrupts from
> >>> direct-assigned devices can be delivered to guests without VMM
> >>> intervention when guest is running in non-root mode.
> >>>
> >>> This patch adds variable 'iommu_intpost' to control whether enable VT-d
> >>> posted-interrupt or not in the generic IOMMU code.
> >>>
> >>> Signed-off-by: Feng Wu <feng.wu@intel.com>
> >>> ---
> >>> v3:
> >>> - Remove pointless initializer for 'iommu_intpost'.
> >>> - Some adjustment for "if no intremap then no intpost" logic.
> >>>     * For parse_iommu_param(), move it to the end of the function,
> >>>       so we don't need to add the some logic when introduing the
> >>>       new kernel parameter 'intpost' in later patch.
> >>>     * Add this logic in iommu_setup() after iommu_hardware_setup()
> >>>       is called.
> >>>
> >>>  xen/drivers/passthrough/iommu.c | 10 +++++++++-
> >>>  xen/include/xen/iommu.h         |  2 +-
> >>>  2 files changed, 10 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/xen/drivers/passthrough/iommu.c
> >> b/xen/drivers/passthrough/iommu.c
> >>> index 06cb38f..597f676 100644
> >>> --- a/xen/drivers/passthrough/iommu.c
> >>> +++ b/xen/drivers/passthrough/iommu.c
> >>> @@ -39,6 +39,7 @@ static void iommu_dump_p2m_table(unsigned char
> >> key);
> >>>   *   no-snoop                   Disable VT-d Snoop Control
> >>>   *   no-qinval                  Disable VT-d Queued Invalidation
> >>>   *   no-intremap                Disable VT-d Interrupt Remapping
> >>> + *   no-intpost                 Disable VT-d Interrupt posting
> >>>   */
> >>>  custom_param("iommu", parse_iommu_param);
> >>>  bool_t __initdata iommu_enable = 1;
> >>> @@ -51,6 +52,7 @@ bool_t __read_mostly iommu_passthrough;
> >>>  bool_t __read_mostly iommu_snoop = 1;
> >>>  bool_t __read_mostly iommu_qinval = 1;
> >>>  bool_t __read_mostly iommu_intremap = 1;
> >>> +bool_t __read_mostly iommu_intpost;
> >>>  bool_t __read_mostly iommu_hap_pt_share = 1;
> >>>  bool_t __read_mostly iommu_debug;
> >>>  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
> >>> @@ -112,6 +114,9 @@ static void __init parse_iommu_param(char *s)
> >>>
> >>>          s = ss + 1;
> >>>      } while ( ss );
> >>> +
> >>> +    if ( !iommu_intremap )
> >>> +        iommu_intpost = 0;
> >> This check is redundant - It will unconditionally be performed by
> >> iommu_setup().  I would just drop the hunk.
> >>
> >> However, what you are missing is parsing logic to catch a command line
> >> configuration such as "iommu=intremap,no-intpost"
> > Doesn't the above code cover this command line?
> 
> How would you expect it to? I do not see any strcmp( , "intpost") in
> this patch.

Oh, the command line handle logic is covered in [15/15].

Thanks,
Feng
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-06-25 10:21   ` Andrew Cooper
@ 2015-06-25 13:02     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-06-25 13:02 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, June 25, 2015 6:21 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; Tian, Kevin; Zhang, Yang Z;
> george.dunlap@eu.citrix.com
> Subject: Re: [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> > With VT-d Posted-Interrupts enabled, external interrupts from
> > direct-assigned devices can be delivered to guests without VMM
> > intervention when guest is running in non-root mode.
> >
> > This patch adds feature detection logic for VT-d posted-interrupt.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - Remove the "if no intremap then no intpost" logic in
> >   intel_vtd_setup(), it is covered in the iommu_setup().
> > - Add "if no intremap then no intpost" logic in the end
> >   of init_vtd_hw() which is called by vtd_resume().
> >
> > So the logic exists in the following three places:
> > - parse_iommu_param()
> > - iommu_setup()
> > - init_vtd_hw()
> >
> >  xen/drivers/passthrough/vtd/iommu.c | 18 ++++++++++++++++--
> >  xen/drivers/passthrough/vtd/iommu.h |  1 +
> >  2 files changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/drivers/passthrough/vtd/iommu.c
> b/xen/drivers/passthrough/vtd/iommu.c
> > index 9053a1f..4221185 100644
> > --- a/xen/drivers/passthrough/vtd/iommu.c
> > +++ b/xen/drivers/passthrough/vtd/iommu.c
> > @@ -2071,6 +2071,9 @@ static int init_vtd_hw(void)
> >                  disable_intremap(drhd->iommu);
> >      }
> >
> > +    if ( !iommu_intremap )
> > +        iommu_intpost = 0;
> > +
> 
> Yet again, this curious logic.
> 
> There should be exactly 1 place which performs this dependency check,
> not scattered in multiple places.

Seems, it is a little hard to find the one right place to perform this check.
But I will think this more.

> 
> >      /*
> >       * Set root entries for each VT-d engine.  After set root entry,
> >       * must globally invalidate context cache, and then globally
> > @@ -2133,8 +2136,8 @@ int __init intel_vtd_setup(void)
> >      }
> >
> >      /* We enable the following features only if they are supported by all
> VT-d
> > -     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
> > -     * Interrupt Remapping.
> > +     * engines: Snoop Control, DMA passthrough, Queued Invalidation,
> Interrupt
> > +     * Remapping, and Posted Interrupt
> >       */
> >      for_each_drhd_unit ( drhd )
> >      {
> > @@ -2162,6 +2165,15 @@ int __init intel_vtd_setup(void)
> >          if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
> >              iommu_intremap = 0;
> >
> > +        /*
> > +         * We cannot use posted interrupt if X86_FEATURE_CX16 is
> > +         * not supported, since we count on this feature to
> > +         * atomically update 16-byte IRTE in posted format.
> > +         */
> > +        if ( !iommu_intremap &&
> > +             (!cap_intr_post(iommu->cap) || !cpu_has_cx16) )
> > +            iommu_intpost = 0;
> 
> intremap is not relevant to CX16 being a prerequisite of intpost.
> 

I think I made a typo here, it should be like this:

+        if ( !iommu_intremap || !cap_intr_post(iommu->cap) || !cpu_has_cx16 )
+            iommu_intpost = 0;

Thanks,
Feng

> ~Andrew
> 
> > +
> >          if ( !vtd_ept_page_compatible(iommu) )
> >              iommu_hap_pt_share = 0;
> >
> > @@ -2187,6 +2199,7 @@ int __init intel_vtd_setup(void)
> >      P(iommu_passthrough, "Dom0 DMA Passthrough");
> >      P(iommu_qinval, "Queued Invalidation");
> >      P(iommu_intremap, "Interrupt Remapping");
> > +    P(iommu_intpost, "Posted Interrupt");
> >      P(iommu_hap_pt_share, "Shared EPT tables");
> >  #undef P
> >
> > @@ -2206,6 +2219,7 @@ int __init intel_vtd_setup(void)
> >      iommu_passthrough = 0;
> >      iommu_qinval = 0;
> >      iommu_intremap = 0;
> > +    iommu_intpost = 0;
> >      return ret;
> >  }
> >
> > diff --git a/xen/drivers/passthrough/vtd/iommu.h
> b/xen/drivers/passthrough/vtd/iommu.h
> > index 80f8830..e807253 100644
> > --- a/xen/drivers/passthrough/vtd/iommu.h
> > +++ b/xen/drivers/passthrough/vtd/iommu.h
> > @@ -69,6 +69,7 @@
> >  /*
> >   * Decoding Capability Register
> >   */
> > +#define cap_intr_post(c)       (((c) >> 59) & 1)
> >  #define cap_read_drain(c)      (((c) >> 55) & 1)
> >  #define cap_write_drain(c)     (((c) >> 54) & 1)
> >  #define cap_max_amask_val(c)   (((c) >> 48) & 0x3f)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-06-24  5:18 ` [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
@ 2015-06-29 15:04   ` Andrew Cooper
  2015-07-08  7:48   ` Tian, Kevin
  2015-07-10 13:08   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 15:04 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

Although this, like many other patches in the series needs a VT-d
maintainers ack/review.

> ---
> v3:
> - Use u32 instead of u64 for the bitfield in 'struct pi_desc'
>
>  xen/include/asm-x86/hvm/vmx/vmcs.h | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
> index 1104bda..dedfaef 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmcs.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
> @@ -81,8 +81,19 @@ struct vmx_domain {
>  
>  struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
> -    u32 control;
> -    u32 rsvd[7];
> +    union {
> +        struct
> +        {
> +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> +            sn     : 1,  /* bit 257 - Suppress Notification */
> +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> +        u8  nv;          /* bit 279:272 - Notification Vector */
> +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> +        u32 ndst;        /* bit 319:288 - Notification Destination */
> +        };
> +        u64 control;
> +    };
> +    u32 rsvd[6];
>  } __attribute__ ((aligned (64)));
>  
>  #define ept_get_wl(ept)   ((ept)->ept_wl)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-06-24  5:18 ` [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
@ 2015-06-29 15:32   ` Andrew Cooper
  2015-06-30  1:46     ` Wu, Feng
  2015-06-30  2:32     ` Dario Faggioli
  2015-07-08  7:53   ` Tian, Kevin
  1 sibling, 2 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 15:32 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: kevin.tian, keir, george.dunlap, Dario Faggioli, jbeulich,
	yang.z.zhang

On 24/06/15 06:18, Feng Wu wrote:
> This patch initializes the VT-d Posted-interrupt Descriptor.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Move pi_desc_init() to xen/arch/x86/hvm/vmx/vmcs.c
> - Remove the 'inline' flag of pi_desc_init()
>
>  xen/arch/x86/hvm/vmx/vmcs.c       | 18 ++++++++++++++++++
>  xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
>  2 files changed, 20 insertions(+)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> index 3aff365..11dc1b5 100644
> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> @@ -40,6 +40,7 @@
>  #include <asm/flushtlb.h>
>  #include <asm/shadow.h>
>  #include <asm/tboot.h>
> +#include <asm/apic.h>
>  
>  static bool_t __read_mostly opt_vpid_enabled = 1;
>  boolean_param("vpid", opt_vpid_enabled);
> @@ -921,6 +922,20 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
>      virtual_vmcs_exit(vvmcs);
>  }
>  
> +static void pi_desc_init(struct vcpu *v)
> +{
> +    uint32_t dest;
> +
> +    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
> +
> +    dest = cpu_physical_id(v->processor);

I am fairly sure that this is not a safe use of v->processor. 
Everything else in this patch looks fine, but I would like review from
people more familiar with scheduling.

~Andrew

> +
> +    if ( x2apic_enabled )
> +        v->arch.hvm_vmx.pi_desc.ndst = dest;
> +    else
> +        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest, PI_xAPIC_NDST_MASK);
> +}
> +
>  static int construct_vmcs(struct vcpu *v)
>  {
>      struct domain *d = v->domain;
> @@ -1054,6 +1069,9 @@ static int construct_vmcs(struct vcpu *v)
>  
>      if ( cpu_has_vmx_posted_intr_processing )
>      {
> +        if ( iommu_intpost )
> +            pi_desc_init(v);
> +
>          __vmwrite(PI_DESC_ADDR, virt_to_maddr(&v->arch.hvm_vmx.pi_desc));
>          __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR, posted_intr_vector);
>      }
> diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
> index 35f804a..5853563 100644
> --- a/xen/include/asm-x86/hvm/vmx/vmx.h
> +++ b/xen/include/asm-x86/hvm/vmx/vmx.h
> @@ -89,6 +89,8 @@ typedef enum {
>  #define EPT_EMT_WB              6
>  #define EPT_EMT_RSV2            7
>  
> +#define PI_xAPIC_NDST_MASK      0xFF00
> +
>  void vmx_asm_vmexit_handler(struct cpu_user_regs);
>  void vmx_asm_do_vmentry(void);
>  void vmx_intr_assist(void);

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-06-24  5:18 ` [v3 08/15] Suppress posting interrupts when 'SN' is set Feng Wu
@ 2015-06-29 15:41   ` Andrew Cooper
  2015-06-30  1:48     ` Wu, Feng
  2015-07-08  9:06   ` Tian, Kevin
  2015-07-10 13:20   ` Jan Beulich
  2 siblings, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 15:41 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: george.dunlap, yang.z.zhang, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> Currently, we don't support urgent interrupt, all interrupts
> are recognized as non-urgent interrupt, so we cannot send
> posted-interrupt when 'SN' is set.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> use cmpxchg to test SN/ON and set ON
>
>  xen/arch/x86/hvm/vmx/vmx.c | 32 ++++++++++++++++++++++++++++----
>  1 file changed, 28 insertions(+), 4 deletions(-)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 0837627..b94ef6a 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1686,6 +1686,8 @@ static void __vmx_deliver_posted_interrupt(struct vcpu *v)
>  
>  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
>  {
> +    struct pi_desc old, new, prev;

These should be moved into the else clause below, to reduce their scope.

> +
>      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
>          return;
>  
> @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
>           */
>          pi_set_on(&v->arch.hvm_vmx.pi_desc);
>      }
> -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> +    else
>      {
> +        prev.control = 0;
> +
> +        do {
> +            old.control = v->arch.hvm_vmx.pi_desc.control &
> +                          ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);

Brackets around each of the 1 << $NNN please.

> +            new.control = v->arch.hvm_vmx.pi_desc.control |
> +                          1 << POSTED_INTR_ON;
> +
> +            /*
> +             * Currently, we don't support urgent interrupt, all
> +             * interrupts are recognized as non-urgent interrupt,
> +             * so we cannot send posted-interrupt when 'SN' is set.
> +             * Besides that, if 'ON' is already set, we cannot set
> +             * posted-interrupts as well.
> +             */
> +            if ( prev.sn || prev.on )
> +            {
> +                vcpu_kick(v);
> +                return;
> +            }
> +
> +            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
> +                                   old.control, new.control);
> +        } while ( prev.control != old.control );
> +
>          __vmx_deliver_posted_interrupt(v);
> -        return;
>      }
> -
> -    vcpu_kick(v);

This removes a vcpu_kick() from the eoi_exitmap_changed path, which I
suspect is not what you intend.

~Andrew

>  }
>  
>  static void vmx_sync_pir_to_irr(struct vcpu *v)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-24  5:18 ` [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
@ 2015-06-29 16:04   ` Andrew Cooper
  2015-06-30  1:52     ` Wu, Feng
  2015-07-08  9:10   ` Tian, Kevin
  2015-07-10 13:27   ` Jan Beulich
  2 siblings, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 16:04 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> diff --git a/xen/drivers/passthrough/vtd/iommu.h b/xen/drivers/passthrough/vtd/iommu.h
> index e807253..49daa70 100644
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -289,29 +289,43 @@ struct dma_pte {
>  /* interrupt remap entry */
>  struct iremap_entry {
>    union {
> -    u64 lo_val;
> +    struct { u64 lo, hi; };
>      struct {
> -        u64 p       : 1,
> +        u16 p       : 1,
>              fpd     : 1,
>              dm      : 1,
>              rh      : 1,
>              tm      : 1,
>              dlm     : 3,
>              avail   : 4,
> -            res_1   : 4,
> -            vector  : 8,
> -            res_2   : 8,
> -            dst     : 32;
> -    }lo;
> -  };
> -  union {
> -    u64 hi_val;
> +            res_1   : 4;
> +        u8  vector;
> +        u8  res_2;
> +        u32 dst;
> +        u16 sid;
> +        u16 sq      : 2,
> +            svt     : 2,
> +            res_3   : 12;
> +        u32 res_4   : 32;

res_4 does not need to be a bitfield.

> +    } remap;
>      struct {
> -        u64 sid     : 16,
> -            sq      : 2,
> +        u16 p       : 1,
> +            fpd     : 1,
> +            res_1   : 6,
> +            avail   : 4,
> +            res_2   : 2,
> +            urg     : 1,
> +            im      : 1;

I think "im" needs exposing in both the post and remap unions, as it is
the bit which identifies which representation to use.

> +        u8  vector;
> +        u8  res_3;
> +        u32 res_4   : 6,
> +            pda_l   : 26;
> +        u16 sid;
> +        u16 sq      : 2,
>              svt     : 2,
> -            res_1   : 44;
> -    }hi;
> +            res_5   : 12;
> +        u32 pda_h;
> +    } post;
>    };
>  };
>  
> diff --git a/xen/drivers/passthrough/vtd/utils.c b/xen/drivers/passthrough/vtd/utils.c
> index bd14c02..a5fe237 100644
> --- a/xen/drivers/passthrough/vtd/utils.c
> +++ b/xen/drivers/passthrough/vtd/utils.c
> @@ -238,14 +238,14 @@ static void dump_iommu_info(unsigned char key)
>                  else
>                      p = &iremap_entries[i % (1 << IREMAP_ENTRY_ORDER)];
>  
> -                if ( !p->lo.p )
> +                if ( !p->remap.p )
>                      continue;
>                  printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
>                      "   %x %x\n", i,
> -                    (u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
> -                    (u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
> -                    (u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
> -                    (u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
> +                    (u32)p->remap.svt, (u32)p->remap.sq, (u32)p->remap.sid,
> +                    (u32)p->remap.dst, (u32)p->remap.vector, (u32)p->remap.avail,
> +                    (u32)p->remap.dlm, (u32)p->remap.tm, (u32)p->remap.rh,
> +                    (u32)p->remap.dm, (u32)p->remap.fpd, (u32)p->remap.p);

This printing is only valid if "im" is 0.  As this series adds support
for the posted format, I would suggest you extend this debugging here to
deal with both formats.

~Andrew

>                  print_cnt++;
>              }
>              if ( iremap_entries )

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-24  5:18 ` [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
@ 2015-06-29 16:22   ` Andrew Cooper
  2015-07-08  9:59   ` Tian, Kevin
  2015-07-10 14:01   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 16:22 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> This patch adds an API which is used to update the IRTE
> for posted-interrupt when guest changes MSI/MSI-X information.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Remove "adding PDA_MASK()" when updating 'pda_l' and 'pda_h' for IRTE.
> - Change the return type of pi_update_irte() to int.
> - Remove some pointless printk message in pi_update_irte().
> - Use structure assignment instead of memcpy() for irte copy.
>
>  xen/drivers/passthrough/vtd/intremap.c | 98 ++++++++++++++++++++++++++++++++++
>  xen/drivers/passthrough/vtd/iommu.h    |  2 +
>  xen/include/asm-x86/iommu.h            |  2 +
>  3 files changed, 102 insertions(+)
>
> diff --git a/xen/drivers/passthrough/vtd/intremap.c b/xen/drivers/passthrough/vtd/intremap.c
> index b7a42f6..401a9d1 100644
> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -900,3 +900,101 @@ void iommu_disable_x2apic_IR(void)
>      for_each_drhd_unit ( drhd )
>          disable_qinval(drhd->iommu);
>  }
> +
> +static inline void setup_posted_irte(

No need for "inline" here.

> +    struct iremap_entry *new_ire, struct pi_desc *pi_desc, uint8_t gvec)

const struct pi_desc *pi_desc

> +{
> +    new_ire->post.urg = 0;

I would start by memset()ing the entire structure to 0, then filling in
the non-zero bits.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-06-24  5:18 ` [v3 11/15] Update IRTE according to guest interrupt config changes Feng Wu
@ 2015-06-29 16:46   ` Andrew Cooper
  2015-07-08 10:22   ` Tian, Kevin
  2015-07-10 14:23   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 16:46 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> When guest changes its interrupt configuration (such as, vector, etc.)
> for direct-assigned devices, we need to update the associated IRTE
> with the new guest vector, so external interrupts from the assigned
> devices can be injected to guests without VM-Exit.
>
> For lowest-priority interrupts, we use vector-hashing mechamisn to find
> the destination vCPU. This follows the hardware behavior, since modern
> Intel CPUs use vector hashing to handle the lowest-priority interrupt.
>
> For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> still use interrupt remapping.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Use bitmap to store the all the possible destination vCPUs of an
> interrupt, then trying to find the right destination from the bitmap
> - Typo and some small changes
>
>  xen/drivers/passthrough/io.c | 96 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 9b77334..18e24e1 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -26,6 +26,7 @@
>  #include <asm/hvm/iommu.h>
>  #include <asm/hvm/support.h>
>  #include <xen/hvm/irq.h>
> +#include <asm/io_apic.h>
>  
>  static DEFINE_PER_CPU(struct list_head, dpci_list);
>  
> @@ -199,6 +200,78 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
>      xfree(dpci);
>  }
>  
> +/*
> + * The purpose of this routine is to find the right destination vCPU for
> + * an interrupt which will be delivered by VT-d posted-interrupt. There
> + * are several cases as below:
> + *
> + * - For lowest-priority interrupts, we find the destination vCPU from the
> + *   guest vector using vector-hashing mechanism and return true. This follows
> + *   the hardware behavior, since modern Intel CPUs use vector hashing to
> + *   handle the lowest-priority interrupt.
> + * - Otherwise, for single destination interrupt, it is straightforward to
> + *   find the destination vCPU and return true.
> + * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> + *   so return false.

s/false/NULL/ ?

> + *
> + *   Here is the details about the vector-hashing mechanism:
> + *   1. For lowest-priority interrupts, store all the possible destination
> + *      vCPUs in an array.
> + *   2. Use "gvec % max number of destination vCPUs" to find the right
> + *      destination vCPU in the array for the lowest-priority interrupt.
> + */
> +static struct vcpu *pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,

dest_id should clearly be 32bits rather than 8.

> +                                      uint8_t dest_mode, uint8_t delivery_mode,
> +                                      uint8_t gvec)
> +{
> +    unsigned long *dest_vcpu_bitmap = NULL;
> +    unsigned int dest_vcpu_num = 0, idx = 0;
> +    int size = (d->max_vcpus + BITS_PER_LONG - 1) / BITS_PER_LONG;

unsigned int, and "size" is far too generic of a name.

> +    struct vcpu *v, *dest = NULL;
> +    int i;

Also unsigned.

> +
> +    dest_vcpu_bitmap = xzalloc_array(unsigned long, size);
> +    if ( !dest_vcpu_bitmap )
> +    {
> +        dprintk(XENLOG_G_INFO,
> +                "dom%d: failed to allocate memory\n", d->domain_id);
> +        return NULL;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
> +                                dest_id, dest_mode) )
> +            continue;
> +
> +        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
> +        dest_vcpu_num++;
> +    }
> +
> +    if ( delivery_mode == dest_LowestPrio )
> +    {
> +        if (  dest_vcpu_num != 0 )

Too many spaces inside the brackets.

> +        {
> +            for ( i = 0; i <= gvec % dest_vcpu_num; i++)
> +                idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
> +            idx--;
> +
> +            BUG_ON(idx >= d->max_vcpus || idx < 0);
> +            dest = d->vcpu[idx];
> +        }
> +    }
> +    else if (  dest_vcpu_num == 1 )
> +    {
> +        idx = find_first_bit(dest_vcpu_bitmap, d->max_vcpus);
> +        BUG_ON(idx >= d->max_vcpus || idx < 0);

find_first_bit() is unsigned, so can never be less than 0.

> +        dest = d->vcpu[idx];
> +    }
> +
> +    xfree(dest_vcpu_bitmap);
> +
> +    return dest;
> +}
> +
>  int pt_irq_create_bind(
>      struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
>  {
> @@ -257,7 +330,7 @@ int pt_irq_create_bind(
>      {
>      case PT_IRQ_TYPE_MSI:
>      {
> -        uint8_t dest, dest_mode;
> +        uint8_t dest, dest_mode, delivery_mode;
>          int dest_vcpu_id;
>  
>          if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
> @@ -330,11 +403,32 @@ int pt_irq_create_bind(
>          /* Calculate dest_vcpu_id for MSI-type pirq migration. */
>          dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
>          dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
> +        delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
> +                        VMSI_DELIV_MASK;
>          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>          spin_unlock(&d->event_lock);
>          if ( dest_vcpu_id >= 0 )
>              hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
> +
> +        /* Use interrupt posting if it is supported */
> +        if ( iommu_intpost )
> +        {
> +            struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
> +                                        delivery_mode, pirq_dpci->gmsi.gvec);
> +
> +            if ( !vcpu )
> +                dprintk(XENLOG_G_WARNING,
> +                        "dom%u: failed to find the dest vCPU for PI, guest "
> +                        "vector:0x%x use software way to deliver the "
> +                        " interrupts.\n", d->domain_id, pirq_dpci->gmsi.gvec);

If this is normal for a multicast interrupt, it must not be a WARNING
level error message.  It probably shouldn't even be a message at all.

~Andrew

> +            else if ( pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec ) != 0 )
> +                dprintk(XENLOG_G_WARNING,
> +                        "%pv: failed to update PI IRTE, guest vector:0x%x "
> +                        "use software way to deliver the interrupts.\n",
> +                        vcpu, pirq_dpci->gmsi.gvec);
> +        }
> +
>          break;
>      }
>  

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-24  5:18 ` [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
@ 2015-06-29 17:07   ` Andrew Cooper
  2015-07-08 10:36     ` Wu, Feng
       [not found]   ` <559181F9.6020106@citrix.com>
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-06-29 17:07 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> This patch includes the following aspects:
> - Add a global vector to wake up the blocked vCPU
>   when an interrupt is being posted to it (This
>   part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> - Adds a new per-vCPU tasklet to wakeup the blocked
>   vCPU. It can be used in the case vcpu_unblock
>   cannot be called directly.
> - Define two per-cpu variables:
>       * pi_blocked_vcpu:
>       A list storing the vCPUs which were blocked on this pCPU.
>
>       * pi_blocked_vcpu_lock:
>       The spinlock to protect pi_blocked_vcpu.
>
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - This patch is generated by merging the following three patches in v2:
>    [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
>    [RFC v2 10/15] vmx: Define two per-cpu variables
>    [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
> - rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
> - Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
> - rename 'vcpu_wakeup_tasklet_handler' to 'pi_vcpu_wakeup_tasklet_handler'
> - Make pi_wakeup_interrupt() static
> - Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
> - move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
> - Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
> - Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'
>
>  xen/arch/x86/hvm/vmx/vmcs.c        |  3 +++
>  xen/arch/x86/hvm/vmx/vmx.c         | 54 ++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/hvm.h      |  1 +
>  xen/include/asm-x86/hvm/vmx/vmcs.h |  5 ++++
>  xen/include/asm-x86/hvm/vmx/vmx.h  |  5 ++++
>  5 files changed, 68 insertions(+)
>
> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> index 11dc1b5..0c5ce3f 100644
> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> @@ -631,6 +631,9 @@ int vmx_cpu_up(void)
>      if ( cpu_has_vmx_vpid )
>          vpid_sync_all();
>  
> +    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
> +    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
>      return 0;
>  }
>  
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index b94ef6a..7db6009 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -82,7 +82,20 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content);
>  static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
>  static void vmx_invlpg_intercept(unsigned long vaddr);
>  
> +/*
> + * We maintian a per-CPU linked-list of vCPU, so in PI wakeup handler we
> + * can find which vCPU should be waken up.
> + */
> +DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
> +DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
> +
>  uint8_t __read_mostly posted_intr_vector;
> +uint8_t __read_mostly pi_wakeup_vector;
> +
> +static void pi_vcpu_wakeup_tasklet_handler(unsigned long arg)
> +{
> +    vcpu_unblock((struct vcpu *)arg);
> +}
>  
>  static int vmx_domain_initialise(struct domain *d)
>  {
> @@ -148,11 +161,19 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>      if ( v->vcpu_id == 0 )
>          v->arch.user_regs.eax = 1;
>  
> +    tasklet_init(
> +        &v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet,
> +        pi_vcpu_wakeup_tasklet_handler,
> +        (unsigned long)v);

c/s f6dd295 indicates that the global tasklet lock causes a bottleneck
when injecting interrupts, and replaced a tasklet with a softirq to fix
the scalability issue.

I would expect exactly the bottleneck to exist here.

> +
> +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +
>      return 0;
>  }
>  
>  static void vmx_vcpu_destroy(struct vcpu *v)
>  {
> +    tasklet_kill(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
>      /*
>       * There are cases that domain still remains in log-dirty mode when it is
>       * about to be destroyed (ex, user types 'xl destroy <dom>'), in which case
> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata vmx_function_table = {
>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
>  };
>  
> +/*
> + * Handle VT-d posted-interrupt when VCPU is blocked.
> + */
> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct arch_vmx_struct *vmx;
> +    unsigned int cpu = smp_processor_id();
> +
> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));

this_cpu($foo) should be used in preference to per_cpu($foo, $myself).

However, always hoist repeated uses of this/per_cpu into local
variables, as the compiler is unable to elide repeated accesses (because
of a deliberate anti-optimisation behind the scenes).

spinlock_t *lock = &this_cpu(pi_blocked_vcpu_lock);
list_head *blocked_vcpus = &this_cpu(ps_blocked_vcpu);

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-06-29 15:32   ` Andrew Cooper
@ 2015-06-30  1:46     ` Wu, Feng
  2015-06-30  2:32     ` Dario Faggioli
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-06-30  1:46 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Dario Faggioli, jbeulich@suse.com, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Monday, June 29, 2015 11:33 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; Tian, Kevin; Zhang, Yang Z;
> george.dunlap@eu.citrix.com; Dario Faggioli
> Subject: Re: [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > This patch initializes the VT-d Posted-interrupt Descriptor.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - Move pi_desc_init() to xen/arch/x86/hvm/vmx/vmcs.c
> > - Remove the 'inline' flag of pi_desc_init()
> >
> >  xen/arch/x86/hvm/vmx/vmcs.c       | 18 ++++++++++++++++++
> >  xen/include/asm-x86/hvm/vmx/vmx.h |  2 ++
> >  2 files changed, 20 insertions(+)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> > index 3aff365..11dc1b5 100644
> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -40,6 +40,7 @@
> >  #include <asm/flushtlb.h>
> >  #include <asm/shadow.h>
> >  #include <asm/tboot.h>
> > +#include <asm/apic.h>
> >
> >  static bool_t __read_mostly opt_vpid_enabled = 1;
> >  boolean_param("vpid", opt_vpid_enabled);
> > @@ -921,6 +922,20 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32
> vmcs_encoding, u64 val)
> >      virtual_vmcs_exit(vvmcs);
> >  }
> >
> > +static void pi_desc_init(struct vcpu *v)
> > +{
> > +    uint32_t dest;
> > +
> > +    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
> > +
> > +    dest = cpu_physical_id(v->processor);
> 
> I am fairly sure that this is not a safe use of v->processor.
> Everything else in this patch looks fine, but I would like review from
> people more familiar with scheduling.

It would be very helpful if you can elaborate it a bit more about why
it is not safe using v->processor here?

Thanks,
Feng

> 
> ~Andrew
> 
> > +
> > +    if ( x2apic_enabled )
> > +        v->arch.hvm_vmx.pi_desc.ndst = dest;
> > +    else
> > +        v->arch.hvm_vmx.pi_desc.ndst = MASK_INSR(dest,
> PI_xAPIC_NDST_MASK);
> > +}
> > +
> >  static int construct_vmcs(struct vcpu *v)
> >  {
> >      struct domain *d = v->domain;
> > @@ -1054,6 +1069,9 @@ static int construct_vmcs(struct vcpu *v)
> >
> >      if ( cpu_has_vmx_posted_intr_processing )
> >      {
> > +        if ( iommu_intpost )
> > +            pi_desc_init(v);
> > +
> >          __vmwrite(PI_DESC_ADDR,
> virt_to_maddr(&v->arch.hvm_vmx.pi_desc));
> >          __vmwrite(POSTED_INTR_NOTIFICATION_VECTOR,
> posted_intr_vector);
> >      }
> > diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h
> b/xen/include/asm-x86/hvm/vmx/vmx.h
> > index 35f804a..5853563 100644
> > --- a/xen/include/asm-x86/hvm/vmx/vmx.h
> > +++ b/xen/include/asm-x86/hvm/vmx/vmx.h
> > @@ -89,6 +89,8 @@ typedef enum {
> >  #define EPT_EMT_WB              6
> >  #define EPT_EMT_RSV2            7
> >
> > +#define PI_xAPIC_NDST_MASK      0xFF00
> > +
> >  void vmx_asm_vmexit_handler(struct cpu_user_regs);
> >  void vmx_asm_do_vmentry(void);
> >  void vmx_intr_assist(void);

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-06-29 15:41   ` Andrew Cooper
@ 2015-06-30  1:48     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-06-30  1:48 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Monday, June 29, 2015 11:41 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] [v3 08/15] Suppress posting interrupts when 'SN' is set
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > Currently, we don't support urgent interrupt, all interrupts
> > are recognized as non-urgent interrupt, so we cannot send
> > posted-interrupt when 'SN' is set.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > use cmpxchg to test SN/ON and set ON
> >
> >  xen/arch/x86/hvm/vmx/vmx.c | 32 ++++++++++++++++++++++++++++----
> >  1 file changed, 28 insertions(+), 4 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 0837627..b94ef6a 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -1686,6 +1686,8 @@ static void __vmx_deliver_posted_interrupt(struct
> vcpu *v)
> >
> >  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
> >  {
> > +    struct pi_desc old, new, prev;
> 
> These should be moved into the else clause below, to reduce their scope.
> 
> > +
> >      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
> >          return;
> >
> > @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu
> *v, u8 vector)
> >           */
> >          pi_set_on(&v->arch.hvm_vmx.pi_desc);
> >      }
> > -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> > +    else
> >      {
> > +        prev.control = 0;
> > +
> > +        do {
> > +            old.control = v->arch.hvm_vmx.pi_desc.control &
> > +                          ~(1 << POSTED_INTR_ON | 1 <<
> POSTED_INTR_SN);
> 
> Brackets around each of the 1 << $NNN please.
> 
> > +            new.control = v->arch.hvm_vmx.pi_desc.control |
> > +                          1 << POSTED_INTR_ON;
> > +
> > +            /*
> > +             * Currently, we don't support urgent interrupt, all
> > +             * interrupts are recognized as non-urgent interrupt,
> > +             * so we cannot send posted-interrupt when 'SN' is set.
> > +             * Besides that, if 'ON' is already set, we cannot set
> > +             * posted-interrupts as well.
> > +             */
> > +            if ( prev.sn || prev.on )
> > +            {
> > +                vcpu_kick(v);
> > +                return;
> > +            }
> > +
> > +            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
> > +                                   old.control, new.control);
> > +        } while ( prev.control != old.control );
> > +
> >          __vmx_deliver_posted_interrupt(v);
> > -        return;
> >      }
> > -
> > -    vcpu_kick(v);
> 
> This removes a vcpu_kick() from the eoi_exitmap_changed path, which I
> suspect is not what you intend.

Oops.. thanks for pointing it out!

Thanks,
Feng

> 
> ~Andrew
> 
> >  }
> >
> >  static void vmx_sync_pir_to_irr(struct vcpu *v)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-29 16:04   ` Andrew Cooper
@ 2015-06-30  1:52     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-06-30  1:52 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Andrew Cooper
> Sent: Tuesday, June 30, 2015 12:05 AM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: Zhang, Yang Z; george.dunlap@eu.citrix.com; Tian, Kevin; keir@xen.org;
> jbeulich@suse.com
> Subject: Re: [Xen-devel] [v3 09/15] vt-d: Extend struct iremap_entry to support
> VT-d Posted-Interrupts
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > diff --git a/xen/drivers/passthrough/vtd/iommu.h
> b/xen/drivers/passthrough/vtd/iommu.h
> > index e807253..49daa70 100644
> > --- a/xen/drivers/passthrough/vtd/iommu.h
> > +++ b/xen/drivers/passthrough/vtd/iommu.h
> > @@ -289,29 +289,43 @@ struct dma_pte {
> >  /* interrupt remap entry */
> >  struct iremap_entry {
> >    union {
> > -    u64 lo_val;
> > +    struct { u64 lo, hi; };
> >      struct {
> > -        u64 p       : 1,
> > +        u16 p       : 1,
> >              fpd     : 1,
> >              dm      : 1,
> >              rh      : 1,
> >              tm      : 1,
> >              dlm     : 3,
> >              avail   : 4,
> > -            res_1   : 4,
> > -            vector  : 8,
> > -            res_2   : 8,
> > -            dst     : 32;
> > -    }lo;
> > -  };
> > -  union {
> > -    u64 hi_val;
> > +            res_1   : 4;
> > +        u8  vector;
> > +        u8  res_2;
> > +        u32 dst;
> > +        u16 sid;
> > +        u16 sq      : 2,
> > +            svt     : 2,
> > +            res_3   : 12;
> > +        u32 res_4   : 32;
> 
> res_4 does not need to be a bitfield.
> 
> > +    } remap;
> >      struct {
> > -        u64 sid     : 16,
> > -            sq      : 2,
> > +        u16 p       : 1,
> > +            fpd     : 1,
> > +            res_1   : 6,
> > +            avail   : 4,
> > +            res_2   : 2,
> > +            urg     : 1,
> > +            im      : 1;
> 
> I think "im" needs exposing in both the post and remap unions, as it is
> the bit which identifies which representation to use.

Reasonable.

> 
> > +        u8  vector;
> > +        u8  res_3;
> > +        u32 res_4   : 6,
> > +            pda_l   : 26;
> > +        u16 sid;
> > +        u16 sq      : 2,
> >              svt     : 2,
> > -            res_1   : 44;
> > -    }hi;
> > +            res_5   : 12;
> > +        u32 pda_h;
> > +    } post;
> >    };
> >  };
> >
> > diff --git a/xen/drivers/passthrough/vtd/utils.c
> b/xen/drivers/passthrough/vtd/utils.c
> > index bd14c02..a5fe237 100644
> > --- a/xen/drivers/passthrough/vtd/utils.c
> > +++ b/xen/drivers/passthrough/vtd/utils.c
> > @@ -238,14 +238,14 @@ static void dump_iommu_info(unsigned char key)
> >                  else
> >                      p = &iremap_entries[i % (1 <<
> IREMAP_ENTRY_ORDER)];
> >
> > -                if ( !p->lo.p )
> > +                if ( !p->remap.p )
> >                      continue;
> >
> printk("  %04x:  %x   %x  %04x %08x %02x    %x   %x  %x  %x  %x"
> >                      "   %x %x\n", i,
> > -                    (u32)p->hi.svt, (u32)p->hi.sq, (u32)p->hi.sid,
> > -                    (u32)p->lo.dst, (u32)p->lo.vector, (u32)p->lo.avail,
> > -                    (u32)p->lo.dlm, (u32)p->lo.tm, (u32)p->lo.rh,
> > -                    (u32)p->lo.dm, (u32)p->lo.fpd, (u32)p->lo.p);
> > +                    (u32)p->remap.svt, (u32)p->remap.sq,
> (u32)p->remap.sid,
> > +                    (u32)p->remap.dst, (u32)p->remap.vector,
> (u32)p->remap.avail,
> > +                    (u32)p->remap.dlm, (u32)p->remap.tm,
> (u32)p->remap.rh,
> > +                    (u32)p->remap.dm, (u32)p->remap.fpd,
> (u32)p->remap.p);
> 
> This printing is only valid if "im" is 0.  As this series adds support
> for the posted format, I would suggest you extend this debugging here to
> deal with both formats.

Good suggestion, as this patch is only the adjustment for the changes of
'struct iremap_entry', I will add the new logic in a separate patch.

Thanks,
Feng

> 
> ~Andrew
> 
> >                  print_cnt++;
> >              }
> >              if ( iremap_entries )
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-06-29 15:32   ` Andrew Cooper
  2015-06-30  1:46     ` Wu, Feng
@ 2015-06-30  2:32     ` Dario Faggioli
  1 sibling, 0 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-06-30  2:32 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: kevin.tian, keir, george.dunlap, xen-devel, jbeulich,
	yang.z.zhang, Feng Wu


[-- Attachment #1.1: Type: text/plain, Size: 2183 bytes --]

On Mon, 2015-06-29 at 16:32 +0100, Andrew Cooper wrote:
> On 24/06/15 06:18, Feng Wu wrote:

> > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> > index 3aff365..11dc1b5 100644
> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -40,6 +40,7 @@
> >  #include <asm/flushtlb.h>
> >  #include <asm/shadow.h>
> >  #include <asm/tboot.h>
> > +#include <asm/apic.h>
> >  
> >  static bool_t __read_mostly opt_vpid_enabled = 1;
> >  boolean_param("vpid", opt_vpid_enabled);
> > @@ -921,6 +922,20 @@ void virtual_vmcs_vmwrite(void *vvmcs, u32 vmcs_encoding, u64 val)
> >      virtual_vmcs_exit(vvmcs);
> >  }
> >  
> > +static void pi_desc_init(struct vcpu *v)
> > +{
> > +    uint32_t dest;
> > +
> > +    v->arch.hvm_vmx.pi_desc.nv = posted_intr_vector;
> > +
> > +    dest = cpu_physical_id(v->processor);
> 
> I am fairly sure that this is not a safe use of v->processor. 
> Everything else in this patch looks fine, but I would like review from
> people more familiar with scheduling.
> 
It is unsafe, IMO.

However, this is called by vcpu_initialise(), within alloc_vcpu(), which
is in turn triggered by XEN_DOMCTL_max_vcpus. When we get to there, the
vcpu is RUNSTATE_offline, it's got _VPF_down in pause_flags, and the
whole domain is paused.

Honestly, I don't think there are many chances that v->processor changes
under our feet, in this case. A comment, explaining this quickly, would
be useful, though, I think.

George?

In any case, even if v->processor would be at risk of fluctuating, Feng,
what's the risk of going live with a spurious content in dest? Is it
just that the first posted interrupt will be consumed by the wrong vCPU,
but after that, things will settle properly? If yes, I think that would
be ok too (but really, I think things are safe, although by chance, in
this case).

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
       [not found]   ` <559181F9.6020106@citrix.com>
@ 2015-06-30  2:51     ` Dario Faggioli
  2015-06-30  2:59       ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-06-30  2:51 UTC (permalink / raw)
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	jbeulich, yang.z.zhang, feng.wu


[-- Attachment #1.1: Type: text/plain, Size: 3716 bytes --]

So, first of all, thanks Andrew for drawing my attention on this...

On Mon, 2015-06-29 at 18:35 +0100, Andrew Cooper wrote:
-- 
Dario Faggioli <dario.faggioli@citrix.com>
Citrix Inc.
> This patch includes the following aspects:
> - Add a global vector to wake up the blocked vCPU
>   when an interrupt is being posted to it (This
>   part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> - Adds a new per-vCPU tasklet to wakeup the blocked
>   vCPU. It can be used in the case vcpu_unblock
>   cannot be called directly.
> - Define two per-cpu variables:
>       * pi_blocked_vcpu:
>       A list storing the vCPUs which were blocked on this pCPU.
> 
>       * pi_blocked_vcpu_lock:
>       The spinlock to protect pi_blocked_vcpu.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
>
I do have a question. I don't have much expertise with this part of the
code base (HVM stuff, low level details of event/interrupt delivery),
so, it may be a stupid one, in which case, sorry in advance for the
noise.

> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index b94ef6a..7db6009 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>  
> +/*
> + * Handle VT-d posted-interrupt when VCPU is blocked.
> + */
> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct arch_vmx_struct *vmx;
> +    unsigned int cpu = smp_processor_id();
> +
> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
> +    /*
> +     * FIXME: The length of the list depends on how many
> +     * vCPU is current blocked on this specific pCPU.
> +     * This may hurt the interrupt latency if the list
> +     * grows to too many entries.
> +     */
> +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> +                        pi_blocked_vcpu_list)
> +        if ( vmx->pi_desc.on )
> +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> +
> +    spin_unlock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
> +    ack_APIC_irq();
> +    this_cpu(irq_count)++;
> +}
> +
Quoting the design document in patch 1:

+Here is the scenario for the usage of the new global vector:
+
+1. vCPU0 is running on pCPU0
+2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
+3. An external interrupt from an assigned device occurs for vCPU0, if we
+still use 'posted_intr_vector' as the notification vector for vCPU0, the
+notification event for vCPU0 (the event will go to pCPU1) will be consumed
+by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
+case is that vCPU0 will never be woken up again since the wakeup event
+for it is always consumed by other vCPUs incorrectly. So we need introduce
+another global vector, naming 'pi_wakeup_vector' to wake up the blocked vCPU.
+
+After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
+event using this new vector. Since this new vector is not a SPECIAL one to CPU,
+it is just a normal vector. To cpu, it just receives an normal external 
interrupt,
+then we can get control in the handler of this new vector. In this case, 
hypervisor
+can do something in it, such as wakeup the blocked vCPU.

Let's assume that there are two vCPUs blocked, waiting for a (posted)
interrupt, on pCPU0, and that they are vCPU2 and vCPU4, while vCPU12 is
running there.

AFAIU the code above, when an interrupt arrives on pCPU0, you scan the
list, find both vCPU2 and vCPU4, which both have pi_desc.on set to true,
and hence you kick (via the tasklet) both of them?

Again, if this can't happen due to some details I ignore about PI,
sorry... :-)

Regards,
Dario






[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
       [not found]   ` <55918214.4030102@citrix.com>
@ 2015-06-30  2:58     ` Dario Faggioli
  2015-07-02  4:32       ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-06-30  2:58 UTC (permalink / raw)
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	jbeulich, yang.z.zhang, feng.wu


[-- Attachment #1.1: Type: text/plain, Size: 1543 bytes --]

On Mon, 2015-06-29 at 18:36 +0100, Andrew Cooper wrote:

> 
> The basic idea here is:
> 1. When vCPU's state is RUNSTATE_running,
>         - set 'NV' to 'Notification Vector'.
>         - Clear 'SN' to accpet PI.
>         - set 'NDST' to the right pCPU.
> 2. When vCPU's state is RUNSTATE_blocked,
>         - set 'NV' to 'Wake-up Vector', so we can wake up the
>           related vCPU when posted-interrupt happens for it.
>         - Clear 'SN' to accpet PI.
> 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
>         - Set 'SN' to suppress non-urgent interrupts.
>           (Current, we only support non-urgent interrupts)
>         - Set 'NV' back to 'Notification Vector' if needed.
> 
It might be me, but it feels a bit odd to see RUNSTATE-s being (ab)used
directly for this, as it does feel odd to see arch specific code being
added in there.

Can't this be done in context_switch(), which is already architecture
specific? I was thinking to something very similar to what has been done
for PSR, i.e., on x86, put everything in __context_switch().

Looking at who's prev and who's next, and at what pause_flags each has
set, you should be able to implement all of the above logic.

Or am I missing something?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-30  2:51     ` Fwd: " Dario Faggioli
@ 2015-06-30  2:59       ` Wu, Feng
  2015-06-30  9:46         ` Dario Faggioli
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-06-30  2:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Tuesday, June 30, 2015 10:52 AM
> To: Wu, Feng
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Tian,
> Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com; Wu, Feng; xen-devel
> Subject: Re: Fwd: [v3 12/15] vmx: posted-interrupt handling when vCPU is
> blocked
> 
> So, first of all, thanks Andrew for drawing my attention on this...
> 
> On Mon, 2015-06-29 at 18:35 +0100, Andrew Cooper wrote:
> --
> Dario Faggioli <dario.faggioli@citrix.com>
> Citrix Inc.
> > This patch includes the following aspects:
> > - Add a global vector to wake up the blocked vCPU
> >   when an interrupt is being posted to it (This
> >   part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> > - Adds a new per-vCPU tasklet to wakeup the blocked
> >   vCPU. It can be used in the case vcpu_unblock
> >   cannot be called directly.
> > - Define two per-cpu variables:
> >       * pi_blocked_vcpu:
> >       A list storing the vCPUs which were blocked on this pCPU.
> >
> >       * pi_blocked_vcpu_lock:
> >       The spinlock to protect pi_blocked_vcpu.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> >
> I do have a question. I don't have much expertise with this part of the
> code base (HVM stuff, low level details of event/interrupt delivery),
> so, it may be a stupid one, in which case, sorry in advance for the
> noise.

No problem, any comments are welcome! :)

> 
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index b94ef6a..7db6009 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> >
> > +/*
> > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > + */
> > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > +{
> > +    struct arch_vmx_struct *vmx;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> > +    /*
> > +     * FIXME: The length of the list depends on how many
> > +     * vCPU is current blocked on this specific pCPU.
> > +     * This may hurt the interrupt latency if the list
> > +     * grows to too many entries.
> > +     */
> > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > +                        pi_blocked_vcpu_list)
> > +        if ( vmx->pi_desc.on )
> > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> > +
> > +    spin_unlock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> > +    ack_APIC_irq();
> > +    this_cpu(irq_count)++;
> > +}
> > +
> Quoting the design document in patch 1:
> 
> +Here is the scenario for the usage of the new global vector:
> +
> +1. vCPU0 is running on pCPU0
> +2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
> +3. An external interrupt from an assigned device occurs for vCPU0, if we
> +still use 'posted_intr_vector' as the notification vector for vCPU0, the
> +notification event for vCPU0 (the event will go to pCPU1) will be consumed
> +by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
> +case is that vCPU0 will never be woken up again since the wakeup event
> +for it is always consumed by other vCPUs incorrectly. So we need introduce
> +another global vector, naming 'pi_wakeup_vector' to wake up the blocked
> vCPU.
> +
> +After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
> +event using this new vector. Since this new vector is not a SPECIAL one to
> CPU,
> +it is just a normal vector. To cpu, it just receives an normal external
> interrupt,
> +then we can get control in the handler of this new vector. In this case,
> hypervisor
> +can do something in it, such as wakeup the blocked vCPU.
> 
> Let's assume that there are two vCPUs blocked, waiting for a (posted)
> interrupt, on pCPU0, and that they are vCPU2 and vCPU4, while vCPU12 is
> running there.
> 
> AFAIU the code above, when an interrupt arrives on pCPU0, you scan the
> list, find both vCPU2 and vCPU4, which both have pi_desc.on set to true,
> and hence you kick (via the tasklet) both of them?

Yes, that is the case. Do you have any questions about it?

Thanks,
Feng

> 
> Again, if this can't happen due to some details I ignore about PI,
> sorry... :-)
> 
> Regards,
> Dario
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-30  2:59       ` Wu, Feng
@ 2015-06-30  9:46         ` Dario Faggioli
  0 siblings, 0 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-06-30  9:46 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 2412 bytes --]

On Tue, 2015-06-30 at 02:59 +0000, Wu, Feng wrote:

> > Quoting the design document in patch 1:
> > 
> > +Here is the scenario for the usage of the new global vector:
> > +
> > +1. vCPU0 is running on pCPU0
> > +2. vCPU0 is blocked and vCPU1 is currently running on pCPU0
> > +3. An external interrupt from an assigned device occurs for vCPU0, if we
> > +still use 'posted_intr_vector' as the notification vector for vCPU0, the
> > +notification event for vCPU0 (the event will go to pCPU1) will be consumed
> > +by vCPU1 incorrectly (remember this is a special vector to CPU). The worst
> > +case is that vCPU0 will never be woken up again since the wakeup event
> > +for it is always consumed by other vCPUs incorrectly. So we need introduce
> > +another global vector, naming 'pi_wakeup_vector' to wake up the blocked
> > vCPU.
> > +
> > +After using 'pi_wakeup_vector' for vCPU0, VT-d engine will issue notification
> > +event using this new vector. Since this new vector is not a SPECIAL one to
> > CPU,
> > +it is just a normal vector. To cpu, it just receives an normal external
> > interrupt,
> > +then we can get control in the handler of this new vector. In this case,
> > hypervisor
> > +can do something in it, such as wakeup the blocked vCPU.
> > 
> > Let's assume that there are two vCPUs blocked, waiting for a (posted)
> > interrupt, on pCPU0, and that they are vCPU2 and vCPU4, while vCPU12 is
> > running there.
> > 
> > AFAIU the code above, when an interrupt arrives on pCPU0, you scan the
> > list, find both vCPU2 and vCPU4, which both have pi_desc.on set to true,
> > and hence you kick (via the tasklet) both of them?
> 
> Yes, that is the case. Do you have any questions about it?
> 
The question is if that is how things should work as, by reading the
design document, my understanding was that you wanted a certain
interrupt to wake-up a specific vCPU.

But perhaps I'm failing to understand what really happens, and how the
'special vector' vs. 'normal vector' thing work (due to lack of my lack
of confidence in this area).

Am I actually talking nonsense?

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-24  5:18 ` [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
  2015-06-29 17:07   ` Andrew Cooper
       [not found]   ` <559181F9.6020106@citrix.com>
@ 2015-06-30 10:11   ` Andrew Cooper
  2015-07-01 13:26     ` Dario Faggioli
  2015-07-02  4:25     ` Wu, Feng
  2015-07-08 11:00   ` Tian, Kevin
  3 siblings, 2 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-06-30 10:11 UTC (permalink / raw)
  To: Feng Wu, xen-devel
  Cc: yang.z.zhang, george.dunlap, kevin.tian, keir, jbeulich

On 24/06/15 06:18, Feng Wu wrote:
> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata vmx_function_table = {
>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
>  };
>  
> +/*
> + * Handle VT-d posted-interrupt when VCPU is blocked.
> + */
> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct arch_vmx_struct *vmx;
> +    unsigned int cpu = smp_processor_id();
> +
> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
> +    /*
> +     * FIXME: The length of the list depends on how many
> +     * vCPU is current blocked on this specific pCPU.
> +     * This may hurt the interrupt latency if the list
> +     * grows to too many entries.
> +     */
> +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> +                        pi_blocked_vcpu_list)
> +        if ( vmx->pi_desc.on )
> +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);

There is a logical bug here.  If we have two NV's delivered to this
pcpu, we will kick the first vcpu twice.

On finding desc.on, a kick should be scheduled, then the vcpu removed
from this list.  With desc.on set, we know for certain that another NV
will not arrive for it until it has been scheduled again and the
interrupt posted.

~Andrew

> +
> +    spin_unlock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
> +    ack_APIC_irq();
> +    this_cpu(irq_count)++;
> +}
> +
>  const struct hvm_function_table * __init start_vmx(void)
>  {
>      set_in_cr4(X86_CR4_VMXE);
>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-30 10:11   ` Andrew Cooper
@ 2015-07-01 13:26     ` Dario Faggioli
  2015-07-02  4:27       ` Wu, Feng
  2015-07-02  4:25     ` Wu, Feng
  1 sibling, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-01 13:26 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: kevin.tian, keir, george.dunlap, xen-devel, jbeulich,
	yang.z.zhang, Feng Wu


[-- Attachment #1.1: Type: text/plain, Size: 2045 bytes --]

On Tue, 2015-06-30 at 11:11 +0100, Andrew Cooper wrote:
> On 24/06/15 06:18, Feng Wu wrote:

> > +/*
> > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > + */
> > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > +{
> > +    struct arch_vmx_struct *vmx;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> > +    /*
> > +     * FIXME: The length of the list depends on how many
> > +     * vCPU is current blocked on this specific pCPU.
> > +     * This may hurt the interrupt latency if the list
> > +     * grows to too many entries.
> > +     */
> > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > +                        pi_blocked_vcpu_list)
> > +        if ( vmx->pi_desc.on )
> > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> 
> There is a logical bug here.  If we have two NV's delivered to this
> pcpu, we will kick the first vcpu twice.
> 
> On finding desc.on, a kick should be scheduled, then the vcpu removed
> from this list.  With desc.on set, we know for certain that another NV
> will not arrive for it until it has been scheduled again and the
> interrupt posted.
> 
Yes, that seems a possible issue (and one that should indeed be
avoided).

I'm still unsure about the one that I raised myself but, if it is
possible to have more than one vcpu in a pcpu list, with desc.on==true,
then it looks to me that we kick all of them, for each notification.

Added what Andrew's spotted, if there are a bunch of vcpus, queued with
desc.on==ture, and a bunch of notifications arrives before the tasklet
gets executed, we'll be kicking the whole bunch of them for a bunch of
times! :-/

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-30 10:11   ` Andrew Cooper
  2015-07-01 13:26     ` Dario Faggioli
@ 2015-07-02  4:25     ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-02  4:25 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Tuesday, June 30, 2015 6:12 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; Tian, Kevin; Zhang, Yang Z;
> george.dunlap@eu.citrix.com
> Subject: Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> vmx_function_table = {
> >      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
> >  };
> >
> > +/*
> > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > + */
> > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > +{
> > +    struct arch_vmx_struct *vmx;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> > +    /*
> > +     * FIXME: The length of the list depends on how many
> > +     * vCPU is current blocked on this specific pCPU.
> > +     * This may hurt the interrupt latency if the list
> > +     * grows to too many entries.
> > +     */
> > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > +                        pi_blocked_vcpu_list)
> > +        if ( vmx->pi_desc.on )
> > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> 
> There is a logical bug here.  If we have two NV's delivered to this
> pcpu, we will kick the first vcpu twice.
> 
> On finding desc.on, a kick should be scheduled, then the vcpu removed
> from this list.  

So removing the vCPU from the blocking list here can avoid kicking the
vCPU multiple times, right?

Thanks,
Feng

With desc.on set, we know for certain that another NV
> will not arrive for it until it has been scheduled again and the
> interrupt posted.
> 
> ~Andrew
> 
> > +
> > +    spin_unlock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> > +    ack_APIC_irq();
> > +    this_cpu(irq_count)++;
> > +}
> > +
> >  const struct hvm_function_table * __init start_vmx(void)
> >  {
> >      set_in_cr4(X86_CR4_VMXE);
> >

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-01 13:26     ` Dario Faggioli
@ 2015-07-02  4:27       ` Wu, Feng
  2015-07-02  8:30         ` Dario Faggioli
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-02  4:27 UTC (permalink / raw)
  To: Dario Faggioli, Andrew Cooper
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Wednesday, July 01, 2015 9:26 PM
> To: Andrew Cooper
> Cc: Wu, Feng; xen-devel@lists.xen.org; Zhang, Yang Z;
> george.dunlap@eu.citrix.com; Tian, Kevin; keir@xen.org; jbeulich@suse.com
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> On Tue, 2015-06-30 at 11:11 +0100, Andrew Cooper wrote:
> > On 24/06/15 06:18, Feng Wu wrote:
> 
> > > +/*
> > > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > > + */
> > > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > > +{
> > > +    struct arch_vmx_struct *vmx;
> > > +    unsigned int cpu = smp_processor_id();
> > > +
> > > +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > > +
> > > +    /*
> > > +     * FIXME: The length of the list depends on how many
> > > +     * vCPU is current blocked on this specific pCPU.
> > > +     * This may hurt the interrupt latency if the list
> > > +     * grows to too many entries.
> > > +     */
> > > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > > +                        pi_blocked_vcpu_list)
> > > +        if ( vmx->pi_desc.on )
> > > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> >
> > There is a logical bug here.  If we have two NV's delivered to this
> > pcpu, we will kick the first vcpu twice.
> >
> > On finding desc.on, a kick should be scheduled, then the vcpu removed
> > from this list.  With desc.on set, we know for certain that another NV
> > will not arrive for it until it has been scheduled again and the
> > interrupt posted.
> >
> Yes, that seems a possible issue (and one that should indeed be
> avoided).
> 
> I'm still unsure about the one that I raised myself but, if it is
> possible to have more than one vcpu in a pcpu list, with desc.on==true,
> then it looks to me that we kick all of them, for each notification.
> 
> Added what Andrew's spotted, if there are a bunch of vcpus, queued with
> desc.on==ture, and a bunch of notifications arrives before the tasklet
> gets executed, we'll be kicking the whole bunch of them for a bunch of
> times! :-/

As Andrew mentioned, removing the vCPUs with desc.on = true from the
list can avoid kick vCPUs for multiple times.

Thanks,
Feng

> 
> Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-30  2:58     ` Fwd: " Dario Faggioli
@ 2015-07-02  4:32       ` Wu, Feng
  2015-07-02  4:34         ` Wu, Feng
  2015-07-02  8:20         ` Dario Faggioli
  0 siblings, 2 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-02  4:32 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Tuesday, June 30, 2015 10:58 AM
> To: Wu, Feng
> Cc: xen-devel; keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com;
> Tian, Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com; Wu, Feng
> Subject: Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> On Mon, 2015-06-29 at 18:36 +0100, Andrew Cooper wrote:
> 
> >
> > The basic idea here is:
> > 1. When vCPU's state is RUNSTATE_running,
> >         - set 'NV' to 'Notification Vector'.
> >         - Clear 'SN' to accpet PI.
> >         - set 'NDST' to the right pCPU.
> > 2. When vCPU's state is RUNSTATE_blocked,
> >         - set 'NV' to 'Wake-up Vector', so we can wake up the
> >           related vCPU when posted-interrupt happens for it.
> >         - Clear 'SN' to accpet PI.
> > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> >         - Set 'SN' to suppress non-urgent interrupts.
> >           (Current, we only support non-urgent interrupts)
> >         - Set 'NV' back to 'Notification Vector' if needed.
> >
> It might be me, but it feels a bit odd to see RUNSTATE-s being (ab)used
> directly for this, as it does feel odd to see arch specific code being
> added in there.
> 
> Can't this be done in context_switch(), which is already architecture
> specific? I was thinking to something very similar to what has been done
> for PSR, i.e., on x86, put everything in __context_switch().
> 
> Looking at who's prev and who's next, and at what pause_flags each has
> set, you should be able to implement all of the above logic.
> 
> Or am I missing something?

As mentioned in the description of this patch, here we need to do
something when the vCPU's state is changed, can we get the
state transition in __context_switch(), such as "running -> blocking"?

Thanks,
Feng

> 
> Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-02  4:32       ` Wu, Feng
@ 2015-07-02  4:34         ` Wu, Feng
  2015-07-02  8:20         ` Dario Faggioli
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-02  4:34 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Wu, Feng
> Sent: Thursday, July 02, 2015 12:33 PM
> To: Dario Faggioli
> Cc: xen-devel; keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com;
> Tian, Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com; Wu, Feng
> Subject: RE: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> 
> 
> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> > Sent: Tuesday, June 30, 2015 10:58 AM
> > To: Wu, Feng
> > Cc: xen-devel; keir@xen.org; jbeulich@suse.com;
> andrew.cooper3@citrix.com;
> > Tian, Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com; Wu, Feng
> > Subject: Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during
> vCPU
> > scheduling
> >
> > On Mon, 2015-06-29 at 18:36 +0100, Andrew Cooper wrote:
> >
> > >
> > > The basic idea here is:
> > > 1. When vCPU's state is RUNSTATE_running,
> > >         - set 'NV' to 'Notification Vector'.
> > >         - Clear 'SN' to accpet PI.
> > >         - set 'NDST' to the right pCPU.
> > > 2. When vCPU's state is RUNSTATE_blocked,
> > >         - set 'NV' to 'Wake-up Vector', so we can wake up the
> > >           related vCPU when posted-interrupt happens for it.
> > >         - Clear 'SN' to accpet PI.
> > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> > >         - Set 'SN' to suppress non-urgent interrupts.
> > >           (Current, we only support non-urgent interrupts)
> > >         - Set 'NV' back to 'Notification Vector' if needed.
> > >
> > It might be me, but it feels a bit odd to see RUNSTATE-s being (ab)used
> > directly for this, as it does feel odd to see arch specific code being
> > added in there.
> >
> > Can't this be done in context_switch(), which is already architecture
> > specific? I was thinking to something very similar to what has been done
> > for PSR, i.e., on x86, put everything in __context_switch().
> >
> > Looking at who's prev and who's next, and at what pause_flags each has
> > set, you should be able to implement all of the above logic.
> >
> > Or am I missing something?
> 
> As mentioned in the description of this patch, here we need to do
> something when the vCPU's state is changed, can we get the
> state transition in __context_switch(), such as "running -> blocking"?
> 
> Thanks,
> Feng

And ' vcpu_runstate_change ' is a central place where the vCPU's
state gets changed.

Thanks,
Feng

> 
> >
> > Regards,
> > Dario
> >
> > --
> > <<This happens because I choose it to happen!>> (Raistlin Majere)
> > -----------------------------------------------------------------
> > Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-02  4:32       ` Wu, Feng
  2015-07-02  4:34         ` Wu, Feng
@ 2015-07-02  8:20         ` Dario Faggioli
  2015-07-09  3:09           ` Wu, Feng
  1 sibling, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-02  8:20 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 3896 bytes --]

On Thu, 2015-07-02 at 04:32 +0000, Wu, Feng wrote:
> 
> 
> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> > Sent: Tuesday, June 30, 2015 10:58 AM
> > To: Wu, Feng
> > Cc: xen-devel; keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com;
> > Tian, Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com; Wu, Feng
> > Subject: Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU
> > scheduling
> > 
> > On Mon, 2015-06-29 at 18:36 +0100, Andrew Cooper wrote:
> > 
> > >
> > > The basic idea here is:
> > > 1. When vCPU's state is RUNSTATE_running,
> > >         - set 'NV' to 'Notification Vector'.
> > >         - Clear 'SN' to accpet PI.
> > >         - set 'NDST' to the right pCPU.
> > > 2. When vCPU's state is RUNSTATE_blocked,
> > >         - set 'NV' to 'Wake-up Vector', so we can wake up the
> > >           related vCPU when posted-interrupt happens for it.
> > >         - Clear 'SN' to accpet PI.
> > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> > >         - Set 'SN' to suppress non-urgent interrupts.
> > >           (Current, we only support non-urgent interrupts)
> > >         - Set 'NV' back to 'Notification Vector' if needed.
> > >
> > It might be me, but it feels a bit odd to see RUNSTATE-s being (ab)used
> > directly for this, as it does feel odd to see arch specific code being
> > added in there.
> > 
> > Can't this be done in context_switch(), which is already architecture
> > specific? I was thinking to something very similar to what has been done
> > for PSR, i.e., on x86, put everything in __context_switch().
> > 
> > Looking at who's prev and who's next, and at what pause_flags each has
> > set, you should be able to implement all of the above logic.
> > 
> > Or am I missing something?
> 
> As mentioned in the description of this patch, here we need to do
> something when the vCPU's state is changed, can we get the
> state transition in __context_switch(), such as "running -> blocking"?
> 
Well, in the patch description you mention how you've done it, so of
course it mentions runstates.

That does not necessarily means "we need to do something" in
vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
you check whether this thing that you need doing can be done somewhere
else than in vcpu_runstaete_change() ?

In fact, looking at how, where and what for, runstetes are used, that
really does not feel right, at least to me. What you seem to be
interested is whether a vCPU blocks and/or unblocks. Runstates are an
abstraction, build up on top of (mostly) pause_flags, like _VPF_blocked
(look at how runstate is updated).

I think you should not build on top of such abstraction, but on top of
pause_flags directly. I had a quick look, and it indeed seems to me that
you can get all you need from there too. It might even result in the
code looking simpler (but that's of course hard to tell without actually
trying). In fact, inside the context switching code, you already know
that prev was running so, if it has the proper flag set, it means it's
blocking (i.e., going to RUNSTATE_blocked, in runstates language), if
not, it maybe is being preempted (i.e., going to RUNSTATE_runnable).
Therefore, you can enact all your logic, even without any need to keep
track of the previous runstate, and without needing to build up a full
state machine and looking at all possible transitions.

So, can you have a look at whether that solution can fly? Because, if it
does, I think it would be a lot better.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02  4:27       ` Wu, Feng
@ 2015-07-02  8:30         ` Dario Faggioli
  2015-07-02  8:58           ` Wu, Feng
  2015-07-02 10:30           ` Andrew Cooper
  0 siblings, 2 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-02  8:30 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 2334 bytes --]

On Thu, 2015-07-02 at 04:27 +0000, Wu, Feng wrote:

> > > > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > > > +                        pi_blocked_vcpu_list)
> > > > +        if ( vmx->pi_desc.on )
> > > > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> > >
> > > There is a logical bug here.  If we have two NV's delivered to this
> > > pcpu, we will kick the first vcpu twice.
> > >
> > > On finding desc.on, a kick should be scheduled, then the vcpu removed
> > > from this list.  With desc.on set, we know for certain that another NV
> > > will not arrive for it until it has been scheduled again and the
> > > interrupt posted.
> > >
> > Yes, that seems a possible issue (and one that should indeed be
> > avoided).
> > 
> > I'm still unsure about the one that I raised myself but, if it is
> > possible to have more than one vcpu in a pcpu list, with desc.on==true,
> > then it looks to me that we kick all of them, for each notification.
> > 
> > Added what Andrew's spotted, if there are a bunch of vcpus, queued with
> > desc.on==ture, and a bunch of notifications arrives before the tasklet
> > gets executed, we'll be kicking the whole bunch of them for a bunch of
> > times! :-/
> 
> As Andrew mentioned, removing the vCPUs with desc.on = true from the
> list can avoid kick vCPUs for multiple times.
> 
It avoids kicking vcpus multiple times if more than one notification
arrives, yes.

It is, therefore, not effective in making sure that, even with only one
notification, you only kick the interested vcpu.

This is the third time that I ask:
 (1) whether it is possible to have more vcpus queued on one pcpu PI 
     blocked list with desc.on (I really believe it is);
 (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
     soon as any notification arrives, instead that putting together a
     mechanism for kicking only a specific one.

The fact that you're not answering is not so much of a big deal for
me... I'll just keep asking! :-D


Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02  8:30         ` Dario Faggioli
@ 2015-07-02  8:58           ` Wu, Feng
  2015-07-02 10:09             ` Dario Faggioli
  2015-07-02 10:30           ` Andrew Cooper
  1 sibling, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-02  8:58 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, July 02, 2015 4:30 PM
> To: Wu, Feng
> Cc: Andrew Cooper; xen-devel@lists.xen.org; Zhang, Yang Z;
> george.dunlap@eu.citrix.com; Tian, Kevin; keir@xen.org; jbeulich@suse.com
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> On Thu, 2015-07-02 at 04:27 +0000, Wu, Feng wrote:
> 
> > > > > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > > > > +                        pi_blocked_vcpu_list)
> > > > > +        if ( vmx->pi_desc.on )
> > > > > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> > > >
> > > > There is a logical bug here.  If we have two NV's delivered to this
> > > > pcpu, we will kick the first vcpu twice.
> > > >
> > > > On finding desc.on, a kick should be scheduled, then the vcpu removed
> > > > from this list.  With desc.on set, we know for certain that another NV
> > > > will not arrive for it until it has been scheduled again and the
> > > > interrupt posted.
> > > >
> > > Yes, that seems a possible issue (and one that should indeed be
> > > avoided).
> > >
> > > I'm still unsure about the one that I raised myself but, if it is
> > > possible to have more than one vcpu in a pcpu list, with desc.on==true,
> > > then it looks to me that we kick all of them, for each notification.
> > >
> > > Added what Andrew's spotted, if there are a bunch of vcpus, queued with
> > > desc.on==ture, and a bunch of notifications arrives before the tasklet
> > > gets executed, we'll be kicking the whole bunch of them for a bunch of
> > > times! :-/
> >
> > As Andrew mentioned, removing the vCPUs with desc.on = true from the
> > list can avoid kick vCPUs for multiple times.
> >
> It avoids kicking vcpus multiple times if more than one notification
> arrives, yes.
> 
> It is, therefore, not effective in making sure that, even with only one
> notification, you only kick the interested vcpu.
> 
> This is the third time that I ask:
>  (1) whether it is possible to have more vcpus queued on one pcpu PI
>      blocked list with desc.on (I really believe it is);

I think it is, please see the following scenario:

When cpu masks the interrupts, and an external interrupt occurs for the
assigned device while the target vCPU2 is blocked, the wakeup notification
event handler has no chance to run, after a while, another wakeup
notification event for vCPU4 blocking on the same pCPU occurs,
after cpu unmakes the interrupts, wakeup notification handler
gets called. Then we get:
	vCPU2, desc.on = 1 and vCPU4, desc.on = 1
Then in the handler we need to kick both of them.

>  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
>      soon as any notification arrives, instead that putting together a
>      mechanism for kicking only a specific one.
> 
Why can't we kick all of them, 'desc.on = 1' means there is a pending
interrupt, when we meet this condition, kicking the related vCPU should
be the right thing to do.

Thanks,
Feng

> The fact that you're not answering is not so much of a big deal for
> me... I'll just keep asking! :-D
> 
> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02  8:58           ` Wu, Feng
@ 2015-07-02 10:09             ` Dario Faggioli
  2015-07-02 10:41               ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-02 10:09 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 2239 bytes --]

On Thu, 2015-07-02 at 08:58 +0000, Wu, Feng wrote:

> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> >
> > This is the third time that I ask:
> >  (1) whether it is possible to have more vcpus queued on one pcpu PI
> >      blocked list with desc.on (I really believe it is);
> 
> I think it is, please see the following scenario:
> 
> When cpu masks the interrupts, and an external interrupt occurs for the
> assigned device while the target vCPU2 is blocked, the wakeup notification
> event handler has no chance to run, after a while, another wakeup
> notification event for vCPU4 blocking on the same pCPU occurs,
> after cpu unmakes the interrupts, wakeup notification handler
> gets called. Then we get:
> 	vCPU2, desc.on = 1 and vCPU4, desc.on = 1
> Then in the handler we need to kick both of them.
> 
Ok, first of all, thanks for answering! :-)

And yes, this makes sense.

> >  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
> >      soon as any notification arrives, instead that putting together a
> >      mechanism for kicking only a specific one.
> > 
> Why can't we kick all of them, 'desc.on = 1' means there is a pending
> interrupt, when we meet this condition, kicking the related vCPU should
> be the right thing to do.
> 
Right, I see it now. I felt like I was missing something, and that's why
I was asking to you to elaborate a bit more.
Thanks again for having done this. I was missing/forgetting half of the
way desc.on is actually handled, sorry for this.

BTW, I'm finding it hard reading this series from the archives; there
appears to be some threading issues and some missing messages. I also
don't have it in my inbox, because my filters failed to spot and flag it
properly. If you send a new version, please, Cc me, so it will be easier
for me to look at all the patches, and provide a more helpful review.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02  8:30         ` Dario Faggioli
  2015-07-02  8:58           ` Wu, Feng
@ 2015-07-02 10:30           ` Andrew Cooper
  2015-07-02 10:56             ` Wu, Feng
  2015-07-02 12:04             ` Dario Faggioli
  1 sibling, 2 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-07-02 10:30 UTC (permalink / raw)
  To: Dario Faggioli, Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z

On 02/07/15 09:30, Dario Faggioli wrote:
> On Thu, 2015-07-02 at 04:27 +0000, Wu, Feng wrote:
>
>>>>> +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
>>>>> +                        pi_blocked_vcpu_list)
>>>>> +        if ( vmx->pi_desc.on )
>>>>> +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
>>>> There is a logical bug here.  If we have two NV's delivered to this
>>>> pcpu, we will kick the first vcpu twice.
>>>>
>>>> On finding desc.on, a kick should be scheduled, then the vcpu removed
>>>> from this list.  With desc.on set, we know for certain that another NV
>>>> will not arrive for it until it has been scheduled again and the
>>>> interrupt posted.
>>>>
>>> Yes, that seems a possible issue (and one that should indeed be
>>> avoided).
>>>
>>> I'm still unsure about the one that I raised myself but, if it is
>>> possible to have more than one vcpu in a pcpu list, with desc.on==true,
>>> then it looks to me that we kick all of them, for each notification.
>>>
>>> Added what Andrew's spotted, if there are a bunch of vcpus, queued with
>>> desc.on==ture, and a bunch of notifications arrives before the tasklet
>>> gets executed, we'll be kicking the whole bunch of them for a bunch of
>>> times! :-/
>> As Andrew mentioned, removing the vCPUs with desc.on = true from the
>> list can avoid kick vCPUs for multiple times.
>>
> It avoids kicking vcpus multiple times if more than one notification
> arrives, yes.
>
> It is, therefore, not effective in making sure that, even with only one
> notification, you only kick the interested vcpu.
>
> This is the third time that I ask:
>  (1) whether it is possible to have more vcpus queued on one pcpu PI 
>      blocked list with desc.on (I really believe it is);
>  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
>      soon as any notification arrives, instead that putting together a
>      mechanism for kicking only a specific one.

We will receive one NV for every time the hardware managed to
successfully set desc.on

If multiple stack up and we proactively drain the list, we will
subsequently search the list to completion for all remaining NV's, due
to finding no appropriate entries.

I can't currently decide whether this will be quicker or slower overall,
or (most likely) it will even out to equal in the general case.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 10:09             ` Dario Faggioli
@ 2015-07-02 10:41               ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-02 10:41 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, July 02, 2015 6:10 PM
> To: Wu, Feng
> Cc: Andrew Cooper; xen-devel@lists.xen.org; Zhang, Yang Z;
> george.dunlap@eu.citrix.com; Tian, Kevin; keir@xen.org; jbeulich@suse.com
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> On Thu, 2015-07-02 at 08:58 +0000, Wu, Feng wrote:
> 
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> > >
> > > This is the third time that I ask:
> > >  (1) whether it is possible to have more vcpus queued on one pcpu PI
> > >      blocked list with desc.on (I really believe it is);
> >
> > I think it is, please see the following scenario:
> >
> > When cpu masks the interrupts, and an external interrupt occurs for the
> > assigned device while the target vCPU2 is blocked, the wakeup notification
> > event handler has no chance to run, after a while, another wakeup
> > notification event for vCPU4 blocking on the same pCPU occurs,
> > after cpu unmakes the interrupts, wakeup notification handler
> > gets called. Then we get:
> > 	vCPU2, desc.on = 1 and vCPU4, desc.on = 1
> > Then in the handler we need to kick both of them.
> >
> Ok, first of all, thanks for answering! :-)
> 
> And yes, this makes sense.
> 
> > >  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
> > >      soon as any notification arrives, instead that putting together a
> > >      mechanism for kicking only a specific one.
> > >
> > Why can't we kick all of them, 'desc.on = 1' means there is a pending
> > interrupt, when we meet this condition, kicking the related vCPU should
> > be the right thing to do.
> >
> Right, I see it now. I felt like I was missing something, and that's why
> I was asking to you to elaborate a bit more.
> Thanks again for having done this. I was missing/forgetting half of the
> way desc.on is actually handled, sorry for this.
> 
> BTW, I'm finding it hard reading this series from the archives; there
> appears to be some threading issues and some missing messages. I also
> don't have it in my inbox, because my filters failed to spot and flag it
> properly. If you send a new version, please, Cc me, so it will be easier
> for me to look at all the patches, and provide a more helpful review.

Sure, thanks for the review!

Thanks,
Feng

> 
> Thanks and Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 10:30           ` Andrew Cooper
@ 2015-07-02 10:56             ` Wu, Feng
  2015-07-02 12:04             ` Dario Faggioli
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-02 10:56 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, July 02, 2015 6:30 PM
> To: Dario Faggioli; Wu, Feng
> Cc: xen-devel@lists.xen.org; Zhang, Yang Z; george.dunlap@eu.citrix.com;
> Tian, Kevin; keir@xen.org; jbeulich@suse.com
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> On 02/07/15 09:30, Dario Faggioli wrote:
> > On Thu, 2015-07-02 at 04:27 +0000, Wu, Feng wrote:
> >
> >>>>> +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> >>>>> +                        pi_blocked_vcpu_list)
> >>>>> +        if ( vmx->pi_desc.on )
> >>>>> +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> >>>> There is a logical bug here.  If we have two NV's delivered to this
> >>>> pcpu, we will kick the first vcpu twice.
> >>>>
> >>>> On finding desc.on, a kick should be scheduled, then the vcpu removed
> >>>> from this list.  With desc.on set, we know for certain that another NV
> >>>> will not arrive for it until it has been scheduled again and the
> >>>> interrupt posted.
> >>>>
> >>> Yes, that seems a possible issue (and one that should indeed be
> >>> avoided).
> >>>
> >>> I'm still unsure about the one that I raised myself but, if it is
> >>> possible to have more than one vcpu in a pcpu list, with desc.on==true,
> >>> then it looks to me that we kick all of them, for each notification.
> >>>
> >>> Added what Andrew's spotted, if there are a bunch of vcpus, queued with
> >>> desc.on==ture, and a bunch of notifications arrives before the tasklet
> >>> gets executed, we'll be kicking the whole bunch of them for a bunch of
> >>> times! :-/
> >> As Andrew mentioned, removing the vCPUs with desc.on = true from the
> >> list can avoid kick vCPUs for multiple times.
> >>
> > It avoids kicking vcpus multiple times if more than one notification
> > arrives, yes.
> >
> > It is, therefore, not effective in making sure that, even with only one
> > notification, you only kick the interested vcpu.
> >
> > This is the third time that I ask:
> >  (1) whether it is possible to have more vcpus queued on one pcpu PI
> >      blocked list with desc.on (I really believe it is);
> >  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
> >      soon as any notification arrives, instead that putting together a
> >      mechanism for kicking only a specific one.
> 
> We will receive one NV for every time the hardware managed to
> successfully set desc.on
> 
> If multiple stack up and we proactively drain the list, we will
> subsequently search the list to completion for all remaining NV's, due
> to finding no appropriate entries.
> 
> I can't currently decide whether this will be quicker or slower overall,
> or (most likely) it will even out to equal in the general case.

What do you mean by "general case"?

Thanks,
Feng

> 
> ~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 10:30           ` Andrew Cooper
  2015-07-02 10:56             ` Wu, Feng
@ 2015-07-02 12:04             ` Dario Faggioli
  2015-07-02 12:10               ` Wu, Feng
  2015-07-02 12:16               ` Andrew Cooper
  1 sibling, 2 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-02 12:04 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng


[-- Attachment #1.1: Type: text/plain, Size: 1616 bytes --]

On Thu, 2015-07-02 at 11:30 +0100, Andrew Cooper wrote:
> On 02/07/15 09:30, Dario Faggioli wrote:

> > It is, therefore, not effective in making sure that, even with only one
> > notification, you only kick the interested vcpu.
> >
> > This is the third time that I ask:
> >  (1) whether it is possible to have more vcpus queued on one pcpu PI 
> >      blocked list with desc.on (I really believe it is);
> >  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
> >      soon as any notification arrives, instead that putting together a
> >      mechanism for kicking only a specific one.
> 
> We will receive one NV for every time the hardware managed to
> successfully set desc.on
> 
Right, I see it now, thanks.

> If multiple stack up and we proactively drain the list, we will
> subsequently search the list to completion for all remaining NV's, due
> to finding no appropriate entries.
> 
> I can't currently decide whether this will be quicker or slower overall,
> or (most likely) it will even out to equal in the general case.
> 
Well, given the thing works as you (two) just described, I think
draining the list is the only thing we can do.

In fact, AFAICT, since we can't know for what vcpu a particular
notification is intended, we don't have alternatives to waking them all,
do we?

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 12:04             ` Dario Faggioli
@ 2015-07-02 12:10               ` Wu, Feng
  2015-07-02 12:16               ` Andrew Cooper
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-02 12:10 UTC (permalink / raw)
  To: Dario Faggioli, Andrew Cooper
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, July 02, 2015 8:04 PM
> To: Andrew Cooper
> Cc: Wu, Feng; Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> xen-devel@lists.xen.org; jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> On Thu, 2015-07-02 at 11:30 +0100, Andrew Cooper wrote:
> > On 02/07/15 09:30, Dario Faggioli wrote:
> 
> > > It is, therefore, not effective in making sure that, even with only one
> > > notification, you only kick the interested vcpu.
> > >
> > > This is the third time that I ask:
> > >  (1) whether it is possible to have more vcpus queued on one pcpu PI
> > >      blocked list with desc.on (I really believe it is);
> > >  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
> > >      soon as any notification arrives, instead that putting together a
> > >      mechanism for kicking only a specific one.
> >
> > We will receive one NV for every time the hardware managed to
> > successfully set desc.on
> >
> Right, I see it now, thanks.
> 
> > If multiple stack up and we proactively drain the list, we will
> > subsequently search the list to completion for all remaining NV's, due
> > to finding no appropriate entries.
> >
> > I can't currently decide whether this will be quicker or slower overall,
> > or (most likely) it will even out to equal in the general case.
> >
> Well, given the thing works as you (two) just described, I think
> draining the list is the only thing we can do.
> 
> In fact, AFAICT, since we can't know for what vcpu a particular
> notification is intended,

Exactly, when notification event happens, the hardware sets 'ON',
software will find the vCPU with 'ON' set, in fact, software doesn't
know which vCPU the wakeup event is targeting, the only thing it
can do is kicking the vCPUs with desc.on = 1.

Thanks,
Feng

 we don't have alternatives to waking them all,
> do we?
> 
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 12:04             ` Dario Faggioli
  2015-07-02 12:10               ` Wu, Feng
@ 2015-07-02 12:16               ` Andrew Cooper
  2015-07-02 12:38                 ` Dario Faggioli
  1 sibling, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-07-02 12:16 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng

On 02/07/15 13:04, Dario Faggioli wrote:
> On Thu, 2015-07-02 at 11:30 +0100, Andrew Cooper wrote:
>> On 02/07/15 09:30, Dario Faggioli wrote:
>>> It is, therefore, not effective in making sure that, even with only one
>>> notification, you only kick the interested vcpu.
>>>
>>> This is the third time that I ask:
>>>  (1) whether it is possible to have more vcpus queued on one pcpu PI 
>>>      blocked list with desc.on (I really believe it is);
>>>  (2) if yes, whether it is TheRightThing(TM) to kick all of them, as
>>>      soon as any notification arrives, instead that putting together a
>>>      mechanism for kicking only a specific one.
>> We will receive one NV for every time the hardware managed to
>> successfully set desc.on
>>
> Right, I see it now, thanks.
>
>> If multiple stack up and we proactively drain the list, we will
>> subsequently search the list to completion for all remaining NV's, due
>> to finding no appropriate entries.
>>
>> I can't currently decide whether this will be quicker or slower overall,
>> or (most likely) it will even out to equal in the general case.
>>
> Well, given the thing works as you (two) just described, I think
> draining the list is the only thing we can do.
>
> In fact, AFAICT, since we can't know for what vcpu a particular
> notification is intended, we don't have alternatives to waking them all,
> do we?

Perhaps you misunderstand.

Every single vcpu has a PI descriptor which is shared memory with hardware.

A NV is delivered strictly when hardware atomically changes desc.on from
0 to 1.  i.e. the first time that an oustanding notification arrives. 
(iirc, desc.on is later cleared by hardware when the vcpu is scheduled
and the vector(s) actually injected.)

Part of the scheduling modifications alter when a vcpu is eligible to
have NV's delivered on its behalf.  non-scheduled vcpus get NV's while
scheduled vcpus have direct injection instead.

Therefore, in the case that an NV arrives, we know for certain that one
of the NV-eligible vcpus has had desc.on set by hardware, and we can
uniquely identify it by searching for the vcpu for which desc.on is set.

In the case of stacked NV's, we cannot associate which specific vcpu
caused which NV, but we know that we will get one NV per vcpu needing
kicking.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 12:16               ` Andrew Cooper
@ 2015-07-02 12:38                 ` Dario Faggioli
  2015-07-02 12:59                   ` Andrew Cooper
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-02 12:38 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng


[-- Attachment #1.1: Type: text/plain, Size: 2714 bytes --]

On Thu, 2015-07-02 at 13:16 +0100, Andrew Cooper wrote:
> On 02/07/15 13:04, Dario Faggioli wrote:
> > On Thu, 2015-07-02 at 11:30 +0100, Andrew Cooper wrote:

> >> I can't currently decide whether this will be quicker or slower overall,
> >> or (most likely) it will even out to equal in the general case.
> >>
> > Well, given the thing works as you (two) just described, I think
> > draining the list is the only thing we can do.
> >
> > In fact, AFAICT, since we can't know for what vcpu a particular
> > notification is intended, we don't have alternatives to waking them all,
> > do we?
> 
> Perhaps you misunderstand.
> 
I'm quite sure I was. While I think now I'm getting it.

> Every single vcpu has a PI descriptor which is shared memory with hardware.
> 
Right.

> A NV is delivered strictly when hardware atomically changes desc.on from
> 0 to 1.  i.e. the first time that an oustanding notification arrives. 
> (iirc, desc.on is later cleared by hardware when the vcpu is scheduled
> and the vector(s) actually injected.)
> 
> Part of the scheduling modifications alter when a vcpu is eligible to
> have NV's delivered on its behalf.  non-scheduled vcpus get NV's while
> scheduled vcpus have direct injection instead.
> 
Blocked vcpus, AFAICT. But that's not relevant here.

> Therefore, in the case that an NV arrives, we know for certain that one
> of the NV-eligible vcpus has had desc.on set by hardware, and we can
> uniquely identify it by searching for the vcpu for which desc.on is set.
> 
Yeah, but we ca have more than one of them. You said "I can't currently
decide whether this will be quicker or slower", which I read like you
were suggesting that not draining the queue was a plausible alternative,
while I now think it's not.

Perhaps you were not meaning anything like that, so it was not necessary
for me to point this out, in which case, sorry for the noise. :-)

> In the case of stacked NV's, we cannot associate which specific vcpu
> caused which NV, but we know that we will get one NV per vcpu needing
> kicking.
> 
Exactly, and that's what I'm talking about, and why I'm saying that
waking everyone is the only solution. The bottom line being that, even
in case this is deemed too slow, we don't have the option of waking only
one vcpu at each NV, as we wouldn't know who to wake, and hence we'd
need to make things faster in some other way.


Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 12:38                 ` Dario Faggioli
@ 2015-07-02 12:59                   ` Andrew Cooper
  2015-07-03  1:33                     ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-07-02 12:59 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng

On 02/07/15 13:38, Dario Faggioli wrote:
> On Thu, 2015-07-02 at 13:16 +0100, Andrew Cooper wrote:
>> On 02/07/15 13:04, Dario Faggioli wrote:
>>> On Thu, 2015-07-02 at 11:30 +0100, Andrew Cooper wrote:
>>>> I can't currently decide whether this will be quicker or slower overall,
>>>> or (most likely) it will even out to equal in the general case.
>>>>
>>> Well, given the thing works as you (two) just described, I think
>>> draining the list is the only thing we can do.
>>>
>>> In fact, AFAICT, since we can't know for what vcpu a particular
>>> notification is intended, we don't have alternatives to waking them all,
>>> do we?
>> Perhaps you misunderstand.
>>
> I'm quite sure I was. While I think now I'm getting it.
>
>> Every single vcpu has a PI descriptor which is shared memory with hardware.
>>
> Right.
>
>> A NV is delivered strictly when hardware atomically changes desc.on from
>> 0 to 1.  i.e. the first time that an oustanding notification arrives. 
>> (iirc, desc.on is later cleared by hardware when the vcpu is scheduled
>> and the vector(s) actually injected.)
>>
>> Part of the scheduling modifications alter when a vcpu is eligible to
>> have NV's delivered on its behalf.  non-scheduled vcpus get NV's while
>> scheduled vcpus have direct injection instead.
>>
> Blocked vcpus, AFAICT. But that's not relevant here.
>
>> Therefore, in the case that an NV arrives, we know for certain that one
>> of the NV-eligible vcpus has had desc.on set by hardware, and we can
>> uniquely identify it by searching for the vcpu for which desc.on is set.
>>
> Yeah, but we ca have more than one of them. You said "I can't currently
> decide whether this will be quicker or slower", which I read like you
> were suggesting that not draining the queue was a plausible alternative,
> while I now think it's not.
>
> Perhaps you were not meaning anything like that, so it was not necessary
> for me to point this out, in which case, sorry for the noise. :-)

To be clear, (assuming that a kicked vcpu is removed from the list),
then both options of kicking exactly one vcpu or kicking all vcpus will
function.  The end result after all processing of NVs will be that every
vcpu with desc.on set will be kicked exactly once.

I just was concerned about the O() of searching the list on a subsequent
NV, knowing that we most likely took the relevant entry off the list on
the previous NV.

>
>> In the case of stacked NV's, we cannot associate which specific vcpu
>> caused which NV, but we know that we will get one NV per vcpu needing
>> kicking.
>>
> Exactly, and that's what I'm talking about, and why I'm saying that
> waking everyone is the only solution. The bottom line being that, even
> in case this is deemed too slow, we don't have the option of waking only
> one vcpu at each NV, as we wouldn't know who to wake, and hence we'd
> need to make things faster in some other way.

Ah - I see your point now.

Yes - kicking exactly one vcpu per NV could result in a different vcpu
being deferred based on the interrupt activity of other vcpus and its
position in the list.

In which case, we should eagerly kick all vcpus.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-02 12:59                   ` Andrew Cooper
@ 2015-07-03  1:33                     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-03  1:33 UTC (permalink / raw)
  To: Andrew Cooper, Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	xen-devel@lists.xen.org, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, July 02, 2015 9:00 PM
> To: Dario Faggioli
> Cc: Wu, Feng; Tian, Kevin; keir@xen.org; george.dunlap@eu.citrix.com;
> xen-devel@lists.xen.org; jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> On 02/07/15 13:38, Dario Faggioli wrote:
> > On Thu, 2015-07-02 at 13:16 +0100, Andrew Cooper wrote:
> >> On 02/07/15 13:04, Dario Faggioli wrote:
> >>> On Thu, 2015-07-02 at 11:30 +0100, Andrew Cooper wrote:
> >>>> I can't currently decide whether this will be quicker or slower overall,
> >>>> or (most likely) it will even out to equal in the general case.
> >>>>
> >>> Well, given the thing works as you (two) just described, I think
> >>> draining the list is the only thing we can do.
> >>>
> >>> In fact, AFAICT, since we can't know for what vcpu a particular
> >>> notification is intended, we don't have alternatives to waking them all,
> >>> do we?
> >> Perhaps you misunderstand.
> >>
> > I'm quite sure I was. While I think now I'm getting it.
> >
> >> Every single vcpu has a PI descriptor which is shared memory with
> hardware.
> >>
> > Right.
> >
> >> A NV is delivered strictly when hardware atomically changes desc.on from
> >> 0 to 1.  i.e. the first time that an oustanding notification arrives.
> >> (iirc, desc.on is later cleared by hardware when the vcpu is scheduled
> >> and the vector(s) actually injected.)
> >>
> >> Part of the scheduling modifications alter when a vcpu is eligible to
> >> have NV's delivered on its behalf.  non-scheduled vcpus get NV's while
> >> scheduled vcpus have direct injection instead.
> >>
> > Blocked vcpus, AFAICT. But that's not relevant here.
> >
> >> Therefore, in the case that an NV arrives, we know for certain that one
> >> of the NV-eligible vcpus has had desc.on set by hardware, and we can
> >> uniquely identify it by searching for the vcpu for which desc.on is set.
> >>
> > Yeah, but we ca have more than one of them. You said "I can't currently
> > decide whether this will be quicker or slower", which I read like you
> > were suggesting that not draining the queue was a plausible alternative,
> > while I now think it's not.
> >
> > Perhaps you were not meaning anything like that, so it was not necessary
> > for me to point this out, in which case, sorry for the noise. :-)
> 
> To be clear, (assuming that a kicked vcpu is removed from the list),
> then both options of kicking exactly one vcpu or kicking all vcpus will
> function.  The end result after all processing of NVs will be that every
> vcpu with desc.on set will be kicked exactly once.

We need to remove the kicked vCPU from the list, this is will be included
in the next version.

Thanks,
Feng

> 
> I just was concerned about the O() of searching the list on a subsequent
> NV, knowing that we most likely took the relevant entry off the list on
> the previous NV.
> 
> >
> >> In the case of stacked NV's, we cannot associate which specific vcpu
> >> caused which NV, but we know that we will get one NV per vcpu needing
> >> kicking.
> >>
> > Exactly, and that's what I'm talking about, and why I'm saying that
> > waking everyone is the only solution. The bottom line being that, even
> > in case this is deemed too slow, we don't have the option of waking only
> > one vcpu at each NV, as we wouldn't know who to wake, and hence we'd
> > need to make things faster in some other way.
> 
> Ah - I see your point now.
> 
> Yes - kicking exactly one vcpu per NV could result in a different vcpu
> being deferred based on the interrupt activity of other vcpus and its
> position in the list.
> 
> In which case, we should eagerly kick all vcpus.
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-06-24 18:35   ` Andrew Cooper
@ 2015-07-08  7:06     ` Wu, Feng
  2015-07-08  8:12       ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-08  7:06 UTC (permalink / raw)
  To: Andrew Cooper, xen-devel@lists.xen.org
  Cc: Tian, Kevin, Wu, Feng, george.dunlap@eu.citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: xen-devel-bounces@lists.xen.org
> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Andrew Cooper
> Sent: Thursday, June 25, 2015 2:35 AM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: george.dunlap@eu.citrix.com; Zhang, Yang Z; Tian, Kevin; keir@xen.org;
> jbeulich@suse.com
> Subject: Re: [Xen-devel] [v3 03/15] Add cmpxchg16b support for x86-64
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > This patch adds cmpxchg16b support for x86-64, so software
> > can perform 128-bit atomic write/read.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > Newly added.
> >
> >  xen/include/asm-x86/x86_64/system.h | 28
> ++++++++++++++++++++++++++++
> >  xen/include/xen/types.h             |  5 +++++
> >  2 files changed, 33 insertions(+)
> >
> > diff --git a/xen/include/asm-x86/x86_64/system.h
> b/xen/include/asm-x86/x86_64/system.h
> > index 662813a..a910d00 100644
> > --- a/xen/include/asm-x86/x86_64/system.h
> > +++ b/xen/include/asm-x86/x86_64/system.h
> > @@ -6,6 +6,34 @@
> >                                     (unsigned
> long)(n),sizeof(*(ptr))))
> >
> >  /*
> > + * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
> > + * identical, store NEW in MEM.  Return the initial value in MEM.
> > + * Success is indicated by comparing RETURN with OLD.
> > + *
> > + * This function can only be called when cpu_has_cx16 is ture.
> > + */
> > +
> > +static always_inline uint128_t __cmpxchg16b(
> > +    volatile void *ptr, uint128_t old, uint128_t new)
> 
> It is not nice for register scheduling taking uint128_t's by value.
> Instead, I would pass them by pointer and let the inlining sort the
> eventual references out.
> 
> > +{
> > +    uint128_t prev;
> > +
> > +    ASSERT(cpu_has_cx16);
> 
> Given that if this assertion were to fail, cmpxchg16b would fail with
> #UD, I would hand-code a asm_fixup section which in turn panics.  This
> avoids a situation where non-debug builds could die with an unqualified
> #UD exception.

Is there an existing way to panic the hypervisor in assembler code, I
don't find it, it would be appreciated if you can point it out.

> 
> Also, you must enforce 16-byte alignment of the memory reference, as
> described in the manual.

What should I do if the caller passes an non 16-byte alignment data
(struct iremap_entry in this case) ? Do this mean I need to define
it like this?

struct iremap_entry {

......

} __attribute__ ((aligned (16)));

Thanks,
Feng

> 
> ~Andrew
> 
> > +
> > +    asm volatile ( "lock; cmpxchg16b %4"
> > +                   : "=d" (prev.high), "=a" (prev.low)
> > +                   : "c" (new.high), "b" (new.low),
> > +                   "m" (*__xg((volatile void *)ptr)),
> > +                   "0" (old.high), "1" (old.low)
> > +                   : "memory" );
> > +
> > +    return prev;
> > +}
> > +
> > +#define cmpxchg16b(ptr,o,n)
> \
> > +    __cmpxchg16b((ptr), *(uint128_t *)(o), *(uint128_t *)(n))
> > +
> > +/*
> >   * This function causes value _o to be changed to _n at location _p.
> >   * If this access causes a fault then we return 1, otherwise we return 0.
> >   * If no fault occurs then _o is updated to the value we saw at _p. If this
> > diff --git a/xen/include/xen/types.h b/xen/include/xen/types.h
> > index 8596ded..30f8a44 100644
> > --- a/xen/include/xen/types.h
> > +++ b/xen/include/xen/types.h
> > @@ -47,6 +47,11 @@ typedef         __u64           uint64_t;
> >  typedef         __u64           u_int64_t;
> >  typedef         __s64           int64_t;
> >
> > +typedef struct {
> > +        uint64_t low;
> > +        uint64_t high;
> > +} uint128_t;
> > +
> >  struct domain;
> >  struct vcpu;
> >
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 01/15] Vt-d Posted-intterrupt (PI) design
  2015-06-24  5:18 ` [v3 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
  2015-06-24  6:15   ` Meng Xu
@ 2015-07-08  7:21   ` Tian, Kevin
  2015-07-08  7:29     ` Wu, Feng
  1 sibling, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  7:21 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
sign
> 
> Add the design doc for VT-d PI.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>


> +So, gist of above is that, lowest priority interrupts has never been delivered as
> +"lowest priority" in physical hardware.
> +
> +I will emulate vector hashing for posted-interrupt for XEN.

"I will" is not a good usage in design doc. :-)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection
  2015-06-24  5:18 ` [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection Feng Wu
  2015-06-24 17:31   ` Andrew Cooper
@ 2015-07-08  7:23   ` Tian, Kevin
  1 sibling, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  7:23 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> Add macro cpu_has_cx16 to detect X86_FEATURE_CX16 feature.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 01/15] Vt-d Posted-intterrupt (PI) design
  2015-07-08  7:21   ` Tian, Kevin
@ 2015-07-08  7:29     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-08  7:29 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 3:21 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 01/15] Vt-d Posted-intterrupt (PI) design
> 
> > From: Wu, Feng
> > Sent: Wednesday, June 24, 2015 1:18 PM
> sign
> >
> > Add the design doc for VT-d PI.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> 
> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
> 
> 
> > +So, gist of above is that, lowest priority interrupts has never been delivered
> as
> > +"lowest priority" in physical hardware.
> > +
> > +I will emulate vector hashing for posted-interrupt for XEN.
> 
> "I will" is not a good usage in design doc. :-)

Thanks for the review, I will rephrase it! :)

Thanks,
Feng

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature
  2015-06-24  5:18 ` [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
  2015-06-25  9:06   ` Andrew Cooper
@ 2015-07-08  7:30   ` Tian, Kevin
  1 sibling, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  7:30 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> This patch adds variable 'iommu_intpost' to control whether enable VT-d
> posted-interrupt or not in the generic IOMMU code.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-06-24  5:18 ` [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
  2015-06-25 10:21   ` Andrew Cooper
@ 2015-07-08  7:32   ` Tian, Kevin
  2015-07-08  8:00     ` Wu, Feng
  1 sibling, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  7:32 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> With VT-d Posted-Interrupts enabled, external interrupts from
> direct-assigned devices can be delivered to guests without VMM
> intervention when guest is running in non-root mode.
> 
> This patch adds feature detection logic for VT-d posted-interrupt.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Remove the "if no intremap then no intpost" logic in
>   intel_vtd_setup(), it is covered in the iommu_setup().
> - Add "if no intremap then no intpost" logic in the end
>   of init_vtd_hw() which is called by vtd_resume().
> 
> So the logic exists in the following three places:
> - parse_iommu_param()
> - iommu_setup()
> - init_vtd_hw()
> 
>  xen/drivers/passthrough/vtd/iommu.c | 18 ++++++++++++++++--
>  xen/drivers/passthrough/vtd/iommu.h |  1 +
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/xen/drivers/passthrough/vtd/iommu.c
> b/xen/drivers/passthrough/vtd/iommu.c
> index 9053a1f..4221185 100644
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -2071,6 +2071,9 @@ static int init_vtd_hw(void)
>                  disable_intremap(drhd->iommu);
>      }
> 
> +    if ( !iommu_intremap )
> +        iommu_intpost = 0;
> +
>      /*
>       * Set root entries for each VT-d engine.  After set root entry,
>       * must globally invalidate context cache, and then globally
> @@ -2133,8 +2136,8 @@ int __init intel_vtd_setup(void)
>      }
> 
>      /* We enable the following features only if they are supported by all VT-d
> -     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
> -     * Interrupt Remapping.
> +     * engines: Snoop Control, DMA passthrough, Queued Invalidation, Interrupt
> +     * Remapping, and Posted Interrupt
>       */
>      for_each_drhd_unit ( drhd )
>      {
> @@ -2162,6 +2165,15 @@ int __init intel_vtd_setup(void)
>          if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
>              iommu_intremap = 0;
> 
> +        /*
> +         * We cannot use posted interrupt if X86_FEATURE_CX16 is
> +         * not supported, since we count on this feature to
> +         * atomically update 16-byte IRTE in posted format.
> +         */
> +        if ( !iommu_intremap &&
> +             (!cap_intr_post(iommu->cap) || !cpu_has_cx16) )
> +            iommu_intpost = 0;
> +

Looks a typo here. &&->||

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-06-24  5:18 ` [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
  2015-06-29 15:04   ` Andrew Cooper
@ 2015-07-08  7:48   ` Tian, Kevin
  2015-07-10 13:08   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  7:48 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor
  2015-06-24  5:18 ` [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
  2015-06-29 15:32   ` Andrew Cooper
@ 2015-07-08  7:53   ` Tian, Kevin
  1 sibling, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  7:53 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> This patch initializes the VT-d Posted-interrupt Descriptor.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
  2015-07-08  7:32   ` Tian, Kevin
@ 2015-07-08  8:00     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-08  8:00 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 3:32 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection
> 
> > From: Wu, Feng
> > Sent: Wednesday, June 24, 2015 1:18 PM
> >
> > VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> > With VT-d Posted-Interrupts enabled, external interrupts from
> > direct-assigned devices can be delivered to guests without VMM
> > intervention when guest is running in non-root mode.
> >
> > This patch adds feature detection logic for VT-d posted-interrupt.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - Remove the "if no intremap then no intpost" logic in
> >   intel_vtd_setup(), it is covered in the iommu_setup().
> > - Add "if no intremap then no intpost" logic in the end
> >   of init_vtd_hw() which is called by vtd_resume().
> >
> > So the logic exists in the following three places:
> > - parse_iommu_param()
> > - iommu_setup()
> > - init_vtd_hw()
> >
> >  xen/drivers/passthrough/vtd/iommu.c | 18 ++++++++++++++++--
> >  xen/drivers/passthrough/vtd/iommu.h |  1 +
> >  2 files changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/xen/drivers/passthrough/vtd/iommu.c
> > b/xen/drivers/passthrough/vtd/iommu.c
> > index 9053a1f..4221185 100644
> > --- a/xen/drivers/passthrough/vtd/iommu.c
> > +++ b/xen/drivers/passthrough/vtd/iommu.c
> > @@ -2071,6 +2071,9 @@ static int init_vtd_hw(void)
> >                  disable_intremap(drhd->iommu);
> >      }
> >
> > +    if ( !iommu_intremap )
> > +        iommu_intpost = 0;
> > +
> >      /*
> >       * Set root entries for each VT-d engine.  After set root entry,
> >       * must globally invalidate context cache, and then globally
> > @@ -2133,8 +2136,8 @@ int __init intel_vtd_setup(void)
> >      }
> >
> >      /* We enable the following features only if they are supported by all
> VT-d
> > -     * engines: Snoop Control, DMA passthrough, Queued Invalidation and
> > -     * Interrupt Remapping.
> > +     * engines: Snoop Control, DMA passthrough, Queued Invalidation,
> Interrupt
> > +     * Remapping, and Posted Interrupt
> >       */
> >      for_each_drhd_unit ( drhd )
> >      {
> > @@ -2162,6 +2165,15 @@ int __init intel_vtd_setup(void)
> >          if ( iommu_intremap && !ecap_intr_remap(iommu->ecap) )
> >              iommu_intremap = 0;
> >
> > +        /*
> > +         * We cannot use posted interrupt if X86_FEATURE_CX16 is
> > +         * not supported, since we count on this feature to
> > +         * atomically update 16-byte IRTE in posted format.
> > +         */
> > +        if ( !iommu_intremap &&
> > +             (!cap_intr_post(iommu->cap) || !cpu_has_cx16) )
> > +            iommu_intpost = 0;
> > +
> 
> Looks a typo here. &&->||

Yes, this is a typo. Thanks for the review.

Thanks,
Feng
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-07-08  7:06     ` Wu, Feng
@ 2015-07-08  8:12       ` Jan Beulich
  2015-07-08  8:33         ` Wu, Feng
  2015-07-08  8:50         ` Andrew Cooper
  0 siblings, 2 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-08  8:12 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, YangZ Zhang

>>> On 08.07.15 at 09:06, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: xen-devel-bounces@lists.xen.org 
>> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Andrew Cooper
>> Sent: Thursday, June 25, 2015 2:35 AM
>> To: Wu, Feng; xen-devel@lists.xen.org 
>> Cc: george.dunlap@eu.citrix.com; Zhang, Yang Z; Tian, Kevin; keir@xen.org;
>> jbeulich@suse.com 
>> Subject: Re: [Xen-devel] [v3 03/15] Add cmpxchg16b support for x86-64
>> 
>> On 24/06/15 06:18, Feng Wu wrote:
>> > This patch adds cmpxchg16b support for x86-64, so software
>> > can perform 128-bit atomic write/read.
>> >
>> > Signed-off-by: Feng Wu <feng.wu@intel.com>
>> > ---
>> > v3:
>> > Newly added.
>> >
>> >  xen/include/asm-x86/x86_64/system.h | 28
>> ++++++++++++++++++++++++++++
>> >  xen/include/xen/types.h             |  5 +++++
>> >  2 files changed, 33 insertions(+)
>> >
>> > diff --git a/xen/include/asm-x86/x86_64/system.h
>> b/xen/include/asm-x86/x86_64/system.h
>> > index 662813a..a910d00 100644
>> > --- a/xen/include/asm-x86/x86_64/system.h
>> > +++ b/xen/include/asm-x86/x86_64/system.h
>> > @@ -6,6 +6,34 @@
>> >                                     (unsigned
>> long)(n),sizeof(*(ptr))))
>> >
>> >  /*
>> > + * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
>> > + * identical, store NEW in MEM.  Return the initial value in MEM.
>> > + * Success is indicated by comparing RETURN with OLD.
>> > + *
>> > + * This function can only be called when cpu_has_cx16 is ture.
>> > + */
>> > +
>> > +static always_inline uint128_t __cmpxchg16b(
>> > +    volatile void *ptr, uint128_t old, uint128_t new)
>> 
>> It is not nice for register scheduling taking uint128_t's by value.
>> Instead, I would pass them by pointer and let the inlining sort the
>> eventual references out.
>> 
>> > +{
>> > +    uint128_t prev;
>> > +
>> > +    ASSERT(cpu_has_cx16);
>> 
>> Given that if this assertion were to fail, cmpxchg16b would fail with
>> #UD, I would hand-code a asm_fixup section which in turn panics.  This
>> avoids a situation where non-debug builds could die with an unqualified
>> #UD exception.
> 
> Is there an existing way to panic the hypervisor in assembler code, I
> don't find it, it would be appreciated if you can point it out.

I'm not convinced such a #UD would be a significant problem: Looking
at the disassembly will show the cause right away. The out of line
ud2-s in some of VMX'es inline assembly wrappers are far worse.

As to panic()ing from assembly code:

	movq	$<string-label>, %rdi
	call	panic

>> Also, you must enforce 16-byte alignment of the memory reference, as
>> described in the manual.
> 
> What should I do if the caller passes an non 16-byte alignment data
> (struct iremap_entry in this case) ? Do this mean I need to define
> it like this?
> 
> struct iremap_entry {
> 
> ......
> 
> } __attribute__ ((aligned (16)));

How would that help? The table entries hardware uses are supposed
to be 16-byte aligned anyway, aren't they? I think Andrew's "enforce"
really means ASSERT() or BUG_ON(), again to avoid an unqualified
exception. However - see above.

Plus, all that said, without having seen the actual use sites of
cmpxchg16b yet, I'm not at all convinced we really need this patch.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-07-08  8:12       ` Jan Beulich
@ 2015-07-08  8:33         ` Wu, Feng
  2015-07-08  8:43           ` Jan Beulich
  2015-07-08  8:50         ` Andrew Cooper
  1 sibling, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-08  8:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 08, 2015 4:13 PM
> To: Wu, Feng
> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Tian, Kevin; Zhang, Yang Z;
> xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [Xen-devel] [v3 03/15] Add cmpxchg16b support for x86-64
> 
> >>> On 08.07.15 at 09:06, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: xen-devel-bounces@lists.xen.org
> >> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Andrew Cooper
> >> Sent: Thursday, June 25, 2015 2:35 AM
> >> To: Wu, Feng; xen-devel@lists.xen.org
> >> Cc: george.dunlap@eu.citrix.com; Zhang, Yang Z; Tian, Kevin; keir@xen.org;
> >> jbeulich@suse.com
> >> Subject: Re: [Xen-devel] [v3 03/15] Add cmpxchg16b support for x86-64
> >>
> >> On 24/06/15 06:18, Feng Wu wrote:
> >> > This patch adds cmpxchg16b support for x86-64, so software
> >> > can perform 128-bit atomic write/read.
> >> >
> >> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> >> > ---
> >> > v3:
> >> > Newly added.
> >> >
> >> >  xen/include/asm-x86/x86_64/system.h | 28
> >> ++++++++++++++++++++++++++++
> >> >  xen/include/xen/types.h             |  5 +++++
> >> >  2 files changed, 33 insertions(+)
> >> >
> >> > diff --git a/xen/include/asm-x86/x86_64/system.h
> >> b/xen/include/asm-x86/x86_64/system.h
> >> > index 662813a..a910d00 100644
> >> > --- a/xen/include/asm-x86/x86_64/system.h
> >> > +++ b/xen/include/asm-x86/x86_64/system.h
> >> > @@ -6,6 +6,34 @@
> >> >                                     (unsigned
> >> long)(n),sizeof(*(ptr))))
> >> >
> >> >  /*
> >> > + * Atomic 16 bytes compare and exchange.  Compare OLD with MEM, if
> >> > + * identical, store NEW in MEM.  Return the initial value in MEM.
> >> > + * Success is indicated by comparing RETURN with OLD.
> >> > + *
> >> > + * This function can only be called when cpu_has_cx16 is ture.
> >> > + */
> >> > +
> >> > +static always_inline uint128_t __cmpxchg16b(
> >> > +    volatile void *ptr, uint128_t old, uint128_t new)
> >>
> >> It is not nice for register scheduling taking uint128_t's by value.
> >> Instead, I would pass them by pointer and let the inlining sort the
> >> eventual references out.
> >>
> >> > +{
> >> > +    uint128_t prev;
> >> > +
> >> > +    ASSERT(cpu_has_cx16);
> >>
> >> Given that if this assertion were to fail, cmpxchg16b would fail with
> >> #UD, I would hand-code a asm_fixup section which in turn panics.  This
> >> avoids a situation where non-debug builds could die with an unqualified
> >> #UD exception.
> >
> > Is there an existing way to panic the hypervisor in assembler code, I
> > don't find it, it would be appreciated if you can point it out.
> 
> I'm not convinced such a #UD would be a significant problem: Looking
> at the disassembly will show the cause right away. The out of line
> ud2-s in some of VMX'es inline assembly wrappers are far worse.
> 

So, do you agree with the fixup section or not?

> As to panic()ing from assembly code:
> 
> 	movq	$<string-label>, %rdi
> 	call	panic
> 
> >> Also, you must enforce 16-byte alignment of the memory reference, as
> >> described in the manual.
> >
> > What should I do if the caller passes an non 16-byte alignment data
> > (struct iremap_entry in this case) ? Do this mean I need to define
> > it like this?
> >
> > struct iremap_entry {
> >
> > ......
> >
> > } __attribute__ ((aligned (16)));
> 
> How would that help? The table entries hardware uses are supposed
> to be 16-byte aligned anyway, aren't they?

Oh, yes, the base address of the remapping table is 4K aligned.

> I think Andrew's "enforce"
> really means ASSERT() or BUG_ON(), again to avoid an unqualified
> exception. However - see above.
> 
> Plus, all that said, without having seen the actual use sites of
> cmpxchg16b yet, I'm not at all convinced we really need this patch.

After introducing posted format in IRTE, some fields exist in both the
High 64 bit and the low 64 bit,such as pda_h and pda_l, how to make
sure it is atomic when updating the pda field?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-07-08  8:33         ` Wu, Feng
@ 2015-07-08  8:43           ` Jan Beulich
  2015-07-08  8:50             ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-08  8:43 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 08.07.15 at 10:33, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, July 08, 2015 4:13 PM
>> >>> On 08.07.15 at 09:06, <feng.wu@intel.com> wrote:
>> >> From: xen-devel-bounces@lists.xen.org 
>> >> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Andrew Cooper
>> >> Sent: Thursday, June 25, 2015 2:35 AM
>> >> On 24/06/15 06:18, Feng Wu wrote:
>> >> > +{
>> >> > +    uint128_t prev;
>> >> > +
>> >> > +    ASSERT(cpu_has_cx16);
>> >>
>> >> Given that if this assertion were to fail, cmpxchg16b would fail with
>> >> #UD, I would hand-code a asm_fixup section which in turn panics.  This
>> >> avoids a situation where non-debug builds could die with an unqualified
>> >> #UD exception.
>> >
>> > Is there an existing way to panic the hypervisor in assembler code, I
>> > don't find it, it would be appreciated if you can point it out.
>> 
>> I'm not convinced such a #UD would be a significant problem: Looking
>> at the disassembly will show the cause right away. The out of line
>> ud2-s in some of VMX'es inline assembly wrappers are far worse.
> 
> So, do you agree with the fixup section or not?

I'd rather not go that route, unless Andrew or your manage to
convince me otherwise.

>> I think Andrew's "enforce"
>> really means ASSERT() or BUG_ON(), again to avoid an unqualified
>> exception. However - see above.
>> 
>> Plus, all that said, without having seen the actual use sites of
>> cmpxchg16b yet, I'm not at all convinced we really need this patch.
> 
> After introducing posted format in IRTE, some fields exist in both the
> High 64 bit and the low 64 bit,such as pda_h and pda_l, how to make
> sure it is atomic when updating the pda field?

Is there a need for updating these _after_ initially setting up an
entry?

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-07-08  8:12       ` Jan Beulich
  2015-07-08  8:33         ` Wu, Feng
@ 2015-07-08  8:50         ` Andrew Cooper
  1 sibling, 0 replies; 155+ messages in thread
From: Andrew Cooper @ 2015-07-08  8:50 UTC (permalink / raw)
  To: Jan Beulich, Feng Wu
  Cc: george.dunlap@eu.citrix.com, YangZ Zhang, Kevin Tian,
	keir@xen.org, xen-devel@lists.xen.org

On 08/07/2015 09:12, Jan Beulich wrote:
>
>>>
>>>> +{
>>>> +    uint128_t prev;
>>>> +
>>>> +    ASSERT(cpu_has_cx16);
>>> Given that if this assertion were to fail, cmpxchg16b would fail with
>>> #UD, I would hand-code a asm_fixup section which in turn panics.  This
>>> avoids a situation where non-debug builds could die with an unqualified
>>> #UD exception.
>> Is there an existing way to panic the hypervisor in assembler code, I
>> don't find it, it would be appreciated if you can point it out.

When I asked for this, I was thinking of having an assertion frame with
the cmpxchg16b instruction in the place of the regular ud2a.  This way,
if it were to failed with #UD, there is a more useful error message.

However, there is no easy way of doing this at the moment, and it is an
obscure set of circumstances, so probably not worth the hassle.

> I'm not convinced such a #UD would be a significant problem: Looking
> at the disassembly will show the cause right away. The out of line
> ud2-s in some of VMX'es inline assembly wrappers are far worse.

Unqualified #UDs are harder to debug than qualified ones, and I have an
annoying habit of hitting them.  In some copious free time, I want to
continue the work started with c/s 0a3e27e and 881d6bf.  git grep
suggests there isn't actually too much to fix up in this regard.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-07-08  8:43           ` Jan Beulich
@ 2015-07-08  8:50             ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-08  8:50 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 08, 2015 4:44 PM
> To: Wu, Feng
> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Tian, Kevin; Zhang, Yang Z;
> xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [Xen-devel] [v3 03/15] Add cmpxchg16b support for x86-64
> 
> >>> On 08.07.15 at 10:33, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, July 08, 2015 4:13 PM
> >> >>> On 08.07.15 at 09:06, <feng.wu@intel.com> wrote:
> >> >> From: xen-devel-bounces@lists.xen.org
> >> >> [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of Andrew Cooper
> >> >> Sent: Thursday, June 25, 2015 2:35 AM
> >> >> On 24/06/15 06:18, Feng Wu wrote:
> >> >> > +{
> >> >> > +    uint128_t prev;
> >> >> > +
> >> >> > +    ASSERT(cpu_has_cx16);
> >> >>
> >> >> Given that if this assertion were to fail, cmpxchg16b would fail with
> >> >> #UD, I would hand-code a asm_fixup section which in turn panics.  This
> >> >> avoids a situation where non-debug builds could die with an unqualified
> >> >> #UD exception.
> >> >
> >> > Is there an existing way to panic the hypervisor in assembler code, I
> >> > don't find it, it would be appreciated if you can point it out.
> >>
> >> I'm not convinced such a #UD would be a significant problem: Looking
> >> at the disassembly will show the cause right away. The out of line
> >> ud2-s in some of VMX'es inline assembly wrappers are far worse.
> >
> > So, do you agree with the fixup section or not?
> 
> I'd rather not go that route, unless Andrew or your manage to
> convince me otherwise.
> 
> >> I think Andrew's "enforce"
> >> really means ASSERT() or BUG_ON(), again to avoid an unqualified
> >> exception. However - see above.
> >>
> >> Plus, all that said, without having seen the actual use sites of
> >> cmpxchg16b yet, I'm not at all convinced we really need this patch.
> >
> > After introducing posted format in IRTE, some fields exist in both the
> > High 64 bit and the low 64 bit,such as pda_h and pda_l, how to make
> > sure it is atomic when updating the pda field?
> 
> Is there a need for updating these _after_ initially setting up an
> entry?

Each time the guest sets the affinity, we need to change this
filed to refer to the new destination.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-06-24  5:18 ` [v3 08/15] Suppress posting interrupts when 'SN' is set Feng Wu
  2015-06-29 15:41   ` Andrew Cooper
@ 2015-07-08  9:06   ` Tian, Kevin
  2015-07-08 10:11     ` Wu, Feng
  2015-07-10 13:20   ` Jan Beulich
  2 siblings, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  9:06 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> Currently, we don't support urgent interrupt, all interrupts
> are recognized as non-urgent interrupt, so we cannot send
> posted-interrupt when 'SN' is set.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> use cmpxchg to test SN/ON and set ON
> 
>  xen/arch/x86/hvm/vmx/vmx.c | 32 ++++++++++++++++++++++++++++----
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 0837627..b94ef6a 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1686,6 +1686,8 @@ static void __vmx_deliver_posted_interrupt(struct vcpu *v)
> 
>  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
>  {
> +    struct pi_desc old, new, prev;
> +

move to 'else if'.

>      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
>          return;
> 
> @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8
> vector)
>           */
>          pi_set_on(&v->arch.hvm_vmx.pi_desc);
>      }
> -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> +    else
>      {
> +        prev.control = 0;
> +
> +        do {
> +            old.control = v->arch.hvm_vmx.pi_desc.control &
> +                          ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);
> +            new.control = v->arch.hvm_vmx.pi_desc.control |
> +                          1 << POSTED_INTR_ON;
> +
> +            /*
> +             * Currently, we don't support urgent interrupt, all
> +             * interrupts are recognized as non-urgent interrupt,
> +             * so we cannot send posted-interrupt when 'SN' is set.
> +             * Besides that, if 'ON' is already set, we cannot set
> +             * posted-interrupts as well.
> +             */
> +            if ( prev.sn || prev.on )
> +            {
> +                vcpu_kick(v);
> +                return;
> +            }

would it make more sense to move above check after cmpxchg?

> +
> +            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
> +                                   old.control, new.control);
> +        } while ( prev.control != old.control );
> +
>          __vmx_deliver_posted_interrupt(v);
> -        return;
>      }
> -
> -    vcpu_kick(v);
>  }
> 
>  static void vmx_sync_pir_to_irr(struct vcpu *v)
> --
> 2.1.0

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-24  5:18 ` [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
  2015-06-29 16:04   ` Andrew Cooper
@ 2015-07-08  9:10   ` Tian, Kevin
  2015-07-10 13:27   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  9:10 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> Extend struct iremap_entry according to VT-d Posted-Interrupts Spec.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-24  5:18 ` [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
  2015-06-29 16:22   ` Andrew Cooper
@ 2015-07-08  9:59   ` Tian, Kevin
  2015-07-08 10:12     ` Wu, Feng
  2015-07-10 14:01   ` Jan Beulich
  2 siblings, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08  9:59 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> This patch adds an API which is used to update the IRTE
> for posted-interrupt when guest changes MSI/MSI-X information.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>, with one small comment:

> +int pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
> +{
> +    struct irq_desc *desc;
> +    struct msi_desc *msi_desc;
> +    int remap_index;
> +    int rc = 0;
> +    struct pci_dev *pci_dev;
> +    struct acpi_drhd_unit *drhd;
> +    struct iommu *iommu;
> +    struct ir_ctrl *ir_ctrl;
> +    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> +    struct iremap_entry new_ire;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    unsigned long flags;
> +    uint128_t old_ire, ret;
> +
> +    desc = pirq_spin_lock_irq_desc(pirq, NULL);
> +    if ( !desc )
> +        return -ENOMEM;

-EINVAL?

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-07-08  9:06   ` Tian, Kevin
@ 2015-07-08 10:11     ` Wu, Feng
  2015-07-08 11:31       ` Tian, Kevin
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 10:11 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 5:06 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 08/15] Suppress posting interrupts when 'SN' is set
> 
> > From: Wu, Feng
> > Sent: Wednesday, June 24, 2015 1:18 PM
> >
> > Currently, we don't support urgent interrupt, all interrupts
> > are recognized as non-urgent interrupt, so we cannot send
> > posted-interrupt when 'SN' is set.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > use cmpxchg to test SN/ON and set ON
> >
> >  xen/arch/x86/hvm/vmx/vmx.c | 32 ++++++++++++++++++++++++++++----
> >  1 file changed, 28 insertions(+), 4 deletions(-)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 0837627..b94ef6a 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -1686,6 +1686,8 @@ static void __vmx_deliver_posted_interrupt(struct
> vcpu *v)
> >
> >  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
> >  {
> > +    struct pi_desc old, new, prev;
> > +
> 
> move to 'else if'.
> 
> >      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
> >          return;
> >
> > @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu
> *v, u8
> > vector)
> >           */
> >          pi_set_on(&v->arch.hvm_vmx.pi_desc);
> >      }
> > -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> > +    else
> >      {
> > +        prev.control = 0;
> > +
> > +        do {
> > +            old.control = v->arch.hvm_vmx.pi_desc.control &
> > +                          ~(1 << POSTED_INTR_ON | 1 <<
> POSTED_INTR_SN);
> > +            new.control = v->arch.hvm_vmx.pi_desc.control |
> > +                          1 << POSTED_INTR_ON;
> > +
> > +            /*
> > +             * Currently, we don't support urgent interrupt, all
> > +             * interrupts are recognized as non-urgent interrupt,
> > +             * so we cannot send posted-interrupt when 'SN' is set.
> > +             * Besides that, if 'ON' is already set, we cannot set
> > +             * posted-interrupts as well.
> > +             */
> > +            if ( prev.sn || prev.on )
> > +            {
> > +                vcpu_kick(v);
> > +                return;
> > +            }
> 
> would it make more sense to move above check after cmpxchg?

My original idea is that, we only need to do the check when
prev.control != old.control, which means the cmpxchg is not
successful completed. If we add the check between cmpxchg
and while ( prev.control != old.control ), it seems the logic is
not so clear, since we don't need to check prev.sn and prev.on
when cmxchg succeeds in setting the new value.

Thanks,
Feng

> 
> > +
> > +            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
> > +                                   old.control, new.control);
> > +        } while ( prev.control != old.control );
> > +
> >          __vmx_deliver_posted_interrupt(v);
> > -        return;
> >      }
> > -
> > -    vcpu_kick(v);
> >  }
> >
> >  static void vmx_sync_pir_to_irr(struct vcpu *v)
> > --
> > 2.1.0

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-08  9:59   ` Tian, Kevin
@ 2015-07-08 10:12     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 10:12 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 6:00 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> > From: Wu, Feng
> > Sent: Wednesday, June 24, 2015 1:18 PM
> >
> > This patch adds an API which is used to update the IRTE
> > for posted-interrupt when guest changes MSI/MSI-X information.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> 
> Acked-by: Kevin Tian <kevin.tian@intel.com>, with one small comment:
> 
> > +int pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
> > +{
> > +    struct irq_desc *desc;
> > +    struct msi_desc *msi_desc;
> > +    int remap_index;
> > +    int rc = 0;
> > +    struct pci_dev *pci_dev;
> > +    struct acpi_drhd_unit *drhd;
> > +    struct iommu *iommu;
> > +    struct ir_ctrl *ir_ctrl;
> > +    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> > +    struct iremap_entry new_ire;
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +    unsigned long flags;
> > +    uint128_t old_ire, ret;
> > +
> > +    desc = pirq_spin_lock_irq_desc(pirq, NULL);
> > +    if ( !desc )
> > +        return -ENOMEM;
> 
> -EINVAL?
> 

I think -EINVAL is reasonable.

Thanks,
Feng

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-06-24  5:18 ` [v3 11/15] Update IRTE according to guest interrupt config changes Feng Wu
  2015-06-29 16:46   ` Andrew Cooper
@ 2015-07-08 10:22   ` Tian, Kevin
  2015-07-08 10:31     ` Wu, Feng
  2015-07-10 14:23   ` Jan Beulich
  2 siblings, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 10:22 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> When guest changes its interrupt configuration (such as, vector, etc.)
> for direct-assigned devices, we need to update the associated IRTE
> with the new guest vector, so external interrupts from the assigned
> devices can be injected to guests without VM-Exit.
> 
> For lowest-priority interrupts, we use vector-hashing mechamisn to find
> the destination vCPU. This follows the hardware behavior, since modern
> Intel CPUs use vector hashing to handle the lowest-priority interrupt.
> 
> For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> still use interrupt remapping.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - Use bitmap to store the all the possible destination vCPUs of an
> interrupt, then trying to find the right destination from the bitmap
> - Typo and some small changes
> 
>  xen/drivers/passthrough/io.c | 96
> +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 95 insertions(+), 1 deletion(-)
> 
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index 9b77334..18e24e1 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -26,6 +26,7 @@
>  #include <asm/hvm/iommu.h>
>  #include <asm/hvm/support.h>
>  #include <xen/hvm/irq.h>
> +#include <asm/io_apic.h>
> 
>  static DEFINE_PER_CPU(struct list_head, dpci_list);
> 
> @@ -199,6 +200,78 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
>      xfree(dpci);
>  }
> 
> +/*
> + * The purpose of this routine is to find the right destination vCPU for
> + * an interrupt which will be delivered by VT-d posted-interrupt. There
> + * are several cases as below:

If you aim to have this interface common to more usages, don't restrict to
VT-d posted-interrupt which should be just an example.

> + *
> + * - For lowest-priority interrupts, we find the destination vCPU from the
> + *   guest vector using vector-hashing mechanism and return true. This follows
> + *   the hardware behavior, since modern Intel CPUs use vector hashing to
> + *   handle the lowest-priority interrupt.

Does AMD use same hashing mechanism? Can this interface be reused by
other IOMMU type or it's an Intel specific implementation?

> + * - Otherwise, for single destination interrupt, it is straightforward to
> + *   find the destination vCPU and return true.
> + * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> + *   so return false.
> + *
> + *   Here is the details about the vector-hashing mechanism:
> + *   1. For lowest-priority interrupts, store all the possible destination
> + *      vCPUs in an array.
> + *   2. Use "gvec % max number of destination vCPUs" to find the right
> + *      destination vCPU in the array for the lowest-priority interrupt.
> + */
> +static struct vcpu *pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,
> +                                      uint8_t dest_mode, uint8_t delivery_mode,
> +                                      uint8_t gvec)
> +{
> +    unsigned long *dest_vcpu_bitmap = NULL;
> +    unsigned int dest_vcpu_num = 0, idx = 0;
> +    int size = (d->max_vcpus + BITS_PER_LONG - 1) / BITS_PER_LONG;
> +    struct vcpu *v, *dest = NULL;
> +    int i;
> +
> +    dest_vcpu_bitmap = xzalloc_array(unsigned long, size);
> +    if ( !dest_vcpu_bitmap )
> +    {
> +        dprintk(XENLOG_G_INFO,
> +                "dom%d: failed to allocate memory\n", d->domain_id);
> +        return NULL;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
> +                                dest_id, dest_mode) )
> +            continue;
> +
> +        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
> +        dest_vcpu_num++;
> +    }
> +
> +    if ( delivery_mode == dest_LowestPrio )
> +    {
> +        if (  dest_vcpu_num != 0 )
> +        {

Having 'idx=0' here is more readable than initializing it earlier.

> +            for ( i = 0; i <= gvec % dest_vcpu_num; i++)
> +                idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
> +            idx--;
> +
> +            BUG_ON(idx >= d->max_vcpus || idx < 0);

idx is unsigned int. can't <0

> +            dest = d->vcpu[idx];
> +        }
> +    }
> +    else if (  dest_vcpu_num == 1 )

a comment would be applausive to explain the condition means
fixed destination, while multicast/broadcast will have num as ZERO.

> +    {
> +        idx = find_first_bit(dest_vcpu_bitmap, d->max_vcpus);
> +        BUG_ON(idx >= d->max_vcpus || idx < 0);
> +        dest = d->vcpu[idx];
> +    }
> +
> +    xfree(dest_vcpu_bitmap);
> +
> +    return dest;
> +}
> +
>  int pt_irq_create_bind(
>      struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
>  {
> @@ -257,7 +330,7 @@ int pt_irq_create_bind(
>      {
>      case PT_IRQ_TYPE_MSI:
>      {
> -        uint8_t dest, dest_mode;
> +        uint8_t dest, dest_mode, delivery_mode;
>          int dest_vcpu_id;
> 
>          if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
> @@ -330,11 +403,32 @@ int pt_irq_create_bind(
>          /* Calculate dest_vcpu_id for MSI-type pirq migration. */
>          dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
>          dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
> +        delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
> +                        VMSI_DELIV_MASK;
>          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>          spin_unlock(&d->event_lock);
>          if ( dest_vcpu_id >= 0 )
>              hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
> +
> +        /* Use interrupt posting if it is supported */
> +        if ( iommu_intpost )
> +        {
> +            struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
> +                                        delivery_mode, pirq_dpci->gmsi.gvec);
> +
> +            if ( !vcpu )
> +                dprintk(XENLOG_G_WARNING,
> +                        "dom%u: failed to find the dest vCPU for PI, guest "
> +                        "vector:0x%x use software way to deliver the "
> +                        " interrupts.\n", d->domain_id, pirq_dpci->gmsi.gvec);

If software delivery is a normal behavior, no printk here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-07-08 10:22   ` Tian, Kevin
@ 2015-07-08 10:31     ` Wu, Feng
  2015-07-08 11:46       ` Tian, Kevin
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 10:31 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 6:23 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 11/15] Update IRTE according to guest interrupt config
> changes
> 
> > From: Wu, Feng
> > Sent: Wednesday, June 24, 2015 1:18 PM
> >
> > When guest changes its interrupt configuration (such as, vector, etc.)
> > for direct-assigned devices, we need to update the associated IRTE
> > with the new guest vector, so external interrupts from the assigned
> > devices can be injected to guests without VM-Exit.
> >
> > For lowest-priority interrupts, we use vector-hashing mechamisn to find
> > the destination vCPU. This follows the hardware behavior, since modern
> > Intel CPUs use vector hashing to handle the lowest-priority interrupt.
> >
> > For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> > still use interrupt remapping.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - Use bitmap to store the all the possible destination vCPUs of an
> > interrupt, then trying to find the right destination from the bitmap
> > - Typo and some small changes
> >
> >  xen/drivers/passthrough/io.c | 96
> > +++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 95 insertions(+), 1 deletion(-)
> >
> > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> > index 9b77334..18e24e1 100644
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -26,6 +26,7 @@
> >  #include <asm/hvm/iommu.h>
> >  #include <asm/hvm/support.h>
> >  #include <xen/hvm/irq.h>
> > +#include <asm/io_apic.h>
> >
> >  static DEFINE_PER_CPU(struct list_head, dpci_list);
> >
> > @@ -199,6 +200,78 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
> >      xfree(dpci);
> >  }
> >
> > +/*
> > + * The purpose of this routine is to find the right destination vCPU for
> > + * an interrupt which will be delivered by VT-d posted-interrupt. There
> > + * are several cases as below:
> 
> If you aim to have this interface common to more usages, don't restrict to
> VT-d posted-interrupt which should be just an example.

Yes, making this a common interface should be better.

> 
> > + *
> > + * - For lowest-priority interrupts, we find the destination vCPU from the
> > + *   guest vector using vector-hashing mechanism and return true. This
> follows
> > + *   the hardware behavior, since modern Intel CPUs use vector hashing to
> > + *   handle the lowest-priority interrupt.
> 
> Does AMD use same hashing mechanism? Can this interface be reused by
> other IOMMU type or it's an Intel specific implementation?

I am not sure how AMD handle lowest-priority. Intel hardware guys told me
recent Intel hardware platform use this method to deliver lowest-priority
interrupts. What do you mean by "other IOMMU type"?

Thanks,
Feng

> 
> > + * - Otherwise, for single destination interrupt, it is straightforward to
> > + *   find the destination vCPU and return true.
> > + * - For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> > + *   so return false.
> > + *
> > + *   Here is the details about the vector-hashing mechanism:
> > + *   1. For lowest-priority interrupts, store all the possible destination
> > + *      vCPUs in an array.
> > + *   2. Use "gvec % max number of destination vCPUs" to find the right
> > + *      destination vCPU in the array for the lowest-priority interrupt.
> > + */
> > +static struct vcpu *pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,
> > +                                      uint8_t dest_mode, uint8_t
> delivery_mode,
> > +                                      uint8_t gvec)
> > +{
> > +    unsigned long *dest_vcpu_bitmap = NULL;
> > +    unsigned int dest_vcpu_num = 0, idx = 0;
> > +    int size = (d->max_vcpus + BITS_PER_LONG - 1) / BITS_PER_LONG;
> > +    struct vcpu *v, *dest = NULL;
> > +    int i;
> > +
> > +    dest_vcpu_bitmap = xzalloc_array(unsigned long, size);
> > +    if ( !dest_vcpu_bitmap )
> > +    {
> > +        dprintk(XENLOG_G_INFO,
> > +                "dom%d: failed to allocate memory\n", d->domain_id);
> > +        return NULL;
> > +    }
> > +
> > +    for_each_vcpu ( d, v )
> > +    {
> > +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
> > +                                dest_id, dest_mode) )
> > +            continue;
> > +
> > +        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
> > +        dest_vcpu_num++;
> > +    }
> > +
> > +    if ( delivery_mode == dest_LowestPrio )
> > +    {
> > +        if (  dest_vcpu_num != 0 )
> > +        {
> 
> Having 'idx=0' here is more readable than initializing it earlier.
> 
> > +            for ( i = 0; i <= gvec % dest_vcpu_num; i++)
> > +                idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus,
> idx) + 1;
> > +            idx--;
> > +
> > +            BUG_ON(idx >= d->max_vcpus || idx < 0);
> 
> idx is unsigned int. can't <0
> 
> > +            dest = d->vcpu[idx];
> > +        }
> > +    }
> > +    else if (  dest_vcpu_num == 1 )
> 
> a comment would be applausive to explain the condition means
> fixed destination, while multicast/broadcast will have num as ZERO.
> 
> > +    {
> > +        idx = find_first_bit(dest_vcpu_bitmap, d->max_vcpus);
> > +        BUG_ON(idx >= d->max_vcpus || idx < 0);
> > +        dest = d->vcpu[idx];
> > +    }
> > +
> > +    xfree(dest_vcpu_bitmap);
> > +
> > +    return dest;
> > +}
> > +
> >  int pt_irq_create_bind(
> >      struct domain *d, xen_domctl_bind_pt_irq_t *pt_irq_bind)
> >  {
> > @@ -257,7 +330,7 @@ int pt_irq_create_bind(
> >      {
> >      case PT_IRQ_TYPE_MSI:
> >      {
> > -        uint8_t dest, dest_mode;
> > +        uint8_t dest, dest_mode, delivery_mode;
> >          int dest_vcpu_id;
> >
> >          if ( !(pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) )
> > @@ -330,11 +403,32 @@ int pt_irq_create_bind(
> >          /* Calculate dest_vcpu_id for MSI-type pirq migration. */
> >          dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
> >          dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
> > +        delivery_mode = (pirq_dpci->gmsi.gflags >>
> GFLAGS_SHIFT_DELIV_MODE) &
> > +                        VMSI_DELIV_MASK;
> >          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
> >          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> >          spin_unlock(&d->event_lock);
> >          if ( dest_vcpu_id >= 0 )
> >              hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
> > +
> > +        /* Use interrupt posting if it is supported */
> > +        if ( iommu_intpost )
> > +        {
> > +            struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
> > +                                        delivery_mode,
> pirq_dpci->gmsi.gvec);
> > +
> > +            if ( !vcpu )
> > +                dprintk(XENLOG_G_WARNING,
> > +                        "dom%u: failed to find the dest vCPU for PI,
> guest "
> > +                        "vector:0x%x use software way to deliver the
> "
> > +                        " interrupts.\n", d->domain_id,
> pirq_dpci->gmsi.gvec);
> 
> If software delivery is a normal behavior, no printk here.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-29 17:07   ` Andrew Cooper
@ 2015-07-08 10:36     ` Wu, Feng
  2015-07-08 10:48       ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 10:36 UTC (permalink / raw)
  To: Jan Beulich (JBeulich@suse.com)
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Tuesday, June 30, 2015 1:07 AM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; Tian, Kevin; Zhang, Yang Z;
> george.dunlap@eu.citrix.com
> Subject: Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
> 
> On 24/06/15 06:18, Feng Wu wrote:
> > This patch includes the following aspects:
> > - Add a global vector to wake up the blocked vCPU
> >   when an interrupt is being posted to it (This
> >   part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> > - Adds a new per-vCPU tasklet to wakeup the blocked
> >   vCPU. It can be used in the case vcpu_unblock
> >   cannot be called directly.
> > - Define two per-cpu variables:
> >       * pi_blocked_vcpu:
> >       A list storing the vCPUs which were blocked on this pCPU.
> >
> >       * pi_blocked_vcpu_lock:
> >       The spinlock to protect pi_blocked_vcpu.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - This patch is generated by merging the following three patches in v2:
> >    [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
> >    [RFC v2 10/15] vmx: Define two per-cpu variables
> >    [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d
> Posted-Interrupts
> > - rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
> > - Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
> > - rename 'vcpu_wakeup_tasklet_handler' to
> 'pi_vcpu_wakeup_tasklet_handler'
> > - Make pi_wakeup_interrupt() static
> > - Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
> > - move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
> > - Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
> > - Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'
> >
> >  xen/arch/x86/hvm/vmx/vmcs.c        |  3 +++
> >  xen/arch/x86/hvm/vmx/vmx.c         | 54
> ++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-x86/hvm/hvm.h      |  1 +
> >  xen/include/asm-x86/hvm/vmx/vmcs.h |  5 ++++
> >  xen/include/asm-x86/hvm/vmx/vmx.h  |  5 ++++
> >  5 files changed, 68 insertions(+)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> > index 11dc1b5..0c5ce3f 100644
> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -631,6 +631,9 @@ int vmx_cpu_up(void)
> >      if ( cpu_has_vmx_vpid )
> >          vpid_sync_all();
> >
> > +    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
> > +    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> >      return 0;
> >  }
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index b94ef6a..7db6009 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -82,7 +82,20 @@ static int vmx_msr_read_intercept(unsigned int msr,
> uint64_t *msr_content);
> >  static int vmx_msr_write_intercept(unsigned int msr, uint64_t
> msr_content);
> >  static void vmx_invlpg_intercept(unsigned long vaddr);
> >
> > +/*
> > + * We maintian a per-CPU linked-list of vCPU, so in PI wakeup handler we
> > + * can find which vCPU should be waken up.
> > + */
> > +DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
> > +DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
> > +
> >  uint8_t __read_mostly posted_intr_vector;
> > +uint8_t __read_mostly pi_wakeup_vector;
> > +
> > +static void pi_vcpu_wakeup_tasklet_handler(unsigned long arg)
> > +{
> > +    vcpu_unblock((struct vcpu *)arg);
> > +}
> >
> >  static int vmx_domain_initialise(struct domain *d)
> >  {
> > @@ -148,11 +161,19 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> >      if ( v->vcpu_id == 0 )
> >          v->arch.user_regs.eax = 1;
> >
> > +    tasklet_init(
> > +        &v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet,
> > +        pi_vcpu_wakeup_tasklet_handler,
> > +        (unsigned long)v);
> 
> c/s f6dd295 indicates that the global tasklet lock causes a bottleneck
> when injecting interrupts, and replaced a tasklet with a softirq to fix
> the scalability issue.
> 
> I would expect exactly the bottleneck to exist here.

I am still considering this comments. Jan, what is your opinion about this?

Thanks,
Feng

> 
> > +
> > +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> > +
> >      return 0;
> >  }
> >
> >  static void vmx_vcpu_destroy(struct vcpu *v)
> >  {
> > +    tasklet_kill(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
> >      /*
> >       * There are cases that domain still remains in log-dirty mode when it
> is
> >       * about to be destroyed (ex, user types 'xl destroy <dom>'), in which
> case
> > @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> vmx_function_table = {
> >      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
> >  };
> >
> > +/*
> > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > + */
> > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > +{
> > +    struct arch_vmx_struct *vmx;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> 
> this_cpu($foo) should be used in preference to per_cpu($foo, $myself).
> 
> However, always hoist repeated uses of this/per_cpu into local
> variables, as the compiler is unable to elide repeated accesses (because
> of a deliberate anti-optimisation behind the scenes).
> 
> spinlock_t *lock = &this_cpu(pi_blocked_vcpu_lock);
> list_head *blocked_vcpus = &this_cpu(ps_blocked_vcpu);
> 
> ~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 10:36     ` Wu, Feng
@ 2015-07-08 10:48       ` Jan Beulich
  0 siblings, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-08 10:48 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 08.07.15 at 12:36, <feng.wu@intel.com> wrote:
>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> Sent: Tuesday, June 30, 2015 1:07 AM
>> On 24/06/15 06:18, Feng Wu wrote:
>> > @@ -148,11 +161,19 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>> >      if ( v->vcpu_id == 0 )
>> >          v->arch.user_regs.eax = 1;
>> >
>> > +    tasklet_init(
>> > +        &v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet,
>> > +        pi_vcpu_wakeup_tasklet_handler,
>> > +        (unsigned long)v);
>> 
>> c/s f6dd295 indicates that the global tasklet lock causes a bottleneck
>> when injecting interrupts, and replaced a tasklet with a softirq to fix
>> the scalability issue.
>> 
>> I would expect exactly the bottleneck to exist here.
> 
> I am still considering this comments. Jan, what is your opinion about this?

"My opinion" here is that I expect you to respond to Andrew.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-06-24  5:18 ` [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
                     ` (2 preceding siblings ...)
  2015-06-30 10:11   ` Andrew Cooper
@ 2015-07-08 11:00   ` Tian, Kevin
  2015-07-08 11:02     ` Wu, Feng
  2015-07-08 12:46     ` Jan Beulich
  3 siblings, 2 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:00 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> This patch includes the following aspects:
> - Add a global vector to wake up the blocked vCPU
>   when an interrupt is being posted to it (This
>   part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> - Adds a new per-vCPU tasklet to wakeup the blocked
>   vCPU. It can be used in the case vcpu_unblock
>   cannot be called directly.
> - Define two per-cpu variables:
>       * pi_blocked_vcpu:
>       A list storing the vCPUs which were blocked on this pCPU.
> 
>       * pi_blocked_vcpu_lock:
>       The spinlock to protect pi_blocked_vcpu.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> - This patch is generated by merging the following three patches in v2:
>    [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
>    [RFC v2 10/15] vmx: Define two per-cpu variables
>    [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d Posted-Interrupts
> - rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
> - Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
> - rename 'vcpu_wakeup_tasklet_handler' to 'pi_vcpu_wakeup_tasklet_handler'
> - Make pi_wakeup_interrupt() static
> - Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
> - move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
> - Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
> - Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'
> 
>  xen/arch/x86/hvm/vmx/vmcs.c        |  3 +++
>  xen/arch/x86/hvm/vmx/vmx.c         | 54
> ++++++++++++++++++++++++++++++++++++++
>  xen/include/asm-x86/hvm/hvm.h      |  1 +
>  xen/include/asm-x86/hvm/vmx/vmcs.h |  5 ++++
>  xen/include/asm-x86/hvm/vmx/vmx.h  |  5 ++++
>  5 files changed, 68 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> index 11dc1b5..0c5ce3f 100644
> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> @@ -631,6 +631,9 @@ int vmx_cpu_up(void)
>      if ( cpu_has_vmx_vpid )
>          vpid_sync_all();
> 
> +    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
> +    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
>      return 0;
>  }
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index b94ef6a..7db6009 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -82,7 +82,20 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t
> *msr_content);
>  static int vmx_msr_write_intercept(unsigned int msr, uint64_t msr_content);
>  static void vmx_invlpg_intercept(unsigned long vaddr);
> 
> +/*
> + * We maintian a per-CPU linked-list of vCPU, so in PI wakeup handler we
> + * can find which vCPU should be waken up.
> + */
> +DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
> +DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
> +
>  uint8_t __read_mostly posted_intr_vector;
> +uint8_t __read_mostly pi_wakeup_vector;
> +
> +static void pi_vcpu_wakeup_tasklet_handler(unsigned long arg)
> +{
> +    vcpu_unblock((struct vcpu *)arg);
> +}
> 
>  static int vmx_domain_initialise(struct domain *d)
>  {
> @@ -148,11 +161,19 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>      if ( v->vcpu_id == 0 )
>          v->arch.user_regs.eax = 1;
> 
> +    tasklet_init(
> +        &v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet,
> +        pi_vcpu_wakeup_tasklet_handler,
> +        (unsigned long)v);
> +
> +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +
>      return 0;
>  }
> 
>  static void vmx_vcpu_destroy(struct vcpu *v)
>  {
> +    tasklet_kill(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
>      /*
>       * There are cases that domain still remains in log-dirty mode when it is
>       * about to be destroyed (ex, user types 'xl destroy <dom>'), in which case
> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> vmx_function_table = {
>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
>  };
> 
> +/*
> + * Handle VT-d posted-interrupt when VCPU is blocked.
> + */
> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> +{
> +    struct arch_vmx_struct *vmx;
> +    unsigned int cpu = smp_processor_id();
> +
> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> +
> +    /*
> +     * FIXME: The length of the list depends on how many
> +     * vCPU is current blocked on this specific pCPU.
> +     * This may hurt the interrupt latency if the list
> +     * grows to too many entries.
> +     */

let's go with this linked list first until a real issue is identified.

> +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> +                        pi_blocked_vcpu_list)
> +        if ( vmx->pi_desc.on )
> +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);

Not sure where the vcpu is removed from the list (possibly in later patch).
But at least removing vcpu from the list at this point should be safe and 
right way to go. IIRC Andrew and other guys raised similar concern earlier. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 11:00   ` Tian, Kevin
@ 2015-07-08 11:02     ` Wu, Feng
  2015-07-08 12:46     ` Jan Beulich
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 11:02 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 7:00 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
> 
> > From: Wu, Feng
> > Sent: Wednesday, June 24, 2015 1:18 PM
> >
> > This patch includes the following aspects:
> > - Add a global vector to wake up the blocked vCPU
> >   when an interrupt is being posted to it (This
> >   part was sugguested by Yang Zhang <yang.z.zhang@intel.com>).
> > - Adds a new per-vCPU tasklet to wakeup the blocked
> >   vCPU. It can be used in the case vcpu_unblock
> >   cannot be called directly.
> > - Define two per-cpu variables:
> >       * pi_blocked_vcpu:
> >       A list storing the vCPUs which were blocked on this pCPU.
> >
> >       * pi_blocked_vcpu_lock:
> >       The spinlock to protect pi_blocked_vcpu.
> >
> > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > ---
> > v3:
> > - This patch is generated by merging the following three patches in v2:
> >    [RFC v2 09/15] Add a new per-vCPU tasklet to wakeup the blocked vCPU
> >    [RFC v2 10/15] vmx: Define two per-cpu variables
> >    [RFC v2 11/15] vmx: Add a global wake-up vector for VT-d
> Posted-Interrupts
> > - rename 'vcpu_wakeup_tasklet' to 'pi_vcpu_wakeup_tasklet'
> > - Move the definition of 'pi_vcpu_wakeup_tasklet' to 'struct arch_vmx_struct'
> > - rename 'vcpu_wakeup_tasklet_handler' to
> 'pi_vcpu_wakeup_tasklet_handler'
> > - Make pi_wakeup_interrupt() static
> > - Rename 'blocked_vcpu_list' to 'pi_blocked_vcpu_list'
> > - move 'pi_blocked_vcpu_list' to 'struct arch_vmx_struct'
> > - Rename 'blocked_vcpu' to 'pi_blocked_vcpu'
> > - Rename 'blocked_vcpu_lock' to 'pi_blocked_vcpu_lock'
> >
> >  xen/arch/x86/hvm/vmx/vmcs.c        |  3 +++
> >  xen/arch/x86/hvm/vmx/vmx.c         | 54
> > ++++++++++++++++++++++++++++++++++++++
> >  xen/include/asm-x86/hvm/hvm.h      |  1 +
> >  xen/include/asm-x86/hvm/vmx/vmcs.h |  5 ++++
> >  xen/include/asm-x86/hvm/vmx/vmx.h  |  5 ++++
> >  5 files changed, 68 insertions(+)
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
> > index 11dc1b5..0c5ce3f 100644
> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -631,6 +631,9 @@ int vmx_cpu_up(void)
> >      if ( cpu_has_vmx_vpid )
> >          vpid_sync_all();
> >
> > +    INIT_LIST_HEAD(&per_cpu(pi_blocked_vcpu, cpu));
> > +    spin_lock_init(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> >      return 0;
> >  }
> >
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index b94ef6a..7db6009 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -82,7 +82,20 @@ static int vmx_msr_read_intercept(unsigned int msr,
> uint64_t
> > *msr_content);
> >  static int vmx_msr_write_intercept(unsigned int msr, uint64_t
> msr_content);
> >  static void vmx_invlpg_intercept(unsigned long vaddr);
> >
> > +/*
> > + * We maintian a per-CPU linked-list of vCPU, so in PI wakeup handler we
> > + * can find which vCPU should be waken up.
> > + */
> > +DEFINE_PER_CPU(struct list_head, pi_blocked_vcpu);
> > +DEFINE_PER_CPU(spinlock_t, pi_blocked_vcpu_lock);
> > +
> >  uint8_t __read_mostly posted_intr_vector;
> > +uint8_t __read_mostly pi_wakeup_vector;
> > +
> > +static void pi_vcpu_wakeup_tasklet_handler(unsigned long arg)
> > +{
> > +    vcpu_unblock((struct vcpu *)arg);
> > +}
> >
> >  static int vmx_domain_initialise(struct domain *d)
> >  {
> > @@ -148,11 +161,19 @@ static int vmx_vcpu_initialise(struct vcpu *v)
> >      if ( v->vcpu_id == 0 )
> >          v->arch.user_regs.eax = 1;
> >
> > +    tasklet_init(
> > +        &v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet,
> > +        pi_vcpu_wakeup_tasklet_handler,
> > +        (unsigned long)v);
> > +
> > +    INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> > +
> >      return 0;
> >  }
> >
> >  static void vmx_vcpu_destroy(struct vcpu *v)
> >  {
> > +    tasklet_kill(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
> >      /*
> >       * There are cases that domain still remains in log-dirty mode when it
> is
> >       * about to be destroyed (ex, user types 'xl destroy <dom>'), in which
> case
> > @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> > vmx_function_table = {
> >      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
> >  };
> >
> > +/*
> > + * Handle VT-d posted-interrupt when VCPU is blocked.
> > + */
> > +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> > +{
> > +    struct arch_vmx_struct *vmx;
> > +    unsigned int cpu = smp_processor_id();
> > +
> > +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> > +
> > +    /*
> > +     * FIXME: The length of the list depends on how many
> > +     * vCPU is current blocked on this specific pCPU.
> > +     * This may hurt the interrupt latency if the list
> > +     * grows to too many entries.
> > +     */
> 
> let's go with this linked list first until a real issue is identified.
> 
> > +    list_for_each_entry(vmx, &per_cpu(pi_blocked_vcpu, cpu),
> > +                        pi_blocked_vcpu_list)
> > +        if ( vmx->pi_desc.on )
> > +            tasklet_schedule(&vmx->pi_vcpu_wakeup_tasklet);
> 
> Not sure where the vcpu is removed from the list (possibly in later patch).
> But at least removing vcpu from the list at this point should be safe and
> right way to go. IIRC Andrew and other guys raised similar concern earlier. :-)

Thanks for the comments!
Yes, though it will be remove eventually in another patch, we need to remove
the vcpu right here to avoid kick the vCPU for multiple times.

Thanks,
Feng

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 13/15] vmx: Properly handle notification event when vCPU is running
  2015-06-24  5:18 ` [v3 13/15] vmx: Properly handle notification event when vCPU is running Feng Wu
@ 2015-07-08 11:03   ` Tian, Kevin
  2015-07-10 14:40   ` Jan Beulich
  1 sibling, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:03 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> When a vCPU is running in Root mode and a notification event
> has been injected to it. we need to set VCPU_KICK_SOFTIRQ for
> the current cpu, so the pending interrupt in PIRR will be
> synced to vIRR before VM-Exit in time.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-24  5:18 ` [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
       [not found]   ` <55918214.4030102@citrix.com>
@ 2015-07-08 11:24   ` Tian, Kevin
  2015-07-10 14:48   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:24 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> The basic idea here is:
> 1. When vCPU's state is RUNSTATE_running,
>         - set 'NV' to 'Notification Vector'.
>         - Clear 'SN' to accpet PI.
>         - set 'NDST' to the right pCPU.
> 2. When vCPU's state is RUNSTATE_blocked,
>         - set 'NV' to 'Wake-up Vector', so we can wake up the
>           related vCPU when posted-interrupt happens for it.
>         - Clear 'SN' to accpet PI.
> 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
>         - Set 'SN' to suppress non-urgent interrupts.
>           (Current, we only support non-urgent interrupts)
>         - Set 'NV' back to 'Notification Vector' if needed.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>

Acked-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 15/15] Add a command line parameter for VT-d posted-interrupts
  2015-06-24  5:18 ` [v3 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
@ 2015-07-08 11:25   ` Tian, Kevin
  0 siblings, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:25 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, June 24, 2015 1:18 PM
> 
> Enable VT-d Posted-Interrupts and add a command line
> parameter for it.
> 
> Signed-off-by: Feng Wu <feng.wu@intel.com>
> ---
> v3:
> Remove the redundant "no intremp then no intpost" logic
> 
>  docs/misc/xen-command-line.markdown | 9 ++++++++-
>  xen/drivers/passthrough/iommu.c     | 4 +++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/misc/xen-command-line.markdown
> b/docs/misc/xen-command-line.markdown
> index aa684c0..f8ec15f 100644
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -875,6 +875,13 @@ debug hypervisor only).
>  >> Control the use of interrupt remapping (DMA remapping will always be enabled
>  >> if IOMMU functionality is enabled).
> 
> +> `intpost`
> +
> +> Default: `true`
> +
> +>> Control the use of interrupt posting, interrupt posting is dependant on
> +>> interrupt remapping.

"Control the use of interrupt posting, which depends on the availability of interrupt remapping."

> +
>  > `qinval` (VT-d)
> 
>  > Default: `true`
> diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c
> index 597f676..e13251c 100644
> --- a/xen/drivers/passthrough/iommu.c
> +++ b/xen/drivers/passthrough/iommu.c
> @@ -52,7 +52,7 @@ bool_t __read_mostly iommu_passthrough;
>  bool_t __read_mostly iommu_snoop = 1;
>  bool_t __read_mostly iommu_qinval = 1;
>  bool_t __read_mostly iommu_intremap = 1;
> -bool_t __read_mostly iommu_intpost;
> +bool_t __read_mostly iommu_intpost = 1;
>  bool_t __read_mostly iommu_hap_pt_share = 1;
>  bool_t __read_mostly iommu_debug;
>  bool_t __read_mostly amd_iommu_perdev_intremap = 1;
> @@ -97,6 +97,8 @@ static void __init parse_iommu_param(char *s)
>              iommu_qinval = val;
>          else if ( !strcmp(s, "intremap") )
>              iommu_intremap = val;
> +        else if ( !strcmp(s, "intpost") )
> +            iommu_intpost = val;
>          else if ( !strcmp(s, "debug") )
>          {
>              iommu_debug = val;
> --
> 2.1.0

Reviewed-by: Kevin Tian <kevin.tian@intel.com>

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-07-08 10:11     ` Wu, Feng
@ 2015-07-08 11:31       ` Tian, Kevin
  2015-07-08 11:58         ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:31 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, July 08, 2015 6:11 PM
> > From: Tian, Kevin
> > Sent: Wednesday, July 08, 2015 5:06 PM
> >
> > > From: Wu, Feng
> > > Sent: Wednesday, June 24, 2015 1:18 PM
> > >
> > > Currently, we don't support urgent interrupt, all interrupts
> > > are recognized as non-urgent interrupt, so we cannot send
> > > posted-interrupt when 'SN' is set.
> > >
> > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > ---
> > > v3:
> > > use cmpxchg to test SN/ON and set ON
> > >
> > >  xen/arch/x86/hvm/vmx/vmx.c | 32
> ++++++++++++++++++++++++++++----
> > >  1 file changed, 28 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > > index 0837627..b94ef6a 100644
> > > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > > @@ -1686,6 +1686,8 @@ static void __vmx_deliver_posted_interrupt(struct
> > vcpu *v)
> > >
> > >  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
> > >  {
> > > +    struct pi_desc old, new, prev;
> > > +
> >
> > move to 'else if'.
> >
> > >      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
> > >          return;
> > >
> > > @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu
> > *v, u8
> > > vector)
> > >           */
> > >          pi_set_on(&v->arch.hvm_vmx.pi_desc);
> > >      }
> > > -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> > > +    else
> > >      {
> > > +        prev.control = 0;
> > > +
> > > +        do {
> > > +            old.control = v->arch.hvm_vmx.pi_desc.control &
> > > +                          ~(1 << POSTED_INTR_ON | 1 <<
> > POSTED_INTR_SN);
> > > +            new.control = v->arch.hvm_vmx.pi_desc.control |
> > > +                          1 << POSTED_INTR_ON;
> > > +
> > > +            /*
> > > +             * Currently, we don't support urgent interrupt, all
> > > +             * interrupts are recognized as non-urgent interrupt,
> > > +             * so we cannot send posted-interrupt when 'SN' is set.
> > > +             * Besides that, if 'ON' is already set, we cannot set
> > > +             * posted-interrupts as well.
> > > +             */
> > > +            if ( prev.sn || prev.on )
> > > +            {
> > > +                vcpu_kick(v);
> > > +                return;
> > > +            }
> >
> > would it make more sense to move above check after cmpxchg?
> 
> My original idea is that, we only need to do the check when
> prev.control != old.control, which means the cmpxchg is not
> successful completed. If we add the check between cmpxchg
> and while ( prev.control != old.control ), it seems the logic is
> not so clear, since we don't need to check prev.sn and prev.on
> when cmxchg succeeds in setting the new value.
> 
> Thanks,
> Feng
> 

Then it'd be clearer if you move the check the start of the loop, so
you can avoid two additional reads when the prev.on/sn is set. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-07-08 10:31     ` Wu, Feng
@ 2015-07-08 11:46       ` Tian, Kevin
  2015-07-08 11:52         ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:46 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, July 08, 2015 6:32 PM
> 
> 
> 
> > -----Original Message-----
> > From: Tian, Kevin
> > Sent: Wednesday, July 08, 2015 6:23 PM
> > To: Wu, Feng; xen-devel@lists.xen.org
> > Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> > Yang Z; george.dunlap@eu.citrix.com
> > Subject: RE: [v3 11/15] Update IRTE according to guest interrupt config
> > changes
> >
> > > From: Wu, Feng
> > > Sent: Wednesday, June 24, 2015 1:18 PM
> > >
> > > When guest changes its interrupt configuration (such as, vector, etc.)
> > > for direct-assigned devices, we need to update the associated IRTE
> > > with the new guest vector, so external interrupts from the assigned
> > > devices can be injected to guests without VM-Exit.
> > >
> > > For lowest-priority interrupts, we use vector-hashing mechamisn to find
> > > the destination vCPU. This follows the hardware behavior, since modern
> > > Intel CPUs use vector hashing to handle the lowest-priority interrupt.
> > >
> > > For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> > > still use interrupt remapping.
> > >
> > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > ---
> > > v3:
> > > - Use bitmap to store the all the possible destination vCPUs of an
> > > interrupt, then trying to find the right destination from the bitmap
> > > - Typo and some small changes
> > >
> > >  xen/drivers/passthrough/io.c | 96
> > > +++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 95 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> > > index 9b77334..18e24e1 100644
> > > --- a/xen/drivers/passthrough/io.c
> > > +++ b/xen/drivers/passthrough/io.c
> > > @@ -26,6 +26,7 @@
> > >  #include <asm/hvm/iommu.h>
> > >  #include <asm/hvm/support.h>
> > >  #include <xen/hvm/irq.h>
> > > +#include <asm/io_apic.h>
> > >
> > >  static DEFINE_PER_CPU(struct list_head, dpci_list);
> > >
> > > @@ -199,6 +200,78 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci *dpci)
> > >      xfree(dpci);
> > >  }
> > >
> > > +/*
> > > + * The purpose of this routine is to find the right destination vCPU for
> > > + * an interrupt which will be delivered by VT-d posted-interrupt. There
> > > + * are several cases as below:
> >
> > If you aim to have this interface common to more usages, don't restrict to
> > VT-d posted-interrupt which should be just an example.
> 
> Yes, making this a common interface should be better.
> 
> >
> > > + *
> > > + * - For lowest-priority interrupts, we find the destination vCPU from the
> > > + *   guest vector using vector-hashing mechanism and return true. This
> > follows
> > > + *   the hardware behavior, since modern Intel CPUs use vector hashing to
> > > + *   handle the lowest-priority interrupt.
> >
> > Does AMD use same hashing mechanism? Can this interface be reused by
> > other IOMMU type or it's an Intel specific implementation?
> 
> I am not sure how AMD handle lowest-priority. Intel hardware guys told me
> recent Intel hardware platform use this method to deliver lowest-priority
> interrupts. What do you mean by "other IOMMU type"?
> 

OS doesn't assume how vector hashing is done in hardware level. So it should
be fine to use Intel algorithm in this emulation path. However my point is just
about the comment " since modern Intel CPUs use vector hashing to handle 
the lowest-priority interrupt". It's not because Intel does so. It's the 
implementation option that you choose Intel algorithm here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-07-08 11:46       ` Tian, Kevin
@ 2015-07-08 11:52         ` Wu, Feng
  2015-07-08 11:54           ` Tian, Kevin
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 11:52 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 7:46 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 11/15] Update IRTE according to guest interrupt config
> changes
> 
> > From: Wu, Feng
> > Sent: Wednesday, July 08, 2015 6:32 PM
> >
> >
> >
> > > -----Original Message-----
> > > From: Tian, Kevin
> > > Sent: Wednesday, July 08, 2015 6:23 PM
> > > To: Wu, Feng; xen-devel@lists.xen.org
> > > Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> > > Yang Z; george.dunlap@eu.citrix.com
> > > Subject: RE: [v3 11/15] Update IRTE according to guest interrupt config
> > > changes
> > >
> > > > From: Wu, Feng
> > > > Sent: Wednesday, June 24, 2015 1:18 PM
> > > >
> > > > When guest changes its interrupt configuration (such as, vector, etc.)
> > > > for direct-assigned devices, we need to update the associated IRTE
> > > > with the new guest vector, so external interrupts from the assigned
> > > > devices can be injected to guests without VM-Exit.
> > > >
> > > > For lowest-priority interrupts, we use vector-hashing mechamisn to find
> > > > the destination vCPU. This follows the hardware behavior, since modern
> > > > Intel CPUs use vector hashing to handle the lowest-priority interrupt.
> > > >
> > > > For multicast/broadcast vCPU, we cannot handle it via interrupt posting,
> > > > still use interrupt remapping.
> > > >
> > > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > > ---
> > > > v3:
> > > > - Use bitmap to store the all the possible destination vCPUs of an
> > > > interrupt, then trying to find the right destination from the bitmap
> > > > - Typo and some small changes
> > > >
> > > >  xen/drivers/passthrough/io.c | 96
> > > > +++++++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 95 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> > > > index 9b77334..18e24e1 100644
> > > > --- a/xen/drivers/passthrough/io.c
> > > > +++ b/xen/drivers/passthrough/io.c
> > > > @@ -26,6 +26,7 @@
> > > >  #include <asm/hvm/iommu.h>
> > > >  #include <asm/hvm/support.h>
> > > >  #include <xen/hvm/irq.h>
> > > > +#include <asm/io_apic.h>
> > > >
> > > >  static DEFINE_PER_CPU(struct list_head, dpci_list);
> > > >
> > > > @@ -199,6 +200,78 @@ void free_hvm_irq_dpci(struct hvm_irq_dpci
> *dpci)
> > > >      xfree(dpci);
> > > >  }
> > > >
> > > > +/*
> > > > + * The purpose of this routine is to find the right destination vCPU for
> > > > + * an interrupt which will be delivered by VT-d posted-interrupt. There
> > > > + * are several cases as below:
> > >
> > > If you aim to have this interface common to more usages, don't restrict to
> > > VT-d posted-interrupt which should be just an example.
> >
> > Yes, making this a common interface should be better.
> >
> > >
> > > > + *
> > > > + * - For lowest-priority interrupts, we find the destination vCPU from the
> > > > + *   guest vector using vector-hashing mechanism and return true. This
> > > follows
> > > > + *   the hardware behavior, since modern Intel CPUs use vector
> hashing to
> > > > + *   handle the lowest-priority interrupt.
> > >
> > > Does AMD use same hashing mechanism? Can this interface be reused by
> > > other IOMMU type or it's an Intel specific implementation?
> >
> > I am not sure how AMD handle lowest-priority. Intel hardware guys told me
> > recent Intel hardware platform use this method to deliver lowest-priority
> > interrupts. What do you mean by "other IOMMU type"?
> >
> 
> OS doesn't assume how vector hashing is done in hardware level. So it should
> be fine to use Intel algorithm in this emulation path. However my point is just
> about the comment " since modern Intel CPUs use vector hashing to handle
> the lowest-priority interrupt". It's not because Intel does so. It's the
> implementation option that you choose Intel algorithm here.

here I can mention: we choose vector-hashing for lowest-priority handling and
list Intel as an example to use it, okay?

Thanks,
Feng

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-07-08 11:52         ` Wu, Feng
@ 2015-07-08 11:54           ` Tian, Kevin
  0 siblings, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 11:54 UTC (permalink / raw)
  To: Wu, Feng, xen-devel@lists.xen.org
  Cc: Zhang, Yang Z, andrew.cooper3@citrix.com, keir@xen.org,
	george.dunlap@eu.citrix.com, jbeulich@suse.com

> From: Wu, Feng
> Sent: Wednesday, July 08, 2015 7:52 PM
> > > > > + * - For lowest-priority interrupts, we find the destination vCPU from the
> > > > > + *   guest vector using vector-hashing mechanism and return true. This
> > > > follows
> > > > > + *   the hardware behavior, since modern Intel CPUs use vector
> > hashing to
> > > > > + *   handle the lowest-priority interrupt.
> > > >
> > > > Does AMD use same hashing mechanism? Can this interface be reused by
> > > > other IOMMU type or it's an Intel specific implementation?
> > >
> > > I am not sure how AMD handle lowest-priority. Intel hardware guys told me
> > > recent Intel hardware platform use this method to deliver lowest-priority
> > > interrupts. What do you mean by "other IOMMU type"?
> > >
> >
> > OS doesn't assume how vector hashing is done in hardware level. So it should
> > be fine to use Intel algorithm in this emulation path. However my point is just
> > about the comment " since modern Intel CPUs use vector hashing to handle
> > the lowest-priority interrupt". It's not because Intel does so. It's the
> > implementation option that you choose Intel algorithm here.
> 
> here I can mention: we choose vector-hashing for lowest-priority handling and
> list Intel as an example to use it, okay?
> 

Yes. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-07-08 11:31       ` Tian, Kevin
@ 2015-07-08 11:58         ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-08 11:58 UTC (permalink / raw)
  To: Tian, Kevin, xen-devel@lists.xen.org
  Cc: Wu, Feng, george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	jbeulich@suse.com, Zhang, Yang Z, keir@xen.org



> -----Original Message-----
> From: Tian, Kevin
> Sent: Wednesday, July 08, 2015 7:31 PM
> To: Wu, Feng; xen-devel@lists.xen.org
> Cc: keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com; Zhang,
> Yang Z; george.dunlap@eu.citrix.com
> Subject: RE: [v3 08/15] Suppress posting interrupts when 'SN' is set
> 
> > From: Wu, Feng
> > Sent: Wednesday, July 08, 2015 6:11 PM
> > > From: Tian, Kevin
> > > Sent: Wednesday, July 08, 2015 5:06 PM
> > >
> > > > From: Wu, Feng
> > > > Sent: Wednesday, June 24, 2015 1:18 PM
> > > >
> > > > Currently, we don't support urgent interrupt, all interrupts
> > > > are recognized as non-urgent interrupt, so we cannot send
> > > > posted-interrupt when 'SN' is set.
> > > >
> > > > Signed-off-by: Feng Wu <feng.wu@intel.com>
> > > > ---
> > > > v3:
> > > > use cmpxchg to test SN/ON and set ON
> > > >
> > > >  xen/arch/x86/hvm/vmx/vmx.c | 32
> > ++++++++++++++++++++++++++++----
> > > >  1 file changed, 28 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > > > index 0837627..b94ef6a 100644
> > > > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > > > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > > > @@ -1686,6 +1686,8 @@ static void
> __vmx_deliver_posted_interrupt(struct
> > > vcpu *v)
> > > >
> > > >  static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
> > > >  {
> > > > +    struct pi_desc old, new, prev;
> > > > +
> > >
> > > move to 'else if'.
> > >
> > > >      if ( pi_test_and_set_pir(vector, &v->arch.hvm_vmx.pi_desc) )
> > > >          return;
> > > >
> > > > @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct
> vcpu
> > > *v, u8
> > > > vector)
> > > >           */
> > > >          pi_set_on(&v->arch.hvm_vmx.pi_desc);
> > > >      }
> > > > -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> > > > +    else
> > > >      {
> > > > +        prev.control = 0;
> > > > +
> > > > +        do {
> > > > +            old.control = v->arch.hvm_vmx.pi_desc.control &
> > > > +                          ~(1 << POSTED_INTR_ON | 1 <<
> > > POSTED_INTR_SN);
> > > > +            new.control = v->arch.hvm_vmx.pi_desc.control |
> > > > +                          1 << POSTED_INTR_ON;
> > > > +
> > > > +            /*
> > > > +             * Currently, we don't support urgent interrupt, all
> > > > +             * interrupts are recognized as non-urgent interrupt,
> > > > +             * so we cannot send posted-interrupt when 'SN' is set.
> > > > +             * Besides that, if 'ON' is already set, we cannot set
> > > > +             * posted-interrupts as well.
> > > > +             */
> > > > +            if ( prev.sn || prev.on )
> > > > +            {
> > > > +                vcpu_kick(v);
> > > > +                return;
> > > > +            }
> > >
> > > would it make more sense to move above check after cmpxchg?
> >
> > My original idea is that, we only need to do the check when
> > prev.control != old.control, which means the cmpxchg is not
> > successful completed. If we add the check between cmpxchg
> > and while ( prev.control != old.control ), it seems the logic is
> > not so clear, since we don't need to check prev.sn and prev.on
> > when cmxchg succeeds in setting the new value.
> >
> > Thanks,
> > Feng
> >
> 
> Then it'd be clearer if you move the check the start of the loop, so
> you can avoid two additional reads when the prev.on/sn is set. :-)

Good idea!

Thanks,
Feng

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 11:00   ` Tian, Kevin
  2015-07-08 11:02     ` Wu, Feng
@ 2015-07-08 12:46     ` Jan Beulich
  2015-07-08 13:09       ` Andrew Cooper
  2015-07-08 22:31       ` Tian, Kevin
  1 sibling, 2 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-08 12:46 UTC (permalink / raw)
  To: Feng Wu, Kevin Tian
  Cc: george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	Yang Z Zhang, keir@xen.org, xen-devel@lists.xen.org

>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
>> vmx_function_table = {
>>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
>>  };
>> 
>> +/*
>> + * Handle VT-d posted-interrupt when VCPU is blocked.
>> + */
>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>> +{
>> +    struct arch_vmx_struct *vmx;
>> +    unsigned int cpu = smp_processor_id();
>> +
>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
>> +
>> +    /*
>> +     * FIXME: The length of the list depends on how many
>> +     * vCPU is current blocked on this specific pCPU.
>> +     * This may hurt the interrupt latency if the list
>> +     * grows to too many entries.
>> +     */
> 
> let's go with this linked list first until a real issue is identified.

This is exactly the way of thinking I dislike when it comes to code
that isn't intended to be experimental only: We shouldn't wait
for problems to surface when we already can see them. I.e. if
there are no plans to deal with this, I'd ask for the feature to be
off by default and be properly marked experimental in the
command line option documentation (so people know to stay
away from it).

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 12:46     ` Jan Beulich
@ 2015-07-08 13:09       ` Andrew Cooper
  2015-07-08 22:49         ` Tian, Kevin
  2015-07-08 22:31       ` Tian, Kevin
  1 sibling, 1 reply; 155+ messages in thread
From: Andrew Cooper @ 2015-07-08 13:09 UTC (permalink / raw)
  To: Jan Beulich, Feng Wu, Kevin Tian
  Cc: george.dunlap@eu.citrix.com, Yang Z Zhang, keir@xen.org,
	xen-devel@lists.xen.org

On 08/07/2015 13:46, Jan Beulich wrote:
>>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
>>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
>>> vmx_function_table = {
>>>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
>>>  };
>>>
>>> +/*
>>> + * Handle VT-d posted-interrupt when VCPU is blocked.
>>> + */
>>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>>> +{
>>> +    struct arch_vmx_struct *vmx;
>>> +    unsigned int cpu = smp_processor_id();
>>> +
>>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
>>> +
>>> +    /*
>>> +     * FIXME: The length of the list depends on how many
>>> +     * vCPU is current blocked on this specific pCPU.
>>> +     * This may hurt the interrupt latency if the list
>>> +     * grows to too many entries.
>>> +     */
>> let's go with this linked list first until a real issue is identified.
> This is exactly the way of thinking I dislike when it comes to code
> that isn't intended to be experimental only: We shouldn't wait
> for problems to surface when we already can see them. I.e. if
> there are no plans to deal with this, I'd ask for the feature to be
> off by default and be properly marked experimental in the
> command line option documentation (so people know to stay
> away from it).

And in this specific case, there is no balancing of vcpus across the
pcpus lists.

One can construct a pathological case using pinning and pausing to get
almost every vcpu on a single pcpu list, and vcpus recieving fewer
interrupts will exasperate the problem by staying on the list for longer
periods of time.

IMO, the PI feature cannot be declared as done/supported with this bug
remaining.  OTOH, it is fine to be experimental, and disabled by default
for people who wish to experiment.

~Andrew

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 12:46     ` Jan Beulich
  2015-07-08 13:09       ` Andrew Cooper
@ 2015-07-08 22:31       ` Tian, Kevin
  1 sibling, 0 replies; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 22:31 UTC (permalink / raw)
  To: Jan Beulich, Wu, Feng
  Cc: george.dunlap@eu.citrix.com, andrew.cooper3@citrix.com,
	Zhang, Yang Z, keir@xen.org, xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 08, 2015 8:46 PM
> 
> >>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
> >> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> >> vmx_function_table = {
> >>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
> >>  };
> >>
> >> +/*
> >> + * Handle VT-d posted-interrupt when VCPU is blocked.
> >> + */
> >> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> >> +{
> >> +    struct arch_vmx_struct *vmx;
> >> +    unsigned int cpu = smp_processor_id();
> >> +
> >> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> >> +
> >> +    /*
> >> +     * FIXME: The length of the list depends on how many
> >> +     * vCPU is current blocked on this specific pCPU.
> >> +     * This may hurt the interrupt latency if the list
> >> +     * grows to too many entries.
> >> +     */
> >
> > let's go with this linked list first until a real issue is identified.
> 
> This is exactly the way of thinking I dislike when it comes to code
> that isn't intended to be experimental only: We shouldn't wait
> for problems to surface when we already can see them. I.e. if
> there are no plans to deal with this, I'd ask for the feature to be
> off by default and be properly marked experimental in the
> command line option documentation (so people know to stay
> away from it).
> 

I don't see big problem here. For typical server consolidation ratio
1:4 or 1:8, the link list should be short. It's not experimental. It's
based on our judge that linked list should be fine here. But any
structure may has potential problem which we don't know now,
just like tasklet scalability issue Andrew raised earlier (in that case
we can't blame anyone using tasklet before the issue is reported).

So, there's no plan to address this since we don't see real problem
now. I'd suggest Feng to remove the comment since FIXME means
something experimental. Or if you have a better suggestion, please
articulate instead of asking people stay away from it.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 13:09       ` Andrew Cooper
@ 2015-07-08 22:49         ` Tian, Kevin
  2015-07-09  7:25           ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Tian, Kevin @ 2015-07-08 22:49 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Wu, Feng
  Cc: george.dunlap@eu.citrix.com, Zhang, Yang Z, keir@xen.org,
	xen-devel@lists.xen.org

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Wednesday, July 08, 2015 9:09 PM
> 
> On 08/07/2015 13:46, Jan Beulich wrote:
> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> >>> vmx_function_table = {
> >>>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
> >>>  };
> >>>
> >>> +/*
> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
> >>> + */
> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> >>> +{
> >>> +    struct arch_vmx_struct *vmx;
> >>> +    unsigned int cpu = smp_processor_id();
> >>> +
> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> >>> +
> >>> +    /*
> >>> +     * FIXME: The length of the list depends on how many
> >>> +     * vCPU is current blocked on this specific pCPU.
> >>> +     * This may hurt the interrupt latency if the list
> >>> +     * grows to too many entries.
> >>> +     */
> >> let's go with this linked list first until a real issue is identified.
> > This is exactly the way of thinking I dislike when it comes to code
> > that isn't intended to be experimental only: We shouldn't wait
> > for problems to surface when we already can see them. I.e. if
> > there are no plans to deal with this, I'd ask for the feature to be
> > off by default and be properly marked experimental in the
> > command line option documentation (so people know to stay
> > away from it).
> 
> And in this specific case, there is no balancing of vcpus across the
> pcpus lists.
> 
> One can construct a pathological case using pinning and pausing to get
> almost every vcpu on a single pcpu list, and vcpus recieving fewer
> interrupts will exasperate the problem by staying on the list for longer
> periods of time.

In that extreme case I believe many contentions in other code paths will
be much larger than overhead caused by this structure limitation.

> 
> IMO, the PI feature cannot be declared as done/supported with this bug
> remaining.  OTOH, it is fine to be experimental, and disabled by default
> for people who wish to experiment.
> 

Again, I don't expect to see it disabled as experimental. For good production
environment where vcpus are well balanced and interrupt latency is sensitive,
linked list should be efficient here. For bad environment like extreme case
you raised, I don't know whether it really matters to just tune interrupt path.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-02  8:20         ` Dario Faggioli
@ 2015-07-09  3:09           ` Wu, Feng
  2015-07-09  8:18             ` Dario Faggioli
  2015-07-09 11:19             ` George Dunlap
  0 siblings, 2 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-09  3:09 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, July 02, 2015 4:21 PM
> To: Wu, Feng
> Cc: xen-devel; keir@xen.org; jbeulich@suse.com; andrew.cooper3@citrix.com;
> Tian, Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com
> Subject: Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU
> scheduling
> 
> On Thu, 2015-07-02 at 04:32 +0000, Wu, Feng wrote:
> >
> >
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> > > Sent: Tuesday, June 30, 2015 10:58 AM
> > > To: Wu, Feng
> > > Cc: xen-devel; keir@xen.org; jbeulich@suse.com;
> andrew.cooper3@citrix.com;
> > > Tian, Kevin; Zhang, Yang Z; george.dunlap@eu.citrix.com; Wu, Feng
> > > Subject: Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during
> vCPU
> > > scheduling
> > >
> > > On Mon, 2015-06-29 at 18:36 +0100, Andrew Cooper wrote:
> > >
> > > >
> > > > The basic idea here is:
> > > > 1. When vCPU's state is RUNSTATE_running,
> > > >         - set 'NV' to 'Notification Vector'.
> > > >         - Clear 'SN' to accpet PI.
> > > >         - set 'NDST' to the right pCPU.
> > > > 2. When vCPU's state is RUNSTATE_blocked,
> > > >         - set 'NV' to 'Wake-up Vector', so we can wake up the
> > > >           related vCPU when posted-interrupt happens for it.
> > > >         - Clear 'SN' to accpet PI.
> > > > 3. When vCPU's state is RUNSTATE_runnable/RUNSTATE_offline,
> > > >         - Set 'SN' to suppress non-urgent interrupts.
> > > >           (Current, we only support non-urgent interrupts)
> > > >         - Set 'NV' back to 'Notification Vector' if needed.
> > > >
> > > It might be me, but it feels a bit odd to see RUNSTATE-s being (ab)used
> > > directly for this, as it does feel odd to see arch specific code being
> > > added in there.
> > >
> > > Can't this be done in context_switch(), which is already architecture
> > > specific? I was thinking to something very similar to what has been done
> > > for PSR, i.e., on x86, put everything in __context_switch().
> > >
> > > Looking at who's prev and who's next, and at what pause_flags each has
> > > set, you should be able to implement all of the above logic.
> > >
> > > Or am I missing something?
> >
> > As mentioned in the description of this patch, here we need to do
> > something when the vCPU's state is changed, can we get the
> > state transition in __context_switch(), such as "running -> blocking"?
> >
> Well, in the patch description you mention how you've done it, so of
> course it mentions runstates.
> 
> That does not necessarily means "we need to do something" in
> vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
> you check whether this thing that you need doing can be done somewhere
> else than in vcpu_runstaete_change() ?

Why do you think vcpu_runstaete_change() is not the right place to do this?

> 
> In fact, looking at how, where and what for, runstetes are used, that
> really does not feel right, at least to me. What you seem to be
> interested is whether a vCPU blocks and/or unblocks

Not just blocks, unblocks, we need to track the state transition, and
update posted-interrupt descriptor accordingly.

 Runstates are an
> abstraction, build up on top of (mostly) pause_flags, like _VPF_blocked
> (look at how runstate is updated).
> 
> I think you should not build on top of such abstraction, but on top of
> pause_flags directly.

I don't think building on top of vCPU state is not suitable for my case,
the abstraction is there, why cannot people use it?

Thanks,
Feng

> I had a quick look, and it indeed seems to me that
> you can get all you need from there too. It might even result in the
> code looking simpler (but that's of course hard to tell without actually
> trying). In fact, inside the context switching code, you already know
> that prev was running so, if it has the proper flag set, it means it's
> blocking (i.e., going to RUNSTATE_blocked, in runstates language), if
> not, it maybe is being preempted (i.e., going to RUNSTATE_runnable).
> Therefore, you can enact all your logic, even without any need to keep
> track of the previous runstate, and without needing to build up a full
> state machine and looking at all possible transitions.
> 
> So, can you have a look at whether that solution can fly? Because, if it
> does, I think it would be a lot better.
> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-08 22:49         ` Tian, Kevin
@ 2015-07-09  7:25           ` Jan Beulich
  2015-07-10  6:21             ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-09  7:25 UTC (permalink / raw)
  To: Feng Wu, Kevin Tian
  Cc: george.dunlap@eu.citrix.com, Andrew Cooper, YangZ Zhang,
	keir@xen.org, xen-devel@lists.xen.org

>>> On 09.07.15 at 00:49, <kevin.tian@intel.com> wrote:
>>  From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> Sent: Wednesday, July 08, 2015 9:09 PM
>> On 08/07/2015 13:46, Jan Beulich wrote:
>> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
>> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
>> >>> vmx_function_table = {
>> >>>      .enable_msr_exit_interception = vmx_enable_msr_exit_interception,
>> >>>  };
>> >>>
>> >>> +/*
>> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
>> >>> + */
>> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>> >>> +{
>> >>> +    struct arch_vmx_struct *vmx;
>> >>> +    unsigned int cpu = smp_processor_id();
>> >>> +
>> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
>> >>> +
>> >>> +    /*
>> >>> +     * FIXME: The length of the list depends on how many
>> >>> +     * vCPU is current blocked on this specific pCPU.
>> >>> +     * This may hurt the interrupt latency if the list
>> >>> +     * grows to too many entries.
>> >>> +     */
>> >> let's go with this linked list first until a real issue is identified.
>> > This is exactly the way of thinking I dislike when it comes to code
>> > that isn't intended to be experimental only: We shouldn't wait
>> > for problems to surface when we already can see them. I.e. if
>> > there are no plans to deal with this, I'd ask for the feature to be
>> > off by default and be properly marked experimental in the
>> > command line option documentation (so people know to stay
>> > away from it).
>> 
>> And in this specific case, there is no balancing of vcpus across the
>> pcpus lists.
>> 
>> One can construct a pathological case using pinning and pausing to get
>> almost every vcpu on a single pcpu list, and vcpus recieving fewer
>> interrupts will exasperate the problem by staying on the list for longer
>> periods of time.
> 
> In that extreme case I believe many contentions in other code paths will
> be much larger than overhead caused by this structure limitation.

Examples?

>> IMO, the PI feature cannot be declared as done/supported with this bug
>> remaining.  OTOH, it is fine to be experimental, and disabled by default
>> for people who wish to experiment.
>> 
> 
> Again, I don't expect to see it disabled as experimental. For good 
> production
> environment where vcpus are well balanced and interrupt latency is 
> sensitive,
> linked list should be efficient here. For bad environment like extreme case
> you raised, I don't know whether it really matters to just tune interrupt 
> path.

Can you _guarantee_ that everything potentially leading to such a
pathological situation is covered by XSA-77? And even if it is now,
removing elements from the waiver list would become significantly
more difficult if disconnected behavior like this one would need to
be taken into account.

Please understand that history has told us to be rather more careful
than might seem necessary with this: ATS originally having been
enabled by default is one bold example, and the recent flood of MSI
related XSAs is another; I suppose I could find more. All affecting
code originating from Intel, apparently written with only functionality
in mind, while having left out (other than basic) security considerations.

IOW, with my committer role hat on, the feature is going to be
experimental (and hence default off) unless the issue here gets
addressed. And no, I cannot immediately suggest a good approach,
and with all of the rush before the feature freeze I also can't justify
taking a lot of time to think of options. 

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09  3:09           ` Wu, Feng
@ 2015-07-09  8:18             ` Dario Faggioli
  2015-07-09 11:19             ` George Dunlap
  1 sibling, 0 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-09  8:18 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 2235 bytes --]

On Thu, 2015-07-09 at 03:09 +0000, Wu, Feng wrote:

> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > In fact, looking at how, where and what for, runstetes are used, that
> > really does not feel right, at least to me. What you seem to be
> > interested is whether a vCPU blocks and/or unblocks
> 
> Not just blocks, unblocks, we need to track the state transition, and
> update posted-interrupt descriptor accordingly.
> 
And why is that so? What is it that runstates gives you, that you don't
find where I'm suggesting to look? What's so in specific need of knowing
the runstate?

>  Runstates are an
> > abstraction, build up on top of (mostly) pause_flags, like _VPF_blocked
> > (look at how runstate is updated).
> > 
> > I think you should not build on top of such abstraction, but on top of
> > pause_flags directly.
> 
> I don't think building on top of vCPU state is not suitable for my case,
> the abstraction is there, why cannot people use it?
> 
Because, IMO, your stuff is a low level feature, and it does not feel
right to me to build low level feature on top or (quite) high level
abstraction, such as runstates.

The fact that your feature is arch specific is quite a clear sign of
that. Runstates aren't. Actually, very few of what is in schedule.c is.
A notable exception is the low level context switching logic... and, in
fact, that's exactly where I think your stuff should leave, if possible
(and it looks possible to me (I take counterexamples, of course).

And it's not an aesthetic and/or some kind of 'layering violation' issue
(although, it certainly is one, and it makes the code harder to
understand and follow, both wrt your own feature, and in general), it
really looks like calling for problems and potential maintenance issues.

Anyway, I guess this is George's call. Let's see if/when he finds some
time to give us an opinion. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09  3:09           ` Wu, Feng
  2015-07-09  8:18             ` Dario Faggioli
@ 2015-07-09 11:19             ` George Dunlap
  2015-07-09 11:29               ` George Dunlap
  2015-07-09 11:38               ` Wu, Feng
  1 sibling, 2 replies; 155+ messages in thread
From: George Dunlap @ 2015-07-09 11:19 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com,
	Dario Faggioli, xen-devel, jbeulich@suse.com, Zhang, Yang Z

On Thu, Jul 9, 2015 at 4:09 AM, Wu, Feng <feng.wu@intel.com> wrote:
>> That does not necessarily means "we need to do something" in
>> vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
>> you check whether this thing that you need doing can be done somewhere
>> else than in vcpu_runstaete_change() ?
>
> Why do you think vcpu_runstaete_change() is not the right place to do this?

Because what the vcpu_runstate_change() function does at the moment is
*update the vcpu runstate variable*.  It doesn't actually change the
runstate -- the runstate is changed in the various bits of code that
call it; and it's not designed to be a generic place to put hooks on
the runstate changing.

I haven't done a thorough review of this yet, but at least looking
through this patch, and skimming the titles, I don't see anywhere you
handle migration -- what happens if a vcpu that's blocked / offline /
runnable migrates from one cpu to another?  Is the information
updated?

The right thing to do in this situation is either to change
vcpu_runstate_change() so that it is the central place to make all (or
most) hooks happen; or to add a set of architectural hooks (similar to
the SCHED_OP() hooks) in the various places you need them.

I'm inclined to think that the second is the better option; if for no
other reason that it makes it more clear which states are handled.

 -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 11:19             ` George Dunlap
@ 2015-07-09 11:29               ` George Dunlap
  2015-07-09 11:38               ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: George Dunlap @ 2015-07-09 11:29 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com,
	Dario Faggioli, xen-devel, jbeulich@suse.com, Zhang, Yang Z

On Thu, Jul 9, 2015 at 12:19 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:
> On Thu, Jul 9, 2015 at 4:09 AM, Wu, Feng <feng.wu@intel.com> wrote:
>>> That does not necessarily means "we need to do something" in
>>> vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
>>> you check whether this thing that you need doing can be done somewhere
>>> else than in vcpu_runstaete_change() ?
>>
>> Why do you think vcpu_runstaete_change() is not the right place to do this?
>
> Because what the vcpu_runstate_change() function does at the moment is
> *update the vcpu runstate variable*.  It doesn't actually change the
> runstate -- the runstate is changed in the various bits of code that
> call it; and it's not designed to be a generic place to put hooks on
> the runstate changing.

At first glance vcpu_urgent_count_update() might be seen as such a
hook; but the key here is that vcpu_urgent_count_update() is mainly
updating the is_urgent flag *of the vcpu* based on the various
scheduler-related flags.  In that sense it's doing exactly what
vcpu_runstate_change() is doing.

 -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 11:19             ` George Dunlap
  2015-07-09 11:29               ` George Dunlap
@ 2015-07-09 11:38               ` Wu, Feng
  2015-07-09 12:42                 ` Dario Faggioli
  2015-07-09 12:53                 ` George Dunlap
  1 sibling, 2 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-09 11:38 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com,
	Dario Faggioli, xen-devel, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> Dunlap
> Sent: Thursday, July 09, 2015 7:20 PM
> To: Wu, Feng
> Cc: Dario Faggioli; Tian, Kevin; keir@xen.org; andrew.cooper3@citrix.com;
> xen-devel; jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On Thu, Jul 9, 2015 at 4:09 AM, Wu, Feng <feng.wu@intel.com> wrote:
> >> That does not necessarily means "we need to do something" in
> >> vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
> >> you check whether this thing that you need doing can be done somewhere
> >> else than in vcpu_runstaete_change() ?
> >
> > Why do you think vcpu_runstaete_change() is not the right place to do this?
> 
> Because what the vcpu_runstate_change() function does at the moment is
> *update the vcpu runstate variable*.  It doesn't actually change the
> runstate -- the runstate is changed in the various bits of code that
> call it; and it's not designed to be a generic place to put hooks on
> the runstate changing.
> 
> I haven't done a thorough review of this yet, but at least looking
> through this patch, and skimming the titles, I don't see anywhere you
> handle migration -- what happens if a vcpu that's blocked / offline /
> runnable migrates from one cpu to another?  Is the information
> updated?

Thanks for your review!

The migration is handled in arch_pi_desc_update() which is called
by vcpu_runstate_change().

> 
> The right thing to do in this situation is either to change
> vcpu_runstate_change() so that it is the central place to make all (or
> most) hooks happen;

Yes, this is my implementation. I think vcpu_runstate_change()
is the _central_ place to do things when vCPU state is changed. This
makes things clear and simple. I call an arch hooks to update
posted-interrupt descriptor in this function.


 or to add a set of architectural hooks (similar to
> the SCHED_OP() hooks) in the various places you need them.

I don't have a picture of this method, but from your comments, seems
we need to put the logic to many different places, and must be very
careful so as to not miss some places. I think the above method
is more clear and straightforward, since we have a central place to
handle all the cases. Anyway, if you prefer to this one, it would be
highly appreciated if you can give a more detailed solution! Thank you!

Thanks,
Feng

> 
> I'm inclined to think that the second is the better option; if for no
> other reason that it makes it more clear which states are handled.
> 
>  -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 11:38               ` Wu, Feng
@ 2015-07-09 12:42                 ` Dario Faggioli
  2015-07-10  0:07                   ` Wu, Feng
  2015-07-09 12:53                 ` George Dunlap
  1 sibling, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-09 12:42 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 3071 bytes --]

On Thu, 2015-07-09 at 11:38 +0000, Wu, Feng wrote:
> 
> > -----Original Message-----
> > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
> > Dunlap

> > > Why do you think vcpu_runstaete_change() is not the right place to do this?
> > 
> > Because what the vcpu_runstate_change() function does at the moment is
> > *update the vcpu runstate variable*.  It doesn't actually change the
> > runstate -- the runstate is changed in the various bits of code that
> > call it; and it's not designed to be a generic place to put hooks on
> > the runstate changing.
> > 
> > I haven't done a thorough review of this yet, but at least looking
> > through this patch, and skimming the titles, I don't see anywhere you
> > handle migration -- what happens if a vcpu that's blocked / offline /
> > runnable migrates from one cpu to another?  Is the information
> > updated?
> 
> Thanks for your review!
> 
> The migration is handled in arch_pi_desc_update() which is called
> by vcpu_runstate_change().
> 
> > 
> > The right thing to do in this situation is either to change
> > vcpu_runstate_change() so that it is the central place to make all (or
> > most) hooks happen;
> 
> Yes, this is my implementation. I think vcpu_runstate_change()
> is the _central_ place to do things when vCPU state is changed. This
> makes things clear and simple. I call an arch hooks to update
> posted-interrupt descriptor in this function.
> 
Perhaps, one way to double check this line of reasoning (the fact that
you think this needs to lay on top of runstates, and more specifically
in that function), would be to come up with some kind of "list of
requirements", not taking runstates into account.

I know there is a design document for this series (and I also know I
could have commented on it earlier, sorry for that), but that itself
mentions runstates, which does not help.

What I mean is, can you describe when you need each specific operation
needs to happen? Something like "descriptor needs to be updated like
this upon migration", "notification should be disabled when vcpu starts
running", "notification method should be changed that other way when
vcpu is preempted", etc.

This would help a lot, IMO, figuring out the actual functional
requirements that needs to be satisfied for things to work well. Once
that is done, we can go check in the code where is the best place to put
each call, hook, or whatever.


Note that I've already tried to infer the above, by looking at the
patches, and that is making me think that it would be possible to
implement things in another way. But maybe I'm missing something. So it
would be really valuable if you, with all your knowledge of how PI
should work, could do it.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 11:38               ` Wu, Feng
  2015-07-09 12:42                 ` Dario Faggioli
@ 2015-07-09 12:53                 ` George Dunlap
  2015-07-09 13:44                   ` Jan Beulich
  2015-07-10  0:15                   ` Wu, Feng
  1 sibling, 2 replies; 155+ messages in thread
From: George Dunlap @ 2015-07-09 12:53 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com,
	Dario Faggioli, xen-devel, jbeulich@suse.com, Zhang, Yang Z

On 07/09/2015 12:38 PM, Wu, Feng wrote:
> 
> 
>> -----Original Message-----
>> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of George
>> Dunlap
>> Sent: Thursday, July 09, 2015 7:20 PM
>> To: Wu, Feng
>> Cc: Dario Faggioli; Tian, Kevin; keir@xen.org; andrew.cooper3@citrix.com;
>> xen-devel; jbeulich@suse.com; Zhang, Yang Z
>> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
>> during vCPU scheduling
>>
>> On Thu, Jul 9, 2015 at 4:09 AM, Wu, Feng <feng.wu@intel.com> wrote:
>>>> That does not necessarily means "we need to do something" in
>>>> vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
>>>> you check whether this thing that you need doing can be done somewhere
>>>> else than in vcpu_runstaete_change() ?
>>>
>>> Why do you think vcpu_runstaete_change() is not the right place to do this?
>>
>> Because what the vcpu_runstate_change() function does at the moment is
>> *update the vcpu runstate variable*.  It doesn't actually change the
>> runstate -- the runstate is changed in the various bits of code that
>> call it; and it's not designed to be a generic place to put hooks on
>> the runstate changing.
>>
>> I haven't done a thorough review of this yet, but at least looking
>> through this patch, and skimming the titles, I don't see anywhere you
>> handle migration -- what happens if a vcpu that's blocked / offline /
>> runnable migrates from one cpu to another?  Is the information
>> updated?
> 
> Thanks for your review!

And I'd like to say -- sorry that I didn't notice this issue sooner; I
know you've had your series posted for quite a while, but I didn't
realize until last week that it actually involved the scheduler.  It's
really my fault for not paying closer attention -- you did CC me in v2
back in June.

> The migration is handled in arch_pi_desc_update() which is called
> by vcpu_runstate_change().

Well as far as I can tell from looking at the code,
vcpu_runstate_change() will not be called when migrating a vcpu which is
already blocked.

Consider the following scenario:
- v1 blocks on pcpu 0.
 - vcpu_runstate_change() will do everything necessary for v1 on p0.
- The scheduler does load balancing and moves v1 to p1, calling
vcpu_migrate().  Because the vcpu is still blocked,
vcpu_runstate_change() is not called.
- A device interrupt is generated.

What happens to the interrupt?  Does everything still work properly, or
will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?

>  or to add a set of architectural hooks (similar to
>> the SCHED_OP() hooks) in the various places you need them.
> 
> I don't have a picture of this method, but from your comments, seems
> we need to put the logic to many different places, and must be very
> careful so as to not miss some places. I think the above method
> is more clear and straightforward, since we have a central place to
> handle all the cases. Anyway, if you prefer to this one, it would be
> highly appreciated if you can give a more detailed solution! Thank you!

Well you can check to make sure you've caught at least all the places
you had before by searching for vcpu_runstate_change(). :-)

Using the callback method also can help prompt you to think about other
times you may need to do something.  For instance, you might still
consider searching for SCHED_OP() everywhere in schedule.c and seeing if
that's a place you need to do something (similar to the migration thing
above).

Anyway, the most detailed thing I can say at this time is to look at
SCHED_OP() and see if doing  something like that, but for architectural
callbacks, makes sense.

I'll come back and take a closer look a bit later.

 -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 12:53                 ` George Dunlap
@ 2015-07-09 13:44                   ` Jan Beulich
  2015-07-09 14:18                     ` Dario Faggioli
  2015-07-10  0:15                   ` Wu, Feng
  1 sibling, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-09 13:44 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Feng Wu, andrew.cooper3@citrix.com, Dario Faggioli,
	xen-devel, Yang Z Zhang, keir@xen.org

>>> On 09.07.15 at 14:53, <george.dunlap@eu.citrix.com> wrote:
>> The migration is handled in arch_pi_desc_update() which is called
>> by vcpu_runstate_change().
> 
> Well as far as I can tell from looking at the code,
> vcpu_runstate_change() will not be called when migrating a vcpu which is
> already blocked.
> 
> Consider the following scenario:
> - v1 blocks on pcpu 0.
>  - vcpu_runstate_change() will do everything necessary for v1 on p0.
> - The scheduler does load balancing and moves v1 to p1, calling
> vcpu_migrate().  Because the vcpu is still blocked,
> vcpu_runstate_change() is not called.
> - A device interrupt is generated.
> 
> What happens to the interrupt?  Does everything still work properly, or
> will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?

I think much of this was discussed before, since I also disliked the
hooking into vcpu_runstate_change(). What I remember having
been told is that it really only matters which pCPU's list a vCPU is
on, not what v->processor says. And I sort of accepted that
moving the vCPU between lists when v->processor changes
without the vCPU actually running on that pCPU is pretty
pointless (and perhaps wasteful, but in any case cluttering more
code than necessary).

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 13:44                   ` Jan Beulich
@ 2015-07-09 14:18                     ` Dario Faggioli
  2015-07-09 14:27                       ` George Dunlap
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-09 14:18 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, George Dunlap, andrew.cooper3@citrix.com,
	xen-devel, Yang Z Zhang, keir@xen.org


[-- Attachment #1.1: Type: text/plain, Size: 1813 bytes --]

On Thu, 2015-07-09 at 14:44 +0100, Jan Beulich wrote:
> >>> On 09.07.15 at 14:53, <george.dunlap@eu.citrix.com> wrote:

> > Consider the following scenario:
> > - v1 blocks on pcpu 0.
> >  - vcpu_runstate_change() will do everything necessary for v1 on p0.
> > - The scheduler does load balancing and moves v1 to p1, calling
> > vcpu_migrate().  Because the vcpu is still blocked,
> > vcpu_runstate_change() is not called.
> > - A device interrupt is generated.
> > 
> > What happens to the interrupt?  Does everything still work properly, or
> > will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?
> 
> I think much of this was discussed before, since I also disliked the
> hooking into vcpu_runstate_change(). What I remember having
> been told is that it really only matters which pCPU's list a vCPU is
> on, not what v->processor says. 
>
Right.

But, as far as I could understand from the patches I've seen, a vcpu
ends up in a list when it blocks, and when it blocks there will be a
context switch, and hence we can deal with the queueing during the the
context switch itself (which is, in part, an arch specific operation
already).

What am I missing?

Maybe (looking at vmx_pi_desc_update() in patch 14), something that is
right now dealt with by means of RUNSTATE_offline? What (if yes)? Can
(if yes) it be explained in some "non rusntate"-based way, so we can
judge whether that could also be achieved differently than by actually
hooking in vcpu_runstate_change()?

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 14:18                     ` Dario Faggioli
@ 2015-07-09 14:27                       ` George Dunlap
  2015-07-09 14:47                         ` Dario Faggioli
  2015-07-10  5:59                         ` Wu, Feng
  0 siblings, 2 replies; 155+ messages in thread
From: George Dunlap @ 2015-07-09 14:27 UTC (permalink / raw)
  To: Dario Faggioli, Jan Beulich
  Cc: Kevin Tian, keir@xen.org, andrew.cooper3@citrix.com, xen-devel,
	Yang Z Zhang, Feng Wu

On 07/09/2015 03:18 PM, Dario Faggioli wrote:
> On Thu, 2015-07-09 at 14:44 +0100, Jan Beulich wrote:
>>>>> On 09.07.15 at 14:53, <george.dunlap@eu.citrix.com> wrote:
> 
>>> Consider the following scenario:
>>> - v1 blocks on pcpu 0.
>>>  - vcpu_runstate_change() will do everything necessary for v1 on p0.
>>> - The scheduler does load balancing and moves v1 to p1, calling
>>> vcpu_migrate().  Because the vcpu is still blocked,
>>> vcpu_runstate_change() is not called.
>>> - A device interrupt is generated.
>>>
>>> What happens to the interrupt?  Does everything still work properly, or
>>> will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?
>>
>> I think much of this was discussed before, since I also disliked the
>> hooking into vcpu_runstate_change(). What I remember having
>> been told is that it really only matters which pCPU's list a vCPU is
>> on, not what v->processor says. 
>>
> Right.
> 
> But, as far as I could understand from the patches I've seen, a vcpu
> ends up in a list when it blocks, and when it blocks there will be a
> context switch, and hence we can deal with the queueing during the the
> context switch itself (which is, in part, an arch specific operation
> already).
> 
> What am I missing?

I think what you're missing is that Jan is answering my question about
migrating a blocked vcpu, not arguing that vcpu_runstate_change() is the
right way to go.  At least that's how I understood him. :-)

But regarding context_switch: I think the reason we need more hooks than
that is that context_switch only changes into and out of running state.
 There are also changes that need to happen when you change from blocked
to offline, offline to blocked, blocked to runnable, &c; these don't go
through context_switch.  That's why I was suggesting some architectural
equivalents to the SCHED_OP() callbacks to be added to vcpu_wake &c.

vcpu_runstate_change() is at the moment a nice quiet cul-de-sac that
just does a little bit of accounting; I'd rather not have it suddenly
become a major thoroughfare for runstate change hooks, if we can avoid
it. :-)

 -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 14:27                       ` George Dunlap
@ 2015-07-09 14:47                         ` Dario Faggioli
  2015-07-10  5:59                         ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-09 14:47 UTC (permalink / raw)
  To: George Dunlap
  Cc: Kevin Tian, Feng Wu, andrew.cooper3@citrix.com, xen-devel,
	Jan Beulich, Yang Z Zhang, keir@xen.org


[-- Attachment #1.1: Type: text/plain, Size: 3395 bytes --]

On Thu, 2015-07-09 at 15:27 +0100, George Dunlap wrote:
> On 07/09/2015 03:18 PM, Dario Faggioli wrote:
> > On Thu, 2015-07-09 at 14:44 +0100, Jan Beulich wrote:
> >>>>> On 09.07.15 at 14:53, <george.dunlap@eu.citrix.com> wrote:
> > 
> >>> Consider the following scenario:
> >>> - v1 blocks on pcpu 0.
> >>>  - vcpu_runstate_change() will do everything necessary for v1 on p0.
> >>> - The scheduler does load balancing and moves v1 to p1, calling
> >>> vcpu_migrate().  Because the vcpu is still blocked,
> >>> vcpu_runstate_change() is not called.
> >>> - A device interrupt is generated.
> >>>
> >>> What happens to the interrupt?  Does everything still work properly, or
> >>> will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?
> >>
> >> I think much of this was discussed before, since I also disliked the
> >> hooking into vcpu_runstate_change(). What I remember having
> >> been told is that it really only matters which pCPU's list a vCPU is
> >> on, not what v->processor says. 
> >>
> > Right.
> > 
> > But, as far as I could understand from the patches I've seen, a vcpu
> > ends up in a list when it blocks, and when it blocks there will be a
> > context switch, and hence we can deal with the queueing during the the
> > context switch itself (which is, in part, an arch specific operation
> > already).
> > 
> > What am I missing?
> 
> I think what you're missing is that Jan is answering my question about
> migrating a blocked vcpu, not arguing that vcpu_runstate_change() is the
> right way to go.  At least that's how I understood him. :-)
> 
Eheh, no, I got that... I just wanted to take one more chance to try to
get an answer (from Feng, mainly, but from anyone who knows or has an
idea, in reality! :-D) on why another approach is not possible.

Sorry for abusing the reply for my own mean purposes. :-D

> But regarding context_switch: I think the reason we need more hooks than
> that is that context_switch only changes into and out of running state.
>
Sure.

> There are also changes that need to happen when you change from blocked
> to offline, offline to blocked, blocked to runnable, &c; these don't go
> through context_switch.  
>
Right, those ones. And about them, I could say for the third time that
having a non-runstate biased description of what we need would help to
understand where to do things, whether there are suitable spots already
or we need to add new ones, etc.... But I guess I don't want to be so
repetitive! :-PP

> That's why I was suggesting some architectural
> equivalents to the SCHED_OP() callbacks to be added to vcpu_wake &c.
> 
Yep. If (or at least for those of which) really there is no other place
already, adding these would be TheRightThing for me as well.

> vcpu_runstate_change() is at the moment a nice quiet cul-de-sac that
> just does a little bit of accounting; 
>
Indeed!

> I'd rather not have it suddenly
> become a major thoroughfare for runstate change hooks, if we can avoid
> it. :-)
> 
Yep, that was my understanding of it, and my idea too.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 12:42                 ` Dario Faggioli
@ 2015-07-10  0:07                   ` Wu, Feng
  2015-07-10 12:40                     ` Dario Faggioli
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-10  0:07 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Thursday, July 09, 2015 8:42 PM
> To: Wu, Feng
> Cc: George Dunlap; Tian, Kevin; keir@xen.org; andrew.cooper3@citrix.com;
> xen-devel; jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On Thu, 2015-07-09 at 11:38 +0000, Wu, Feng wrote:
> >
> > > -----Original Message-----
> > > From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George
> > > Dunlap
> 
> > > > Why do you think vcpu_runstaete_change() is not the right place to do
> this?
> > >
> > > Because what the vcpu_runstate_change() function does at the moment is
> > > *update the vcpu runstate variable*.  It doesn't actually change the
> > > runstate -- the runstate is changed in the various bits of code that
> > > call it; and it's not designed to be a generic place to put hooks on
> > > the runstate changing.
> > >
> > > I haven't done a thorough review of this yet, but at least looking
> > > through this patch, and skimming the titles, I don't see anywhere you
> > > handle migration -- what happens if a vcpu that's blocked / offline /
> > > runnable migrates from one cpu to another?  Is the information
> > > updated?
> >
> > Thanks for your review!
> >
> > The migration is handled in arch_pi_desc_update() which is called
> > by vcpu_runstate_change().
> >
> > >
> > > The right thing to do in this situation is either to change
> > > vcpu_runstate_change() so that it is the central place to make all (or
> > > most) hooks happen;
> >
> > Yes, this is my implementation. I think vcpu_runstate_change()
> > is the _central_ place to do things when vCPU state is changed. This
> > makes things clear and simple. I call an arch hooks to update
> > posted-interrupt descriptor in this function.
> >
> Perhaps, one way to double check this line of reasoning (the fact that
> you think this needs to lay on top of runstates, and more specifically
> in that function), would be to come up with some kind of "list of
> requirements", not taking runstates into account.
> 
> I know there is a design document for this series (and I also know I
> could have commented on it earlier, sorry for that), but that itself
> mentions runstates, which does not help.
> 
> What I mean is, can you describe when you need each specific operation
> needs to happen? Something like "descriptor needs to be updated like
> this upon migration", "notification should be disabled when vcpu starts
> running", "notification method should be changed that other way when
> vcpu is preempted", etc.

I cannot see the differences, I think the requirements are clearly listed in
the design doc and the comments of this patch.

> 
> This would help a lot, IMO, figuring out the actual functional
> requirements that needs to be satisfied for things to work well. Once
> that is done, we can go check in the code where is the best place to put
> each call, hook, or whatever.
> 
> 
> Note that I've already tried to infer the above, by looking at the
> patches, and that is making me think that it would be possible to
> implement things in another way. But maybe I'm missing something. So it
> would be really valuable if you, with all your knowledge of how PI
> should work, could do it.

I keep describing how PI works, what the purpose of the two vectors are,
how special they are from the beginning.

Thanks,
Feng


> 
> Thanks and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 12:53                 ` George Dunlap
  2015-07-09 13:44                   ` Jan Beulich
@ 2015-07-10  0:15                   ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-10  0:15 UTC (permalink / raw)
  To: George Dunlap
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com,
	Dario Faggioli, xen-devel, jbeulich@suse.com, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Thursday, July 09, 2015 8:53 PM
> To: Wu, Feng
> Cc: Dario Faggioli; Tian, Kevin; keir@xen.org; andrew.cooper3@citrix.com;
> xen-devel; jbeulich@suse.com; Zhang, Yang Z
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On 07/09/2015 12:38 PM, Wu, Feng wrote:
> >
> >
> >> -----Original Message-----
> >> From: dunlapg@gmail.com [mailto:dunlapg@gmail.com] On Behalf Of
> George
> >> Dunlap
> >> Sent: Thursday, July 09, 2015 7:20 PM
> >> To: Wu, Feng
> >> Cc: Dario Faggioli; Tian, Kevin; keir@xen.org; andrew.cooper3@citrix.com;
> >> xen-devel; jbeulich@suse.com; Zhang, Yang Z
> >> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts
> Descriptor
> >> during vCPU scheduling
> >>
> >> On Thu, Jul 9, 2015 at 4:09 AM, Wu, Feng <feng.wu@intel.com> wrote:
> >>>> That does not necessarily means "we need to do something" in
> >>>> vcpu_runstate_change(). Actually, that's exactly what I'm asking: can
> >>>> you check whether this thing that you need doing can be done somewhere
> >>>> else than in vcpu_runstaete_change() ?
> >>>
> >>> Why do you think vcpu_runstaete_change() is not the right place to do
> this?
> >>
> >> Because what the vcpu_runstate_change() function does at the moment is
> >> *update the vcpu runstate variable*.  It doesn't actually change the
> >> runstate -- the runstate is changed in the various bits of code that
> >> call it; and it's not designed to be a generic place to put hooks on
> >> the runstate changing.
> >>
> >> I haven't done a thorough review of this yet, but at least looking
> >> through this patch, and skimming the titles, I don't see anywhere you
> >> handle migration -- what happens if a vcpu that's blocked / offline /
> >> runnable migrates from one cpu to another?  Is the information
> >> updated?
> >
> > Thanks for your review!
> 
> And I'd like to say -- sorry that I didn't notice this issue sooner; I
> know you've had your series posted for quite a while, but I didn't
> realize until last week that it actually involved the scheduler.  It's
> really my fault for not paying closer attention -- you did CC me in v2
> back in June.
> 
> > The migration is handled in arch_pi_desc_update() which is called
> > by vcpu_runstate_change().
> 
> Well as far as I can tell from looking at the code,
> vcpu_runstate_change() will not be called when migrating a vcpu which is
> already blocked.
> 
> Consider the following scenario:
> - v1 blocks on pcpu 0.
>  - vcpu_runstate_change() will do everything necessary for v1 on p0.
> - The scheduler does load balancing and moves v1 to p1, calling
> vcpu_migrate().  Because the vcpu is still blocked,
> vcpu_runstate_change() is not called.
> - A device interrupt is generated.
> 
> What happens to the interrupt?  Does everything still work properly, or
> will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?

I think it works correctly. Before blocking, we save the v->processor, and
save the vcpu on this per-cpu list, even when the vCPU is migrated to
another pCPU, the wakeup notification event will go to the original one
(p0 in this case), this is what I want, in the list of p0, we can find and
unblock the blocked vCPU, this is the point.

Thanks,
Feng

> 
> >  or to add a set of architectural hooks (similar to
> >> the SCHED_OP() hooks) in the various places you need them.
> >
> > I don't have a picture of this method, but from your comments, seems
> > we need to put the logic to many different places, and must be very
> > careful so as to not miss some places. I think the above method
> > is more clear and straightforward, since we have a central place to
> > handle all the cases. Anyway, if you prefer to this one, it would be
> > highly appreciated if you can give a more detailed solution! Thank you!
> 
> Well you can check to make sure you've caught at least all the places
> you had before by searching for vcpu_runstate_change(). :-)
> 
> Using the callback method also can help prompt you to think about other
> times you may need to do something.  For instance, you might still
> consider searching for SCHED_OP() everywhere in schedule.c and seeing if
> that's a place you need to do something (similar to the migration thing
> above).
> 
> Anyway, the most detailed thing I can say at this time is to look at
> SCHED_OP() and see if doing  something like that, but for architectural
> callbacks, makes sense.
> 
> I'll come back and take a closer look a bit later.
> 
>  -George
> 

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-09 14:27                       ` George Dunlap
  2015-07-09 14:47                         ` Dario Faggioli
@ 2015-07-10  5:59                         ` Wu, Feng
  2015-07-10  6:22                           ` Jan Beulich
  1 sibling, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-10  5:59 UTC (permalink / raw)
  To: George Dunlap, Dario Faggioli, Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Thursday, July 09, 2015 10:28 PM
> To: Dario Faggioli; Jan Beulich
> Cc: Tian, Kevin; Wu, Feng; andrew.cooper3@citrix.com; xen-devel; Zhang, Yang
> Z; keir@xen.org
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On 07/09/2015 03:18 PM, Dario Faggioli wrote:
> > On Thu, 2015-07-09 at 14:44 +0100, Jan Beulich wrote:
> >>>>> On 09.07.15 at 14:53, <george.dunlap@eu.citrix.com> wrote:
> >
> >>> Consider the following scenario:
> >>> - v1 blocks on pcpu 0.
> >>>  - vcpu_runstate_change() will do everything necessary for v1 on p0.
> >>> - The scheduler does load balancing and moves v1 to p1, calling
> >>> vcpu_migrate().  Because the vcpu is still blocked,
> >>> vcpu_runstate_change() is not called.
> >>> - A device interrupt is generated.
> >>>
> >>> What happens to the interrupt?  Does everything still work properly, or
> >>> will the device wake-up interrupt go to the wrong pcpu (p0 rather than p1)?
> >>
> >> I think much of this was discussed before, since I also disliked the
> >> hooking into vcpu_runstate_change(). What I remember having
> >> been told is that it really only matters which pCPU's list a vCPU is
> >> on, not what v->processor says.
> >>
> > Right.
> >
> > But, as far as I could understand from the patches I've seen, a vcpu
> > ends up in a list when it blocks, and when it blocks there will be a
> > context switch, and hence we can deal with the queueing during the the
> > context switch itself (which is, in part, an arch specific operation
> > already).
> >
> > What am I missing?
> 
> I think what you're missing is that Jan is answering my question about
> migrating a blocked vcpu, not arguing that vcpu_runstate_change() is the
> right way to go.  At least that's how I understood him. :-)
> 
> But regarding context_switch: I think the reason we need more hooks than
> that is that context_switch only changes into and out of running state.
>  There are also changes that need to happen when you change from blocked
> to offline, offline to blocked, blocked to runnable, &c; these don't go
> through context_switch.  That's why I was suggesting some architectural
> equivalents to the SCHED_OP() callbacks to be added to vcpu_wake &c.
> 
> vcpu_runstate_change() is at the moment a nice quiet cul-de-sac that
> just does a little bit of accounting; I'd rather not have it suddenly
> become a major thoroughfare for runstate change hooks, if we can avoid
> it. :-)

So in my understanding, vcpu_runstate_change() is a central place to
do this, which is good. However, this function is original designed to be
served only for accounting. it is a little intrusive to make it so important
after adding the hooks in it.

If you agree with doing all this in a central place, maybe we can create
an arch hook for 'struct scheduler' to do this and call it in all the places
vcpu_runstate_change() gets called. What is your opinion about this?

Thanks,
Feng

> 
>  -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-09  7:25           ` Jan Beulich
@ 2015-07-10  6:21             ` Wu, Feng
  2015-07-10  6:32               ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-10  6:21 UTC (permalink / raw)
  To: Jan Beulich, Tian, Kevin
  Cc: keir@xen.org, george.dunlap@eu.citrix.com, Andrew Cooper,
	xen-devel@lists.xen.org, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, July 09, 2015 3:26 PM
> To: Wu, Feng; Tian, Kevin
> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Zhang, Yang Z;
> xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> >>> On 09.07.15 at 00:49, <kevin.tian@intel.com> wrote:
> >>  From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> >> Sent: Wednesday, July 08, 2015 9:09 PM
> >> On 08/07/2015 13:46, Jan Beulich wrote:
> >> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
> >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
> >> >>> vmx_function_table = {
> >> >>>      .enable_msr_exit_interception =
> vmx_enable_msr_exit_interception,
> >> >>>  };
> >> >>>
> >> >>> +/*
> >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
> >> >>> + */
> >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> >> >>> +{
> >> >>> +    struct arch_vmx_struct *vmx;
> >> >>> +    unsigned int cpu = smp_processor_id();
> >> >>> +
> >> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> >> >>> +
> >> >>> +    /*
> >> >>> +     * FIXME: The length of the list depends on how many
> >> >>> +     * vCPU is current blocked on this specific pCPU.
> >> >>> +     * This may hurt the interrupt latency if the list
> >> >>> +     * grows to too many entries.
> >> >>> +     */
> >> >> let's go with this linked list first until a real issue is identified.
> >> > This is exactly the way of thinking I dislike when it comes to code
> >> > that isn't intended to be experimental only: We shouldn't wait
> >> > for problems to surface when we already can see them. I.e. if
> >> > there are no plans to deal with this, I'd ask for the feature to be
> >> > off by default and be properly marked experimental in the
> >> > command line option documentation (so people know to stay
> >> > away from it).
> >>
> >> And in this specific case, there is no balancing of vcpus across the
> >> pcpus lists.
> >>
> >> One can construct a pathological case using pinning and pausing to get
> >> almost every vcpu on a single pcpu list, and vcpus recieving fewer
> >> interrupts will exasperate the problem by staying on the list for longer
> >> periods of time.
> >
> > In that extreme case I believe many contentions in other code paths will
> > be much larger than overhead caused by this structure limitation.
> 
> Examples?
> 
> >> IMO, the PI feature cannot be declared as done/supported with this bug
> >> remaining.  OTOH, it is fine to be experimental, and disabled by default
> >> for people who wish to experiment.
> >>
> >
> > Again, I don't expect to see it disabled as experimental. For good
> > production
> > environment where vcpus are well balanced and interrupt latency is
> > sensitive,
> > linked list should be efficient here. For bad environment like extreme case
> > you raised, I don't know whether it really matters to just tune interrupt
> > path.
> 
> Can you _guarantee_ that everything potentially leading to such a
> pathological situation is covered by XSA-77? And even if it is now,
> removing elements from the waiver list would become significantly
> more difficult if disconnected behavior like this one would need to
> be taken into account.
> 
> Please understand that history has told us to be rather more careful
> than might seem necessary with this: ATS originally having been
> enabled by default is one bold example, and the recent flood of MSI
> related XSAs is another; I suppose I could find more. All affecting
> code originating from Intel, apparently written with only functionality
> in mind, while having left out (other than basic) security considerations.
> 
> IOW, with my committer role hat on, the feature is going to be
> experimental (and hence default off) unless the issue here gets
> addressed. And no, I cannot immediately suggest a good approach,
> and with all of the rush before the feature freeze I also can't justify
> taking a lot of time to think of options.
> 

Is it acceptable to you if I only add the blocked vcpus that has
assigned devices to the list? I think that should shorten the
length of the list.

Thanks,
Feng

> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10  5:59                         ` Wu, Feng
@ 2015-07-10  6:22                           ` Jan Beulich
  2015-07-10 11:05                             ` Dario Faggioli
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-10  6:22 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, Dario Faggioli, xen-devel,
	Yang Z Zhang

>>> On 10.07.15 at 07:59, <feng.wu@intel.com> wrote:
> If you agree with doing all this in a central place, maybe we can create
> an arch hook for 'struct scheduler' to do this and call it in all the places
> vcpu_runstate_change() gets called. What is your opinion about this?

Doing this in a central place is certainly the right approach, but
adding an arch hook that needs to be called everywhere
vcpu_runstate_change() wouldn't serve that purpose. Instead
we'd need to replace all current vcpu_runstate_change() calls
with calls to a new function calling both this and the to be added
arch hook.

But please wait for George's / Dario's feedback, because they
seem to be even less convinced than me about your model of
tying the updates to runstate changes.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-10  6:21             ` Wu, Feng
@ 2015-07-10  6:32               ` Jan Beulich
  2015-07-10  7:29                 ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-10  6:32 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 10.07.15 at 08:21, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, July 09, 2015 3:26 PM
>> To: Wu, Feng; Tian, Kevin
>> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Zhang, Yang Z;
>> xen-devel@lists.xen.org; keir@xen.org 
>> Subject: RE: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
>> is blocked
>> 
>> >>> On 09.07.15 at 00:49, <kevin.tian@intel.com> wrote:
>> >>  From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> >> Sent: Wednesday, July 08, 2015 9:09 PM
>> >> On 08/07/2015 13:46, Jan Beulich wrote:
>> >> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
>> >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table __initdata
>> >> >>> vmx_function_table = {
>> >> >>>      .enable_msr_exit_interception =
>> vmx_enable_msr_exit_interception,
>> >> >>>  };
>> >> >>>
>> >> >>> +/*
>> >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
>> >> >>> + */
>> >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>> >> >>> +{
>> >> >>> +    struct arch_vmx_struct *vmx;
>> >> >>> +    unsigned int cpu = smp_processor_id();
>> >> >>> +
>> >> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
>> >> >>> +
>> >> >>> +    /*
>> >> >>> +     * FIXME: The length of the list depends on how many
>> >> >>> +     * vCPU is current blocked on this specific pCPU.
>> >> >>> +     * This may hurt the interrupt latency if the list
>> >> >>> +     * grows to too many entries.
>> >> >>> +     */
>> >> >> let's go with this linked list first until a real issue is identified.
>> >> > This is exactly the way of thinking I dislike when it comes to code
>> >> > that isn't intended to be experimental only: We shouldn't wait
>> >> > for problems to surface when we already can see them. I.e. if
>> >> > there are no plans to deal with this, I'd ask for the feature to be
>> >> > off by default and be properly marked experimental in the
>> >> > command line option documentation (so people know to stay
>> >> > away from it).
>> >>
>> >> And in this specific case, there is no balancing of vcpus across the
>> >> pcpus lists.
>> >>
>> >> One can construct a pathological case using pinning and pausing to get
>> >> almost every vcpu on a single pcpu list, and vcpus recieving fewer
>> >> interrupts will exasperate the problem by staying on the list for longer
>> >> periods of time.
>> >
>> > In that extreme case I believe many contentions in other code paths will
>> > be much larger than overhead caused by this structure limitation.
>> 
>> Examples?
>> 
>> >> IMO, the PI feature cannot be declared as done/supported with this bug
>> >> remaining.  OTOH, it is fine to be experimental, and disabled by default
>> >> for people who wish to experiment.
>> >>
>> >
>> > Again, I don't expect to see it disabled as experimental. For good
>> > production
>> > environment where vcpus are well balanced and interrupt latency is
>> > sensitive,
>> > linked list should be efficient here. For bad environment like extreme case
>> > you raised, I don't know whether it really matters to just tune interrupt
>> > path.
>> 
>> Can you _guarantee_ that everything potentially leading to such a
>> pathological situation is covered by XSA-77? And even if it is now,
>> removing elements from the waiver list would become significantly
>> more difficult if disconnected behavior like this one would need to
>> be taken into account.
>> 
>> Please understand that history has told us to be rather more careful
>> than might seem necessary with this: ATS originally having been
>> enabled by default is one bold example, and the recent flood of MSI
>> related XSAs is another; I suppose I could find more. All affecting
>> code originating from Intel, apparently written with only functionality
>> in mind, while having left out (other than basic) security considerations.
>> 
>> IOW, with my committer role hat on, the feature is going to be
>> experimental (and hence default off) unless the issue here gets
>> addressed. And no, I cannot immediately suggest a good approach,
>> and with all of the rush before the feature freeze I also can't justify
>> taking a lot of time to think of options.
> 
> Is it acceptable to you if I only add the blocked vcpus that has
> assigned devices to the list? I think that should shorten the
> length of the list.

I actually implied this to be the case already, i.e.
- if it's not, this needs to be fixed anyway,
- it's not going to eliminate the concern (just think of a couple of
  many-vCPU guests all having devices assigned).

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-10  6:32               ` Jan Beulich
@ 2015-07-10  7:29                 ` Wu, Feng
  2015-07-10  8:49                   ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-10  7:29 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, July 10, 2015 2:32 PM
> To: Wu, Feng
> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Tian, Kevin; Zhang, Yang Z;
> xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> >>> On 10.07.15 at 08:21, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Thursday, July 09, 2015 3:26 PM
> >> To: Wu, Feng; Tian, Kevin
> >> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Zhang, Yang Z;
> >> xen-devel@lists.xen.org; keir@xen.org
> >> Subject: RE: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when
> vCPU
> >> is blocked
> >>
> >> >>> On 09.07.15 at 00:49, <kevin.tian@intel.com> wrote:
> >> >>  From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> >> >> Sent: Wednesday, July 08, 2015 9:09 PM
> >> >> On 08/07/2015 13:46, Jan Beulich wrote:
> >> >> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
> >> >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table
> __initdata
> >> >> >>> vmx_function_table = {
> >> >> >>>      .enable_msr_exit_interception =
> >> vmx_enable_msr_exit_interception,
> >> >> >>>  };
> >> >> >>>
> >> >> >>> +/*
> >> >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
> >> >> >>> + */
> >> >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> >> >> >>> +{
> >> >> >>> +    struct arch_vmx_struct *vmx;
> >> >> >>> +    unsigned int cpu = smp_processor_id();
> >> >> >>> +
> >> >> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> >> >> >>> +
> >> >> >>> +    /*
> >> >> >>> +     * FIXME: The length of the list depends on how many
> >> >> >>> +     * vCPU is current blocked on this specific pCPU.
> >> >> >>> +     * This may hurt the interrupt latency if the list
> >> >> >>> +     * grows to too many entries.
> >> >> >>> +     */
> >> >> >> let's go with this linked list first until a real issue is identified.
> >> >> > This is exactly the way of thinking I dislike when it comes to code
> >> >> > that isn't intended to be experimental only: We shouldn't wait
> >> >> > for problems to surface when we already can see them. I.e. if
> >> >> > there are no plans to deal with this, I'd ask for the feature to be
> >> >> > off by default and be properly marked experimental in the
> >> >> > command line option documentation (so people know to stay
> >> >> > away from it).
> >> >>
> >> >> And in this specific case, there is no balancing of vcpus across the
> >> >> pcpus lists.
> >> >>
> >> >> One can construct a pathological case using pinning and pausing to get
> >> >> almost every vcpu on a single pcpu list, and vcpus recieving fewer
> >> >> interrupts will exasperate the problem by staying on the list for longer
> >> >> periods of time.
> >> >
> >> > In that extreme case I believe many contentions in other code paths will
> >> > be much larger than overhead caused by this structure limitation.
> >>
> >> Examples?
> >>
> >> >> IMO, the PI feature cannot be declared as done/supported with this bug
> >> >> remaining.  OTOH, it is fine to be experimental, and disabled by default
> >> >> for people who wish to experiment.
> >> >>
> >> >
> >> > Again, I don't expect to see it disabled as experimental. For good
> >> > production
> >> > environment where vcpus are well balanced and interrupt latency is
> >> > sensitive,
> >> > linked list should be efficient here. For bad environment like extreme case
> >> > you raised, I don't know whether it really matters to just tune interrupt
> >> > path.
> >>
> >> Can you _guarantee_ that everything potentially leading to such a
> >> pathological situation is covered by XSA-77? And even if it is now,
> >> removing elements from the waiver list would become significantly
> >> more difficult if disconnected behavior like this one would need to
> >> be taken into account.
> >>
> >> Please understand that history has told us to be rather more careful
> >> than might seem necessary with this: ATS originally having been
> >> enabled by default is one bold example, and the recent flood of MSI
> >> related XSAs is another; I suppose I could find more. All affecting
> >> code originating from Intel, apparently written with only functionality
> >> in mind, while having left out (other than basic) security considerations.
> >>
> >> IOW, with my committer role hat on, the feature is going to be
> >> experimental (and hence default off) unless the issue here gets
> >> addressed. And no, I cannot immediately suggest a good approach,
> >> and with all of the rush before the feature freeze I also can't justify
> >> taking a lot of time to think of options.
> >
> > Is it acceptable to you if I only add the blocked vcpus that has
> > assigned devices to the list? I think that should shorten the
> > length of the list.
> 
> I actually implied this to be the case already, i.e.
> - if it's not, this needs to be fixed anyway,
> - it's not going to eliminate the concern (just think of a couple of
>   many-vCPU guests all having devices assigned).

So how about allocating multiple wakeup vectors (says, 16, maybe
we can make this configurable) and multiplex them amongst all the
blocked vCPUs?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-10  7:29                 ` Wu, Feng
@ 2015-07-10  8:49                   ` Jan Beulich
  2015-07-10  8:57                     ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-10  8:49 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 10.07.15 at 09:29, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, July 10, 2015 2:32 PM
>> >>> On 10.07.15 at 08:21, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Thursday, July 09, 2015 3:26 PM
>> >> >>> On 09.07.15 at 00:49, <kevin.tian@intel.com> wrote:
>> >> >>  From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> >> >> Sent: Wednesday, July 08, 2015 9:09 PM
>> >> >> On 08/07/2015 13:46, Jan Beulich wrote:
>> >> >> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
>> >> >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table
>> __initdata
>> >> >> >>> vmx_function_table = {
>> >> >> >>>      .enable_msr_exit_interception =
>> >> vmx_enable_msr_exit_interception,
>> >> >> >>>  };
>> >> >> >>>
>> >> >> >>> +/*
>> >> >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
>> >> >> >>> + */
>> >> >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>> >> >> >>> +{
>> >> >> >>> +    struct arch_vmx_struct *vmx;
>> >> >> >>> +    unsigned int cpu = smp_processor_id();
>> >> >> >>> +
>> >> >> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
>> >> >> >>> +
>> >> >> >>> +    /*
>> >> >> >>> +     * FIXME: The length of the list depends on how many
>> >> >> >>> +     * vCPU is current blocked on this specific pCPU.
>> >> >> >>> +     * This may hurt the interrupt latency if the list
>> >> >> >>> +     * grows to too many entries.
>> >> >> >>> +     */
>> >> >> >> let's go with this linked list first until a real issue is identified.
>> >> >> > This is exactly the way of thinking I dislike when it comes to code
>> >> >> > that isn't intended to be experimental only: We shouldn't wait
>> >> >> > for problems to surface when we already can see them. I.e. if
>> >> >> > there are no plans to deal with this, I'd ask for the feature to be
>> >> >> > off by default and be properly marked experimental in the
>> >> >> > command line option documentation (so people know to stay
>> >> >> > away from it).
>> >> >>
>> >> >> And in this specific case, there is no balancing of vcpus across the
>> >> >> pcpus lists.
>> >> >>
>> >> >> One can construct a pathological case using pinning and pausing to get
>> >> >> almost every vcpu on a single pcpu list, and vcpus recieving fewer
>> >> >> interrupts will exasperate the problem by staying on the list for longer
>> >> >> periods of time.
>> >> >
>> >> > In that extreme case I believe many contentions in other code paths will
>> >> > be much larger than overhead caused by this structure limitation.
>> >>
>> >> Examples?
>> >>
>> >> >> IMO, the PI feature cannot be declared as done/supported with this bug
>> >> >> remaining.  OTOH, it is fine to be experimental, and disabled by default
>> >> >> for people who wish to experiment.
>> >> >>
>> >> >
>> >> > Again, I don't expect to see it disabled as experimental. For good
>> >> > production
>> >> > environment where vcpus are well balanced and interrupt latency is
>> >> > sensitive,
>> >> > linked list should be efficient here. For bad environment like extreme 
> case
>> >> > you raised, I don't know whether it really matters to just tune interrupt
>> >> > path.
>> >>
>> >> Can you _guarantee_ that everything potentially leading to such a
>> >> pathological situation is covered by XSA-77? And even if it is now,
>> >> removing elements from the waiver list would become significantly
>> >> more difficult if disconnected behavior like this one would need to
>> >> be taken into account.
>> >>
>> >> Please understand that history has told us to be rather more careful
>> >> than might seem necessary with this: ATS originally having been
>> >> enabled by default is one bold example, and the recent flood of MSI
>> >> related XSAs is another; I suppose I could find more. All affecting
>> >> code originating from Intel, apparently written with only functionality
>> >> in mind, while having left out (other than basic) security considerations.
>> >>
>> >> IOW, with my committer role hat on, the feature is going to be
>> >> experimental (and hence default off) unless the issue here gets
>> >> addressed. And no, I cannot immediately suggest a good approach,
>> >> and with all of the rush before the feature freeze I also can't justify
>> >> taking a lot of time to think of options.
>> >
>> > Is it acceptable to you if I only add the blocked vcpus that has
>> > assigned devices to the list? I think that should shorten the
>> > length of the list.
>> 
>> I actually implied this to be the case already, i.e.
>> - if it's not, this needs to be fixed anyway,
>> - it's not going to eliminate the concern (just think of a couple of
>>   many-vCPU guests all having devices assigned).
> 
> So how about allocating multiple wakeup vectors (says, 16, maybe
> we can make this configurable) and multiplex them amongst all the
> blocked vCPUs?

For such an approach to be effective, you'd need to know up front
how many vCPU-s you may need to deal with, or allocate vectors
on demand. Plus you'd need to convince us that spending additional
vectors (which we're already short of on certain big systems) is the
only viable solution to the issue (which I don't think it is).

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked
  2015-07-10  8:49                   ` Jan Beulich
@ 2015-07-10  8:57                     ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-10  8:57 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	Andrew Cooper, xen-devel@lists.xen.org, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, July 10, 2015 4:50 PM
> To: Wu, Feng
> Cc: Andrew Cooper; george.dunlap@eu.citrix.com; Tian, Kevin; Zhang, Yang Z;
> xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [Xen-devel] [v3 12/15] vmx: posted-interrupt handling when vCPU
> is blocked
> 
> >>> On 10.07.15 at 09:29, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Friday, July 10, 2015 2:32 PM
> >> >>> On 10.07.15 at 08:21, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Thursday, July 09, 2015 3:26 PM
> >> >> >>> On 09.07.15 at 00:49, <kevin.tian@intel.com> wrote:
> >> >> >>  From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> >> >> >> Sent: Wednesday, July 08, 2015 9:09 PM
> >> >> >> On 08/07/2015 13:46, Jan Beulich wrote:
> >> >> >> >>>> On 08.07.15 at 13:00, <kevin.tian@intel.com> wrote:
> >> >> >> >>> @@ -1848,6 +1869,33 @@ static struct hvm_function_table
> >> __initdata
> >> >> >> >>> vmx_function_table = {
> >> >> >> >>>      .enable_msr_exit_interception =
> >> >> vmx_enable_msr_exit_interception,
> >> >> >> >>>  };
> >> >> >> >>>
> >> >> >> >>> +/*
> >> >> >> >>> + * Handle VT-d posted-interrupt when VCPU is blocked.
> >> >> >> >>> + */
> >> >> >> >>> +static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
> >> >> >> >>> +{
> >> >> >> >>> +    struct arch_vmx_struct *vmx;
> >> >> >> >>> +    unsigned int cpu = smp_processor_id();
> >> >> >> >>> +
> >> >> >> >>> +    spin_lock(&per_cpu(pi_blocked_vcpu_lock, cpu));
> >> >> >> >>> +
> >> >> >> >>> +    /*
> >> >> >> >>> +     * FIXME: The length of the list depends on how many
> >> >> >> >>> +     * vCPU is current blocked on this specific pCPU.
> >> >> >> >>> +     * This may hurt the interrupt latency if the list
> >> >> >> >>> +     * grows to too many entries.
> >> >> >> >>> +     */
> >> >> >> >> let's go with this linked list first until a real issue is identified.
> >> >> >> > This is exactly the way of thinking I dislike when it comes to code
> >> >> >> > that isn't intended to be experimental only: We shouldn't wait
> >> >> >> > for problems to surface when we already can see them. I.e. if
> >> >> >> > there are no plans to deal with this, I'd ask for the feature to be
> >> >> >> > off by default and be properly marked experimental in the
> >> >> >> > command line option documentation (so people know to stay
> >> >> >> > away from it).
> >> >> >>
> >> >> >> And in this specific case, there is no balancing of vcpus across the
> >> >> >> pcpus lists.
> >> >> >>
> >> >> >> One can construct a pathological case using pinning and pausing to
> get
> >> >> >> almost every vcpu on a single pcpu list, and vcpus recieving fewer
> >> >> >> interrupts will exasperate the problem by staying on the list for longer
> >> >> >> periods of time.
> >> >> >
> >> >> > In that extreme case I believe many contentions in other code paths
> will
> >> >> > be much larger than overhead caused by this structure limitation.
> >> >>
> >> >> Examples?
> >> >>
> >> >> >> IMO, the PI feature cannot be declared as done/supported with this
> bug
> >> >> >> remaining.  OTOH, it is fine to be experimental, and disabled by
> default
> >> >> >> for people who wish to experiment.
> >> >> >>
> >> >> >
> >> >> > Again, I don't expect to see it disabled as experimental. For good
> >> >> > production
> >> >> > environment where vcpus are well balanced and interrupt latency is
> >> >> > sensitive,
> >> >> > linked list should be efficient here. For bad environment like extreme
> > case
> >> >> > you raised, I don't know whether it really matters to just tune interrupt
> >> >> > path.
> >> >>
> >> >> Can you _guarantee_ that everything potentially leading to such a
> >> >> pathological situation is covered by XSA-77? And even if it is now,
> >> >> removing elements from the waiver list would become significantly
> >> >> more difficult if disconnected behavior like this one would need to
> >> >> be taken into account.
> >> >>
> >> >> Please understand that history has told us to be rather more careful
> >> >> than might seem necessary with this: ATS originally having been
> >> >> enabled by default is one bold example, and the recent flood of MSI
> >> >> related XSAs is another; I suppose I could find more. All affecting
> >> >> code originating from Intel, apparently written with only functionality
> >> >> in mind, while having left out (other than basic) security considerations.
> >> >>
> >> >> IOW, with my committer role hat on, the feature is going to be
> >> >> experimental (and hence default off) unless the issue here gets
> >> >> addressed. And no, I cannot immediately suggest a good approach,
> >> >> and with all of the rush before the feature freeze I also can't justify
> >> >> taking a lot of time to think of options.
> >> >
> >> > Is it acceptable to you if I only add the blocked vcpus that has
> >> > assigned devices to the list? I think that should shorten the
> >> > length of the list.
> >>
> >> I actually implied this to be the case already, i.e.
> >> - if it's not, this needs to be fixed anyway,
> >> - it's not going to eliminate the concern (just think of a couple of
> >>   many-vCPU guests all having devices assigned).
> >
> > So how about allocating multiple wakeup vectors (says, 16, maybe
> > we can make this configurable) and multiplex them amongst all the
> > blocked vCPUs?
> 
> For such an approach to be effective, you'd need to know up front
> how many vCPU-s you may need to deal with, or allocate vectors
> on demand. Plus you'd need to convince us that spending additional
> vectors (which we're already short of on certain big systems) is the
> only viable solution to the issue (which I don't think it is).

Well, in that case, maybe you need to provide a better solution!

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10  6:22                           ` Jan Beulich
@ 2015-07-10 11:05                             ` Dario Faggioli
  2015-07-14  5:44                               ` Wu, Feng
  2015-07-14 14:08                               ` Wu, Feng
  0 siblings, 2 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-10 11:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, George Dunlap, andrew.cooper3@citrix.com,
	xen-devel, Yang Z Zhang, keir@xen.org


[-- Attachment #1.1: Type: text/plain, Size: 3536 bytes --]

On Fri, 2015-07-10 at 07:22 +0100, Jan Beulich wrote:
> >>> On 10.07.15 at 07:59, <feng.wu@intel.com> wrote:
> > If you agree with doing all this in a central place, maybe we can create
> > an arch hook for 'struct scheduler' to do this and call it in all the places
> > vcpu_runstate_change() gets called. What is your opinion about this?
> 
> Doing this in a central place is certainly the right approach, but
> adding an arch hook that needs to be called everywhere
> vcpu_runstate_change() wouldn't serve that purpose. 
>
Indeed.

> Instead
> we'd need to replace all current vcpu_runstate_change() calls
> with calls to a new function calling both this and the to be added
> arch hook.
> 
Well, I also see the value of having this done in one place, but not to
the point of adding something like this.

> But please wait for George's / Dario's feedback, because they
> seem to be even less convinced than me about your model of
> tying the updates to runstate changes.
> 
Indeed. George stated very well the reason why vcpu_runstate_change()
should not be used, and suggested arch hooks to be added in the relevant
places. I particularly like this idea as, not only it would leave
vcpu_runstate_change() alone, but it would also help disentangling this
from runstates, which, IMO, is also important.

So, can we identify the state (runstate? :-/) transitions that needs
intercepting, and find a suitable place where to place hooks? I mean,
something like this:

 - running-->blocked: can be handled in the arch specific part of
                      context switch (similarly to CMT/CAT, which
                      already hooks into there). So, in this case, no
                      need to add any hook, as arch specific code is
                      called already;

 - running-->runnable: same as above;

 - running-->offline: not sure if you need to take action on this. If
                      yes, context switch should be fine as well;

 - blocked-->runnable: I think we need this, don't we? If yes, we
                       probably want an arch hook in vcpu_wake();

 - blocked-->offline: do you need it? Well, the hook in wake should work
                      for this as well, if yes;

 - runnable/running-->offline: if necessary, we want an hook in 
                               vcpu_sleep_nosync().

Another way to look at this, less biased toward runstates (i.e., what
I've been asking for since a while), would be:

 - do you need to perform an action upon context switch (on prev and/or
   next vcpu)? If yes, there's an arch specific path in there already;
 - do you need to perform an action when a vcpu wakes-up? If yes, we
   need an arch hook in vcpu_wake();
 - do you need to perform an action when a vcpu goes to sleep? If yes,
   we need an arch hook in vcpu_sleep_nosync();

I think this makes a more than fair solution. I happen to like it even
better than the centralized approach, actually! That is for personal
taste, but also because I think it may be useful for others too, in
future, to be able to execute arch specific code, e.g., upon wakes-up,
in which case we will be able to use the hook that we're introducing
here for PI.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10  0:07                   ` Wu, Feng
@ 2015-07-10 12:40                     ` Dario Faggioli
  2015-07-10 13:47                       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-10 12:40 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 3767 bytes --]

On Fri, 2015-07-10 at 00:07 +0000, Wu, Feng wrote:

> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > What I mean is, can you describe when you need each specific operation
> > needs to happen? Something like "descriptor needs to be updated like
> > this upon migration", "notification should be disabled when vcpu starts
> > running", "notification method should be changed that other way when
> > vcpu is preempted", etc.
> 
> I cannot see the differences, I think the requirements are clearly listed in
> the design doc and the comments of this patch.
> 
The difference is, and is IMO quite a big one, this: do you need to do
something when a vcpu wakes up, perhaps depending whether it is runnable
or not immediately after that, or when a vcpu enters runstate
RUNSTATE_runnable.

IOW, are you interested in the event, or in the change that such an
event causes, as far as a particular subsystem (in this case
accounting/information reporting) is concerned?

And no, the fact that when a vcpu wakes up, if it's runnable, it enters
te RUNSTATE_runnable runstate is not enough to say that they're the same
thing! Runstate are an abstraction used for accounting and for reporting
information to the higher levels.
So, why not use it? No reason, and in fact it's used a lot! For
instance, xenalyze (and tracing in general) uses it; getdomaininfo()
uses it; XEN_DOMCTL_getvcpuinfo uses it.

However, there is no one single feature (e.g., for hardware enablement,
like yours) that I can find, within Xen, that builds on top of runstates
(the only exception is credit1 scheduler, and only it, using
runstate.state_entry_time once... and I think that's quite bad of it,
FWIW).

Theoretically speaking, runstates could well disappear, or change
meaning, or be replaced by something else, and only the accounting and
reporting code (as far as the hypervisor is concerned, of course) would
suffer/need changing.

I think, OTOH, that you should really be interesting in making sure you
intercept an event, in this example a wake-up, and adding an
architectural hook in vcpu_wake() is certainly a way for doing that.

In fact, even if runstates are (ever) going away or be changed, vcpus
are always going to wake-up! :-)

Regards,
Dario

> > 
> > This would help a lot, IMO, figuring out the actual functional
> > requirements that needs to be satisfied for things to work well. Once
> > that is done, we can go check in the code where is the best place to put
> > each call, hook, or whatever.
> > 
> > 
> > Note that I've already tried to infer the above, by looking at the
> > patches, and that is making me think that it would be possible to
> > implement things in another way. But maybe I'm missing something. So it
> > would be really valuable if you, with all your knowledge of how PI
> > should work, could do it.
> 
> I keep describing how PI works, what the purpose of the two vectors are,
> how special they are from the beginning.
> 
> Thanks,
> Feng
> 
> 
> > 
> > Thanks and Regards,
> > Dario
> > --
> > <<This happens because I choose it to happen!>> (Raistlin Majere)
> > -----------------------------------------------------------------
> > Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 03/15] Add cmpxchg16b support for x86-64
  2015-06-24  5:18 ` [v3 03/15] Add cmpxchg16b support for x86-64 Feng Wu
  2015-06-24 18:35   ` Andrew Cooper
@ 2015-07-10 12:57   ` Jan Beulich
  1 sibling, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 12:57 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> --- a/xen/include/xen/types.h
> +++ b/xen/include/xen/types.h
> @@ -47,6 +47,11 @@ typedef         __u64           uint64_t;
>  typedef         __u64           u_int64_t;
>  typedef         __s64           int64_t;
>  
> +typedef struct {
> +        uint64_t low;
> +        uint64_t high;
> +} uint128_t;

This violates the C standard; if you did this in an x86-specific file
(or conditional upon BITS_PER_LONG >= 64) I can't see why you
couldn't use gcc's __uint128_t built-in type (available on arches
with host word width >= 64 bits).

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-06-24  5:18 ` [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
  2015-06-29 15:04   ` Andrew Cooper
  2015-07-08  7:48   ` Tian, Kevin
@ 2015-07-10 13:08   ` Jan Beulich
  2015-07-15  2:40     ` Wu, Feng
  2015-07-15  3:13     ` Wu, Feng
  2 siblings, 2 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 13:08 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> @@ -81,8 +81,19 @@ struct vmx_domain {
>  
>  struct pi_desc {
>      DECLARE_BITMAP(pir, NR_VECTORS);
> -    u32 control;
> -    u32 rsvd[7];
> +    union {
> +        struct
> +        {
> +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> +            sn     : 1,  /* bit 257 - Suppress Notification */
> +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> +        u8  nv;          /* bit 279:272 - Notification Vector */
> +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> +        u32 ndst;        /* bit 319:288 - Notification Destination */
> +        };
> +        u64 control;
> +    };

So current code, afaics, uses e.g. test_and_set_bit() to set ON.
By also declaring this as a bitfield you're opening the structure for
non-atomic accesses. If that's correct, why is other code not
being changed to _only_ use the bitfield mechanism (likely also
eliminating the need for it being a union with the now 64-bit
"control"? If atomic accesses are required, then I'd strongly
suggest against making this a bit field.

And in no event can I see why "ndst" needs to be union-ized
with "control" if it doesn't need to be updated atomically with
e.g. "nv".

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 08/15] Suppress posting interrupts when 'SN' is set
  2015-06-24  5:18 ` [v3 08/15] Suppress posting interrupts when 'SN' is set Feng Wu
  2015-06-29 15:41   ` Andrew Cooper
  2015-07-08  9:06   ` Tian, Kevin
@ 2015-07-10 13:20   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 13:20 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> @@ -1698,13 +1700,35 @@ static void vmx_deliver_posted_intr(struct vcpu *v, u8 vector)
>           */
>          pi_set_on(&v->arch.hvm_vmx.pi_desc);
>      }
> -    else if ( !pi_test_and_set_on(&v->arch.hvm_vmx.pi_desc) )
> +    else
>      {
> +        prev.control = 0;
> +
> +        do {
> +            old.control = v->arch.hvm_vmx.pi_desc.control &
> +                          ~(1 << POSTED_INTR_ON | 1 << POSTED_INTR_SN);
> +            new.control = v->arch.hvm_vmx.pi_desc.control |
> +                          1 << POSTED_INTR_ON;
> +
> +            /*
> +             * Currently, we don't support urgent interrupt, all
> +             * interrupts are recognized as non-urgent interrupt,
> +             * so we cannot send posted-interrupt when 'SN' is set.
> +             * Besides that, if 'ON' is already set, we cannot set
> +             * posted-interrupts as well.
> +             */
> +            if ( prev.sn || prev.on )
> +            {
> +                vcpu_kick(v);
> +                return;
> +            }
> +
> +            prev.control = cmpxchg(&v->arch.hvm_vmx.pi_desc.control,
> +                                   old.control, new.control);
> +        } while ( prev.control != old.control );

This pretty clearly demonstrates that mixing bitfields and non-bitfield
mask operations makes code hard to read: How is one supposed to
see at the first glance that e.g. prev.on and
old.control & (1 << POSTED_INTR_ON) are the same thing?

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts
  2015-06-24  5:18 ` [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
  2015-06-29 16:04   ` Andrew Cooper
  2015-07-08  9:10   ` Tian, Kevin
@ 2015-07-10 13:27   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 13:27 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> --- a/xen/drivers/passthrough/vtd/iommu.h
> +++ b/xen/drivers/passthrough/vtd/iommu.h
> @@ -289,29 +289,43 @@ struct dma_pte {
>  /* interrupt remap entry */
>  struct iremap_entry {
>    union {
> -    u64 lo_val;
> +    struct { u64 lo, hi; };
>      struct {
> -        u64 p       : 1,
> +        u16 p       : 1,
>              fpd     : 1,
>              dm      : 1,
>              rh      : 1,
>              tm      : 1,
>              dlm     : 3,
>              avail   : 4,
> -            res_1   : 4,
> -            vector  : 8,
> -            res_2   : 8,
> -            dst     : 32;
> -    }lo;
> -  };
> -  union {
> -    u64 hi_val;
> +            res_1   : 4;
> +        u8  vector;
> +        u8  res_2;
> +        u32 dst;
> +        u16 sid;
> +        u16 sq      : 2,
> +            svt     : 2,
> +            res_3   : 12;
> +        u32 res_4   : 32;
> +    } remap;
>      struct {
> -        u64 sid     : 16,
> -            sq      : 2,
> +        u16 p       : 1,
> +            fpd     : 1,
> +            res_1   : 6,
> +            avail   : 4,
> +            res_2   : 2,
> +            urg     : 1,
> +            im      : 1;
> +        u8  vector;
> +        u8  res_3;
> +        u32 res_4   : 6,
> +            pda_l   : 26;
> +        u16 sid;
> +        u16 sq      : 2,
>              svt     : 2,
> -            res_1   : 44;
> -    }hi;
> +            res_5   : 12;
> +        u32 pda_h;
> +    } post;
>    };
>  };

Considering that various of the fields are identical between the two
formats I wonder whether that shouldn't be presented in the layout
here: sid, svt, sq, vector could all be a single structure field, with
other parts becoming sub-unions. Or would that cause problems
elsewhere?

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10 12:40                     ` Dario Faggioli
@ 2015-07-10 13:47                       ` Konrad Rzeszutek Wilk
  2015-07-10 13:59                         ` Dario Faggioli
  0 siblings, 1 reply; 155+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-07-10 13:47 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng

On Fri, Jul 10, 2015 at 02:40:17PM +0200, Dario Faggioli wrote:
> On Fri, 2015-07-10 at 00:07 +0000, Wu, Feng wrote:
> 
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> 
> > > What I mean is, can you describe when you need each specific operation
> > > needs to happen? Something like "descriptor needs to be updated like
> > > this upon migration", "notification should be disabled when vcpu starts
> > > running", "notification method should be changed that other way when
> > > vcpu is preempted", etc.
> > 
> > I cannot see the differences, I think the requirements are clearly listed in
> > the design doc and the comments of this patch.
> > 
> The difference is, and is IMO quite a big one, this: do you need to do
> something when a vcpu wakes up, perhaps depending whether it is runnable
> or not immediately after that, or when a vcpu enters runstate
> RUNSTATE_runnable.
> 
> IOW, are you interested in the event, or in the change that such an
> event causes, as far as a particular subsystem (in this case
> accounting/information reporting) is concerned?
> 
> And no, the fact that when a vcpu wakes up, if it's runnable, it enters
> te RUNSTATE_runnable runstate is not enough to say that they're the same
> thing! Runstate are an abstraction used for accounting and for reporting
> information to the higher levels.
> So, why not use it? No reason, and in fact it's used a lot! For
> instance, xenalyze (and tracing in general) uses it; getdomaininfo()
> uses it; XEN_DOMCTL_getvcpuinfo uses it.
> 
> However, there is no one single feature (e.g., for hardware enablement,
> like yours) that I can find, within Xen, that builds on top of runstates
> (the only exception is credit1 scheduler, and only it, using
> runstate.state_entry_time once... and I think that's quite bad of it,
> FWIW).

Linux kernel uses them. It ends up reporting the values for 'steal time'
as the RUNSTATE_runnable. Aka if you run 'top' and see 'st' - that is it.
> 
> Theoretically speaking, runstates could well disappear, or change
> meaning, or be replaced by something else, and only the accounting and
> reporting code (as far as the hypervisor is concerned, of course) would
> suffer/need changing.

Please don't remove them! They helped me in tracking down a situation
where guests had 20% of them time-slice taken out by a global spinlock!

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10 13:47                       ` Konrad Rzeszutek Wilk
@ 2015-07-10 13:59                         ` Dario Faggioli
  0 siblings, 0 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-10 13:59 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, jbeulich@suse.com,
	Zhang, Yang Z, Wu, Feng


[-- Attachment #1.1: Type: text/plain, Size: 1933 bytes --]

On Fri, 2015-07-10 at 09:47 -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Jul 10, 2015 at 02:40:17PM +0200, Dario Faggioli wrote:

> > However, there is no one single feature (e.g., for hardware enablement,
> > like yours) that I can find, within Xen, that builds on top of runstates
> > (the only exception is credit1 scheduler, and only it, using
> > runstate.state_entry_time once... and I think that's quite bad of it,
> > FWIW).
> 
> Linux kernel uses them. It ends up reporting the values for 'steal time'
> as the RUNSTATE_runnable. Aka if you run 'top' and see 'st' - that is it.
>
It sure does, and it is in its own full right of doing it! I think I
said that they're there for accounting and information reporting to
upper layer, and Linux is what populates one of the (various) upper
layers, as far as Xen is concerned, isn't it?

What I'm arguing about is not them being there at all, nor about them
being used for their own purpose, it's much rather about them being
(ab)used for hardware enablement _in_Xen_ itself. :-)

> > Theoretically speaking, runstates could well disappear, or change
> > meaning, or be replaced by something else, and only the accounting and
> > reporting code (as far as the hypervisor is concerned, of course) would
> > suffer/need changing.
> 
> Please don't remove them! They helped me in tracking down a situation
> where guests had 20% of them time-slice taken out by a global spinlock!
>
Hey, I said 'theoretically speaking'! :-)

I don't have, and am not aware of any other plan to get rid of them..
So, fear nothing, they're not going anywhere... are are they?
Mhwuawhuahwua :-D :-D

Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-06-24  5:18 ` [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
  2015-06-29 16:22   ` Andrew Cooper
  2015-07-08  9:59   ` Tian, Kevin
@ 2015-07-10 14:01   ` Jan Beulich
  2015-07-15  6:04     ` Wu, Feng
  2 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 14:01 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> This patch adds an API which is used to update the IRTE
> for posted-interrupt when guest changes MSI/MSI-X information.

This is again an example where adding a dead function complicates
review: How will I know here why this statement is correct, namely
why MSI/MSI-X are affected but IO-APIC isn't?

> +int pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
> +{
> +    struct irq_desc *desc;
> +    struct msi_desc *msi_desc;
> +    int remap_index;
> +    int rc = 0;
> +    struct pci_dev *pci_dev;
> +    struct acpi_drhd_unit *drhd;
> +    struct iommu *iommu;
> +    struct ir_ctrl *ir_ctrl;
> +    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> +    struct iremap_entry new_ire;
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;

I suppose some of the pointers above could become pointers to
const?

> +    unsigned long flags;
> +    uint128_t old_ire, ret;
> +
> +    desc = pirq_spin_lock_irq_desc(pirq, NULL);
> +    if ( !desc )
> +        return -ENOMEM;
> +
> +    msi_desc = desc->msi_desc;
> +    if ( !msi_desc )
> +    {
> +        rc = -EBADSLT;
> +        goto unlock_out;
> +    }
> +
> +    pci_dev = msi_desc->dev;
> +    if ( !pci_dev )
> +    {
> +        rc = -ENODEV;
> +        goto unlock_out;
> +    }
> +
> +    remap_index = msi_desc->remap_index;
> +    drhd = acpi_find_matched_drhd_unit(pci_dev);
> +    if ( !drhd )
> +    {
> +        rc = -ENODEV;
> +        goto unlock_out;
> +    }
> +
> +    iommu = drhd->iommu;
> +    ir_ctrl = iommu_ir_ctrl(iommu);
> +    if ( !ir_ctrl )
> +    {
> +        rc = -ENODEV;
> +        goto unlock_out;
> +    }
> +
> +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);

Interrupts are unconditionally disabled here already. Question
though is whether you really need to hold on to the IRQ descriptor
lock across the entire function. Much of course depends on what
other locks you maybe imply to be held by the caller. 

I'm particularly worried by the call to acpi_find_matched_drhd_unit()
- is it maybe worth storing the iommu pointer in struct msi_desc?

> +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index, iremap_entries, p);
> +    new_ire = *p;
> +
> +    /* Setup/Update interrupt remapping table entry. */
> +    setup_posted_irte(&new_ire, pi_desc, gvec);
> +
> +    do {
> +        old_ire = *(uint128_t *)p;

This cast suggests that you might want to go beyond what Andrew
said on cmpxchg16b()'s parameters: Perhaps they'd better be
void * instead of uint128_t *.

> +        ret = cmpxchg16b(p, &old_ire, &new_ire);
> +    } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );

Doesn't setup_posted_irte() need to move inside this loop, as it
tries to preserve certain fields? Or else, what is the cmpxchg16b
loop guarding against (i.e. why isn't this just a single one)?

> +    iommu_flush_cache_entry(p, sizeof(struct iremap_entry));

sizeof(*p)

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 11/15] Update IRTE according to guest interrupt config changes
  2015-06-24  5:18 ` [v3 11/15] Update IRTE according to guest interrupt config changes Feng Wu
  2015-06-29 16:46   ` Andrew Cooper
  2015-07-08 10:22   ` Tian, Kevin
@ 2015-07-10 14:23   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 14:23 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> +static struct vcpu *pi_find_dest_vcpu(struct domain *d, uint8_t dest_id,

Looks like there's nothing modifying d, i.e. should be const.

But how can dest_id be uint8_t when we support x2APIC in the
guest?

> +                                      uint8_t dest_mode, uint8_t delivery_mode,

According to vlapic_match_dest()'s parameter types dest_mode
ought to be bool_t.

> +                                      uint8_t gvec)
> +{
> +    unsigned long *dest_vcpu_bitmap = NULL;

Pointless initializer.

> +    unsigned int dest_vcpu_num = 0, idx = 0;
> +    int size = (d->max_vcpus + BITS_PER_LONG - 1) / BITS_PER_LONG;

unsigned int size = BITS_TO_LONGS(d->max_vcpus);

> +    struct vcpu *v, *dest = NULL;
> +    int i;

unsigned int

> +    dest_vcpu_bitmap = xzalloc_array(unsigned long, size);
> +    if ( !dest_vcpu_bitmap )
> +    {
> +        dprintk(XENLOG_G_INFO,
> +                "dom%d: failed to allocate memory\n", d->domain_id);
> +        return NULL;
> +    }
> +
> +    for_each_vcpu ( d, v )
> +    {
> +        if ( !vlapic_match_dest(vcpu_vlapic(v), NULL, 0,
> +                                dest_id, dest_mode) )
> +            continue;
> +
> +        __set_bit(v->vcpu_id, dest_vcpu_bitmap);
> +        dest_vcpu_num++;
> +    }
> +
> +    if ( delivery_mode == dest_LowestPrio )
> +    {
> +        if (  dest_vcpu_num != 0 )
> +        {
> +            for ( i = 0; i <= gvec % dest_vcpu_num; i++)

Unless the compiler recognizes this, you do a divide on each iteration.
Make this the initialized of i instead.

> +                idx = find_next_bit(dest_vcpu_bitmap, d->max_vcpus, idx) + 1;
> +            idx--;
> +
> +            BUG_ON(idx >= d->max_vcpus || idx < 0);

If you really think you need this, then this is too late: find_next_bit()
in undefined for an out of range last argument.

> +            dest = d->vcpu[idx];
> +        }
> +    }
> +    else if (  dest_vcpu_num == 1 )
> +    {
> +        idx = find_first_bit(dest_vcpu_bitmap, d->max_vcpus);
> +        BUG_ON(idx >= d->max_vcpus || idx < 0);

This seems pretty pointless considering how dest_vcpu_num
ends up being non-zero.

> +        dest = d->vcpu[idx];
> +    }

else BUG()? Or at least some gdprintk()? Oh, I see you have one in
the caller - I guess that's fine then.

> @@ -330,11 +403,32 @@ int pt_irq_create_bind(
>          /* Calculate dest_vcpu_id for MSI-type pirq migration. */
>          dest = pirq_dpci->gmsi.gflags & VMSI_DEST_ID_MASK;
>          dest_mode = !!(pirq_dpci->gmsi.gflags & VMSI_DM_MASK);
> +        delivery_mode = (pirq_dpci->gmsi.gflags >> GFLAGS_SHIFT_DELIV_MODE) &
> +                        VMSI_DELIV_MASK;
>          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
>          spin_unlock(&d->event_lock);
>          if ( dest_vcpu_id >= 0 )
>              hvm_migrate_pirqs(d->vcpu[dest_vcpu_id]);
> +
> +        /* Use interrupt posting if it is supported */
> +        if ( iommu_intpost )
> +        {
> +            struct vcpu *vcpu = pi_find_dest_vcpu(d, dest, dest_mode,
> +                                        delivery_mode, pirq_dpci->gmsi.gvec);
> +
> +            if ( !vcpu )
> +                dprintk(XENLOG_G_WARNING,
> +                        "dom%u: failed to find the dest vCPU for PI, guest "
> +                        "vector:0x%x use software way to deliver the "
> +                        " interrupts.\n", d->domain_id, pirq_dpci->gmsi.gvec);
> +            else if ( pi_update_irte( vcpu, info, pirq_dpci->gmsi.gvec ) != 0 )
> +                dprintk(XENLOG_G_WARNING,
> +                        "%pv: failed to update PI IRTE, guest vector:0x%x "
> +                        "use software way to deliver the interrupts.\n",
> +                        vcpu, pirq_dpci->gmsi.gvec);

%#x instead of 0x%x generally please. For vectors, however,
%02x please. Also please don't break long format strings, so one
can grep for them. I'd suggest shortening them as much as possible
anyway (namely drop "use software way to deliver the interrupts.").

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 13/15] vmx: Properly handle notification event when vCPU is running
  2015-06-24  5:18 ` [v3 13/15] vmx: Properly handle notification event when vCPU is running Feng Wu
  2015-07-08 11:03   ` Tian, Kevin
@ 2015-07-10 14:40   ` Jan Beulich
  1 sibling, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 14:40 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -1896,6 +1896,59 @@ static void pi_wakeup_interrupt(struct cpu_user_regs *regs)
>      this_cpu(irq_count)++;
>  }
>  
> +/*
> + * Handle VT-d posted-interrupt when VCPU is running.
> + */
> +
> +static void pi_notification_interrupt(struct cpu_user_regs *regs)

Stray blank line and improper multi-line comment.

> +{
> +    /*
> +     * We get here when a vCPU is running in root-mode
> +     * (such as via hypercall, or any other reasons which
> +     * can result in VM-Exit), and before vCPU is back to
> +     * non-root, external interrupts from an assigned
> +     * device happen and a notification event is delivered
> +     * to this logical CPU.
> +     *
> +     * we need to set VCPU_KICK_SOFTIRQ for the current
> +     * cpu, just like __vmx_deliver_posted_interrupt().
> +     *
> +     * So the pending interrupt in PIRR will be synced to
> +     * vIRR before VM-Exit in time.

Please make better use of the 80 characters per line, making
the textual description better stand out from the following
quoted code.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-06-24  5:18 ` [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
       [not found]   ` <55918214.4030102@citrix.com>
  2015-07-08 11:24   ` Tian, Kevin
@ 2015-07-10 14:48   ` Jan Beulich
  2 siblings, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-10 14:48 UTC (permalink / raw)
  To: Feng Wu
  Cc: kevin.tian, keir, george.dunlap, andrew.cooper3, xen-devel,
	yang.z.zhang

>>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> --- a/xen/arch/x86/hvm/hvm.c
> +++ b/xen/arch/x86/hvm/hvm.c
> @@ -6475,6 +6475,12 @@ enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v)
>      return hvm_funcs.nhvm_intr_blocked(v);
>  }
>  
> +void arch_pi_desc_update(struct vcpu *v, int old_state)
> +{
> +    if ( is_hvm_vcpu(v) && hvm_funcs.pi_desc_update )
> +        hvm_funcs.pi_desc_update(v, old_state);
> +}

Shouldn't this use has_hvm_container_vcpu()?

> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -168,6 +168,7 @@ static int vmx_vcpu_initialise(struct vcpu *v)
>  
>      INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
>  
> +    v->arch.hvm_vmx.pi_block_cpu = -1;

The field should be of unsigned type, at which point it might be better
(but isn't required) to store NR_CPUS here (and below), and check
against that value further down.

> @@ -1778,6 +1779,124 @@ static void vmx_handle_eoi(u8 vector)
>      __vmwrite(GUEST_INTR_STATUS, status);
>  }
>  
> +static void vmx_pi_desc_update(struct vcpu *v, int old_state)
> +{
> +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +    struct pi_desc old, new;
> +    unsigned long flags;
> +
> +    ASSERT(iommu_intpost);
> +
> +    switch ( v->runstate.state )
> +    {
> +    case RUNSTATE_runnable:
> +    case RUNSTATE_offline:
> +        /*
> +         * We don't need to send notification event to a non-running
> +         * vcpu, the interrupt information will be delivered to it before
> +         * VM-ENTRY when the vcpu is scheduled to run next time.
> +         */
> +        pi_set_sn(pi_desc);
> +
> +        /*
> +         * If the state is transferred from RUNSTATE_blocked,
> +         * we should set 'NV' feild back to posted_intr_vector,
> +         * so the Posted-Interrupts can be delivered to the vCPU
> +         * by VT-d HW after it is scheduled to run.
> +         */
> +        if ( old_state == RUNSTATE_blocked )
> +        {
> +            write_atomic((uint8_t*)&new.nv, posted_intr_vector);
> +
> +            /*
> +             * Delete the vCPU from the related block list
> +             * if we are resuming from blocked state
> +             */
> +            ASSERT(v->arch.hvm_vmx.pi_block_cpu != -1);
> +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                              v->arch.hvm_vmx.pi_block_cpu), flags);
> +            list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
> +            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> +                                   v->arch.hvm_vmx.pi_block_cpu), flags);
> +            v->arch.hvm_vmx.pi_block_cpu = -1;
> +        }
> +        break;
> +
> +    case RUNSTATE_blocked:
> +        ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
> +
> +        /*
> +         * The vCPU is blocked on the block list. Add the blocked
> +         * vCPU on the list of the v->arch.hvm_vmx.pi_block_cpu,
> +         * which is the destination of the wake-up notification event.
> +         */
> +        v->arch.hvm_vmx.pi_block_cpu = v->processor;
> +        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                          v->arch.hvm_vmx.pi_block_cpu), flags);
> +        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> +                      &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
> +        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> +                               v->arch.hvm_vmx.pi_block_cpu), flags);
> +
> +        do {
> +            old.control = new.control = pi_desc->control;
> +
> +            /*
> +             * We should not block the vCPU if
> +             * an interrupt was posted for it.
> +             */
> +
> +            if ( old.on )
> +            {
> +                /*
> +                 * The vCPU will be removed from the block list
> +                 * during its state transferring from RUNSTATE_blocked
> +                 * to RUNSTATE_runnable after the following tasklet
> +                 * is executed.
> +                 */
> +                tasklet_schedule(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
> +                return;

"break" here has the same effect and looks better overall (single
function return point).

> +            }
> +
> +            /*
> +             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
> +             * so when external interrupts from assigned deivces happen,
> +             * wakeup notifiction event will go to
> +             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
> +             * we can find the vCPU in the right list to wake up.
> +             */
> +            if ( x2apic_enabled )
> +                new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
> +            else
> +                new.ndst = MASK_INSR(cpu_physical_id(
> +                                     v->arch.hvm_vmx.pi_block_cpu),
> +                                     PI_xAPIC_NDST_MASK);

This isn't the first time I see this - please add a helper inline function.

> +            new.sn = 0;
> +            new.nv = pi_wakeup_vector;
> +        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
> +                  != old.control );
> +        break;
> +
> +    case RUNSTATE_running:
> +        ASSERT( pi_desc->sn == 1 );

Stray blanks.

> +        if ( x2apic_enabled )
> +            write_atomic(&new.ndst, cpu_physical_id(v->processor));
> +        else
> +            write_atomic(&new.ndst,
> +                         MASK_INSR(cpu_physical_id(v->processor),
> +                         PI_xAPIC_NDST_MASK));

Ah, another use case for that helper...

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10 11:05                             ` Dario Faggioli
@ 2015-07-14  5:44                               ` Wu, Feng
  2015-07-14 14:08                               ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-14  5:44 UTC (permalink / raw)
  To: Dario Faggioli, Jan Beulich, George Dunlap
  Cc: Tian, Kevin, keir@xen.org, andrew.cooper3@citrix.com, xen-devel,
	Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Friday, July 10, 2015 7:06 PM
> To: Jan Beulich
> Cc: Wu, Feng; andrew.cooper3@citrix.com; George Dunlap; Tian, Kevin; Zhang,
> Yang Z; xen-devel; keir@xen.org
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On Fri, 2015-07-10 at 07:22 +0100, Jan Beulich wrote:
> > >>> On 10.07.15 at 07:59, <feng.wu@intel.com> wrote:
> > > If you agree with doing all this in a central place, maybe we can create
> > > an arch hook for 'struct scheduler' to do this and call it in all the places
> > > vcpu_runstate_change() gets called. What is your opinion about this?
> >
> > Doing this in a central place is certainly the right approach, but
> > adding an arch hook that needs to be called everywhere
> > vcpu_runstate_change() wouldn't serve that purpose.
> >
> Indeed.
> 
> > Instead
> > we'd need to replace all current vcpu_runstate_change() calls
> > with calls to a new function calling both this and the to be added
> > arch hook.
> >
> Well, I also see the value of having this done in one place, but not to
> the point of adding something like this.
> 
> > But please wait for George's / Dario's feedback, because they
> > seem to be even less convinced than me about your model of
> > tying the updates to runstate changes.
> >
> Indeed. George stated very well the reason why vcpu_runstate_change()
> should not be used, and suggested arch hooks to be added in the relevant
> places. I particularly like this idea as, not only it would leave
> vcpu_runstate_change() alone, but it would also help disentangling this
> from runstates, which, IMO, is also important.
> 
> So, can we identify the state (runstate? :-/) transitions that needs
> intercepting, and find a suitable place where to place hooks? I mean,
> something like this:
> 
>  - running-->blocked: can be handled in the arch specific part of
>                       context switch (similarly to CMT/CAT, which
>                       already hooks into there). So, in this case, no
>                       need to add any hook, as arch specific code is
>                       called already;
> 
>  - running-->runnable: same as above;
> 
>  - running-->offline: not sure if you need to take action on this. If
>                       yes, context switch should be fine as well;
> 
>  - blocked-->runnable: I think we need this, don't we? If yes, we
>                        probably want an arch hook in vcpu_wake();
> 
>  - blocked-->offline: do you need it? Well, the hook in wake should work
>                       for this as well, if yes;
> 
>  - runnable/running-->offline: if necessary, we want an hook in
>                                vcpu_sleep_nosync().
> 
> Another way to look at this, less biased toward runstates (i.e., what
> I've been asking for since a while), would be:
> 
>  - do you need to perform an action upon context switch (on prev and/or
>    next vcpu)? If yes, there's an arch specific path in there already;
>  - do you need to perform an action when a vcpu wakes-up? If yes, we
>    need an arch hook in vcpu_wake();
>  - do you need to perform an action when a vcpu goes to sleep? If yes,
>    we need an arch hook in vcpu_sleep_nosync();
> 
> I think this makes a more than fair solution. I happen to like it even
> better than the centralized approach, actually! That is for personal
> taste, but also because I think it may be useful for others too, in
> future, to be able to execute arch specific code, e.g., upon wakes-up,
> in which case we will be able to use the hook that we're introducing
> here for PI.
> 

Hi George, any ideas about this?

Thanks,
Feng

> Thanks and Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-10 11:05                             ` Dario Faggioli
  2015-07-14  5:44                               ` Wu, Feng
@ 2015-07-14 14:08                               ` Wu, Feng
  2015-07-14 14:54                                 ` Jan Beulich
  2015-07-14 16:02                                 ` Dario Faggioli
  1 sibling, 2 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-14 14:08 UTC (permalink / raw)
  To: Dario Faggioli, Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Zhang, Yang Z, Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Friday, July 10, 2015 7:06 PM
> To: Jan Beulich
> Cc: Wu, Feng; andrew.cooper3@citrix.com; George Dunlap; Tian, Kevin; Zhang,
> Yang Z; xen-devel; keir@xen.org
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On Fri, 2015-07-10 at 07:22 +0100, Jan Beulich wrote:
> > >>> On 10.07.15 at 07:59, <feng.wu@intel.com> wrote:
> > > If you agree with doing all this in a central place, maybe we can create
> > > an arch hook for 'struct scheduler' to do this and call it in all the places
> > > vcpu_runstate_change() gets called. What is your opinion about this?
> >
> > Doing this in a central place is certainly the right approach, but
> > adding an arch hook that needs to be called everywhere
> > vcpu_runstate_change() wouldn't serve that purpose.
> >
> Indeed.
> 
> > Instead
> > we'd need to replace all current vcpu_runstate_change() calls
> > with calls to a new function calling both this and the to be added
> > arch hook.
> >
> Well, I also see the value of having this done in one place, but not to
> the point of adding something like this.
> 
> > But please wait for George's / Dario's feedback, because they
> > seem to be even less convinced than me about your model of
> > tying the updates to runstate changes.
> >
> Indeed. George stated very well the reason why vcpu_runstate_change()
> should not be used, and suggested arch hooks to be added in the relevant
> places. I particularly like this idea as, not only it would leave
> vcpu_runstate_change() alone, but it would also help disentangling this
> from runstates, which, IMO, is also important.
> 
> So, can we identify the state (runstate? :-/) transitions that needs
> intercepting, and find a suitable place where to place hooks? I mean,
> something like this:
> 
>  - running-->blocked: can be handled in the arch specific part of
>                       context switch (similarly to CMT/CAT, which
>                       already hooks into there). So, in this case, no
>                       need to add any hook, as arch specific code is
>                       called already;
> 
>  - running-->runnable: same as above;
> 
>  - running-->offline: not sure if you need to take action on this. If
>                       yes, context switch should be fine as well;
> 
>  - blocked-->runnable: I think we need this, don't we? If yes, we
>                        probably want an arch hook in vcpu_wake();
> 
>  - blocked-->offline: do you need it? Well, the hook in wake should work
>                       for this as well, if yes;
> 
>  - runnable/running-->offline: if necessary, we want an hook in
>                                vcpu_sleep_nosync().
> 
> Another way to look at this, less biased toward runstates (i.e., what
> I've been asking for since a while), would be:
> 
>  - do you need to perform an action upon context switch (on prev and/or
>    next vcpu)? If yes, there's an arch specific path in there already;
>  - do you need to perform an action when a vcpu wakes-up? If yes, we
>    need an arch hook in vcpu_wake();
>  - do you need to perform an action when a vcpu goes to sleep? If yes,
>    we need an arch hook in vcpu_sleep_nosync();
> 
> I think this makes a more than fair solution. I happen to like it even
> better than the centralized approach, actually! That is for personal
> taste, but also because I think it may be useful for others too, in
> future, to be able to execute arch specific code, e.g., upon wakes-up,
> in which case we will be able to use the hook that we're introducing
> here for PI.
> 
> Thanks and Regards,
> Dario

Hi Dario,

Thanks for the suggestion! I made a draft patch for this idea, It may have
some issues since It is just a draft version, kind of like prototype, I post
it here just like to know whether it is meet your expectation, if it is I
can continue with this direction and this may speed up the upstreaming
process. Thanks a lot!

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 6eebc1a..7e678c8 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -740,6 +740,81 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
     vmx_save_guest_msrs(v);
     vmx_restore_host_msrs();
     vmx_save_dr(v);
+
+    if ( iommu_intpost )
+    {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+        struct pi_desc old, new;
+        unsigned long flags;
+
+        if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
+        {
+            /*
+             * The vCPU is preempted or sleeped. We don't need to send
+             * notification event to a non-running vcpu, the interrupt
+             * information will be delivered to it before VM-ENTRY when
+             * the vcpu is scheduled to run next time.
+             */
+            pi_set_sn(pi_desc);
+
+        }
+        else if ( test_bit(_VPF_blocked, &v->pause_flags) )
+        {
+            /* The vCPU is blocked */
+            ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
+
+            /*
+             * The vCPU is blocked on the block list. Add the blocked
+             * vCPU on the list of the v->arch.hvm_vmx.pi_block_cpu,
+             * which is the destination of the wake-up notification event.
+             */
+            v->arch.hvm_vmx.pi_block_cpu = v->processor;
+            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                              v->arch.hvm_vmx.pi_block_cpu), flags);
+            list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+                          &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                               v->arch.hvm_vmx.pi_block_cpu), flags);
+
+            do {
+                old.control = new.control = pi_desc->control;
+
+                /*
+                 * We should not block the vCPU if
+                 * an interrupt was posted for it.
+                 */
+
+                if ( old.on )
+                {
+                    /*
+                     * The vCPU will be removed from the block list
+                     * during its state transferring from RUNSTATE_blocked
+                     * to RUNSTATE_runnable after the following tasklet
+                     * is executed.
+                     */
+                    tasklet_schedule(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
+                    return;
+                }
+
+                /*
+                 * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+                 * so when external interrupts from assigned deivces happen,
+                 * wakeup notifiction event will go to
+                 * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+                 * we can find the vCPU in the right list to wake up.
+                 */
+                if ( x2apic_enabled )
+                    new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+                else
+                    new.ndst = MASK_INSR(cpu_physical_id(
+                                     v->arch.hvm_vmx.pi_block_cpu),
+                                     PI_xAPIC_NDST_MASK);
+                new.sn = 0;
+                new.nv = pi_wakeup_vector;
+            } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                      != old.control );
+        }
+    }
 }

 static void vmx_ctxt_switch_to(struct vcpu *v)
@@ -764,6 +839,22 @@ static void vmx_ctxt_switch_to(struct vcpu *v)

     vmx_restore_guest_msrs(v);
     vmx_restore_dr(v);
+
+    if ( iommu_intpost )
+    {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+        ASSERT( pi_desc->sn == 1 );
+
+        if ( x2apic_enabled )
+            write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+        else
+            write_atomic(&pi_desc->ndst,
+                         MASK_INSR(cpu_physical_id(v->processor),
+                         PI_xAPIC_NDST_MASK));
+
+        pi_clear_sn(pi_desc);
+    }
 }


@@ -1900,6 +1991,42 @@ static void vmx_pi_desc_update(struct vcpu *v, int old_state)
     }
 } 
+void arch_vcpu_wake(struct vcpu *v)
+{
+    if ( !iommu_intpost || (v->runstate.state != RUNSTATE_blocked) )
+        return;
+
+    if ( likely(vcpu_runnable(v)) ||
+         !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+        unsigned long flags;
+
+        /*
+         * blocked -> runnable/offline
+         * If the state is transferred from RUNSTATE_blocked,
+         * we should set 'NV' feild back to posted_intr_vector,
+         * so the Posted-Interrupts can be delivered to the vCPU
+         * by VT-d HW after it is scheduled to run.
+         */
+        write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
+
+        /*
+         * Delete the vCPU from the related block list
+         * if we are resuming from blocked state
+         */
+        if (v->arch.hvm_vmx.pi_block_cpu != -1)
+        {
+            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                              v->arch.hvm_vmx.pi_block_cpu), flags);
+            list_del(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                                   v->arch.hvm_vmx.pi_block_cpu), flags);
+            v->arch.hvm_vmx.pi_block_cpu = -1;
+        }
+    }
+}
+
 void vmx_hypervisor_cpuid_leaf(uint32_t sub_idx,
                                uint32_t *eax, uint32_t *ebx,
                                uint32_t *ecx, uint32_t *edx)
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 20727d6..7b08797 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -397,6 +397,8 @@ void vcpu_wake(struct vcpu *v)
             vcpu_runstate_change(v, RUNSTATE_offline, NOW());
     }

+    arch_vcpu_wake(v);
+
     vcpu_schedule_unlock_irqrestore(lock, flags, v);

     TRACE_2D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id);
diff --git a/xen/include/asm-arm/domain.h b/xen/include/asm-arm/domain.h
index 9603cf0..be5aebf 100644
--- a/xen/include/asm-arm/domain.h
+++ b/xen/include/asm-arm/domain.h
@@ -266,6 +266,7 @@ static inline unsigned int domain_max_vcpus(const struct domain *d)
 }

 static void arch_pi_desc_update(struct vcpu *v, int old_state) {}
+static void arch_vcpu_wake(struct vcpu *v) {}

 #endif /* __ASM_DOMAIN_H__ */

diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index e175417..38c796c 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -511,6 +511,7 @@ bool_t nhvm_vmcx_hap_enabled(struct vcpu *v);
 enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v);

 void arch_pi_desc_update(struct vcpu *v, int old_state);
+void arch_vcpu_wake(struct vcpu *v);

 #ifndef NDEBUG
 /* Permit use of the Forced Emulation Prefix in HVM guests */



> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-14 14:08                               ` Wu, Feng
@ 2015-07-14 14:54                                 ` Jan Beulich
  2015-07-14 15:20                                   ` Dario Faggioli
  2015-07-14 16:02                                 ` Dario Faggioli
  1 sibling, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-14 14:54 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, Dario Faggioli, xen-devel,
	Yang Z Zhang

>>> On 14.07.15 at 16:08, <feng.wu@intel.com> wrote:
> Thanks for the suggestion! I made a draft patch for this idea, It may have
> some issues since It is just a draft version, kind of like prototype, I post
> it here just like to know whether it is meet your expectation, if it is I
> can continue with this direction and this may speed up the upstreaming
> process. Thanks a lot!

FWIW this looks okay to me as a draft (i.e. minus mechanical issues).
If it meets your requirements, I think this would nicely eliminate all the
objections against the earlier model. But let's see what Dario and
George think...

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-14 14:54                                 ` Jan Beulich
@ 2015-07-14 15:20                                   ` Dario Faggioli
  2015-07-14 16:41                                     ` George Dunlap
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-14 15:20 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, Feng Wu, George Dunlap, andrew.cooper3@citrix.com,
	xen-devel, Yang Z Zhang, keir@xen.org


[-- Attachment #1.1: Type: text/plain, Size: 1334 bytes --]

On Tue, 2015-07-14 at 15:54 +0100, Jan Beulich wrote:
> >>> On 14.07.15 at 16:08, <feng.wu@intel.com> wrote:
> > Thanks for the suggestion! I made a draft patch for this idea, It may have
> > some issues since It is just a draft version, kind of like prototype, I post
> > it here just like to know whether it is meet your expectation, if it is I
> > can continue with this direction and this may speed up the upstreaming
> > process. Thanks a lot!
> 
> FWIW this looks okay to me as a draft (i.e. minus mechanical issues).
> If it meets your requirements, I think this would nicely eliminate all the
> objections against the earlier model. But let's see what Dario and
> George think...
> 
I'll reply to the Feng's email in more detail (even considering that
it's a draft), but yes, indeed, this looks *a lot* nicer, a way better
way of interacting with the scheduler!

The approach is exactly the one I had in mind and was asking Feng to
investigate, and I like the result much better than the old
runstate-based hack-ish model. :-)

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-14 14:08                               ` Wu, Feng
  2015-07-14 14:54                                 ` Jan Beulich
@ 2015-07-14 16:02                                 ` Dario Faggioli
  2015-07-15  0:54                                   ` Wu, Feng
  2015-07-17  7:46                                   ` Wu, Feng
  1 sibling, 2 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-14 16:02 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 8880 bytes --]

On Tue, 2015-07-14 at 14:08 +0000, Wu, Feng wrote:
> 
> > -----Original Message-----
> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> >  - do you need to perform an action upon context switch (on prev and/or
> >    next vcpu)? If yes, there's an arch specific path in there already;
> >  - do you need to perform an action when a vcpu wakes-up? If yes, we
> >    need an arch hook in vcpu_wake();
> >  - do you need to perform an action when a vcpu goes to sleep? If yes,
> >    we need an arch hook in vcpu_sleep_nosync();
> > 
> > I think this makes a more than fair solution. I happen to like it even
> > better than the centralized approach, actually! That is for personal
> > taste, but also because I think it may be useful for others too, in
> > future, to be able to execute arch specific code, e.g., upon wakes-up,
> > in which case we will be able to use the hook that we're introducing
> > here for PI.
> > 
> > Thanks and Regards,
> > Dario
> 
> Hi Dario,
> 
Hi,

> Thanks for the suggestion! I made a draft patch for this idea, 
>
Great!

> It may have
> some issues since It is just a draft version, kind of like prototype, I post
> it here just like to know whether it is meet your expectation, if it is I
> can continue with this direction and this may speed up the upstreaming
> process.
>
Yes, I think this is a good approach, and the proper way for this
feature to interact with the scheduler.

I appreciate it is a draft, so I'm not performing a thorough review, but
I'll try to at least give some comments, in the hope that it helps.

> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 6eebc1a..7e678c8 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -740,6 +740,81 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
>      vmx_save_guest_msrs(v);
>      vmx_restore_host_msrs();
>      vmx_save_dr(v);
> +
> +    if ( iommu_intpost )
> +    {
>
I'd put an helper together ( vmx_<something>_pi() ) and put the body of
this if in it.

Then, either just call it unconditionally from here and have, in the
helper, something like this:

 if ( !iommu_intpost )
   return;

Or just have this in here:

 if ( iommu_intpost )
  vmx_<something>_pi();

> +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +        struct pi_desc old, new;
> +        unsigned long flags;
> +
> +        if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
> +        {
>
Aha! So, AFAICT, this means we can deal with preemptions, sleeps and
blockings (as can be seen below) here in _ctxt_switch_from, i.e., we
don't have to call in this code from vcpu_sleep_nosync(), like we were,
when tying this to vcpu_runstate_change())... nice! :-D

> +            /*
> +             * The vCPU is preempted or sleeped. 
>
"has been preempted or went to sleep" ?

> We don't need to send
> +             * notification event to a non-running vcpu, the interrupt
> +             * information will be delivered to it before VM-ENTRY when
> +             * the vcpu is scheduled to run next time.
> +             */
> +            pi_set_sn(pi_desc);
> +
> +        }
> +        else if ( test_bit(_VPF_blocked, &v->pause_flags) )
> +        {
> +            /* The vCPU is blocked */
>
This comment does not add much, I'd kill it.

> +            ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
> +
> +            /*
> +             * The vCPU is blocked on the block list. 
>
What about "The vCPU is blocking, we need to add it to one of the per
pCPU lists."

> Add the blocked
> +             * vCPU on the list of the v->arch.hvm_vmx.pi_block_cpu,
>
What you're doing seems more "Add the vCPU to the blocked list of
v->processor, which will be the target of the wake-up notification".

> +             * which is the destination of the wake-up notification event.
> +             */
> +            v->arch.hvm_vmx.pi_block_cpu = v->processor;
> +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> +                              v->arch.hvm_vmx.pi_block_cpu), flags);
> +            list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> +                          &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
> +            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> +                               v->arch.hvm_vmx.pi_block_cpu), flags);
> +
> +            do {
> +                old.control = new.control = pi_desc->control;
> +
> +                /*
> +                 * We should not block the vCPU if
> +                 * an interrupt was posted for it.
> +                 */
> +
> +                if ( old.on )
> +                {
> +                    /*
> +                     * The vCPU will be removed from the block list
> +                     * during its state transferring from RUNSTATE_blocked
> +                     * to RUNSTATE_runnable after the following tasklet
> +                     * is executed.
>
We can avoid referencing RUNSTATEs at all, can't we? Just say something
about the vCPU leaving the blocked vCPUs list on the wake-up path.

> +                     */
> +                    tasklet_schedule(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
> +                    return;
> +                }
> +
> +                /*
> +                 * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
> +                 * so when external interrupts from assigned deivces happen,
> +                 * wakeup notifiction event will go to
> +                 * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
> +                 * we can find the vCPU in the right list to wake up.
> +                 */
> +                if ( x2apic_enabled )
> +                    new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
> +                else
> +                    new.ndst = MASK_INSR(cpu_physical_id(
> +                                     v->arch.hvm_vmx.pi_block_cpu),
> +                                     PI_xAPIC_NDST_MASK);
> +                new.sn = 0;
> +                new.nv = pi_wakeup_vector;
> +            } while ( cmpxchg(&pi_desc->control, old.control, new.control)
> +                      != old.control );
> +        }
> +    }
ISTR, Jan had some comments on this code (variable names, etc.). It's
probably goes without saying that those still applies.

>  static void vmx_ctxt_switch_to(struct vcpu *v)
> @@ -764,6 +839,22 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
> 
>      vmx_restore_guest_msrs(v);
>      vmx_restore_dr(v);
> +
> +    if ( iommu_intpost )
> +    {
>
You may consider having an helper for this too, for symmetry with the
above case, but this is less of an issue, IMO.

> +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +
> +        ASSERT( pi_desc->sn == 1 );
                  ^space

Above you wrote:

  ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
         ^no space

Please, pick up one format (ideally, following suit from other
occurrences in the file, if any), and be consistent.

> +
> +        if ( x2apic_enabled )
> +            write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
> +        else
> +            write_atomic(&pi_desc->ndst,
> +                         MASK_INSR(cpu_physical_id(v->processor),
> +                         PI_xAPIC_NDST_MASK));
> +
> +        pi_clear_sn(pi_desc);
> +    }
>  }

> +void arch_vcpu_wake(struct vcpu *v)
> +{
> +    if ( !iommu_intpost || (v->runstate.state != RUNSTATE_blocked) )
> +        return;
> +
> +    if ( likely(vcpu_runnable(v)) ||
> +         !test_bit(_VPF_blocked, &v->pause_flags) )
> +    {
Invert this and bail if true? Well, a matter of taste, I guess... but it
will save one level of indentation.

> +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> +        unsigned long flags;
> +
> +        /*
> +         * blocked -> runnable/offline
> +         * If the state is transferred from RUNSTATE_blocked,
> +         * we should set 'NV' feild back to posted_intr_vector,
> +         * so the Posted-Interrupts can be delivered to the vCPU
> +         * by VT-d HW after it is scheduled to run.
> +         */
>
Again, make the comment describe things in a RUNSTATE independent way
(e.g., in terms of 'generic states', like "it's preempted", "it's
blocked", "it's runnable"; or in terms of flags; or both).

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-14 15:20                                   ` Dario Faggioli
@ 2015-07-14 16:41                                     ` George Dunlap
  0 siblings, 0 replies; 155+ messages in thread
From: George Dunlap @ 2015-07-14 16:41 UTC (permalink / raw)
  To: Dario Faggioli, Jan Beulich
  Cc: Kevin Tian, keir@xen.org, andrew.cooper3@citrix.com, xen-devel,
	Yang Z Zhang, Feng Wu

On 07/14/2015 04:20 PM, Dario Faggioli wrote:
> On Tue, 2015-07-14 at 15:54 +0100, Jan Beulich wrote:
>>>>> On 14.07.15 at 16:08, <feng.wu@intel.com> wrote:
>>> Thanks for the suggestion! I made a draft patch for this idea, It may have
>>> some issues since It is just a draft version, kind of like prototype, I post
>>> it here just like to know whether it is meet your expectation, if it is I
>>> can continue with this direction and this may speed up the upstreaming
>>> process. Thanks a lot!
>>
>> FWIW this looks okay to me as a draft (i.e. minus mechanical issues).
>> If it meets your requirements, I think this would nicely eliminate all the
>> objections against the earlier model. But let's see what Dario and
>> George think...
>>
> I'll reply to the Feng's email in more detail (even considering that
> it's a draft), but yes, indeed, this looks *a lot* nicer, a way better
> way of interacting with the scheduler!
> 
> The approach is exactly the one I had in mind and was asking Feng to
> investigate, and I like the result much better than the old
> runstate-based hack-ish model. :-)

Yes, I prefer this approach -- thank you.

 -George

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-14 16:02                                 ` Dario Faggioli
@ 2015-07-15  0:54                                   ` Wu, Feng
  2015-07-17  7:46                                   ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  0:54 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Jan Beulich, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Wednesday, July 15, 2015 12:03 AM
> To: Wu, Feng
> Cc: Jan Beulich; Tian, Kevin; keir@xen.org; George Dunlap;
> andrew.cooper3@citrix.com; xen-devel; Zhang, Yang Z
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On Tue, 2015-07-14 at 14:08 +0000, Wu, Feng wrote:
> >
> > > -----Original Message-----
> > > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> 
> > >  - do you need to perform an action upon context switch (on prev and/or
> > >    next vcpu)? If yes, there's an arch specific path in there already;
> > >  - do you need to perform an action when a vcpu wakes-up? If yes, we
> > >    need an arch hook in vcpu_wake();
> > >  - do you need to perform an action when a vcpu goes to sleep? If yes,
> > >    we need an arch hook in vcpu_sleep_nosync();
> > >
> > > I think this makes a more than fair solution. I happen to like it even
> > > better than the centralized approach, actually! That is for personal
> > > taste, but also because I think it may be useful for others too, in
> > > future, to be able to execute arch specific code, e.g., upon wakes-up,
> > > in which case we will be able to use the hook that we're introducing
> > > here for PI.
> > >
> > > Thanks and Regards,
> > > Dario
> >
> > Hi Dario,
> >
> Hi,
> 
> > Thanks for the suggestion! I made a draft patch for this idea,
> >
> Great!
> 
> > It may have
> > some issues since It is just a draft version, kind of like prototype, I post
> > it here just like to know whether it is meet your expectation, if it is I
> > can continue with this direction and this may speed up the upstreaming
> > process.
> >
> Yes, I think this is a good approach, and the proper way for this
> feature to interact with the scheduler.
> 
> I appreciate it is a draft, so I'm not performing a thorough review, but
> I'll try to at least give some comments, in the hope that it helps.

Thanks, any comments are good for my next post!

> 
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 6eebc1a..7e678c8 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -740,6 +740,81 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
> >      vmx_save_guest_msrs(v);
> >      vmx_restore_host_msrs();
> >      vmx_save_dr(v);
> > +
> > +    if ( iommu_intpost )
> > +    {
> >
> I'd put an helper together ( vmx_<something>_pi() ) and put the body of
> this if in it.
> 
> Then, either just call it unconditionally from here and have, in the
> helper, something like this:
> 
>  if ( !iommu_intpost )
>    return;
> 
> Or just have this in here:
> 
>  if ( iommu_intpost )
>   vmx_<something>_pi();
> 
> > +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +        struct pi_desc old, new;
> > +        unsigned long flags;
> > +
> > +        if ( vcpu_runnable(v) || !test_bit(_VPF_blocked,
> &v->pause_flags) )
> > +        {
> >
> Aha! So, AFAICT, this means we can deal with preemptions, sleeps and
> blockings (as can be seen below) here in _ctxt_switch_from,

Yes, here we can handle the above cases, from the runstate point of
view, this can handle the following cases in this hook:

running -> runnable
running -> blocked
running -> offline

> i.e., we
> don't have to call in this code from vcpu_sleep_nosync(), like we were,
> when tying this to vcpu_runstate_change())... nice! :-D

Yes, in vcpu_sleep_nosync(), there are mainly three cases:
runnable -> offline: we don't need anything for PI
running -> offline: covered here

So , I think we don't need to add an arch hook in vcpu_sleep_nosync().

> 
> > +            /*
> > +             * The vCPU is preempted or sleeped.
> >
> "has been preempted or went to sleep" ?
> 
> > We don't need to send
> > +             * notification event to a non-running vcpu, the interrupt
> > +             * information will be delivered to it before VM-ENTRY when
> > +             * the vcpu is scheduled to run next time.
> > +             */
> > +            pi_set_sn(pi_desc);
> > +
> > +        }
> > +        else if ( test_bit(_VPF_blocked, &v->pause_flags) )
> > +        {
> > +            /* The vCPU is blocked */
> >
> This comment does not add much, I'd kill it.
> 
> > +            ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
> > +
> > +            /*
> > +             * The vCPU is blocked on the block list.
> >
> What about "The vCPU is blocking, we need to add it to one of the per
> pCPU lists."
> 
> > Add the blocked
> > +             * vCPU on the list of the v->arch.hvm_vmx.pi_block_cpu,
> >
> What you're doing seems more "Add the vCPU to the blocked list of
> v->processor, which will be the target of the wake-up notification".

Yes, but v->arch.hvm_vmx.pi_block_cpu gets the value of v->processor here.
So maybe we can improve the description here.

> 
> > +             * which is the destination of the wake-up notification event.
> > +             */
> > +            v->arch.hvm_vmx.pi_block_cpu = v->processor;
> > +            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
> > +                              v->arch.hvm_vmx.pi_block_cpu),
> flags);
> > +            list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
> > +                          &per_cpu(pi_blocked_vcpu,
> v->arch.hvm_vmx.pi_block_cpu));
> > +            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
> > +                               v->arch.hvm_vmx.pi_block_cpu),
> flags);
> > +
> > +            do {
> > +                old.control = new.control = pi_desc->control;
> > +
> > +                /*
> > +                 * We should not block the vCPU if
> > +                 * an interrupt was posted for it.
> > +                 */
> > +
> > +                if ( old.on )
> > +                {
> > +                    /*
> > +                     * The vCPU will be removed from the block list
> > +                     * during its state transferring from
> RUNSTATE_blocked
> > +                     * to RUNSTATE_runnable after the following
> tasklet
> > +                     * is executed.
> >
> We can avoid referencing RUNSTATEs at all, can't we? Just say something
> about the vCPU leaving the blocked vCPUs list on the wake-up path.

Sure, I just copy these code from the original one here.

> 
> > +                     */
> > +
> tasklet_schedule(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
> > +                    return;
> > +                }
> > +
> > +                /*
> > +                 * Change the 'NDST' field to
> v->arch.hvm_vmx.pi_block_cpu,
> > +                 * so when external interrupts from assigned deivces
> happen,
> > +                 * wakeup notifiction event will go to
> > +                 * v->arch.hvm_vmx.pi_block_cpu, then in
> pi_wakeup_interrupt()
> > +                 * we can find the vCPU in the right list to wake up.
> > +                 */
> > +                if ( x2apic_enabled )
> > +                    new.ndst =
> cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
> > +                else
> > +                    new.ndst = MASK_INSR(cpu_physical_id(
> > +
> v->arch.hvm_vmx.pi_block_cpu),
> > +                                     PI_xAPIC_NDST_MASK);
> > +                new.sn = 0;
> > +                new.nv = pi_wakeup_vector;
> > +            } while ( cmpxchg(&pi_desc->control, old.control,
> new.control)
> > +                      != old.control );
> > +        }
> > +    }
> ISTR, Jan had some comments on this code (variable names, etc.). It's
> probably goes without saying that those still applies.

Absolutely I will address Jan's comments in the next version.

> 
> >  static void vmx_ctxt_switch_to(struct vcpu *v)
> > @@ -764,6 +839,22 @@ static void vmx_ctxt_switch_to(struct vcpu *v)
> >
> >      vmx_restore_guest_msrs(v);
> >      vmx_restore_dr(v);
> > +
> > +    if ( iommu_intpost )
> > +    {
> >
> You may consider having an helper for this too, for symmetry with the
> above case, but this is less of an issue, IMO.
> 
> > +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +
> > +        ASSERT( pi_desc->sn == 1 );
>                   ^space
> 
> Above you wrote:
> 
>   ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
>          ^no space
> 
> Please, pick up one format (ideally, following suit from other
> occurrences in the file, if any), and be consistent.
> 
> > +
> > +        if ( x2apic_enabled )
> > +            write_atomic(&pi_desc->ndst,
> cpu_physical_id(v->processor));
> > +        else
> > +            write_atomic(&pi_desc->ndst,
> > +                         MASK_INSR(cpu_physical_id(v->processor),
> > +                         PI_xAPIC_NDST_MASK));
> > +
> > +        pi_clear_sn(pi_desc);
> > +    }
> >  }
> 
> > +void arch_vcpu_wake(struct vcpu *v)
> > +{
> > +    if ( !iommu_intpost || (v->runstate.state != RUNSTATE_blocked) )
> > +        return;
> > +
> > +    if ( likely(vcpu_runnable(v)) ||
> > +         !test_bit(_VPF_blocked, &v->pause_flags) )
> > +    {
> Invert this and bail if true? Well, a matter of taste, I guess... but it
> will save one level of indentation.
> 
> > +        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> > +        unsigned long flags;
> > +
> > +        /*
> > +         * blocked -> runnable/offline
> > +         * If the state is transferred from RUNSTATE_blocked,
> > +         * we should set 'NV' feild back to posted_intr_vector,
> > +         * so the Posted-Interrupts can be delivered to the vCPU
> > +         * by VT-d HW after it is scheduled to run.
> > +         */
> >
> Again, make the comment describe things in a RUNSTATE independent way
> (e.g., in terms of 'generic states', like "it's preempted", "it's
> blocked", "it's runnable"; or in terms of flags; or both).

Thanks for your comments and suggestion, Dario!

Thanks,
Feng

> 
> Thanks and Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-10 13:08   ` Jan Beulich
@ 2015-07-15  2:40     ` Wu, Feng
  2015-07-15  8:20       ` Jan Beulich
  2015-07-15  3:13     ` Wu, Feng
  1 sibling, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  2:40 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, July 10, 2015 9:08 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> Posted-Interrupts
> 
> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> > @@ -81,8 +81,19 @@ struct vmx_domain {
> >
> >  struct pi_desc {
> >      DECLARE_BITMAP(pir, NR_VECTORS);
> > -    u32 control;
> > -    u32 rsvd[7];
> > +    union {
> > +        struct
> > +        {
> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> > +            sn     : 1,  /* bit 257 - Suppress Notification */
> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> > +        u8  nv;          /* bit 279:272 - Notification Vector */
> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> > +        u32 ndst;        /* bit 319:288 - Notification Destination */
> > +        };
> > +        u64 control;
> > +    };
> 
> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
> By also declaring this as a bitfield you're opening the structure for
> non-atomic accesses. If that's correct, why is other code not
> being changed to _only_ use the bitfield mechanism (likely also
> eliminating the need for it being a union with the now 64-bit
> "control"? If atomic accesses are required, then I'd strongly
> suggest against making this a bit field.
> 
> And in no event can I see why "ndst" needs to be union-ized
> with "control" if it doesn't need to be updated atomically with
> e.g. "nv".
> 

When the vCPU is to be blocked, we need to atomically update
the "nv" and "ndst", then the wakeup notification event can be
delivered to the right destination.

Thanks,
Feng

> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-10 13:08   ` Jan Beulich
  2015-07-15  2:40     ` Wu, Feng
@ 2015-07-15  3:13     ` Wu, Feng
  1 sibling, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  3:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, July 10, 2015 9:08 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> Posted-Interrupts
> 
> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> > @@ -81,8 +81,19 @@ struct vmx_domain {
> >
> >  struct pi_desc {
> >      DECLARE_BITMAP(pir, NR_VECTORS);
> > -    u32 control;
> > -    u32 rsvd[7];
> > +    union {
> > +        struct
> > +        {
> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> > +            sn     : 1,  /* bit 257 - Suppress Notification */
> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> > +        u8  nv;          /* bit 279:272 - Notification Vector */
> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> > +        u32 ndst;        /* bit 319:288 - Notification Destination */
> > +        };
> > +        u64 control;
> > +    };
> 
> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
> By also declaring this as a bitfield you're opening the structure for
> non-atomic accesses. If that's correct, why is other code not
> being changed to _only_ use the bitfield mechanism (likely also
> eliminating the need for it being a union with the now 64-bit
> "control"? If atomic accesses are required, then I'd strongly
> suggest against making this a bit field.

All this fields are defined in the hardware Spec, if we define nv,
ndst, but not define on and sn explicitly, it is a little strange.
But you are right, I should use the same type to access them,
e.g, bitfield mechanism.

Thanks,
Feng

> 
> And in no event can I see why "ndst" needs to be union-ized
> with "control" if it doesn't need to be updated atomically with
> e.g. "nv".
> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-10 14:01   ` Jan Beulich
@ 2015-07-15  6:04     ` Wu, Feng
  2015-07-15  8:24       ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  6:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, July 10, 2015 10:02 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> > This patch adds an API which is used to update the IRTE
> > for posted-interrupt when guest changes MSI/MSI-X information.
> 
> This is again an example where adding a dead function complicates
> review: How will I know here why this statement is correct, namely
> why MSI/MSI-X are affected but IO-APIC isn't?
> 
> > +int pi_update_irte(struct vcpu *v, struct pirq *pirq, uint8_t gvec)
> > +{
> > +    struct irq_desc *desc;
> > +    struct msi_desc *msi_desc;
> > +    int remap_index;
> > +    int rc = 0;
> > +    struct pci_dev *pci_dev;
> > +    struct acpi_drhd_unit *drhd;
> > +    struct iommu *iommu;
> > +    struct ir_ctrl *ir_ctrl;
> > +    struct iremap_entry *iremap_entries = NULL, *p = NULL;
> > +    struct iremap_entry new_ire;
> > +    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
> 
> I suppose some of the pointers above could become pointers to
> const?
> 
> > +    unsigned long flags;
> > +    uint128_t old_ire, ret;
> > +
> > +    desc = pirq_spin_lock_irq_desc(pirq, NULL);
> > +    if ( !desc )
> > +        return -ENOMEM;
> > +
> > +    msi_desc = desc->msi_desc;
> > +    if ( !msi_desc )
> > +    {
> > +        rc = -EBADSLT;
> > +        goto unlock_out;
> > +    }
> > +
> > +    pci_dev = msi_desc->dev;
> > +    if ( !pci_dev )
> > +    {
> > +        rc = -ENODEV;
> > +        goto unlock_out;
> > +    }
> > +
> > +    remap_index = msi_desc->remap_index;
> > +    drhd = acpi_find_matched_drhd_unit(pci_dev);
> > +    if ( !drhd )
> > +    {
> > +        rc = -ENODEV;
> > +        goto unlock_out;
> > +    }
> > +
> > +    iommu = drhd->iommu;
> > +    ir_ctrl = iommu_ir_ctrl(iommu);
> > +    if ( !ir_ctrl )
> > +    {
> > +        rc = -ENODEV;
> > +        goto unlock_out;
> > +    }
> > +
> > +    spin_lock_irqsave(&ir_ctrl->iremap_lock, flags);
> 
> Interrupts are unconditionally disabled here already. Question
> though is whether you really need to hold on to the IRQ descriptor
> lock across the entire function. Much of course depends on what
> other locks you maybe imply to be held by the caller.

I search lots of current code where the irq desc lock is used, and
I think the unlock operate can be placed before acquiring iremap_lock.

> 
> I'm particularly worried by the call to acpi_find_matched_drhd_unit()
> - is it maybe worth storing the iommu pointer in struct msi_desc?

I think it worth, Like Andrew also mentioned this point before. I tend
to make this a independent work and do it later, since the 4.6 release
is coming, I am still try my best to target it. Could you please share
your concern here, performance? Or other things? Thanks!

> 
> > +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index,
> iremap_entries, p);
> > +    new_ire = *p;
> > +
> > +    /* Setup/Update interrupt remapping table entry. */
> > +    setup_posted_irte(&new_ire, pi_desc, gvec);
> > +
> > +    do {
> > +        old_ire = *(uint128_t *)p;
> 
> This cast suggests that you might want to go beyond what Andrew
> said on cmpxchg16b()'s parameters: Perhaps they'd better be
> void * instead of uint128_t *.

In that case, I need to do the cast in __cmpxchg16b(), right?

> 
> > +        ret = cmpxchg16b(p, &old_ire, &new_ire);
> > +    } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );
> 
> Doesn't setup_posted_irte() need to move inside this loop, as it
> tries to preserve certain fields? Or else, what is the cmpxchg16b
> loop guarding against (i.e. why isn't this just a single one)?

Why need we move setup_posted_irte() inside the loop? "new_ire"
will not be changed after setup, right? Here we need to make sure
the 128b IRTE is updated atomically, especially for the high part
of posted-interrupt descriptor address and the low part of it.

Thanks,
Feng

> 
> > +    iommu_flush_cache_entry(p, sizeof(struct iremap_entry));
> 
> sizeof(*p)
> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-15  2:40     ` Wu, Feng
@ 2015-07-15  8:20       ` Jan Beulich
  2015-07-15  8:26         ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-15  8:20 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 15.07.15 at 04:40, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, July 10, 2015 9:08 PM
>> To: Wu, Feng
>> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
>> Posted-Interrupts
>> 
>> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
>> > @@ -81,8 +81,19 @@ struct vmx_domain {
>> >
>> >  struct pi_desc {
>> >      DECLARE_BITMAP(pir, NR_VECTORS);
>> > -    u32 control;
>> > -    u32 rsvd[7];
>> > +    union {
>> > +        struct
>> > +        {
>> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
>> > +            sn     : 1,  /* bit 257 - Suppress Notification */
>> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
>> > +        u8  nv;          /* bit 279:272 - Notification Vector */
>> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
>> > +        u32 ndst;        /* bit 319:288 - Notification Destination */
>> > +        };
>> > +        u64 control;
>> > +    };
>> 
>> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
>> By also declaring this as a bitfield you're opening the structure for
>> non-atomic accesses. If that's correct, why is other code not
>> being changed to _only_ use the bitfield mechanism (likely also
>> eliminating the need for it being a union with the now 64-bit
>> "control"? If atomic accesses are required, then I'd strongly
>> suggest against making this a bit field.
>> 
>> And in no event can I see why "ndst" needs to be union-ized
>> with "control" if it doesn't need to be updated atomically with
>> e.g. "nv".
>> 
> 
> When the vCPU is to be blocked, we need to atomically update
> the "nv" and "ndst", then the wakeup notification event can be
> delivered to the right destination.

Okay. Your reply made me go through the patches again to check
where updates to nv/ndst happen - what's the reason they aren't
being updated as a pair in patch 14's RUNSTATE_running handling
(or in the replacement draft's vmx_ctxt_switch_to() adjustment)?

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-15  6:04     ` Wu, Feng
@ 2015-07-15  8:24       ` Jan Beulich
  2015-07-15  8:38         ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-15  8:24 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 15.07.15 at 08:04, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Friday, July 10, 2015 10:02 PM
>> I'm particularly worried by the call to acpi_find_matched_drhd_unit()
>> - is it maybe worth storing the iommu pointer in struct msi_desc?
> 
> I think it worth, Like Andrew also mentioned this point before. I tend
> to make this a independent work and do it later, since the 4.6 release
> is coming, I am still try my best to target it. Could you please share
> your concern here, performance? Or other things? Thanks!

Interrupt latency in particular.

>> > +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index,
>> iremap_entries, p);
>> > +    new_ire = *p;
>> > +
>> > +    /* Setup/Update interrupt remapping table entry. */
>> > +    setup_posted_irte(&new_ire, pi_desc, gvec);
>> > +
>> > +    do {
>> > +        old_ire = *(uint128_t *)p;
>> 
>> This cast suggests that you might want to go beyond what Andrew
>> said on cmpxchg16b()'s parameters: Perhaps they'd better be
>> void * instead of uint128_t *.
> 
> In that case, I need to do the cast in __cmpxchg16b(), right?

Where needed, yes. But that would limit casting to just a single place.

>> > +        ret = cmpxchg16b(p, &old_ire, &new_ire);
>> > +    } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );
>> 
>> Doesn't setup_posted_irte() need to move inside this loop, as it
>> tries to preserve certain fields? Or else, what is the cmpxchg16b
>> loop guarding against (i.e. why isn't this just a single one)?
> 
> Why need we move setup_posted_irte() inside the loop? "new_ire"
> will not be changed after setup, right? Here we need to make sure
> the 128b IRTE is updated atomically, especially for the high part
> of posted-interrupt descriptor address and the low part of it.

There are two possible scenarios:

1) There are bits that can be updated behind the back of the code
here. In that case you need to loop, and each iteration of the loop
needs to re-fetch the current value (not doing so would make the
loop infinite).

2) No racing updates are possible; all you care about is atomicity
of the update. In that case you don't need a loop around the
cmpxchg16b().

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-15  8:20       ` Jan Beulich
@ 2015-07-15  8:26         ` Wu, Feng
  2015-07-15  8:36           ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  8:26 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 15, 2015 4:20 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> Posted-Interrupts
> 
> >>> On 15.07.15 at 04:40, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Friday, July 10, 2015 9:08 PM
> >> To: Wu, Feng
> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> >> Posted-Interrupts
> >>
> >> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> >> > @@ -81,8 +81,19 @@ struct vmx_domain {
> >> >
> >> >  struct pi_desc {
> >> >      DECLARE_BITMAP(pir, NR_VECTORS);
> >> > -    u32 control;
> >> > -    u32 rsvd[7];
> >> > +    union {
> >> > +        struct
> >> > +        {
> >> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> >> > +            sn     : 1,  /* bit 257 - Suppress Notification */
> >> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> >> > +        u8  nv;          /* bit 279:272 - Notification Vector */
> >> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> >> > +        u32 ndst;        /* bit 319:288 - Notification Destination */
> >> > +        };
> >> > +        u64 control;
> >> > +    };
> >>
> >> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
> >> By also declaring this as a bitfield you're opening the structure for
> >> non-atomic accesses. If that's correct, why is other code not
> >> being changed to _only_ use the bitfield mechanism (likely also
> >> eliminating the need for it being a union with the now 64-bit
> >> "control"? If atomic accesses are required, then I'd strongly
> >> suggest against making this a bit field.
> >>
> >> And in no event can I see why "ndst" needs to be union-ized
> >> with "control" if it doesn't need to be updated atomically with
> >> e.g. "nv".
> >>
> >
> > When the vCPU is to be blocked, we need to atomically update
> > the "nv" and "ndst", then the wakeup notification event can be
> > delivered to the right destination.
> 
> Okay. Your reply made me go through the patches again to check
> where updates to nv/ndst happen - what's the reason they aren't
> being updated as a pair in patch 14's RUNSTATE_running handling
> (or in the replacement draft's vmx_ctxt_switch_to() adjustment)?

It is because, we can only enter running state from runnable, in which,
the NV field has been already changed back to ' posted_intr_vector ',
we don't need to do it here again.

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-15  8:26         ` Wu, Feng
@ 2015-07-15  8:36           ` Jan Beulich
  2015-07-15  8:43             ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-15  8:36 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 15.07.15 at 10:26, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, July 15, 2015 4:20 PM
>> To: Wu, Feng
>> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
>> Posted-Interrupts
>> 
>> >>> On 15.07.15 at 04:40, <feng.wu@intel.com> wrote:
>> 
>> >
>> >> -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Friday, July 10, 2015 9:08 PM
>> >> To: Wu, Feng
>> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> >> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
>> >> Posted-Interrupts
>> >>
>> >> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
>> >> > @@ -81,8 +81,19 @@ struct vmx_domain {
>> >> >
>> >> >  struct pi_desc {
>> >> >      DECLARE_BITMAP(pir, NR_VECTORS);
>> >> > -    u32 control;
>> >> > -    u32 rsvd[7];
>> >> > +    union {
>> >> > +        struct
>> >> > +        {
>> >> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
>> >> > +            sn     : 1,  /* bit 257 - Suppress Notification */
>> >> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
>> >> > +        u8  nv;          /* bit 279:272 - Notification Vector */
>> >> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
>> >> > +        u32 ndst;        /* bit 319:288 - Notification Destination */
>> >> > +        };
>> >> > +        u64 control;
>> >> > +    };
>> >>
>> >> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
>> >> By also declaring this as a bitfield you're opening the structure for
>> >> non-atomic accesses. If that's correct, why is other code not
>> >> being changed to _only_ use the bitfield mechanism (likely also
>> >> eliminating the need for it being a union with the now 64-bit
>> >> "control"? If atomic accesses are required, then I'd strongly
>> >> suggest against making this a bit field.
>> >>
>> >> And in no event can I see why "ndst" needs to be union-ized
>> >> with "control" if it doesn't need to be updated atomically with
>> >> e.g. "nv".
>> >>
>> >
>> > When the vCPU is to be blocked, we need to atomically update
>> > the "nv" and "ndst", then the wakeup notification event can be
>> > delivered to the right destination.
>> 
>> Okay. Your reply made me go through the patches again to check
>> where updates to nv/ndst happen - what's the reason they aren't
>> being updated as a pair in patch 14's RUNSTATE_running handling
>> (or in the replacement draft's vmx_ctxt_switch_to() adjustment)?
> 
> It is because, we can only enter running state from runnable, in which,
> the NV field has been already changed back to ' posted_intr_vector ',
> we don't need to do it here again.

Without sitting in the runstate update path anymore, I can't see how
you would get to see all transitions to runnable.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-15  8:24       ` Jan Beulich
@ 2015-07-15  8:38         ` Wu, Feng
  2015-07-15  8:46           ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  8:38 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 15, 2015 4:25 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 15.07.15 at 08:04, <feng.wu@intel.com> wrote:
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Friday, July 10, 2015 10:02 PM
> >> I'm particularly worried by the call to acpi_find_matched_drhd_unit()
> >> - is it maybe worth storing the iommu pointer in struct msi_desc?
> >
> > I think it worth, Like Andrew also mentioned this point before. I tend
> > to make this a independent work and do it later, since the 4.6 release
> > is coming, I am still try my best to target it. Could you please share
> > your concern here, performance? Or other things? Thanks!
> 
> Interrupt latency in particular.

This update IRTE operation is not so frequently. It only happens in few times,
especially in the initialization phase of the guest. And even the guest set
the affinity, in the MSI/MSIx configuration doesn't change, QEMU will not
ask Xen to update it.

> 
> >> > +    GET_IREMAP_ENTRY(ir_ctrl->iremap_maddr, remap_index,
> >> iremap_entries, p);
> >> > +    new_ire = *p;
> >> > +
> >> > +    /* Setup/Update interrupt remapping table entry. */
> >> > +    setup_posted_irte(&new_ire, pi_desc, gvec);
> >> > +
> >> > +    do {
> >> > +        old_ire = *(uint128_t *)p;
> >>
> >> This cast suggests that you might want to go beyond what Andrew
> >> said on cmpxchg16b()'s parameters: Perhaps they'd better be
> >> void * instead of uint128_t *.
> >
> > In that case, I need to do the cast in __cmpxchg16b(), right?
> 
> Where needed, yes. But that would limit casting to just a single place.
> 
> >> > +        ret = cmpxchg16b(p, &old_ire, &new_ire);
> >> > +    } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );
> >>
> >> Doesn't setup_posted_irte() need to move inside this loop, as it
> >> tries to preserve certain fields? Or else, what is the cmpxchg16b
> >> loop guarding against (i.e. why isn't this just a single one)?
> >
> > Why need we move setup_posted_irte() inside the loop? "new_ire"
> > will not be changed after setup, right? Here we need to make sure
> > the 128b IRTE is updated atomically, especially for the high part
> > of posted-interrupt descriptor address and the low part of it.
> 
> There are two possible scenarios:
> 
> 1) There are bits that can be updated behind the back of the code
> here. In that case you need to loop, and each iteration of the loop
> needs to re-fetch the current value (not doing so would make the
> loop infinite).

Oh, yes, I think I made a mistake here, it is too hastily these days,
Sorry for that! I think I need do it like this:

    do {
        new_ire = *p;

        /* Setup/Update interrupt remapping table entry. */
        setup_posted_irte(&new_ire, pi_desc, gvec);

        old_ire = *(uint128_t *)p;
        ret = cmpxchg16b(p, &old_ire, &new_ire);
    } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );

Thanks,
Feng

> 
> 2) No racing updates are possible; all you care about is atomicity
> of the update. In that case you don't need a loop around the
> cmpxchg16b().
> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-15  8:36           ` Jan Beulich
@ 2015-07-15  8:43             ` Wu, Feng
  2015-07-15  9:28               ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  8:43 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 15, 2015 4:36 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> Posted-Interrupts
> 
> >>> On 15.07.15 at 10:26, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, July 15, 2015 4:20 PM
> >> To: Wu, Feng
> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> >> Posted-Interrupts
> >>
> >> >>> On 15.07.15 at 04:40, <feng.wu@intel.com> wrote:
> >>
> >> >
> >> >> -----Original Message-----
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Friday, July 10, 2015 9:08 PM
> >> >> To: Wu, Feng
> >> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian,
> Kevin;
> >> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> >> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> >> >> Posted-Interrupts
> >> >>
> >> >> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> >> >> > @@ -81,8 +81,19 @@ struct vmx_domain {
> >> >> >
> >> >> >  struct pi_desc {
> >> >> >      DECLARE_BITMAP(pir, NR_VECTORS);
> >> >> > -    u32 control;
> >> >> > -    u32 rsvd[7];
> >> >> > +    union {
> >> >> > +        struct
> >> >> > +        {
> >> >> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> >> >> > +            sn     : 1,  /* bit 257 - Suppress Notification */
> >> >> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> >> >> > +        u8  nv;          /* bit 279:272 - Notification Vector */
> >> >> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> >> >> > +        u32 ndst;        /* bit 319:288 - Notification Destination
> */
> >> >> > +        };
> >> >> > +        u64 control;
> >> >> > +    };
> >> >>
> >> >> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
> >> >> By also declaring this as a bitfield you're opening the structure for
> >> >> non-atomic accesses. If that's correct, why is other code not
> >> >> being changed to _only_ use the bitfield mechanism (likely also
> >> >> eliminating the need for it being a union with the now 64-bit
> >> >> "control"? If atomic accesses are required, then I'd strongly
> >> >> suggest against making this a bit field.
> >> >>
> >> >> And in no event can I see why "ndst" needs to be union-ized
> >> >> with "control" if it doesn't need to be updated atomically with
> >> >> e.g. "nv".
> >> >>
> >> >
> >> > When the vCPU is to be blocked, we need to atomically update
> >> > the "nv" and "ndst", then the wakeup notification event can be
> >> > delivered to the right destination.
> >>
> >> Okay. Your reply made me go through the patches again to check
> >> where updates to nv/ndst happen - what's the reason they aren't
> >> being updated as a pair in patch 14's RUNSTATE_running handling
> >> (or in the replacement draft's vmx_ctxt_switch_to() adjustment)?
> >
> > It is because, we can only enter running state from runnable, in which,
> > the NV field has been already changed back to ' posted_intr_vector ',
> > we don't need to do it here again.
> 
> Without sitting in the runstate update path anymore, I can't see how
> you would get to see all transitions to runnable.

Sorry, I cannot understanding the above comments well. Do you mean
after using the new method (arch hooks ) to update posted-interrupt
descriptor, I cannot track all the state transitions to runnable?

Thanks,
Feng

> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-15  8:38         ` Wu, Feng
@ 2015-07-15  8:46           ` Jan Beulich
  2015-07-15  8:55             ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-15  8:46 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 15.07.15 at 10:38, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, July 15, 2015 4:25 PM
>> To: Wu, Feng
>> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> Subject: RE: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
>> 
>> >>> On 15.07.15 at 08:04, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Friday, July 10, 2015 10:02 PM
>> >> I'm particularly worried by the call to acpi_find_matched_drhd_unit()
>> >> - is it maybe worth storing the iommu pointer in struct msi_desc?
>> >
>> > I think it worth, Like Andrew also mentioned this point before. I tend
>> > to make this a independent work and do it later, since the 4.6 release
>> > is coming, I am still try my best to target it. Could you please share
>> > your concern here, performance? Or other things? Thanks!
>> 
>> Interrupt latency in particular.
> 
> This update IRTE operation is not so frequently. It only happens in few 
> times,
> especially in the initialization phase of the guest. And even the guest set
> the affinity, in the MSI/MSIx configuration doesn't change, QEMU will not
> ask Xen to update it.

When the guest sets the affinity, the MSI{,-X} configuration is
rather likely to change (at least for Linux guests).

>> There are two possible scenarios:
>> 
>> 1) There are bits that can be updated behind the back of the code
>> here. In that case you need to loop, and each iteration of the loop
>> needs to re-fetch the current value (not doing so would make the
>> loop infinite).
> 
> Oh, yes, I think I made a mistake here, it is too hastily these days,
> Sorry for that! I think I need do it like this:
> 
>     do {
>         new_ire = *p;
> 
>         /* Setup/Update interrupt remapping table entry. */
>         setup_posted_irte(&new_ire, pi_desc, gvec);
> 
>         old_ire = *(uint128_t *)p;
>         ret = cmpxchg16b(p, &old_ire, &new_ire);
>     } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );

So since you put this in a loop again, would you mind pointing out
which bits can get modified behind our back?

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-15  8:46           ` Jan Beulich
@ 2015-07-15  8:55             ` Wu, Feng
  2015-07-15  9:32               ` Jan Beulich
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  8:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 15, 2015 4:46 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
> 
> >>> On 15.07.15 at 10:38, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, July 15, 2015 4:25 PM
> >> To: Wu, Feng
> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> Subject: RE: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
> >>
> >> >>> On 15.07.15 at 08:04, <feng.wu@intel.com> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Friday, July 10, 2015 10:02 PM
> >> >> I'm particularly worried by the call to acpi_find_matched_drhd_unit()
> >> >> - is it maybe worth storing the iommu pointer in struct msi_desc?
> >> >
> >> > I think it worth, Like Andrew also mentioned this point before. I tend
> >> > to make this a independent work and do it later, since the 4.6 release
> >> > is coming, I am still try my best to target it. Could you please share
> >> > your concern here, performance? Or other things? Thanks!
> >>
> >> Interrupt latency in particular.
> >
> > This update IRTE operation is not so frequently. It only happens in few
> > times,
> > especially in the initialization phase of the guest. And even the guest set
> > the affinity, in the MSI/MSIx configuration doesn't change, QEMU will not
> > ask Xen to update it.
> 
> When the guest sets the affinity, the MSI{,-X} configuration is
> rather likely to change (at least for Linux guests).

Yes, it is. But I'd say, it is not a frequent operation. In my test, it only happens
in the initialization phase and some updates doesn't go the Xen since the
configuration is the same (QEMU filters it). And I agree I will change this,
my question is that can we put this a little late, and I can focus on some
other critical issue before 4.6 is release, which may make more chance for
this patch to catch up with 4.6. Is this okay for you?

Thanks,
Feng

> 
> >> There are two possible scenarios:
> >>
> >> 1) There are bits that can be updated behind the back of the code
> >> here. In that case you need to loop, and each iteration of the loop
> >> needs to re-fetch the current value (not doing so would make the
> >> loop infinite).
> >
> > Oh, yes, I think I made a mistake here, it is too hastily these days,
> > Sorry for that! I think I need do it like this:
> >
> >     do {
> >         new_ire = *p;
> >
> >         /* Setup/Update interrupt remapping table entry. */
> >         setup_posted_irte(&new_ire, pi_desc, gvec);
> >
> >         old_ire = *(uint128_t *)p;
> >         ret = cmpxchg16b(p, &old_ire, &new_ire);
> >     } while ( memcmp(&ret, &old_ire, sizeof(old_ire)) );
> 
> So since you put this in a loop again, would you mind pointing out
> which bits can get modified behind our back?
> 
> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-15  8:43             ` Wu, Feng
@ 2015-07-15  9:28               ` Jan Beulich
  2015-07-15  9:30                 ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Jan Beulich @ 2015-07-15  9:28 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 15.07.15 at 10:43, <feng.wu@intel.com> wrote:

> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, July 15, 2015 4:36 PM
>> To: Wu, Feng
>> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
>> Posted-Interrupts
>> 
>> >>> On 15.07.15 at 10:26, <feng.wu@intel.com> wrote:
>> 
>> >
>> >> -----Original Message-----
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Wednesday, July 15, 2015 4:20 PM
>> >> To: Wu, Feng
>> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
>> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> >> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
>> >> Posted-Interrupts
>> >>
>> >> >>> On 15.07.15 at 04:40, <feng.wu@intel.com> wrote:
>> >>
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Friday, July 10, 2015 9:08 PM
>> >> >> To: Wu, Feng
>> >> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian,
>> Kevin;
>> >> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org 
>> >> >> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
>> >> >> Posted-Interrupts
>> >> >>
>> >> >> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
>> >> >> > @@ -81,8 +81,19 @@ struct vmx_domain {
>> >> >> >
>> >> >> >  struct pi_desc {
>> >> >> >      DECLARE_BITMAP(pir, NR_VECTORS);
>> >> >> > -    u32 control;
>> >> >> > -    u32 rsvd[7];
>> >> >> > +    union {
>> >> >> > +        struct
>> >> >> > +        {
>> >> >> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
>> >> >> > +            sn     : 1,  /* bit 257 - Suppress Notification */
>> >> >> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
>> >> >> > +        u8  nv;          /* bit 279:272 - Notification Vector */
>> >> >> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
>> >> >> > +        u32 ndst;        /* bit 319:288 - Notification Destination
>> */
>> >> >> > +        };
>> >> >> > +        u64 control;
>> >> >> > +    };
>> >> >>
>> >> >> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
>> >> >> By also declaring this as a bitfield you're opening the structure for
>> >> >> non-atomic accesses. If that's correct, why is other code not
>> >> >> being changed to _only_ use the bitfield mechanism (likely also
>> >> >> eliminating the need for it being a union with the now 64-bit
>> >> >> "control"? If atomic accesses are required, then I'd strongly
>> >> >> suggest against making this a bit field.
>> >> >>
>> >> >> And in no event can I see why "ndst" needs to be union-ized
>> >> >> with "control" if it doesn't need to be updated atomically with
>> >> >> e.g. "nv".
>> >> >>
>> >> >
>> >> > When the vCPU is to be blocked, we need to atomically update
>> >> > the "nv" and "ndst", then the wakeup notification event can be
>> >> > delivered to the right destination.
>> >>
>> >> Okay. Your reply made me go through the patches again to check
>> >> where updates to nv/ndst happen - what's the reason they aren't
>> >> being updated as a pair in patch 14's RUNSTATE_running handling
>> >> (or in the replacement draft's vmx_ctxt_switch_to() adjustment)?
>> >
>> > It is because, we can only enter running state from runnable, in which,
>> > the NV field has been already changed back to ' posted_intr_vector ',
>> > we don't need to do it here again.
>> 
>> Without sitting in the runstate update path anymore, I can't see how
>> you would get to see all transitions to runnable.
> 
> Sorry, I cannot understanding the above comments well. Do you mean
> after using the new method (arch hooks ) to update posted-interrupt
> descriptor, I cannot track all the state transitions to runnable?

Not sure if "track" is the right word here, but yes.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts
  2015-07-15  9:28               ` Jan Beulich
@ 2015-07-15  9:30                 ` Wu, Feng
  0 siblings, 0 replies; 155+ messages in thread
From: Wu, Feng @ 2015-07-15  9:30 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Tian, Kevin, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, July 15, 2015 5:28 PM
> To: Wu, Feng
> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> Posted-Interrupts
> 
> >>> On 15.07.15 at 10:43, <feng.wu@intel.com> wrote:
> 
> >
> >> -----Original Message-----
> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Wednesday, July 15, 2015 4:36 PM
> >> To: Wu, Feng
> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian, Kevin;
> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> >> Posted-Interrupts
> >>
> >> >>> On 15.07.15 at 10:26, <feng.wu@intel.com> wrote:
> >>
> >> >
> >> >> -----Original Message-----
> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> Sent: Wednesday, July 15, 2015 4:20 PM
> >> >> To: Wu, Feng
> >> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian,
> Kevin;
> >> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> >> Subject: RE: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> >> >> Posted-Interrupts
> >> >>
> >> >> >>> On 15.07.15 at 04:40, <feng.wu@intel.com> wrote:
> >> >>
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >> >> Sent: Friday, July 10, 2015 9:08 PM
> >> >> >> To: Wu, Feng
> >> >> >> Cc: andrew.cooper3@citrix.com; george.dunlap@eu.citrix.com; Tian,
> >> Kevin;
> >> >> >> Zhang, Yang Z; xen-devel@lists.xen.org; keir@xen.org
> >> >> >> Subject: Re: [v3 06/15] vmx: Extend struct pi_desc to support VT-d
> >> >> >> Posted-Interrupts
> >> >> >>
> >> >> >> >>> On 24.06.15 at 07:18, <feng.wu@intel.com> wrote:
> >> >> >> > @@ -81,8 +81,19 @@ struct vmx_domain {
> >> >> >> >
> >> >> >> >  struct pi_desc {
> >> >> >> >      DECLARE_BITMAP(pir, NR_VECTORS);
> >> >> >> > -    u32 control;
> >> >> >> > -    u32 rsvd[7];
> >> >> >> > +    union {
> >> >> >> > +        struct
> >> >> >> > +        {
> >> >> >> > +        u16 on     : 1,  /* bit 256 - Outstanding Notification */
> >> >> >> > +            sn     : 1,  /* bit 257 - Suppress Notification */
> >> >> >> > +            rsvd_1 : 14; /* bit 271:258 - Reserved */
> >> >> >> > +        u8  nv;          /* bit 279:272 - Notification Vector */
> >> >> >> > +        u8  rsvd_2;      /* bit 287:280 - Reserved */
> >> >> >> > +        u32 ndst;        /* bit 319:288 - Notification
> Destination
> >> */
> >> >> >> > +        };
> >> >> >> > +        u64 control;
> >> >> >> > +    };
> >> >> >>
> >> >> >> So current code, afaics, uses e.g. test_and_set_bit() to set ON.
> >> >> >> By also declaring this as a bitfield you're opening the structure for
> >> >> >> non-atomic accesses. If that's correct, why is other code not
> >> >> >> being changed to _only_ use the bitfield mechanism (likely also
> >> >> >> eliminating the need for it being a union with the now 64-bit
> >> >> >> "control"? If atomic accesses are required, then I'd strongly
> >> >> >> suggest against making this a bit field.
> >> >> >>
> >> >> >> And in no event can I see why "ndst" needs to be union-ized
> >> >> >> with "control" if it doesn't need to be updated atomically with
> >> >> >> e.g. "nv".
> >> >> >>
> >> >> >
> >> >> > When the vCPU is to be blocked, we need to atomically update
> >> >> > the "nv" and "ndst", then the wakeup notification event can be
> >> >> > delivered to the right destination.
> >> >>
> >> >> Okay. Your reply made me go through the patches again to check
> >> >> where updates to nv/ndst happen - what's the reason they aren't
> >> >> being updated as a pair in patch 14's RUNSTATE_running handling
> >> >> (or in the replacement draft's vmx_ctxt_switch_to() adjustment)?
> >> >
> >> > It is because, we can only enter running state from runnable, in which,
> >> > the NV field has been already changed back to ' posted_intr_vector ',
> >> > we don't need to do it here again.
> >>
> >> Without sitting in the runstate update path anymore, I can't see how
> >> you would get to see all transitions to runnable.
> >
> > Sorry, I cannot understanding the above comments well. Do you mean
> > after using the new method (arch hooks ) to update posted-interrupt
> > descriptor, I cannot track all the state transitions to runnable?
> 
> Not sure if "track" is the right word here, but yes.
> 

The new method is still in development, let's see how it will be then. :)

Thanks,
Feng

> Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used
  2015-07-15  8:55             ` Wu, Feng
@ 2015-07-15  9:32               ` Jan Beulich
  0 siblings, 0 replies; 155+ messages in thread
From: Jan Beulich @ 2015-07-15  9:32 UTC (permalink / raw)
  To: Feng Wu
  Cc: Kevin Tian, keir@xen.org, george.dunlap@eu.citrix.com,
	andrew.cooper3@citrix.com, xen-devel@lists.xen.org, Yang Z Zhang

>>> On 15.07.15 at 10:55, <feng.wu@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, July 15, 2015 4:46 PM
>> >>> On 15.07.15 at 10:38, <feng.wu@intel.com> wrote:
>> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> Sent: Wednesday, July 15, 2015 4:25 PM
>> >> >>> On 15.07.15 at 08:04, <feng.wu@intel.com> wrote:
>> >> >> From: Jan Beulich [mailto:JBeulich@suse.com]
>> >> >> Sent: Friday, July 10, 2015 10:02 PM
>> >> >> I'm particularly worried by the call to acpi_find_matched_drhd_unit()
>> >> >> - is it maybe worth storing the iommu pointer in struct msi_desc?
>> >> >
>> >> > I think it worth, Like Andrew also mentioned this point before. I tend
>> >> > to make this a independent work and do it later, since the 4.6 release
>> >> > is coming, I am still try my best to target it. Could you please share
>> >> > your concern here, performance? Or other things? Thanks!
>> >>
>> >> Interrupt latency in particular.
>> >
>> > This update IRTE operation is not so frequently. It only happens in few
>> > times,
>> > especially in the initialization phase of the guest. And even the guest set
>> > the affinity, in the MSI/MSIx configuration doesn't change, QEMU will not
>> > ask Xen to update it.
>> 
>> When the guest sets the affinity, the MSI{,-X} configuration is
>> rather likely to change (at least for Linux guests).
> 
> Yes, it is. But I'd say, it is not a frequent operation. In my test, it only 
> happens
> in the initialization phase and some updates doesn't go the Xen since the
> configuration is the same (QEMU filters it).

Can I please ask you to move away from this way of thinking? What
you see in experiments is useful from a functionality pov, but pretty
meaningless from a security perspective. For that, you'd rather start
thinking about what a _malicious_ guest might be doing.

> And I agree I will change this,
> my question is that can we put this a little late, and I can focus on some
> other critical issue before 4.6 is release, which may make more chance for
> this patch to catch up with 4.6. Is this okay for you?

As long as the feature (due to the other issue) remains experimental,
is off by default, and the code has a prominent comment outlining the
intended improvement, I'd be fine, yes.

Jan

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-14 16:02                                 ` Dario Faggioli
  2015-07-15  0:54                                   ` Wu, Feng
@ 2015-07-17  7:46                                   ` Wu, Feng
  2015-07-17 10:13                                     ` Dario Faggioli
  1 sibling, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-17  7:46 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Jan Beulich, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Wednesday, July 15, 2015 12:03 AM
> To: Wu, Feng
> Cc: Jan Beulich; Tian, Kevin; keir@xen.org; George Dunlap;
> andrew.cooper3@citrix.com; xen-devel; Zhang, Yang Z
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 

Hi Dario,

I finished the new patch with arch hooks, but seems something is wrong,
after assigning the NIC to guest, I ping some guy from the guest, the
latency is too big. So far I've not found the reason after debugging the code
for some time. I post the path here to see if you can find any obvious logic
errors in it. Or could you have a look at it to double check whether this patch
exactly does the same thing as
[v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling,
which works well. A thorough review is highly appreciated! Thank you
very much!

    vmx: Add some scheduler hooks for VT-d posted interrupts

    This patch adds the following arch hooks in scheduler:
    - vmx_pre_ctx_switch_pi():
    It is called in vmx_ctxt_switch_from(), we update the posted
    interrupt descriptor when the vCPU is preempted, go to sleep,
    or is blocked.

    - vmx_post_ctx_switch_pi()
    It is called in vmx_ctxt_switch_to(), we update the posted
    interrupt descriptor when the vCPU is going to run.

    - arch_vcpu_wake()
    It will be called in vcpu_wake() in later patch, we update
    the posted interrupt descriptor when the vCPU is unblocked.

    Suggested-by: Dario Faggioli <dario.faggioli@citrix.com>
    Signed-off-by: Feng Wu <feng.wu@intel.com>

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 6d25a32..82d797f 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -168,6 +168,7 @@ static int vmx_vcpu_initialise(struct vcpu *v)

     INIT_LIST_HEAD(&v->arch.hvm_vmx.pi_blocked_vcpu_list);

+    v->arch.hvm_vmx.pi_block_cpu = -1;
     return 0;
 }

@@ -725,6 +726,139 @@ static void vmx_fpu_leave(struct vcpu *v)
     }
 }

+void arch_vcpu_wake(struct vcpu *v)
+{
+    if ( !iommu_intpost || !is_hvm_vcpu(v) )
+        return;
+
+    if ( likely(vcpu_runnable(v)) ||
+         !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+        unsigned long flags;
+
+        /*
+         * blocked -> runnable/offline
+         * If the state is transferred from RUNSTATE_blocked,
+         * we should set 'NV' feild back to posted_intr_vector,
+         * so the Posted-Interrupts can be delivered to the vCPU
+         * by VT-d HW after it is scheduled to run.
+         */
+        write_atomic((uint8_t*)&pi_desc->nv, posted_intr_vector);
+
+        /*
+         * Delete the vCPU from the related block list
+         * if we are resuming from blocked state
+         */
+        if (v->arch.hvm_vmx.pi_block_cpu != -1)
+        {
+            spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                              v->arch.hvm_vmx.pi_block_cpu), flags);
+            list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+            spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                                   v->arch.hvm_vmx.pi_block_cpu), flags);
+            v->arch.hvm_vmx.pi_block_cpu = -1;
+        }
+    }
+}
+
+static void vmx_pre_ctx_switch_pi(struct vcpu *v)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+    struct pi_desc old, new;
+    unsigned long flags;
+
+    if ( !iommu_intpost || !is_hvm_vcpu(v) )
+        return;
+
+    if ( vcpu_runnable(v) || !test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        /*
+         * The vCPU has been preempted or went to sleep. We don't need to send
+         * notification event to a non-running vcpu, the interrupt information
+         * will be delivered to it before VM-ENTRY when the vcpu is scheduled
+         * to run next time.
+         */
+        pi_set_sn(pi_desc);
+
+    }
+    else if ( test_bit(_VPF_blocked, &v->pause_flags) )
+    {
+        ASSERT(v->arch.hvm_vmx.pi_block_cpu == -1);
+
+        /*
+         * The vCPU is blocking, we need to add it to one of the per pCPU lists.
+         * We save v->processor to v->arch.hvm_vmx.pi_block_cpu and use it for
+         * the per-CPU list, we also save it to posted-interrupt descriptor and
+         * make it as the destination of the wake-up notification event.
+         */
+        v->arch.hvm_vmx.pi_block_cpu = v->processor;
+        spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                          v->arch.hvm_vmx.pi_block_cpu), flags);
+        list_add_tail(&v->arch.hvm_vmx.pi_blocked_vcpu_list,
+                      &per_cpu(pi_blocked_vcpu, v->arch.hvm_vmx.pi_block_cpu));
+        spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                           v->arch.hvm_vmx.pi_block_cpu), flags);
+
+        do {
+            old.control = new.control = pi_desc->control;
+
+            /* Should not block the vCPU if an interrupt was posted for it */
+
+            if ( old.on )
+            {
+                ASSERT(v->arch.hvm_vmx.pi_block_cpu != -1);
+
+                spin_lock_irqsave(&per_cpu(pi_blocked_vcpu_lock,
+                                  v->arch.hvm_vmx.pi_block_cpu), flags);
+                list_del_init(&v->arch.hvm_vmx.pi_blocked_vcpu_list);
+                spin_unlock_irqrestore(&per_cpu(pi_blocked_vcpu_lock,
+                                       v->arch.hvm_vmx.pi_block_cpu), flags);
+                v->arch.hvm_vmx.pi_block_cpu = -1;
+
+                tasklet_schedule(&v->arch.hvm_vmx.pi_vcpu_wakeup_tasklet);
+
+                return;
+            }
+
+            /*
+             * Change the 'NDST' field to v->arch.hvm_vmx.pi_block_cpu,
+             * so when external interrupts from assigned deivces happen,
+             * wakeup notifiction event will go to
+             * v->arch.hvm_vmx.pi_block_cpu, then in pi_wakeup_interrupt()
+             * we can find the vCPU in the right list to wake up.
+             */
+            if ( x2apic_enabled )
+                new.ndst = cpu_physical_id(v->arch.hvm_vmx.pi_block_cpu);
+            else
+                new.ndst = MASK_INSR(cpu_physical_id(
+                                 v->arch.hvm_vmx.pi_block_cpu),
+                                 PI_xAPIC_NDST_MASK);
+            new.sn = 0;
+            new.nv = pi_wakeup_vector;
+        } while ( cmpxchg(&pi_desc->control, old.control, new.control)
+                  != old.control );
+    }
+}
+
+static void vmx_post_ctx_switch_pi(struct vcpu *v)
+{
+    struct pi_desc *pi_desc = &v->arch.hvm_vmx.pi_desc;
+
+    if ( !iommu_intpost || !is_hvm_vcpu(v) )
+        return;
+
+    ASSERT(pi_desc->sn == 1);
+
+    if ( x2apic_enabled )
+        write_atomic(&pi_desc->ndst, cpu_physical_id(v->processor));
+    else
+        write_atomic(&pi_desc->ndst,
+                     MASK_INSR(cpu_physical_id(v->processor),
+                     PI_xAPIC_NDST_MASK));
+
+    pi_clear_sn(pi_desc);
+}
+
 static void vmx_ctxt_switch_from(struct vcpu *v)
 {
     /*
@@ -739,6 +873,7 @@ static void vmx_ctxt_switch_from(struct vcpu *v)
     vmx_save_guest_msrs(v);
     vmx_restore_host_msrs();
     vmx_save_dr(v);
+    vmx_pre_ctx_switch_pi(v);
 }

 static void vmx_ctxt_switch_to(struct vcpu *v)
@@ -763,6 +898,7 @@ static void vmx_ctxt_switch_to(struct vcpu *v)

     vmx_restore_guest_msrs(v);
     vmx_restore_dr(v);
+    vmx_post_ctx_switch_pi(v);
 }


diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index e621c30..db546eb 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -510,6 +510,7 @@ bool_t nhvm_vmcx_hap_enabled(struct vcpu *v);
 /* interrupt */
 enum hvm_intblk nhvm_interrupt_blocked(struct vcpu *v);

+void arch_vcpu_wake(struct vcpu *v);
 #ifndef NDEBUG
 /* Permit use of the Forced Emulation Prefix in HVM guests */
 extern bool_t opt_hvm_fep;
diff --git a/xen/include/asm-x86/hvm/vmx/vmcs.h b/xen/include/asm-x86/hvm/vmx/vmcs.h
index b6b34d1..ea8fbe5 100644
--- a/xen/include/asm-x86/hvm/vmx/vmcs.h
+++ b/xen/include/asm-x86/hvm/vmx/vmcs.h
@@ -167,6 +167,13 @@ struct arch_vmx_struct {
     struct tasklet       pi_vcpu_wakeup_tasklet;

     struct list_head     pi_blocked_vcpu_list;
+
+    /*
+     * Before vCPU is blocked, it is added to the global per-cpu list
+     * of 'pi_block_cpu', then VT-d engine can send wakeup notification
+     * event to 'pi_block_cpu' and wakeup the related vCPU.
+     */
+    int                  pi_block_cpu;
 };

 int vmx_create_vmcs(struct vcpu *v);

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 6b02f98..bb18c8c 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -393,6 +393,8 @@ void vcpu_wake(struct vcpu *v)
             vcpu_runstate_change(v, RUNSTATE_offline, NOW());
     }

+    arch_vcpu_wake(v);
+
     vcpu_schedule_unlock_irqrestore(lock, flags, v);

     TRACE_2D(TRC_SCHED_WAKE, v->domain->domain_id, v->vcpu_id);

Thanks,
Feng

> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply related	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-17  7:46                                   ` Wu, Feng
@ 2015-07-17 10:13                                     ` Dario Faggioli
  2015-07-17 22:57                                       ` Wu, Feng
  0 siblings, 1 reply; 155+ messages in thread
From: Dario Faggioli @ 2015-07-17 10:13 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 1525 bytes --]

On Fri, 2015-07-17 at 07:46 +0000, Wu, Feng wrote:
> Hi Dario,
> 
Hi,

> I finished the new patch with arch hooks, but seems something is wrong,
> after assigning the NIC to guest, I ping some guy from the guest, the
> latency is too big. 
>
Ok. What numbers are we talking about, just to have an idea.

> So far I've not found the reason after debugging the code
> for some time. 
>
Yeah, well, I can imagine it's a bit tricky.

> I post the path here to see if you can find any obvious logic
> errors in it. Or could you have a look at it to double check whether this patch
> exactly does the same thing as
> [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling,
> which works well. 
>
A will certainly have a look, and I'll try to compare it with the
previous one.

However, I guess you are testing the new version of the series, where
more things than just this changed, or are you actually comparing
branches, the only difference between which is the use of the new or the
old patch?

Also, I don't have any PI enabled hardware handy so, please, keep
looking and debugging yourself, as you're in a better position than me,
as I can only do code inspection.

I'll let you know what I find.

Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-17 10:13                                     ` Dario Faggioli
@ 2015-07-17 22:57                                       ` Wu, Feng
  2015-07-18 13:43                                         ` Dario Faggioli
  0 siblings, 1 reply; 155+ messages in thread
From: Wu, Feng @ 2015-07-17 22:57 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Jan Beulich, Zhang, Yang Z,
	Wu, Feng



> -----Original Message-----
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Friday, July 17, 2015 6:14 PM
> To: Wu, Feng
> Cc: Jan Beulich; Tian, Kevin; keir@xen.org; George Dunlap;
> andrew.cooper3@citrix.com; xen-devel; Zhang, Yang Z
> Subject: Re: [Xen-devel] Fwd: [v3 14/15] Update Posted-Interrupts Descriptor
> during vCPU scheduling
> 
> On Fri, 2015-07-17 at 07:46 +0000, Wu, Feng wrote:
> > Hi Dario,
> >
> Hi,
> 
> > I finished the new patch with arch hooks, but seems something is wrong,
> > after assigning the NIC to guest, I ping some guy from the guest, the
> > latency is too big.
> >
> Ok. What numbers are we talking about, just to have an idea.

It can be hundreds of microseconds.

> 
> > So far I've not found the reason after debugging the code
> > for some time.
> >
> Yeah, well, I can imagine it's a bit tricky.
> 
> > I post the path here to see if you can find any obvious logic
> > errors in it. Or could you have a look at it to double check whether this patch
> > exactly does the same thing as
> > [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling,
> > which works well.
> >
> A will certainly have a look, and I'll try to compare it with the
> previous one.
> 
> However, I guess you are testing the new version of the series, where
> more things than just this changed, or are you actually comparing
> branches, the only difference between which is the use of the new or the
> old patch?

Yes, I only changed this scheduler things compared to the old version.

> 
> Also, I don't have any PI enabled hardware handy so, please, keep
> looking and debugging yourself, as you're in a better position than me,
> as I can only do code inspection.
> 
> I'll let you know what I find.

Sure, I will continue to debug it. Since you are the scheduler expert, it
should be helpful if you can give some comments at the same time!:)

Thanks,
Feng

> 
> Regards,
> Dario
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 155+ messages in thread

* Re: Fwd: [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling
  2015-07-17 22:57                                       ` Wu, Feng
@ 2015-07-18 13:43                                         ` Dario Faggioli
  0 siblings, 0 replies; 155+ messages in thread
From: Dario Faggioli @ 2015-07-18 13:43 UTC (permalink / raw)
  To: Wu, Feng
  Cc: Tian, Kevin, keir@xen.org, George Dunlap,
	andrew.cooper3@citrix.com, xen-devel, Jan Beulich, Zhang, Yang Z


[-- Attachment #1.1: Type: text/plain, Size: 1194 bytes --]

On Fri, 2015-07-17 at 22:57 +0000, Wu, Feng wrote:

> > From: Dario Faggioli [mailto:dario.faggioli@citrix.com]

> > However, I guess you are testing the new version of the series, where
> > more things than just this changed, or are you actually comparing
> > branches, the only difference between which is the use of the new or the
> > old patch?
> 
> Yes, I only changed this scheduler things compared to the old version.
> 
Ok.

> > Also, I don't have any PI enabled hardware handy so, please, keep
> > looking and debugging yourself, as you're in a better position than me,
> > as I can only do code inspection.
> > 
> > I'll let you know what I find.
> 
> Sure, I will continue to debug it. Since you are the scheduler expert, it
> should be helpful if you can give some comments at the same time!:)
> 
I will have a look. I couldn't on Friday, but will get on it on Monday
morning.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 155+ messages in thread

end of thread, other threads:[~2015-07-18 13:43 UTC | newest]

Thread overview: 155+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-24  5:18 [v3 00/15] Add VT-d Posted-Interrupts support Feng Wu
2015-06-24  5:18 ` [v3 01/15] Vt-d Posted-intterrupt (PI) design Feng Wu
2015-06-24  6:15   ` Meng Xu
2015-06-24  6:19     ` Wu, Feng
2015-07-08  7:21   ` Tian, Kevin
2015-07-08  7:29     ` Wu, Feng
2015-06-24  5:18 ` [v3 02/15] Add helper macro for X86_FEATURE_CX16 feature detection Feng Wu
2015-06-24 17:31   ` Andrew Cooper
2015-07-08  7:23   ` Tian, Kevin
2015-06-24  5:18 ` [v3 03/15] Add cmpxchg16b support for x86-64 Feng Wu
2015-06-24 18:35   ` Andrew Cooper
2015-07-08  7:06     ` Wu, Feng
2015-07-08  8:12       ` Jan Beulich
2015-07-08  8:33         ` Wu, Feng
2015-07-08  8:43           ` Jan Beulich
2015-07-08  8:50             ` Wu, Feng
2015-07-08  8:50         ` Andrew Cooper
2015-07-10 12:57   ` Jan Beulich
2015-06-24  5:18 ` [v3 04/15] iommu: Add iommu_intpost to control VT-d Posted-Interrupts feature Feng Wu
2015-06-25  9:06   ` Andrew Cooper
2015-06-25  9:47     ` Wu, Feng
2015-06-25 10:16       ` Andrew Cooper
2015-06-25 12:47         ` Wu, Feng
2015-07-08  7:30   ` Tian, Kevin
2015-06-24  5:18 ` [v3 05/15] vt-d: VT-d Posted-Interrupts feature detection Feng Wu
2015-06-25 10:21   ` Andrew Cooper
2015-06-25 13:02     ` Wu, Feng
2015-07-08  7:32   ` Tian, Kevin
2015-07-08  8:00     ` Wu, Feng
2015-06-24  5:18 ` [v3 06/15] vmx: Extend struct pi_desc to support VT-d Posted-Interrupts Feng Wu
2015-06-29 15:04   ` Andrew Cooper
2015-07-08  7:48   ` Tian, Kevin
2015-07-10 13:08   ` Jan Beulich
2015-07-15  2:40     ` Wu, Feng
2015-07-15  8:20       ` Jan Beulich
2015-07-15  8:26         ` Wu, Feng
2015-07-15  8:36           ` Jan Beulich
2015-07-15  8:43             ` Wu, Feng
2015-07-15  9:28               ` Jan Beulich
2015-07-15  9:30                 ` Wu, Feng
2015-07-15  3:13     ` Wu, Feng
2015-06-24  5:18 ` [v3 07/15] vmx: Initialize VT-d Posted-Interrupts Descriptor Feng Wu
2015-06-29 15:32   ` Andrew Cooper
2015-06-30  1:46     ` Wu, Feng
2015-06-30  2:32     ` Dario Faggioli
2015-07-08  7:53   ` Tian, Kevin
2015-06-24  5:18 ` [v3 08/15] Suppress posting interrupts when 'SN' is set Feng Wu
2015-06-29 15:41   ` Andrew Cooper
2015-06-30  1:48     ` Wu, Feng
2015-07-08  9:06   ` Tian, Kevin
2015-07-08 10:11     ` Wu, Feng
2015-07-08 11:31       ` Tian, Kevin
2015-07-08 11:58         ` Wu, Feng
2015-07-10 13:20   ` Jan Beulich
2015-06-24  5:18 ` [v3 09/15] vt-d: Extend struct iremap_entry to support VT-d Posted-Interrupts Feng Wu
2015-06-29 16:04   ` Andrew Cooper
2015-06-30  1:52     ` Wu, Feng
2015-07-08  9:10   ` Tian, Kevin
2015-07-10 13:27   ` Jan Beulich
2015-06-24  5:18 ` [v3 10/15] vt-d: Add API to update IRTE when VT-d PI is used Feng Wu
2015-06-29 16:22   ` Andrew Cooper
2015-07-08  9:59   ` Tian, Kevin
2015-07-08 10:12     ` Wu, Feng
2015-07-10 14:01   ` Jan Beulich
2015-07-15  6:04     ` Wu, Feng
2015-07-15  8:24       ` Jan Beulich
2015-07-15  8:38         ` Wu, Feng
2015-07-15  8:46           ` Jan Beulich
2015-07-15  8:55             ` Wu, Feng
2015-07-15  9:32               ` Jan Beulich
2015-06-24  5:18 ` [v3 11/15] Update IRTE according to guest interrupt config changes Feng Wu
2015-06-29 16:46   ` Andrew Cooper
2015-07-08 10:22   ` Tian, Kevin
2015-07-08 10:31     ` Wu, Feng
2015-07-08 11:46       ` Tian, Kevin
2015-07-08 11:52         ` Wu, Feng
2015-07-08 11:54           ` Tian, Kevin
2015-07-10 14:23   ` Jan Beulich
2015-06-24  5:18 ` [v3 12/15] vmx: posted-interrupt handling when vCPU is blocked Feng Wu
2015-06-29 17:07   ` Andrew Cooper
2015-07-08 10:36     ` Wu, Feng
2015-07-08 10:48       ` Jan Beulich
     [not found]   ` <559181F9.6020106@citrix.com>
2015-06-30  2:51     ` Fwd: " Dario Faggioli
2015-06-30  2:59       ` Wu, Feng
2015-06-30  9:46         ` Dario Faggioli
2015-06-30 10:11   ` Andrew Cooper
2015-07-01 13:26     ` Dario Faggioli
2015-07-02  4:27       ` Wu, Feng
2015-07-02  8:30         ` Dario Faggioli
2015-07-02  8:58           ` Wu, Feng
2015-07-02 10:09             ` Dario Faggioli
2015-07-02 10:41               ` Wu, Feng
2015-07-02 10:30           ` Andrew Cooper
2015-07-02 10:56             ` Wu, Feng
2015-07-02 12:04             ` Dario Faggioli
2015-07-02 12:10               ` Wu, Feng
2015-07-02 12:16               ` Andrew Cooper
2015-07-02 12:38                 ` Dario Faggioli
2015-07-02 12:59                   ` Andrew Cooper
2015-07-03  1:33                     ` Wu, Feng
2015-07-02  4:25     ` Wu, Feng
2015-07-08 11:00   ` Tian, Kevin
2015-07-08 11:02     ` Wu, Feng
2015-07-08 12:46     ` Jan Beulich
2015-07-08 13:09       ` Andrew Cooper
2015-07-08 22:49         ` Tian, Kevin
2015-07-09  7:25           ` Jan Beulich
2015-07-10  6:21             ` Wu, Feng
2015-07-10  6:32               ` Jan Beulich
2015-07-10  7:29                 ` Wu, Feng
2015-07-10  8:49                   ` Jan Beulich
2015-07-10  8:57                     ` Wu, Feng
2015-07-08 22:31       ` Tian, Kevin
2015-06-24  5:18 ` [v3 13/15] vmx: Properly handle notification event when vCPU is running Feng Wu
2015-07-08 11:03   ` Tian, Kevin
2015-07-10 14:40   ` Jan Beulich
2015-06-24  5:18 ` [v3 14/15] Update Posted-Interrupts Descriptor during vCPU scheduling Feng Wu
     [not found]   ` <55918214.4030102@citrix.com>
2015-06-30  2:58     ` Fwd: " Dario Faggioli
2015-07-02  4:32       ` Wu, Feng
2015-07-02  4:34         ` Wu, Feng
2015-07-02  8:20         ` Dario Faggioli
2015-07-09  3:09           ` Wu, Feng
2015-07-09  8:18             ` Dario Faggioli
2015-07-09 11:19             ` George Dunlap
2015-07-09 11:29               ` George Dunlap
2015-07-09 11:38               ` Wu, Feng
2015-07-09 12:42                 ` Dario Faggioli
2015-07-10  0:07                   ` Wu, Feng
2015-07-10 12:40                     ` Dario Faggioli
2015-07-10 13:47                       ` Konrad Rzeszutek Wilk
2015-07-10 13:59                         ` Dario Faggioli
2015-07-09 12:53                 ` George Dunlap
2015-07-09 13:44                   ` Jan Beulich
2015-07-09 14:18                     ` Dario Faggioli
2015-07-09 14:27                       ` George Dunlap
2015-07-09 14:47                         ` Dario Faggioli
2015-07-10  5:59                         ` Wu, Feng
2015-07-10  6:22                           ` Jan Beulich
2015-07-10 11:05                             ` Dario Faggioli
2015-07-14  5:44                               ` Wu, Feng
2015-07-14 14:08                               ` Wu, Feng
2015-07-14 14:54                                 ` Jan Beulich
2015-07-14 15:20                                   ` Dario Faggioli
2015-07-14 16:41                                     ` George Dunlap
2015-07-14 16:02                                 ` Dario Faggioli
2015-07-15  0:54                                   ` Wu, Feng
2015-07-17  7:46                                   ` Wu, Feng
2015-07-17 10:13                                     ` Dario Faggioli
2015-07-17 22:57                                       ` Wu, Feng
2015-07-18 13:43                                         ` Dario Faggioli
2015-07-10  0:15                   ` Wu, Feng
2015-07-08 11:24   ` Tian, Kevin
2015-07-10 14:48   ` Jan Beulich
2015-06-24  5:18 ` [v3 15/15] Add a command line parameter for VT-d posted-interrupts Feng Wu
2015-07-08 11:25   ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.