LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Linux RISC-V IOMMU Support
@ 2024-04-18 16:32 Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU Tomasz Jeznach
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

This patch series introduces support for RISC-V IOMMU architected
hardware into the Linux kernel.

The RISC-V IOMMU specification, which this series is based on, is
ratified and available at GitHub/riscv-non-isa [1].

At a high level, the RISC-V IOMMU specification defines:

1) Data structures:
  - Device-context: Associates devices with address spaces and holds
    per-device parameters for address translations.
  - Process-contexts: Associates different virtual address spaces based
    on device-provided process identification numbers.
  - MSI page table configuration used to direct an MSI to a guest
    interrupt file in an IMSIC.
2) In-memory queue interface:
  - Command-queue for issuing commands to the IOMMU.
  - Fault/event queue for reporting faults and events.
  - Page-request queue for reporting "Page Request" messages received
    from PCIe devices.
  - Message-signaled and wire-signaled interrupt mechanisms.
3) Memory-mapped programming interface:
  - Mandatory and optional register layout and description.
  - Software guidelines for device initialization and capabilities discovery.


This series introduces RISC-V IOMMU hardware initialization and complete
single-stage translation with paging domain support.

The patches are organized as follows:

Patch 1: Introduces minimal required device tree bindings for the driver.
Patch 2: Defines RISC-V IOMMU data structures, hardware programming interface
         registers layout, and minimal initialization code for enabling global
         pass-through for all connected masters.
Patch 3: Implements the device driver for PCIe implementation of RISC-V IOMMU
         architected hardware.
Patch 4: Introduces IOMMU interfaces to the kernel subsystem.
Patch 5: Implements device directory management with discovery sequences for
         I/O mapped or in-memory device directory table location, hardware
         capabilities discovery, and device to domain attach implementation.
Patch 6: Implements command and fault queue, and introduces directory cache
         invalidation sequences.
Patch 7: Implements paging domain, with page table using the same format as the
         CPU’s MMU. This patch series enables only 4K mappings; complete support
         for large page mappings will be introduced in follow-up patch series.

Follow-up patch series, providing large page support and updated walk cache
management based on the revised specification, and complete ATS/PRI/SVA support,
will be posted to GitHub [2] in the next few days.

Changes from v1:

  This version includes major reorganization of the code related to queue
  and page table management, removal of all ATS/PRI/SVA features to be addressed
  in follow-up patch series, removal of unnecessary checks, and adoption of new
  interfaces for identity and paging domain allocations.

Apologies for the delay in sending v2 series, and thank you for valuable feedback
and patience with last patch series.

Best regards,
 Tomasz Jeznach

[1] link: https://github.com/riscv-non-isa/riscv-iommu
[2] link: https://github.com/tjeznach/linux
v1 link:  https://lore.kernel.org/linux-iommu/cover.1689792825.git.tjeznach@rivosinc.com/

Tomasz Jeznach (7):
  dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
  iommu/riscv: Add RISC-V IOMMU platform device driver
  iommu/riscv: Add RISC-V IOMMU PCIe device driver
  iommu/riscv: Enable IOMMU registration and device probe.
  iommu/riscv: Device directory management.
  iommu/riscv: Command and fault queue support
  iommu/riscv: Paging domain support

 .../bindings/iommu/riscv,iommu.yaml           |  149 ++
 MAINTAINERS                                   |   14 +
 drivers/iommu/Kconfig                         |    1 +
 drivers/iommu/Makefile                        |    2 +-
 drivers/iommu/riscv/Kconfig                   |   23 +
 drivers/iommu/riscv/Makefile                  |    3 +
 drivers/iommu/riscv/iommu-bits.h              |  782 +++++++++
 drivers/iommu/riscv/iommu-pci.c               |  154 ++
 drivers/iommu/riscv/iommu-platform.c          |   94 ++
 drivers/iommu/riscv/iommu.c                   | 1441 +++++++++++++++++
 drivers/iommu/riscv/iommu.h                   |   88 +
 11 files changed, 2750 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
 create mode 100644 drivers/iommu/riscv/Kconfig
 create mode 100644 drivers/iommu/riscv/Makefile
 create mode 100644 drivers/iommu/riscv/iommu-bits.h
 create mode 100644 drivers/iommu/riscv/iommu-pci.c
 create mode 100644 drivers/iommu/riscv/iommu-platform.c
 create mode 100644 drivers/iommu/riscv/iommu.c
 create mode 100644 drivers/iommu/riscv/iommu.h


base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
-- 
2.34.1


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-18 17:04   ` Conor Dooley
  2024-04-22 14:04   ` Rob Herring
  2024-04-18 16:32 ` [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver Tomasz Jeznach
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Add bindings for the RISC-V IOMMU device drivers.

Co-developed-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Anup Patel <apatel@ventanamicro.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 .../bindings/iommu/riscv,iommu.yaml           | 149 ++++++++++++++++++
 MAINTAINERS                                   |   7 +
 2 files changed, 156 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml

diff --git a/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
new file mode 100644
index 000000000000..d6522ddd43fa
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
@@ -0,0 +1,149 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/riscv,iommu.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: RISC-V IOMMU Architecture Implementation
+
+maintainers:
+  - Tomasz Jeznach <tjeznach@rivosinc.com>
+
+description: |+
+  The RISC-V IOMMU provides memory address translation and isolation for
+  input and output devices, supporting per-device translation context,
+  shared process address spaces including the ATS and PRI components of
+  the PCIe specification, two stage address translation and MSI remapping.
+  It supports identical translation table format to the RISC-V address
+  translation tables with page level access and protection attributes.
+  Hardware uses in-memory command and fault reporting queues with wired
+  interrupt or MSI notifications.
+
+  Visit https://github.com/riscv-non-isa/riscv-iommu for more details.
+
+  For information on assigning RISC-V IOMMU to its peripheral devices,
+  see generic IOMMU bindings.
+
+properties:
+  # For PCIe IOMMU hardware compatible property should contain the vendor
+  # and device ID according to the PCI Bus Binding specification.
+  # Since PCI provides built-in identification methods, compatible is not
+  # actually required. For non-PCIe hardware implementations 'riscv,iommu'
+  # should be specified along with 'reg' property providing MMIO location.
+  compatible:
+    oneOf:
+      - items:
+          - const: riscv,pci-iommu
+          - const: pci1efd,edf1
+      - items:
+          - const: pci1efd,edf1
+      - items:
+          - const: riscv,iommu
+
+  reg:
+    maxItems: 1
+    description:
+      For non-PCI devices this represents base address and size of for the
+      IOMMU memory mapped registers interface.
+      For PCI IOMMU hardware implementation this should represent an address
+      of the IOMMU, as defined in the PCI Bus Binding reference. The reg
+      property is a five-cell address encoded as (phys.hi phys.mid phys.lo
+      size.hi size.lo), where phys.hi should contain the device's BDF as
+      0b00000000 bbbbbbbb dddddfff 00000000. The other cells should be zero.
+
+  '#iommu-cells':
+    const: 1
+    description:
+      Has to be one. The single cell describes the requester id emitted
+      by a master to the IOMMU.
+
+  interrupts:
+    minItems: 1
+    maxItems: 4
+    description:
+      Wired interrupt vectors available for RISC-V IOMMU to notify the
+      RISC-V HARTS. The cause to interrupt vector is software defined
+      using IVEC IOMMU register.
+
+  msi-parent: true
+
+  power-domains:
+    maxItems: 1
+
+required:
+  - compatible
+  - reg
+  - '#iommu-cells'
+
+additionalProperties: false
+
+examples:
+  - |+
+    /* Example 1 (IOMMU device with wired interrupts) */
+    #include <dt-bindings/interrupt-controller/irq.h>
+
+    iommu1: iommu@1bccd000 {
+        compatible = "riscv,iommu";
+        reg = <0x1bccd000 0x1000>;
+        interrupt-parent = <&aplic_smode>;
+        interrupts = <32 IRQ_TYPE_LEVEL_HIGH>,
+                     <33 IRQ_TYPE_LEVEL_HIGH>,
+                     <34 IRQ_TYPE_LEVEL_HIGH>,
+                     <35 IRQ_TYPE_LEVEL_HIGH>;
+        #iommu-cells = <1>;
+    };
+
+    /* Device with two IOMMU device IDs, 0 and 7 */
+    master1 {
+        iommus = <&iommu1 0>, <&iommu1 7>;
+    };
+
+  - |+
+    /* Example 2 (IOMMU device with shared wired interrupt) */
+    #include <dt-bindings/interrupt-controller/irq.h>
+
+    iommu2: iommu@1bccd000 {
+        compatible = "riscv,iommu";
+        reg = <0x1bccd000 0x1000>;
+        interrupt-parent = <&aplic_smode>;
+        interrupts = <32 IRQ_TYPE_LEVEL_HIGH>;
+        #iommu-cells = <1>;
+    };
+
+  - |+
+    /* Example 3 (IOMMU device with MSIs) */
+    iommu3: iommu@1bcdd000 {
+        compatible = "riscv,iommu";
+        reg = <0x1bccd000 0x1000>;
+        msi-parent = <&imsics_smode>;
+        #iommu-cells = <1>;
+    };
+
+  - |+
+    /* Example 4 (IOMMU PCIe device with MSIs) */
+    bus {
+        #address-cells = <2>;
+        #size-cells = <2>;
+
+        pcie@30000000 {
+            device_type = "pci";
+            #address-cells = <3>;
+            #size-cells = <2>;
+            reg = <0x0 0x30000000  0x0 0x1000000>;
+            ranges = <0x02000000 0x0 0x41000000  0x0 0x41000000  0x0 0x0f000000>;
+
+            /*
+             * The IOMMU manages all functions in this PCI domain except
+             * itself. Omit BDF 00:01.0.
+             */
+            iommu-map = <0x0 &iommu0 0x0 0x8
+                         0x9 &iommu0 0x9 0xfff7>;
+
+            /* The IOMMU programming interface uses slot 00:01.0 */
+            iommu0: iommu@1,0 {
+               compatible = "pci1efd,edf1";
+               reg = <0x800 0 0 0 0>;
+               #iommu-cells = <1>;
+            };
+        };
+    };
diff --git a/MAINTAINERS b/MAINTAINERS
index c23fda1aa1f0..2657f9eae84c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18966,6 +18966,13 @@ F:	arch/riscv/
 N:	riscv
 K:	riscv
 
+RISC-V IOMMU
+M:	Tomasz Jeznach <tjeznach@rivosinc.com>
+L:	iommu@lists.linux.dev
+L:	linux-riscv@lists.infradead.org
+S:	Maintained
+F:	Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
+
 RISC-V MICROCHIP FPGA SUPPORT
 M:	Conor Dooley <conor.dooley@microchip.com>
 M:	Daire McNamara <daire.mcnamara@microchip.com>
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-18 21:22   ` Robin Murphy
  2024-04-18 16:32 ` [PATCH v2 3/7] iommu/riscv: Add RISC-V IOMMU PCIe " Tomasz Jeznach
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Introduce platform device driver for implementation of RISC-V IOMMU
architected hardware.

Hardware interface definition located in file iommu-bits.h is based on
ratified RISC-V IOMMU Architecture Specification version 1.0.0.

This patch implements platform device initialization, early check and
configuration of the IOMMU interfaces and enables global pass-through
address translation mode (iommu_mode == BARE), without registering
hardware instance in the IOMMU subsystem.

Link: https://github.com/riscv-non-isa/riscv-iommu
Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 MAINTAINERS                          |   6 +
 drivers/iommu/Kconfig                |   1 +
 drivers/iommu/Makefile               |   2 +-
 drivers/iommu/riscv/Kconfig          |  16 +
 drivers/iommu/riscv/Makefile         |   2 +
 drivers/iommu/riscv/iommu-bits.h     | 707 +++++++++++++++++++++++++++
 drivers/iommu/riscv/iommu-platform.c |  94 ++++
 drivers/iommu/riscv/iommu.c          |  89 ++++
 drivers/iommu/riscv/iommu.h          |  62 +++
 9 files changed, 978 insertions(+), 1 deletion(-)
 create mode 100644 drivers/iommu/riscv/Kconfig
 create mode 100644 drivers/iommu/riscv/Makefile
 create mode 100644 drivers/iommu/riscv/iommu-bits.h
 create mode 100644 drivers/iommu/riscv/iommu-platform.c
 create mode 100644 drivers/iommu/riscv/iommu.c
 create mode 100644 drivers/iommu/riscv/iommu.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2657f9eae84c..051599c76585 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18972,6 +18972,12 @@ L:	iommu@lists.linux.dev
 L:	linux-riscv@lists.infradead.org
 S:	Maintained
 F:	Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
+F:	drivers/iommu/riscv/Kconfig
+F:	drivers/iommu/riscv/Makefile
+F:	drivers/iommu/riscv/iommu-bits.h
+F:	drivers/iommu/riscv/iommu-platform.c
+F:	drivers/iommu/riscv/iommu.c
+F:	drivers/iommu/riscv/iommu.h
 
 RISC-V MICROCHIP FPGA SUPPORT
 M:	Conor Dooley <conor.dooley@microchip.com>
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 0af39bbbe3a3..ae762db0365e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -195,6 +195,7 @@ config MSM_IOMMU
 source "drivers/iommu/amd/Kconfig"
 source "drivers/iommu/intel/Kconfig"
 source "drivers/iommu/iommufd/Kconfig"
+source "drivers/iommu/riscv/Kconfig"
 
 config IRQ_REMAP
 	bool "Support for Interrupt Remapping"
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 542760d963ec..5e5a83c6c2aa 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
-obj-y += amd/ intel/ arm/ iommufd/
+obj-y += amd/ intel/ arm/ iommufd/ riscv/
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
new file mode 100644
index 000000000000..d02326bddb4c
--- /dev/null
+++ b/drivers/iommu/riscv/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+# RISC-V IOMMU support
+
+config RISCV_IOMMU
+	def_bool y if RISCV && 64BIT && MMU
+	depends on RISCV && 64BIT && MMU
+	select DMA_OPS
+	select IOMMU_API
+	select IOMMU_IOVA
+	help
+	  Support for implementations of the RISC-V IOMMU architecture that
+	  complements the RISC-V MMU capabilities, providing similar address
+	  translation and protection functions for accesses from I/O devices.
+
+	  Say Y here if your SoC includes an IOMMU device implementing
+	  the RISC-V IOMMU architecture.
diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
new file mode 100644
index 000000000000..e4c189de58d3
--- /dev/null
+++ b/drivers/iommu/riscv/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
new file mode 100644
index 000000000000..ba093c29de9f
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -0,0 +1,707 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ * Copyright © 2023 RISC-V IOMMU Task Group
+ *
+ * RISC-V IOMMU - Register Layout and Data Structures.
+ *
+ * Based on the 'RISC-V IOMMU Architecture Specification', Version 1.0
+ * Published at  https://github.com/riscv-non-isa/riscv-iommu
+ *
+ */
+
+#ifndef _RISCV_IOMMU_BITS_H_
+#define _RISCV_IOMMU_BITS_H_
+
+#include <linux/types.h>
+#include <linux/bitfield.h>
+#include <linux/bits.h>
+
+/*
+ * Chapter 5: Memory Mapped register interface
+ */
+
+/* Common field positions */
+#define RISCV_IOMMU_PPN_FIELD		GENMASK_ULL(53, 10)
+#define RISCV_IOMMU_QUEUE_LOGSZ_FIELD	GENMASK_ULL(4, 0)
+#define RISCV_IOMMU_QUEUE_INDEX_FIELD	GENMASK_ULL(31, 0)
+#define RISCV_IOMMU_QUEUE_ENABLE	BIT(0)
+#define RISCV_IOMMU_QUEUE_INTR_ENABLE	BIT(1)
+#define RISCV_IOMMU_QUEUE_MEM_FAULT	BIT(8)
+#define RISCV_IOMMU_QUEUE_OVERFLOW	BIT(9)
+#define RISCV_IOMMU_QUEUE_ACTIVE	BIT(16)
+#define RISCV_IOMMU_QUEUE_BUSY		BIT(17)
+
+#define RISCV_IOMMU_ATP_PPN_FIELD	GENMASK_ULL(43, 0)
+#define RISCV_IOMMU_ATP_MODE_FIELD	GENMASK_ULL(63, 60)
+
+/* 5.3 IOMMU Capabilities (64bits) */
+#define RISCV_IOMMU_REG_CAP		0x0000
+#define RISCV_IOMMU_CAP_VERSION		GENMASK_ULL(7, 0)
+#define RISCV_IOMMU_CAP_S_SV32		BIT_ULL(8)
+#define RISCV_IOMMU_CAP_S_SV39		BIT_ULL(9)
+#define RISCV_IOMMU_CAP_S_SV48		BIT_ULL(10)
+#define RISCV_IOMMU_CAP_S_SV57		BIT_ULL(11)
+#define RISCV_IOMMU_CAP_SVPBMT		BIT_ULL(15)
+#define RISCV_IOMMU_CAP_G_SV32		BIT_ULL(16)
+#define RISCV_IOMMU_CAP_G_SV39		BIT_ULL(17)
+#define RISCV_IOMMU_CAP_G_SV48		BIT_ULL(18)
+#define RISCV_IOMMU_CAP_G_SV57		BIT_ULL(19)
+#define RISCV_IOMMU_CAP_AMO_MRIF	BIT_ULL(21)
+#define RISCV_IOMMU_CAP_MSI_FLAT	BIT_ULL(22)
+#define RISCV_IOMMU_CAP_MSI_MRIF	BIT_ULL(23)
+#define RISCV_IOMMU_CAP_AMO_HWAD	BIT_ULL(24)
+#define RISCV_IOMMU_CAP_ATS		BIT_ULL(25)
+#define RISCV_IOMMU_CAP_T2GPA		BIT_ULL(26)
+#define RISCV_IOMMU_CAP_END		BIT_ULL(27)
+#define RISCV_IOMMU_CAP_IGS		GENMASK_ULL(29, 28)
+#define RISCV_IOMMU_CAP_HPM		BIT_ULL(30)
+#define RISCV_IOMMU_CAP_DBG		BIT_ULL(31)
+#define RISCV_IOMMU_CAP_PAS		GENMASK_ULL(37, 32)
+#define RISCV_IOMMU_CAP_PD8		BIT_ULL(38)
+#define RISCV_IOMMU_CAP_PD17		BIT_ULL(39)
+#define RISCV_IOMMU_CAP_PD20		BIT_ULL(40)
+
+#define RISCV_IOMMU_CAP_VERSION_VER_MASK	0xF0
+#define RISCV_IOMMU_CAP_VERSION_REV_MASK	0x0F
+
+/**
+ * enum riscv_iommu_igs_settings - Interrupt Generation Support Settings
+ * @RISCV_IOMMU_CAP_IGS_MSI: I/O MMU supports only MSI generation
+ * @RISCV_IOMMU_CAP_IGS_WSI: I/O MMU supports only Wired-Signaled interrupt
+ * @RISCV_IOMMU_CAP_IGS_BOTH: I/O MMU supports both MSI and WSI generation
+ * @RISCV_IOMMU_CAP_IGS_RSRV: Reserved for standard use
+ */
+enum riscv_iommu_igs_settings {
+	RISCV_IOMMU_CAP_IGS_MSI = 0,
+	RISCV_IOMMU_CAP_IGS_WSI = 1,
+	RISCV_IOMMU_CAP_IGS_BOTH = 2,
+	RISCV_IOMMU_CAP_IGS_RSRV = 3
+};
+
+/* 5.4 Features control register (32bits) */
+#define RISCV_IOMMU_REG_FCTL		0x0008
+#define RISCV_IOMMU_FCTL_BE		BIT(0)
+#define RISCV_IOMMU_FCTL_WSI		BIT(1)
+#define RISCV_IOMMU_FCTL_GXL		BIT(2)
+
+/* 5.5 Device-directory-table pointer (64bits) */
+#define RISCV_IOMMU_REG_DDTP		0x0010
+#define RISCV_IOMMU_DDTP_MODE		GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_DDTP_BUSY		BIT_ULL(4)
+#define RISCV_IOMMU_DDTP_PPN		RISCV_IOMMU_PPN_FIELD
+
+/**
+ * enum riscv_iommu_ddtp_modes - I/O MMU translation modes
+ * @RISCV_IOMMU_DDTP_MODE_OFF: No inbound transactions allowed
+ * @RISCV_IOMMU_DDTP_MODE_BARE: Pass-through mode
+ * @RISCV_IOMMU_DDTP_MODE_1LVL: One-level DDT
+ * @RISCV_IOMMU_DDTP_MODE_2LVL: Two-level DDT
+ * @RISCV_IOMMU_DDTP_MODE_3LVL: Three-level DDT
+ * @RISCV_IOMMU_DDTP_MODE_MAX: Max value allowed by specification
+ */
+enum riscv_iommu_ddtp_modes {
+	RISCV_IOMMU_DDTP_MODE_OFF = 0,
+	RISCV_IOMMU_DDTP_MODE_BARE = 1,
+	RISCV_IOMMU_DDTP_MODE_1LVL = 2,
+	RISCV_IOMMU_DDTP_MODE_2LVL = 3,
+	RISCV_IOMMU_DDTP_MODE_3LVL = 4,
+	RISCV_IOMMU_DDTP_MODE_MAX = 4
+};
+
+/* 5.6 Command Queue Base (64bits) */
+#define RISCV_IOMMU_REG_CQB		0x0018
+#define RISCV_IOMMU_CQB_ENTRIES		RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_CQB_PPN		RISCV_IOMMU_PPN_FIELD
+
+/* 5.7 Command Queue head (32bits) */
+#define RISCV_IOMMU_REG_CQH		0x0020
+#define RISCV_IOMMU_CQH_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.8 Command Queue tail (32bits) */
+#define RISCV_IOMMU_REG_CQT		0x0024
+#define RISCV_IOMMU_CQT_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.9 Fault Queue Base (64bits) */
+#define RISCV_IOMMU_REG_FQB		0x0028
+#define RISCV_IOMMU_FQB_ENTRIES		RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_FQB_PPN		RISCV_IOMMU_PPN_FIELD
+
+/* 5.10 Fault Queue Head (32bits) */
+#define RISCV_IOMMU_REG_FQH		0x0030
+#define RISCV_IOMMU_FQH_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.11 Fault Queue tail (32bits) */
+#define RISCV_IOMMU_REG_FQT		0x0034
+#define RISCV_IOMMU_FQT_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.12 Page Request Queue base (64bits) */
+#define RISCV_IOMMU_REG_PQB		0x0038
+#define RISCV_IOMMU_PQB_ENTRIES		RISCV_IOMMU_QUEUE_LOGSZ_FIELD
+#define RISCV_IOMMU_PQB_PPN		RISCV_IOMMU_PPN_FIELD
+
+/* 5.13 Page Request Queue head (32bits) */
+#define RISCV_IOMMU_REG_PQH		0x0040
+#define RISCV_IOMMU_PQH_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.14 Page Request Queue tail (32bits) */
+#define RISCV_IOMMU_REG_PQT		0x0044
+#define RISCV_IOMMU_PQT_INDEX_MASK	RISCV_IOMMU_QUEUE_INDEX_FIELD
+
+/* 5.15 Command Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_CQCSR		0x0048
+#define RISCV_IOMMU_CQCSR_CQEN		RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_CQCSR_CIE		RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_CQCSR_CQMF		RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_CQCSR_CMD_TO	BIT(9)
+#define RISCV_IOMMU_CQCSR_CMD_ILL	BIT(10)
+#define RISCV_IOMMU_CQCSR_FENCE_W_IP	BIT(11)
+#define RISCV_IOMMU_CQCSR_CQON		RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_CQCSR_BUSY		RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.16 Fault Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_FQCSR		0x004C
+#define RISCV_IOMMU_FQCSR_FQEN		RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_FQCSR_FIE		RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_FQCSR_FQMF		RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_FQCSR_FQOF		RISCV_IOMMU_QUEUE_OVERFLOW
+#define RISCV_IOMMU_FQCSR_FQON		RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_FQCSR_BUSY		RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.17 Page Request Queue CSR (32bits) */
+#define RISCV_IOMMU_REG_PQCSR		0x0050
+#define RISCV_IOMMU_PQCSR_PQEN		RISCV_IOMMU_QUEUE_ENABLE
+#define RISCV_IOMMU_PQCSR_PIE		RISCV_IOMMU_QUEUE_INTR_ENABLE
+#define RISCV_IOMMU_PQCSR_PQMF		RISCV_IOMMU_QUEUE_MEM_FAULT
+#define RISCV_IOMMU_PQCSR_PQOF		RISCV_IOMMU_QUEUE_OVERFLOW
+#define RISCV_IOMMU_PQCSR_PQON		RISCV_IOMMU_QUEUE_ACTIVE
+#define RISCV_IOMMU_PQCSR_BUSY		RISCV_IOMMU_QUEUE_BUSY
+
+/* 5.18 Interrupt Pending Status (32bits) */
+#define RISCV_IOMMU_REG_IPSR		0x0054
+
+#define RISCV_IOMMU_INTR_CQ		0
+#define RISCV_IOMMU_INTR_FQ		1
+#define RISCV_IOMMU_INTR_PM		2
+#define RISCV_IOMMU_INTR_PQ		3
+#define RISCV_IOMMU_INTR_COUNT		4
+
+#define RISCV_IOMMU_IPSR_CIP		BIT(RISCV_IOMMU_INTR_CQ)
+#define RISCV_IOMMU_IPSR_FIP		BIT(RISCV_IOMMU_INTR_FQ)
+#define RISCV_IOMMU_IPSR_PMIP		BIT(RISCV_IOMMU_INTR_PM)
+#define RISCV_IOMMU_IPSR_PIP		BIT(RISCV_IOMMU_INTR_PQ)
+
+/* 5.19 Performance monitoring counter overflow status (32bits) */
+#define RISCV_IOMMU_REG_IOCOUNTOVF	0x0058
+#define RISCV_IOMMU_IOCOUNTOVF_CY	BIT(0)
+#define RISCV_IOMMU_IOCOUNTOVF_HPM	GENMASK_ULL(31, 1)
+
+/* 5.20 Performance monitoring counter inhibits (32bits) */
+#define RISCV_IOMMU_REG_IOCOUNTINH	0x005C
+#define RISCV_IOMMU_IOCOUNTINH_CY	BIT(0)
+#define RISCV_IOMMU_IOCOUNTINH_HPM	GENMASK(31, 1)
+
+/* 5.21 Performance monitoring cycles counter (64bits) */
+#define RISCV_IOMMU_REG_IOHPMCYCLES     0x0060
+#define RISCV_IOMMU_IOHPMCYCLES_COUNTER	GENMASK_ULL(62, 0)
+#define RISCV_IOMMU_IOHPMCYCLES_OVF	BIT_ULL(63)
+
+/* 5.22 Performance monitoring event counters (31 * 64bits) */
+#define RISCV_IOMMU_REG_IOHPMCTR_BASE	0x0068
+#define RISCV_IOMMU_REG_IOHPMCTR(_n)	(RISCV_IOMMU_REG_IOHPMCTR_BASE + ((_n) * 0x8))
+
+/* 5.23 Performance monitoring event selectors (31 * 64bits) */
+#define RISCV_IOMMU_REG_IOHPMEVT_BASE	0x0160
+#define RISCV_IOMMU_REG_IOHPMEVT(_n)	(RISCV_IOMMU_REG_IOHPMEVT_BASE + ((_n) * 0x8))
+#define RISCV_IOMMU_IOHPMEVT_CNT	31
+#define RISCV_IOMMU_IOHPMEVT_EVENT_ID	GENMASK_ULL(14, 0)
+#define RISCV_IOMMU_IOHPMEVT_DMASK	BIT_ULL(15)
+#define RISCV_IOMMU_IOHPMEVT_PID_PSCID	GENMASK_ULL(35, 16)
+#define RISCV_IOMMU_IOHPMEVT_DID_GSCID	GENMASK_ULL(59, 36)
+#define RISCV_IOMMU_IOHPMEVT_PV_PSCV	BIT_ULL(60)
+#define RISCV_IOMMU_IOHPMEVT_DV_GSCV	BIT_ULL(61)
+#define RISCV_IOMMU_IOHPMEVT_IDT	BIT_ULL(62)
+#define RISCV_IOMMU_IOHPMEVT_OF		BIT_ULL(63)
+
+/**
+ * enum riscv_iommu_hpmevent_id - Performance-monitoring event identifier
+ *
+ * @RISCV_IOMMU_HPMEVENT_INVALID: Invalid event, do not count
+ * @RISCV_IOMMU_HPMEVENT_URQ: Untranslated requests
+ * @RISCV_IOMMU_HPMEVENT_TRQ: Translated requests
+ * @RISCV_IOMMU_HPMEVENT_ATS_RQ: ATS translation requests
+ * @RISCV_IOMMU_HPMEVENT_TLB_MISS: TLB misses
+ * @RISCV_IOMMU_HPMEVENT_DD_WALK: Device directory walks
+ * @RISCV_IOMMU_HPMEVENT_PD_WALK: Process directory walks
+ * @RISCV_IOMMU_HPMEVENT_S_VS_WALKS: S/VS-Stage page table walks
+ * @RISCV_IOMMU_HPMEVENT_G_WALKS: G-Stage page table walks
+ * @RISCV_IOMMU_HPMEVENT_MAX: Value to denote maximum Event IDs
+ */
+enum riscv_iommu_hpmevent_id {
+	RISCV_IOMMU_HPMEVENT_INVALID    = 0,
+	RISCV_IOMMU_HPMEVENT_URQ        = 1,
+	RISCV_IOMMU_HPMEVENT_TRQ        = 2,
+	RISCV_IOMMU_HPMEVENT_ATS_RQ     = 3,
+	RISCV_IOMMU_HPMEVENT_TLB_MISS   = 4,
+	RISCV_IOMMU_HPMEVENT_DD_WALK    = 5,
+	RISCV_IOMMU_HPMEVENT_PD_WALK    = 6,
+	RISCV_IOMMU_HPMEVENT_S_VS_WALKS = 7,
+	RISCV_IOMMU_HPMEVENT_G_WALKS    = 8,
+	RISCV_IOMMU_HPMEVENT_MAX        = 9
+};
+
+/* 5.24 Translation request IOVA (64bits) */
+#define RISCV_IOMMU_REG_TR_REQ_IOVA     0x0258
+#define RISCV_IOMMU_TR_REQ_IOVA_VPN	GENMASK_ULL(63, 12)
+
+/* 5.25 Translation request control (64bits) */
+#define RISCV_IOMMU_REG_TR_REQ_CTL	0x0260
+#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY	BIT_ULL(0)
+#define RISCV_IOMMU_TR_REQ_CTL_PRIV	BIT_ULL(1)
+#define RISCV_IOMMU_TR_REQ_CTL_EXE	BIT_ULL(2)
+#define RISCV_IOMMU_TR_REQ_CTL_NW	BIT_ULL(3)
+#define RISCV_IOMMU_TR_REQ_CTL_PID	GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_TR_REQ_CTL_PV	BIT_ULL(32)
+#define RISCV_IOMMU_TR_REQ_CTL_DID	GENMASK_ULL(63, 40)
+
+/* 5.26 Translation request response (64bits) */
+#define RISCV_IOMMU_REG_TR_RESPONSE	0x0268
+#define RISCV_IOMMU_TR_RESPONSE_FAULT	BIT_ULL(0)
+#define RISCV_IOMMU_TR_RESPONSE_PBMT	GENMASK_ULL(8, 7)
+#define RISCV_IOMMU_TR_RESPONSE_SZ	BIT_ULL(9)
+#define RISCV_IOMMU_TR_RESPONSE_PPN	RISCV_IOMMU_PPN_FIELD
+
+/* 5.27 Interrupt cause to vector (64bits) */
+#define RISCV_IOMMU_REG_IVEC		0x02F8
+#define RISCV_IOMMU_IVEC_CIV		GENMASK_ULL(3, 0)
+#define RISCV_IOMMU_IVEC_FIV		GENMASK_ULL(7, 4)
+#define RISCV_IOMMU_IVEC_PMIV		GENMASK_ULL(11, 8)
+#define RISCV_IOMMU_IVEC_PIV		GENMASK_ULL(15, 12)
+
+/* 5.28 MSI Configuration table (32 * 64bits) */
+#define RISCV_IOMMU_REG_MSI_CONFIG	0x0300
+#define RISCV_IOMMU_REG_MSI_ADDR(_n)	(RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10))
+#define RISCV_IOMMU_MSI_ADDR		GENMASK_ULL(55, 2)
+#define RISCV_IOMMU_REG_MSI_DATA(_n)	(RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x08)
+#define RISCV_IOMMU_MSI_DATA		GENMASK_ULL(31, 0)
+#define RISCV_IOMMU_REG_MSI_VEC_CTL(_n)	(RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x0C)
+#define RISCV_IOMMU_MSI_VEC_CTL_M	BIT_ULL(0)
+
+#define RISCV_IOMMU_REG_SIZE	0x1000
+
+/*
+ * Chapter 2: Data structures
+ */
+
+/*
+ * Device Directory Table macros for non-leaf nodes
+ */
+#define RISCV_IOMMU_DDTE_VALID	BIT_ULL(0)
+#define RISCV_IOMMU_DDTE_PPN	RISCV_IOMMU_PPN_FIELD
+
+/**
+ * struct riscv_iommu_dc - Device Context
+ * @tc: Translation Control
+ * @iohgatp: I/O Hypervisor guest address translation and protection
+ *	     (Second stage context)
+ * @ta: Translation Attributes
+ * @fsc: First stage context
+ * @msiptp: MSI page table pointer
+ * @msi_addr_mask: MSI address mask
+ * @msi_addr_pattern: MSI address pattern
+ * @_reserved: Reserved for future use, padding
+ *
+ * This structure is used for leaf nodes on the Device Directory Table,
+ * in case RISCV_IOMMU_CAP_MSI_FLAT is not set, the bottom 4 fields are
+ * not present and are skipped with pointer arithmetic to avoid
+ * casting, check out riscv_iommu_get_dc().
+ * See section 2.1 for more details
+ */
+struct riscv_iommu_dc {
+	u64 tc;
+	u64 iohgatp;
+	u64 ta;
+	u64 fsc;
+	u64 msiptp;
+	u64 msi_addr_mask;
+	u64 msi_addr_pattern;
+	u64 _reserved;
+};
+
+/* Translation control fields */
+#define RISCV_IOMMU_DC_TC_V		BIT_ULL(0)
+#define RISCV_IOMMU_DC_TC_EN_ATS	BIT_ULL(1)
+#define RISCV_IOMMU_DC_TC_EN_PRI	BIT_ULL(2)
+#define RISCV_IOMMU_DC_TC_T2GPA		BIT_ULL(3)
+#define RISCV_IOMMU_DC_TC_DTF		BIT_ULL(4)
+#define RISCV_IOMMU_DC_TC_PDTV		BIT_ULL(5)
+#define RISCV_IOMMU_DC_TC_PRPR		BIT_ULL(6)
+#define RISCV_IOMMU_DC_TC_GADE		BIT_ULL(7)
+#define RISCV_IOMMU_DC_TC_SADE		BIT_ULL(8)
+#define RISCV_IOMMU_DC_TC_DPE		BIT_ULL(9)
+#define RISCV_IOMMU_DC_TC_SBE		BIT_ULL(10)
+#define RISCV_IOMMU_DC_TC_SXL		BIT_ULL(11)
+
+/* Second-stage (aka G-stage) context fields */
+#define RISCV_IOMMU_DC_IOHGATP_PPN	RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_IOHGATP_GSCID	GENMASK_ULL(59, 44)
+#define RISCV_IOMMU_DC_IOHGATP_MODE	RISCV_IOMMU_ATP_MODE_FIELD
+
+/**
+ * enum riscv_iommu_dc_iohgatp_modes - Guest address translation/protection modes
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_BARE: No translation/protection
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4: Sv32x4 (2-bit extension of Sv32), when fctl.GXL == 1
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4: Sv39x4 (2-bit extension of Sv39), when fctl.GXL == 0
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4: Sv48x4 (2-bit extension of Sv48), when fctl.GXL == 0
+ * @RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4: Sv57x4 (2-bit extension of Sv57), when fctl.GXL == 0
+ */
+enum riscv_iommu_dc_iohgatp_modes {
+	RISCV_IOMMU_DC_IOHGATP_MODE_BARE = 0,
+	RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4 = 8,
+	RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4 = 8,
+	RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4 = 9,
+	RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4 = 10
+};
+
+/* Translation attributes fields */
+#define RISCV_IOMMU_DC_TA_PSCID		GENMASK_ULL(31, 12)
+
+/* First-stage context fields */
+#define RISCV_IOMMU_DC_FSC_PPN		RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_FSC_MODE		RISCV_IOMMU_ATP_MODE_FIELD
+
+/**
+ * enum riscv_iommu_dc_fsc_atp_modes - First stage address translation/protection modes
+ * @RISCV_IOMMU_DC_FSC_MODE_BARE: No translation/protection
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32: Sv32, when dc.tc.SXL == 1
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39: Sv39, when dc.tc.SXL == 0
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48: Sv48, when dc.tc.SXL == 0
+ * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57: Sv57, when dc.tc.SXL == 0
+ * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8: 1lvl PDT, 8bit process ids
+ * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17: 2lvl PDT, 17bit process ids
+ * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20: 3lvl PDT, 20bit process ids
+ *
+ * FSC holds IOSATP when RISCV_IOMMU_DC_TC_PDTV is 0 and PDTP otherwise.
+ * IOSATP controls the first stage address translation (same as the satp register on
+ * the RISC-V MMU), and PDTP holds the process directory table, used to select a
+ * first stage page table based on a process id (for devices that support multiple
+ * process ids).
+ */
+enum riscv_iommu_dc_fsc_atp_modes {
+	RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
+	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
+	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 = 8,
+	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48 = 9,
+	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57 = 10,
+	RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8 = 1,
+	RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17 = 2,
+	RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20 = 3
+};
+
+/* MSI page table pointer */
+#define RISCV_IOMMU_DC_MSIPTP_PPN	RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_DC_MSIPTP_MODE	RISCV_IOMMU_ATP_MODE_FIELD
+#define RISCV_IOMMU_DC_MSIPTP_MODE_OFF	0
+#define RISCV_IOMMU_DC_MSIPTP_MODE_FLAT	1
+
+/* MSI address mask */
+#define RISCV_IOMMU_DC_MSI_ADDR_MASK	GENMASK_ULL(51, 0)
+
+/* MSI address pattern */
+#define RISCV_IOMMU_DC_MSI_PATTERN	GENMASK_ULL(51, 0)
+
+/**
+ * struct riscv_iommu_pc - Process Context
+ * @ta: Translation Attributes
+ * @fsc: First stage context
+ *
+ * This structure is used for leaf nodes on the Process Directory Table
+ * See section 2.3 for more details
+ */
+struct riscv_iommu_pc {
+	u64 ta;
+	u64 fsc;
+};
+
+/* Translation attributes fields */
+#define RISCV_IOMMU_PC_TA_V	BIT_ULL(0)
+#define RISCV_IOMMU_PC_TA_ENS	BIT_ULL(1)
+#define RISCV_IOMMU_PC_TA_SUM	BIT_ULL(2)
+#define RISCV_IOMMU_PC_TA_PSCID	GENMASK_ULL(31, 12)
+
+/* First stage context fields */
+#define RISCV_IOMMU_PC_FSC_PPN	RISCV_IOMMU_ATP_PPN_FIELD
+#define RISCV_IOMMU_PC_FSC_MODE	RISCV_IOMMU_ATP_MODE_FIELD
+
+/*
+ * Chapter 3: In-memory queue interface
+ */
+
+/**
+ * struct riscv_iommu_command - Generic I/O MMU command structure
+ * @dword0: Includes the opcode and the function identifier
+ * @dword1: Opcode specific data
+ *
+ * The commands are interpreted as two 64bit fields, where the first
+ * 7bits of the first field are the opcode which also defines the
+ * command's format, followed by a 3bit field that specifies the
+ * function invoked by that command, and the rest is opcode-specific.
+ * This is a generic struct which will be populated differently
+ * according to each command. For more infos on the commands and
+ * the command queue check section 3.1.
+ */
+struct riscv_iommu_command {
+	u64 dword0;
+	u64 dword1;
+};
+
+/* Fields on dword0, common for all commands */
+#define RISCV_IOMMU_CMD_OPCODE	GENMASK_ULL(6, 0)
+#define	RISCV_IOMMU_CMD_FUNC	GENMASK_ULL(9, 7)
+
+/* 3.1.1 I/O MMU Page-table cache invalidation */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_IOTINVAL_OPCODE		1
+#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA	0
+#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA	1
+#define RISCV_IOMMU_CMD_IOTINVAL_AV		BIT_ULL(10)
+#define RISCV_IOMMU_CMD_IOTINVAL_PSCID		GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_IOTINVAL_PSCV		BIT_ULL(32)
+#define RISCV_IOMMU_CMD_IOTINVAL_GV		BIT_ULL(33)
+#define RISCV_IOMMU_CMD_IOTINVAL_GSCID		GENMASK_ULL(59, 44)
+/* dword1[61:10] is the 4K-aligned page address */
+#define RISCV_IOMMU_CMD_IOTINVAL_ADDR		GENMASK_ULL(61, 10)
+
+/* 3.1.2 I/O MMU Command Queue Fences */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_IOFENCE_OPCODE		2
+#define RISCV_IOMMU_CMD_IOFENCE_FUNC_C		0
+#define RISCV_IOMMU_CMD_IOFENCE_AV		BIT_ULL(10)
+#define RISCV_IOMMU_CMD_IOFENCE_WSI		BIT_ULL(11)
+#define RISCV_IOMMU_CMD_IOFENCE_PR		BIT_ULL(12)
+#define RISCV_IOMMU_CMD_IOFENCE_PW		BIT_ULL(13)
+#define RISCV_IOMMU_CMD_IOFENCE_DATA		GENMASK_ULL(63, 32)
+/* dword1 is the address, word-size aligned and shifted to the right by two bits. */
+
+/* 3.1.3 I/O MMU Directory cache invalidation */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_IODIR_OPCODE		3
+#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT	0
+#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT	1
+#define RISCV_IOMMU_CMD_IODIR_PID		GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_IODIR_DV		BIT_ULL(33)
+#define RISCV_IOMMU_CMD_IODIR_DID		GENMASK_ULL(63, 40)
+/* dword1 is reserved for standard use */
+
+/* 3.1.4 I/O MMU PCIe ATS */
+/* Fields on dword0 */
+#define RISCV_IOMMU_CMD_ATS_OPCODE		4
+#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL		0
+#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR		1
+#define RISCV_IOMMU_CMD_ATS_PID			GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_CMD_ATS_PV			BIT_ULL(32)
+#define RISCV_IOMMU_CMD_ATS_DSV			BIT_ULL(33)
+#define RISCV_IOMMU_CMD_ATS_RID			GENMASK_ULL(55, 40)
+#define RISCV_IOMMU_CMD_ATS_DSEG		GENMASK_ULL(63, 56)
+/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
+
+/* ATS.INVAL payload*/
+#define RISCV_IOMMU_CMD_ATS_INVAL_G		BIT_ULL(0)
+/* Bits 1 - 10 are zeroed */
+#define RISCV_IOMMU_CMD_ATS_INVAL_S		BIT_ULL(11)
+#define RISCV_IOMMU_CMD_ATS_INVAL_UADDR		GENMASK_ULL(63, 12)
+
+/* ATS.PRGR payload */
+/* Bits 0 - 31 are zeroed */
+#define RISCV_IOMMU_CMD_ATS_PRGR_PRG_INDEX	GENMASK_ULL(40, 32)
+/* Bits 41 - 43 are zeroed */
+#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE	GENMASK_ULL(47, 44)
+#define RISCV_IOMMU_CMD_ATS_PRGR_DST_ID		GENMASK_ULL(63, 48)
+
+/**
+ * struct riscv_iommu_fq_record - Fault/Event Queue Record
+ * @hdr: Header, includes fault/event cause, PID/DID, transaction type etc
+ * @_reserved: Low 32bits for custom use, high 32bits for standard use
+ * @iotval: Transaction-type/cause specific format
+ * @iotval2: Cause specific format
+ *
+ * The fault/event queue reports events and failures raised when
+ * processing transactions. Each record is a 32byte structure where
+ * the first dword has a fixed format for providing generic infos
+ * regarding the fault/event, and two more dwords are there for
+ * fault/event-specific information. For more details see section
+ * 3.2.
+ */
+struct riscv_iommu_fq_record {
+	u64 hdr;
+	u64 _reserved;
+	u64 iotval;
+	u64 iotval2;
+};
+
+/* Fields on header */
+#define RISCV_IOMMU_FQ_HDR_CAUSE	GENMASK_ULL(11, 0)
+#define RISCV_IOMMU_FQ_HDR_PID		GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_FQ_HDR_PV		BIT_ULL(32)
+#define RISCV_IOMMU_FQ_HDR_PRIV		BIT_ULL(33)
+#define RISCV_IOMMU_FQ_HDR_TTYPE	GENMASK_ULL(39, 34)
+#define RISCV_IOMMU_FQ_HDR_DID		GENMASK_ULL(63, 40)
+
+/**
+ * enum riscv_iommu_fq_causes - Fault/event cause values
+ * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT: Instruction access fault
+ * @RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED: Read address misaligned
+ * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT: Read load fault
+ * @RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED: Write/AMO address misaligned
+ * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT: Write/AMO access fault
+ * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S: Instruction page fault
+ * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S: Read page fault
+ * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S: Write/AMO page fault
+ * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS: Instruction guest page fault
+ * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS: Read guest page fault
+ * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS: Write/AMO guest page fault
+ * @RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED: All inbound transactions disallowed
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT: DDT entry load access fault
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_INVALID: DDT entry invalid
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED: DDT entry misconfigured
+ * @RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED: Transaction type disallowed
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT: MSI PTE load access fault
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_INVALID: MSI PTE invalid
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED: MSI PTE misconfigured
+ * @RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT: MRIF access fault
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT: PDT entry load access fault
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_INVALID: PDT entry invalid
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED: PDT entry misconfigured
+ * @RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED: DDT data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED: PDT data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED: MSI page table data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED: MRIF data corruption
+ * @RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR: Internal data path error
+ * @RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT: IOMMU MSI write access fault
+ * @RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED: First/second stage page table data corruption
+ *
+ * Values are on table 11 of the spec, encodings 275 - 2047 are reserved for standard
+ * use, and 2048 - 4095 for custom use.
+ */
+enum riscv_iommu_fq_causes {
+	RISCV_IOMMU_FQ_CAUSE_INST_FAULT = 1,
+	RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED = 4,
+	RISCV_IOMMU_FQ_CAUSE_RD_FAULT = 5,
+	RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED = 6,
+	RISCV_IOMMU_FQ_CAUSE_WR_FAULT = 7,
+	RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S = 12,
+	RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S = 13,
+	RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S = 15,
+	RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS = 20,
+	RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS = 21,
+	RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS = 23,
+	RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED = 256,
+	RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT = 257,
+	RISCV_IOMMU_FQ_CAUSE_DDT_INVALID = 258,
+	RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED = 259,
+	RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED = 260,
+	RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT = 261,
+	RISCV_IOMMU_FQ_CAUSE_MSI_INVALID = 262,
+	RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED = 263,
+	RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT = 264,
+	RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT = 265,
+	RISCV_IOMMU_FQ_CAUSE_PDT_INVALID = 266,
+	RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED = 267,
+	RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED = 268,
+	RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED = 269,
+	RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED = 270,
+	RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED = 271,
+	RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR = 272,
+	RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT = 273,
+	RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED = 274
+};
+
+/**
+ * enum riscv_iommu_fq_ttypes: Fault/event transaction types
+ * @RISCV_IOMMU_FQ_TTYPE_NONE: None. Fault not caused by an inbound transaction.
+ * @RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH: Instruction fetch from untranslated address
+ * @RISCV_IOMMU_FQ_TTYPE_UADDR_RD: Read from untranslated address
+ * @RISCV_IOMMU_FQ_TTYPE_UADDR_WR: Write/AMO to untranslated address
+ * @RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH: Instruction fetch from translated address
+ * @RISCV_IOMMU_FQ_TTYPE_TADDR_RD: Read from translated address
+ * @RISCV_IOMMU_FQ_TTYPE_TADDR_WR: Write/AMO to translated address
+ * @RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ: PCIe ATS translation request
+ * @RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ: PCIe message request
+ *
+ * Values are on table 12 of the spec, type 4 and 10 - 31 are reserved for standard use
+ * and 31 - 63 for custom use.
+ */
+enum riscv_iommu_fq_ttypes {
+	RISCV_IOMMU_FQ_TTYPE_NONE = 0,
+	RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH = 1,
+	RISCV_IOMMU_FQ_TTYPE_UADDR_RD = 2,
+	RISCV_IOMMU_FQ_TTYPE_UADDR_WR = 3,
+	RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
+	RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
+	RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
+	RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
+	RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
+};
+
+/**
+ * struct riscv_iommu_pq_record - PCIe Page Request record
+ * @hdr: Header, includes PID, DID etc
+ * @payload: Holds the page address, request group and permission bits
+ *
+ * For more infos on the PCIe Page Request queue see chapter 3.3.
+ */
+struct riscv_iommu_pq_record {
+	u64 hdr;
+	u64 payload;
+};
+
+/* Header fields */
+#define RISCV_IOMMU_PREQ_HDR_PID	GENMASK_ULL(31, 12)
+#define RISCV_IOMMU_PREQ_HDR_PV		BIT_ULL(32)
+#define RISCV_IOMMU_PREQ_HDR_PRIV	BIT_ULL(33)
+#define RISCV_IOMMU_PREQ_HDR_EXEC	BIT_ULL(34)
+#define RISCV_IOMMU_PREQ_HDR_DID	GENMASK_ULL(63, 40)
+
+/* Payload fields */
+#define RISCV_IOMMU_PREQ_PAYLOAD_R	BIT_ULL(0)
+#define RISCV_IOMMU_PREQ_PAYLOAD_W	BIT_ULL(1)
+#define RISCV_IOMMU_PREQ_PAYLOAD_L	BIT_ULL(2)
+#define RISCV_IOMMU_PREQ_PAYLOAD_M	GENMASK_ULL(2, 0)	/* Mask of RWL for convenience */
+#define RISCV_IOMMU_PREQ_PRG_INDEX	GENMASK_ULL(11, 3)
+#define RISCV_IOMMU_PREQ_UADDR		GENMASK_ULL(63, 12)
+
+/**
+ * struct riscv_iommu_msi_pte - MSI Page Table Entry
+ * @pte: MSI PTE
+ * @mrif_info: Memory-resident interrupt file info
+ *
+ * The MSI Page Table is used for virtualizing MSIs, so that when
+ * a device sends an MSI to a guest, the IOMMU can reroute it
+ * by translating the MSI address, either to a guest interrupt file
+ * or a memory resident interrupt file (MRIF). Note that this page table
+ * is an array of MSI PTEs, not a multi-level pt, each entry
+ * is a leaf entry. For more infos check out the AIA spec, chapter 9.5.
+ *
+ * Also in basic mode the mrif_info field is ignored by the IOMMU and can
+ * be used by software, any other reserved fields on pte must be zeroed-out
+ * by software.
+ */
+struct riscv_iommu_msi_pte {
+	u64 pte;
+	u64 mrif_info;
+};
+
+/* Fields on pte */
+#define RISCV_IOMMU_MSI_PTE_V		BIT_ULL(0)
+#define RISCV_IOMMU_MSI_PTE_M		GENMASK_ULL(2, 1)
+#define RISCV_IOMMU_MSI_PTE_MRIF_ADDR	GENMASK_ULL(53, 7)	/* When M == 1 (MRIF mode) */
+#define RISCV_IOMMU_MSI_PTE_PPN		RISCV_IOMMU_PPN_FIELD	/* When M == 3 (basic mode) */
+#define RISCV_IOMMU_MSI_PTE_C		BIT_ULL(63)
+
+/* Fields on mrif_info */
+#define RISCV_IOMMU_MSI_MRIF_NID	GENMASK_ULL(9, 0)
+#define RISCV_IOMMU_MSI_MRIF_NPPN	RISCV_IOMMU_PPN_FIELD
+#define RISCV_IOMMU_MSI_MRIF_NID_MSB	BIT_ULL(60)
+
+#endif /* _RISCV_IOMMU_BITS_H_ */
diff --git a/drivers/iommu/riscv/iommu-platform.c b/drivers/iommu/riscv/iommu-platform.c
new file mode 100644
index 000000000000..770086ae2ab3
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-platform.c
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * RISC-V IOMMU as a platform device
+ *
+ * Copyright © 2023 FORTH-ICS/CARV
+ * Copyright © 2023-2024 Rivos Inc.
+ *
+ * Authors
+ *	Nick Kossifidis <mick@ics.forth.gr>
+ *	Tomasz Jeznach <tjeznach@rivosinc.com>
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_platform.h>
+#include <linux/platform_device.h>
+
+#include "iommu-bits.h"
+#include "iommu.h"
+
+static int riscv_iommu_platform_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct riscv_iommu_device *iommu = NULL;
+	struct resource *res = NULL;
+	int vec;
+
+	iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		return -ENOMEM;
+
+	iommu->dev = dev;
+	iommu->reg = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
+	if (IS_ERR(iommu->reg))
+		return dev_err_probe(dev, PTR_ERR(iommu->reg),
+				     "could not map register region\n");
+
+	dev_set_drvdata(dev, iommu);
+
+	/* Check device reported capabilities / features. */
+	iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
+	iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+
+	/* For now we only support WSI */
+	switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
+	case RISCV_IOMMU_CAP_IGS_WSI:
+	case RISCV_IOMMU_CAP_IGS_BOTH:
+		break;
+	default:
+		return dev_err_probe(dev, -ENODEV,
+				     "unable to use wire-signaled interrupts\n");
+	}
+
+	iommu->irqs_count = platform_irq_count(pdev);
+	if (iommu->irqs_count <= 0)
+		return dev_err_probe(dev, -ENODEV,
+				     "no IRQ resources provided\n");
+
+	for (vec = 0; vec < iommu->irqs_count; vec++)
+		iommu->irqs[vec] = platform_get_irq(pdev, vec);
+
+	/* Enable wire-signaled interrupts, fctl.WSI */
+	if (!(iommu->fctl & RISCV_IOMMU_FCTL_WSI)) {
+		iommu->fctl ^= RISCV_IOMMU_FCTL_WSI;
+		riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
+	}
+
+	return riscv_iommu_init(iommu);
+};
+
+static void riscv_iommu_platform_remove(struct platform_device *pdev)
+{
+	riscv_iommu_remove(dev_get_drvdata(&pdev->dev));
+};
+
+static const struct of_device_id riscv_iommu_of_match[] = {
+	{.compatible = "riscv,iommu",},
+	{},
+};
+
+MODULE_DEVICE_TABLE(of, riscv_iommu_of_match);
+
+static struct platform_driver riscv_iommu_platform_driver = {
+	.probe = riscv_iommu_platform_probe,
+	.remove_new = riscv_iommu_platform_remove,
+	.driver = {
+		.name = "riscv,iommu",
+		.of_match_table = riscv_iommu_of_match,
+		.suppress_bind_attrs = true,
+	},
+};
+
+module_driver(riscv_iommu_platform_driver, platform_driver_register,
+	      platform_driver_unregister);
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
new file mode 100644
index 000000000000..af68c89200a9
--- /dev/null
+++ b/drivers/iommu/riscv/iommu.c
@@ -0,0 +1,89 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * IOMMU API for RISC-V IOMMU implementations.
+ *
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ *
+ * Authors
+ *	Tomasz Jeznach <tjeznach@rivosinc.com>
+ *	Nick Kossifidis <mick@ics.forth.gr>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/compiler.h>
+#include <linux/dma-mapping.h>
+#include <linux/init.h>
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#include "iommu-bits.h"
+#include "iommu.h"
+
+MODULE_DESCRIPTION("Driver for RISC-V IOMMU");
+MODULE_AUTHOR("Tomasz Jeznach <tjeznach@rivosinc.com>");
+MODULE_AUTHOR("Nick Kossifidis <mick@ics.forth.gr>");
+MODULE_LICENSE("GPL");
+
+/* Timeouts in [us] */
+#define RISCV_IOMMU_DDTP_TIMEOUT	50000
+
+static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
+{
+	u64 ddtp;
+
+	/* Hardware must be configured in OFF | BARE mode at system initialization. */
+	riscv_iommu_readq_timeout(iommu, RISCV_IOMMU_REG_DDTP,
+				  ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
+				  10, RISCV_IOMMU_DDTP_TIMEOUT);
+	if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE)
+		return -EBUSY;
+
+	/* Configure accesses to in-memory data structures for CPU-native byte order. */
+	if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) {
+		if (!(iommu->caps & RISCV_IOMMU_CAP_END))
+			return -EINVAL;
+		riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL,
+				   iommu->fctl ^ RISCV_IOMMU_FCTL_BE);
+		iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+		if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE))
+			return -EINVAL;
+	}
+
+	dma_set_mask_and_coherent(iommu->dev,
+				  DMA_BIT_MASK(FIELD_GET(RISCV_IOMMU_CAP_PAS, iommu->caps)));
+
+	return 0;
+}
+
+void riscv_iommu_remove(struct riscv_iommu_device *iommu)
+{
+	iommu_device_sysfs_remove(&iommu->iommu);
+}
+
+int riscv_iommu_init(struct riscv_iommu_device *iommu)
+{
+	int rc;
+
+	rc = riscv_iommu_init_check(iommu);
+	if (rc)
+		return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
+	/*
+	 * Placeholder for a complete IOMMU device initialization.
+	 * For now, only bare minimum: enable global identity mapping mode and register sysfs.
+	 */
+	riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
+			   FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
+
+	rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
+				    dev_name(iommu->dev));
+	if (WARN(rc, "cannot register sysfs interface\n"))
+		goto err_sysfs;
+
+	return 0;
+
+err_sysfs:
+	return rc;
+}
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
new file mode 100644
index 000000000000..700e33dc2446
--- /dev/null
+++ b/drivers/iommu/riscv/iommu.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ *
+ * Authors
+ *	Tomasz Jeznach <tjeznach@rivosinc.com>
+ *	Nick Kossifidis <mick@ics.forth.gr>
+ */
+
+#ifndef _RISCV_IOMMU_H_
+#define _RISCV_IOMMU_H_
+
+#include <linux/iommu.h>
+#include <linux/types.h>
+#include <linux/iopoll.h>
+
+#include "iommu-bits.h"
+
+struct riscv_iommu_device {
+	/* iommu core interface */
+	struct iommu_device iommu;
+
+	/* iommu hardware */
+	struct device *dev;
+
+	/* hardware control register space */
+	void __iomem *reg;
+
+	/* supported and enabled hardware capabilities */
+	u64 caps;
+	u32 fctl;
+
+	/* available interrupt numbers, MSI or WSI */
+	unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
+	unsigned int irqs_count;
+};
+
+int riscv_iommu_init(struct riscv_iommu_device *iommu);
+void riscv_iommu_remove(struct riscv_iommu_device *iommu);
+
+#define riscv_iommu_readl(iommu, addr) \
+	readl_relaxed((iommu)->reg + (addr))
+
+#define riscv_iommu_readq(iommu, addr) \
+	readq_relaxed((iommu)->reg + (addr))
+
+#define riscv_iommu_writel(iommu, addr, val) \
+	writel_relaxed((val), (iommu)->reg + (addr))
+
+#define riscv_iommu_writeq(iommu, addr, val) \
+	writeq_relaxed((val), (iommu)->reg + (addr))
+
+#define riscv_iommu_readq_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
+	readx_poll_timeout(readq_relaxed, (iommu)->reg + (addr), val, cond, \
+			   delay_us, timeout_us)
+
+#define riscv_iommu_readl_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
+	readx_poll_timeout(readl_relaxed, (iommu)->reg + (addr), val, cond, \
+			   delay_us, timeout_us)
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/7] iommu/riscv: Add RISC-V IOMMU PCIe device driver
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-18 22:07   ` Robin Murphy
  2024-04-18 16:32 ` [PATCH v2 4/7] iommu/riscv: Enable IOMMU registration and device probe Tomasz Jeznach
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Introduce device driver for PCIe implementation
of RISC-V IOMMU architected hardware.

IOMMU hardware and system support for MSI or MSI-X is
required by this implementation.

Vendor and device identifiers used in this patch
matches QEMU implementation of the RISC-V IOMMU PCIe
device, from Rivos VID (0x1efd) range allocated by the PCI-SIG.

Link: https://lore.kernel.org/qemu-devel/20240307160319.675044-1-dbarboza@ventanamicro.com/
Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 MAINTAINERS                     |   1 +
 drivers/iommu/riscv/Kconfig     |   6 ++
 drivers/iommu/riscv/Makefile    |   1 +
 drivers/iommu/riscv/iommu-pci.c | 154 ++++++++++++++++++++++++++++++++
 4 files changed, 162 insertions(+)
 create mode 100644 drivers/iommu/riscv/iommu-pci.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 051599c76585..4da290d5e9db 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -18975,6 +18975,7 @@ F:	Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
 F:	drivers/iommu/riscv/Kconfig
 F:	drivers/iommu/riscv/Makefile
 F:	drivers/iommu/riscv/iommu-bits.h
+F:	drivers/iommu/riscv/iommu-pci.c
 F:	drivers/iommu/riscv/iommu-platform.c
 F:	drivers/iommu/riscv/iommu.c
 F:	drivers/iommu/riscv/iommu.h
diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
index d02326bddb4c..711326992585 100644
--- a/drivers/iommu/riscv/Kconfig
+++ b/drivers/iommu/riscv/Kconfig
@@ -14,3 +14,9 @@ config RISCV_IOMMU
 
 	  Say Y here if your SoC includes an IOMMU device implementing
 	  the RISC-V IOMMU architecture.
+
+config RISCV_IOMMU_PCI
+	def_bool y if RISCV_IOMMU && PCI_MSI
+	depends on RISCV_IOMMU && PCI_MSI
+	help
+	  Support for the PCI implementation of RISC-V IOMMU architecture.
diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
index e4c189de58d3..f54c9ed17d41 100644
--- a/drivers/iommu/riscv/Makefile
+++ b/drivers/iommu/riscv/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
+obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
diff --git a/drivers/iommu/riscv/iommu-pci.c b/drivers/iommu/riscv/iommu-pci.c
new file mode 100644
index 000000000000..9263c6e475be
--- /dev/null
+++ b/drivers/iommu/riscv/iommu-pci.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Copyright © 2022-2024 Rivos Inc.
+ * Copyright © 2023 FORTH-ICS/CARV
+ *
+ * RISCV IOMMU as a PCIe device
+ *
+ * Authors
+ *	Tomasz Jeznach <tjeznach@rivosinc.com>
+ *	Nick Kossifidis <mick@ics.forth.gr>
+ */
+
+#include <linux/compiler.h>
+#include <linux/init.h>
+#include <linux/iommu.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+
+#include "iommu-bits.h"
+#include "iommu.h"
+
+/* Rivos Inc. assigned PCI Vendor and Device IDs */
+#ifndef PCI_VENDOR_ID_RIVOS
+#define PCI_VENDOR_ID_RIVOS             0x1efd
+#endif
+
+#ifndef PCI_DEVICE_ID_RIVOS_IOMMU
+#define PCI_DEVICE_ID_RIVOS_IOMMU       0xedf1
+#endif
+
+static int riscv_iommu_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	struct device *dev = &pdev->dev;
+	struct riscv_iommu_device *iommu;
+	int rc, vec;
+
+	rc = pci_enable_device_mem(pdev);
+	if (rc)
+		return rc;
+
+	rc = pci_request_mem_regions(pdev, KBUILD_MODNAME);
+	if (rc)
+		goto fail;
+
+	pci_set_master(pdev);
+
+	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM))
+		goto fail;
+
+	if (pci_resource_len(pdev, 0) < RISCV_IOMMU_REG_SIZE)
+		goto fail;
+
+	iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		goto fail;
+
+	iommu->dev = dev;
+	iommu->reg = pci_iomap(pdev, 0, RISCV_IOMMU_REG_SIZE);
+
+	if (!iommu->reg)
+		goto fail;
+
+	dev_set_drvdata(dev, iommu);
+
+	/* Check device reported capabilities / features. */
+	iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
+	iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
+
+	/* The PCI driver only uses MSIs, make sure the IOMMU supports this */
+	switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
+	case RISCV_IOMMU_CAP_IGS_MSI:
+	case RISCV_IOMMU_CAP_IGS_BOTH:
+		break;
+	default:
+		dev_err(dev, "unable to use message-signaled interrupts\n");
+		rc = -ENODEV;
+		goto fail_unmap;
+	}
+
+	/* Allocate and assign IRQ vectors for the various events */
+	rc = pci_alloc_irq_vectors(pdev, 1, RISCV_IOMMU_INTR_COUNT,
+				   PCI_IRQ_MSIX | PCI_IRQ_MSI);
+	if (rc <= 0) {
+		dev_err(dev, "unable to allocate irq vectors\n");
+		goto fail_unmap;
+	}
+	for (vec = 0; vec < rc; vec++) {
+		iommu->irqs[vec] = msi_get_virq(dev, vec);
+		if (!iommu->irqs[vec])
+			break;
+	}
+	iommu->irqs_count = vec;
+
+	/* Enable message-signaled interrupts, fctl.WSI */
+	if (iommu->fctl & RISCV_IOMMU_FCTL_WSI) {
+		iommu->fctl ^= RISCV_IOMMU_FCTL_WSI;
+		riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
+	}
+
+	rc = riscv_iommu_init(iommu);
+	if (!rc)
+		return 0;
+
+fail_unmap:
+	iounmap(iommu->reg);
+	pci_free_irq_vectors(pdev);
+fail:
+	pci_release_regions(pdev);
+	pci_clear_master(pdev);
+	pci_disable_device(pdev);
+	return rc;
+}
+
+static void riscv_iommu_pci_remove(struct pci_dev *pdev)
+{
+	struct riscv_iommu_device *iommu = dev_get_drvdata(&pdev->dev);
+
+	riscv_iommu_remove(iommu);
+	iounmap(iommu->reg);
+	pci_free_irq_vectors(pdev);
+	pci_release_regions(pdev);
+	pci_clear_master(pdev);
+	pci_disable_device(pdev);
+}
+
+static const struct pci_device_id riscv_iommu_pci_tbl[] = {
+	{PCI_VENDOR_ID_RIVOS, PCI_DEVICE_ID_RIVOS_IOMMU,
+	 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},
+	{0,}
+};
+
+MODULE_DEVICE_TABLE(pci, riscv_iommu_pci_tbl);
+
+static const struct of_device_id riscv_iommu_of_match[] = {
+	{.compatible = "riscv,pci-iommu",},
+	{},
+};
+
+MODULE_DEVICE_TABLE(of, riscv_iommu_of_match);
+
+static struct pci_driver riscv_iommu_pci_driver = {
+	.name = KBUILD_MODNAME,
+	.id_table = riscv_iommu_pci_tbl,
+	.probe = riscv_iommu_pci_probe,
+	.remove = riscv_iommu_pci_remove,
+	.driver = {
+		.of_match_table = riscv_iommu_of_match,
+		.suppress_bind_attrs = true,
+	},
+};
+
+module_pci_driver(riscv_iommu_pci_driver);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/7] iommu/riscv: Enable IOMMU registration and device probe.
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
                   ` (2 preceding siblings ...)
  2024-04-18 16:32 ` [PATCH v2 3/7] iommu/riscv: Add RISC-V IOMMU PCIe " Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 5/7] iommu/riscv: Device directory management Tomasz Jeznach
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Advertise IOMMU device and its core API.
Only minimal implementation for single identity domain type, without
per-group domain protection.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 drivers/iommu/riscv/iommu.c | 69 +++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index af68c89200a9..d38317cb2493 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -18,6 +18,7 @@
 #include <linux/iommu.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/pci.h>
 
 #include "iommu-bits.h"
 #include "iommu.h"
@@ -30,6 +31,67 @@ MODULE_LICENSE("GPL");
 /* Timeouts in [us] */
 #define RISCV_IOMMU_DDTP_TIMEOUT	50000
 
+static int riscv_iommu_attach_identity_domain(struct iommu_domain *domain,
+					      struct device *dev)
+{
+	/* Global pass-through already enabled, do nothing for now. */
+	return 0;
+}
+
+static struct iommu_domain riscv_iommu_identity_domain = {
+	.type = IOMMU_DOMAIN_IDENTITY,
+	.ops = &(const struct iommu_domain_ops) {
+		.attach_dev = riscv_iommu_attach_identity_domain,
+	}
+};
+
+static int riscv_iommu_device_domain_type(struct device *dev)
+{
+	return IOMMU_DOMAIN_IDENTITY;
+}
+
+static struct iommu_group *riscv_iommu_device_group(struct device *dev)
+{
+	if (dev_is_pci(dev))
+		return pci_device_group(dev);
+	return generic_device_group(dev);
+}
+
+static int riscv_iommu_of_xlate(struct device *dev, const struct of_phandle_args *args)
+{
+	return iommu_fwspec_add_ids(dev, args->args, 1);
+}
+
+static struct iommu_device *riscv_iommu_probe_device(struct device *dev)
+{
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct riscv_iommu_device *iommu;
+
+	if (!fwspec->iommu_fwnode->dev || !fwspec->num_ids)
+		return ERR_PTR(-ENODEV);
+
+	iommu = dev_get_drvdata(fwspec->iommu_fwnode->dev);
+	if (!iommu)
+		return ERR_PTR(-ENODEV);
+
+	return &iommu->iommu;
+}
+
+static void riscv_iommu_probe_finalize(struct device *dev)
+{
+	iommu_setup_dma_ops(dev, 0, U64_MAX);
+}
+
+static const struct iommu_ops riscv_iommu_ops = {
+	.owner = THIS_MODULE,
+	.of_xlate = riscv_iommu_of_xlate,
+	.identity_domain = &riscv_iommu_identity_domain,
+	.def_domain_type = riscv_iommu_device_domain_type,
+	.device_group = riscv_iommu_device_group,
+	.probe_device = riscv_iommu_probe_device,
+	.probe_finalize = riscv_iommu_probe_finalize,
+};
+
 static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
 {
 	u64 ddtp;
@@ -60,6 +122,7 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
 
 void riscv_iommu_remove(struct riscv_iommu_device *iommu)
 {
+	iommu_device_unregister(&iommu->iommu);
 	iommu_device_sysfs_remove(&iommu->iommu);
 }
 
@@ -82,8 +145,14 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 	if (WARN(rc, "cannot register sysfs interface\n"))
 		goto err_sysfs;
 
+	rc = iommu_device_register(&iommu->iommu, &riscv_iommu_ops, iommu->dev);
+	if (WARN(rc, "cannot register iommu interface\n"))
+		goto err_iommu;
+
 	return 0;
 
+err_iommu:
+	iommu_device_sysfs_remove(&iommu->iommu);
 err_sysfs:
 	return rc;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/7] iommu/riscv: Device directory management.
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
                   ` (3 preceding siblings ...)
  2024-04-18 16:32 ` [PATCH v2 4/7] iommu/riscv: Enable IOMMU registration and device probe Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-19 12:40   ` Jason Gunthorpe
  2024-04-22  5:11   ` Baolu Lu
  2024-04-18 16:32 ` [PATCH v2 6/7] iommu/riscv: Command and fault queue support Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 7/7] iommu/riscv: Paging domain support Tomasz Jeznach
  6 siblings, 2 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Introduce device context allocation and device directory tree
management including capabilities discovery sequence, as described
in Chapter 2.1 of the RISC-V IOMMU Architecture Specification.

Device directory mode will be auto detected using DDTP WARL property,
using highest mode supported by the driver and hardware. If none
supported can be configured, driver will fall back to global pass-through.

First level DDTP page can be located in I/O (detected using DDTP WARL)
and system memory.

Only identity protection domain is supported by this implementation.

Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 drivers/iommu/riscv/iommu.c | 369 +++++++++++++++++++++++++++++++++++-
 drivers/iommu/riscv/iommu.h |   5 +
 2 files changed, 365 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index d38317cb2493..721cc71cb959 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -16,6 +16,7 @@
 #include <linux/dma-mapping.h>
 #include <linux/init.h>
 #include <linux/iommu.h>
+#include <linux/iopoll.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
 #include <linux/pci.h>
@@ -31,13 +32,350 @@ MODULE_LICENSE("GPL");
 /* Timeouts in [us] */
 #define RISCV_IOMMU_DDTP_TIMEOUT	50000
 
-static int riscv_iommu_attach_identity_domain(struct iommu_domain *domain,
-					      struct device *dev)
+/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
+#define phys_to_ppn(va)  (((va) >> 2) & (((1ULL << 44) - 1) << 10))
+#define ppn_to_phys(pn)	 (((pn) << 2) & (((1ULL << 44) - 1) << 12))
+
+#define dev_to_iommu(dev) \
+	container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
+
+/* Device resource-managed allocations */
+struct riscv_iommu_devres {
+	unsigned long addr;
+	unsigned int order;
+};
+
+static void riscv_iommu_devres_pages_release(struct device *dev, void *res)
+{
+	struct riscv_iommu_devres *devres = res;
+
+	free_pages(devres->addr, devres->order);
+}
+
+static int riscv_iommu_devres_pages_match(struct device *dev, void *res, void *p)
+{
+	struct riscv_iommu_devres *devres = res;
+	struct riscv_iommu_devres *target = p;
+
+	return devres->addr == target->addr;
+}
+
+static unsigned long riscv_iommu_get_pages(struct riscv_iommu_device *iommu, unsigned int order)
+{
+	struct riscv_iommu_devres *devres;
+	struct page *pages;
+
+	pages = alloc_pages_node(dev_to_node(iommu->dev),
+				 GFP_KERNEL_ACCOUNT | __GFP_ZERO, order);
+	if (unlikely(!pages)) {
+		dev_err(iommu->dev, "Page allocation failed, order %u\n", order);
+		return 0;
+	}
+
+	devres = devres_alloc(riscv_iommu_devres_pages_release,
+			      sizeof(struct riscv_iommu_devres), GFP_KERNEL);
+
+	if (unlikely(!devres)) {
+		__free_pages(pages, order);
+		return 0;
+	}
+
+	devres->addr = (unsigned long)page_address(pages);
+	devres->order = order;
+
+	devres_add(iommu->dev, devres);
+
+	return devres->addr;
+}
+
+static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, unsigned long addr)
+{
+	struct riscv_iommu_devres devres = { .addr = addr };
+
+	devres_release(iommu->dev, riscv_iommu_devres_pages_release,
+		       riscv_iommu_devres_pages_match, &devres);
+}
+
+/* Lookup and initialize device context info structure. */
+static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
+						 unsigned int devid, bool fetch)
+{
+	const bool base_format = !(iommu->caps & RISCV_IOMMU_CAP_MSI_FLAT);
+	unsigned int depth;
+	unsigned long ddt, ptr, old, new;
+	u8 ddi_bits[3] = { 0 };
+	u64 *ddtp = NULL;
+
+	/* Make sure the mode is valid */
+	if (iommu->ddt_mode < RISCV_IOMMU_DDTP_MODE_1LVL ||
+	    iommu->ddt_mode > RISCV_IOMMU_DDTP_MODE_3LVL)
+		return NULL;
+
+	/*
+	 * Device id partitioning for base format:
+	 * DDI[0]: bits 0 - 6   (1st level) (7 bits)
+	 * DDI[1]: bits 7 - 15  (2nd level) (9 bits)
+	 * DDI[2]: bits 16 - 23 (3rd level) (8 bits)
+	 *
+	 * For extended format:
+	 * DDI[0]: bits 0 - 5   (1st level) (6 bits)
+	 * DDI[1]: bits 6 - 14  (2nd level) (9 bits)
+	 * DDI[2]: bits 15 - 23 (3rd level) (9 bits)
+	 */
+	if (base_format) {
+		ddi_bits[0] = 7;
+		ddi_bits[1] = 7 + 9;
+		ddi_bits[2] = 7 + 9 + 8;
+	} else {
+		ddi_bits[0] = 6;
+		ddi_bits[1] = 6 + 9;
+		ddi_bits[2] = 6 + 9 + 9;
+	}
+
+	/* Make sure device id is within range */
+	depth = iommu->ddt_mode - RISCV_IOMMU_DDTP_MODE_1LVL;
+	if (devid >= (1 << ddi_bits[depth]))
+		return NULL;
+
+	/* Get to the level of the non-leaf node that holds the device context */
+	for (ddtp = iommu->ddt_root; depth-- > 0;) {
+		const int split = ddi_bits[depth];
+		/*
+		 * Each non-leaf node is 64bits wide and on each level
+		 * nodes are indexed by DDI[depth].
+		 */
+		ddtp += (devid >> split) & 0x1FF;
+
+		/*
+		 * Check if this node has been populated and if not
+		 * allocate a new level and populate it.
+		 */
+		do {
+			ddt = READ_ONCE(*(unsigned long *)ddtp);
+			if (ddt & RISCV_IOMMU_DDTE_VALID) {
+				ddtp = __va(ppn_to_phys(ddt));
+				break;
+			}
+
+			/* Fetch only, do not allocate new device context. */
+			if (fetch)
+				return NULL;
+
+			ptr = riscv_iommu_get_pages(iommu, 0);
+			if (!ptr)
+				return NULL;
+
+			new = phys_to_ppn(__pa(ptr)) | RISCV_IOMMU_DDTE_VALID;
+			old = cmpxchg_relaxed((unsigned long *)ddtp, ddt, new);
+
+			if (old == ddt) {
+				ddtp = (u64 *)ptr;
+				break;
+			}
+
+			/* Race setting DDT detected, re-read and retry. */
+			riscv_iommu_free_pages(iommu, ptr);
+		} while (1);
+	}
+
+	/*
+	 * Grab the node that matches DDI[depth], note that when using base
+	 * format the device context is 4 * 64bits, and the extended format
+	 * is 8 * 64bits, hence the (3 - base_format) below.
+	 */
+	ddtp += (devid & ((64 << base_format) - 1)) << (3 - base_format);
+
+	return (struct riscv_iommu_dc *)ddtp;
+}
+
+/*
+ * Discover supported DDT modes starting from requested value,
+ * configure DDTP register with accepted mode and root DDT address.
+ * Accepted iommu->ddt_mode is updated on success.
+ */
+static int riscv_iommu_set_ddtp_mode(struct riscv_iommu_device *iommu,
+				     unsigned int ddtp_mode)
+{
+	struct device *dev = iommu->dev;
+	u64 ddtp, rq_ddtp;
+	unsigned int mode, rq_mode = ddtp_mode;
+	int rc;
+
+	rc = readq_relaxed_poll_timeout(iommu->reg + RISCV_IOMMU_REG_DDTP,
+					ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
+					10, RISCV_IOMMU_DDTP_TIMEOUT);
+	if (rc < 0)
+		return -EBUSY;
+
+	/* Disallow state transition from xLVL to xLVL. */
+	switch (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp)) {
+	case RISCV_IOMMU_DDTP_MODE_BARE:
+	case RISCV_IOMMU_DDTP_MODE_OFF:
+		break;
+	default:
+		if (rq_mode != RISCV_IOMMU_DDTP_MODE_BARE &&
+		    rq_mode != RISCV_IOMMU_DDTP_MODE_OFF)
+			return -EINVAL;
+		break;
+	}
+
+	do {
+		rq_ddtp = FIELD_PREP(RISCV_IOMMU_DDTP_MODE, rq_mode);
+		if (rq_mode > RISCV_IOMMU_DDTP_MODE_BARE)
+			rq_ddtp |= phys_to_ppn(iommu->ddt_phys);
+
+		riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, rq_ddtp);
+
+		rc = readq_relaxed_poll_timeout(iommu->reg + RISCV_IOMMU_REG_DDTP,
+						ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
+						10, RISCV_IOMMU_DDTP_TIMEOUT);
+		if (rc < 0) {
+			dev_warn(dev, "timeout when setting ddtp (ddt mode: %u, read: %llx)\n",
+				 rq_mode, ddtp);
+			return -EBUSY;
+		}
+
+		/* Verify IOMMU hardware accepts new DDTP config. */
+		mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
+
+		if (rq_mode == mode)
+			break;
+
+		/* Hardware mandatory DDTP mode has not been accepted. */
+		if (rq_mode < RISCV_IOMMU_DDTP_MODE_1LVL && rq_ddtp != ddtp) {
+			dev_warn(dev, "DDTP update failed hw: %llx vs %llx\n", ddtp, rq_ddtp);
+			return -EINVAL;
+		}
+
+		/*
+		 * Mode field is WARL, an IOMMU may support a subset of
+		 * directory table levels in which case if we tried to set
+		 * an unsupported number of levels we'll readback either
+		 * a valid xLVL or off/bare. If we got off/bare, try again
+		 * with a smaller xLVL.
+		 */
+		if (mode < RISCV_IOMMU_DDTP_MODE_1LVL &&
+		    rq_mode > RISCV_IOMMU_DDTP_MODE_1LVL) {
+			dev_dbg(dev, "DDTP hw mode %u vs %u\n", mode, rq_mode);
+			rq_mode--;
+			continue;
+		}
+
+		/*
+		 * We tried all supported modes and IOMMU hardware failed to
+		 * accept new settings, something went very wrong since off/bare
+		 * and at least one xLVL must be supported.
+		 */
+		dev_warn(dev, "DDTP hw mode %u, failed to set %u\n", mode, ddtp_mode);
+		return -EINVAL;
+	} while (1);
+
+	iommu->ddt_mode = mode;
+	if (mode != ddtp_mode)
+		dev_warn(dev, "DDTP failover to %u mode, requested %u\n",
+			 mode, ddtp_mode);
+
+	return 0;
+}
+
+static int riscv_iommu_ddt_alloc(struct riscv_iommu_device *iommu)
 {
-	/* Global pass-through already enabled, do nothing for now. */
+	u64 ddtp;
+	unsigned int mode;
+
+	riscv_iommu_readq_timeout(iommu, RISCV_IOMMU_REG_DDTP,
+				  ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
+				  10, RISCV_IOMMU_DDTP_TIMEOUT);
+
+	if (ddtp & RISCV_IOMMU_DDTP_BUSY)
+		return -EBUSY;
+
+	/*
+	 * It is optional for the hardware to report a fixed address for device
+	 * directory root page when DDT.MODE is OFF or BARE.
+	 */
+	mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
+	if (mode != RISCV_IOMMU_DDTP_MODE_BARE && mode != RISCV_IOMMU_DDTP_MODE_OFF) {
+		/* Use WARL to discover hardware fixed DDT PPN */
+		riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
+				   FIELD_PREP(RISCV_IOMMU_DDTP_MODE, mode));
+		riscv_iommu_readl_timeout(iommu, RISCV_IOMMU_REG_DDTP,
+					  ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
+					  10, RISCV_IOMMU_DDTP_TIMEOUT);
+		if (ddtp & RISCV_IOMMU_DDTP_BUSY)
+			return -EBUSY;
+
+		iommu->ddt_phys = ppn_to_phys(ddtp);
+		if (iommu->ddt_phys)
+			iommu->ddt_root = devm_ioremap(iommu->dev, iommu->ddt_phys, PAGE_SIZE);
+		if (iommu->ddt_root)
+			memset(iommu->ddt_root, 0, PAGE_SIZE);
+	}
+
+	if (!iommu->ddt_root) {
+		iommu->ddt_root = (u64 *)riscv_iommu_get_pages(iommu, 0);
+		iommu->ddt_phys = __pa(iommu->ddt_root);
+	}
+
+	if (!iommu->ddt_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
+				     struct device *dev,
+				     struct iommu_domain *iommu_domain)
+{
+	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct riscv_iommu_dc *dc;
+	u64 fsc, ta, tc;
+	int i;
+
+	if (!iommu_domain) {
+		ta = 0;
+		tc = 0;
+		fsc = 0;
+	} else if (iommu_domain->type == IOMMU_DOMAIN_IDENTITY) {
+		ta = 0;
+		tc = RISCV_IOMMU_DC_TC_V;
+		fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
+	} else {
+		/* This should never happen. */
+		return -ENODEV;
+	}
+
+	/* Update existing or allocate new entries in device directory */
+	for (i = 0; i < fwspec->num_ids; i++) {
+		dc = riscv_iommu_get_dc(iommu, fwspec->ids[i], !iommu_domain);
+		if (!dc && !iommu_domain)
+			continue;
+		if (!dc)
+			return -ENODEV;
+
+		/* Swap device context, update TC valid bit as the last operation */
+		xchg64(&dc->fsc, fsc);
+		xchg64(&dc->ta, ta);
+		xchg64(&dc->tc, tc);
+
+		/* Device context invalidation will be required. Ignoring for now. */
+	}
+
 	return 0;
 }
 
+static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
+					      struct device *dev)
+{
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+
+	/* Global pass-through already enabled, do nothing. */
+	if (iommu->ddt_mode == RISCV_IOMMU_DDTP_MODE_BARE)
+		return 0;
+
+	return riscv_iommu_attach_domain(iommu, dev, iommu_domain);
+}
+
 static struct iommu_domain riscv_iommu_identity_domain = {
 	.type = IOMMU_DOMAIN_IDENTITY,
 	.ops = &(const struct iommu_domain_ops) {
@@ -82,6 +420,13 @@ static void riscv_iommu_probe_finalize(struct device *dev)
 	iommu_setup_dma_ops(dev, 0, U64_MAX);
 }
 
+static void riscv_iommu_release_device(struct device *dev)
+{
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+
+	riscv_iommu_attach_domain(iommu, dev, NULL);
+}
+
 static const struct iommu_ops riscv_iommu_ops = {
 	.owner = THIS_MODULE,
 	.of_xlate = riscv_iommu_of_xlate,
@@ -90,6 +435,7 @@ static const struct iommu_ops riscv_iommu_ops = {
 	.device_group = riscv_iommu_device_group,
 	.probe_device = riscv_iommu_probe_device,
 	.probe_finalize = riscv_iommu_probe_finalize,
+	.release_device = riscv_iommu_release_device,
 };
 
 static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
@@ -124,6 +470,7 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
 {
 	iommu_device_unregister(&iommu->iommu);
 	iommu_device_sysfs_remove(&iommu->iommu);
+	riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
 }
 
 int riscv_iommu_init(struct riscv_iommu_device *iommu)
@@ -133,12 +480,14 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 	rc = riscv_iommu_init_check(iommu);
 	if (rc)
 		return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
-	/*
-	 * Placeholder for a complete IOMMU device initialization.
-	 * For now, only bare minimum: enable global identity mapping mode and register sysfs.
-	 */
-	riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
-			   FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
+
+	rc = riscv_iommu_ddt_alloc(iommu);
+	if (WARN(rc, "cannot allocate device directory\n"))
+		goto err_init;
+
+	rc = riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
+	if (WARN(rc, "cannot enable iommu device\n"))
+		goto err_init;
 
 	rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
 				    dev_name(iommu->dev));
@@ -154,5 +503,7 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 err_iommu:
 	iommu_device_sysfs_remove(&iommu->iommu);
 err_sysfs:
+	riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
+err_init:
 	return rc;
 }
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index 700e33dc2446..f1696926582c 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -34,6 +34,11 @@ struct riscv_iommu_device {
 	/* available interrupt numbers, MSI or WSI */
 	unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
 	unsigned int irqs_count;
+
+	/* device directory */
+	unsigned int ddt_mode;
+	dma_addr_t ddt_phys;
+	u64 *ddt_root;
 };
 
 int riscv_iommu_init(struct riscv_iommu_device *iommu);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 6/7] iommu/riscv: Command and fault queue support
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
                   ` (4 preceding siblings ...)
  2024-04-18 16:32 ` [PATCH v2 5/7] iommu/riscv: Device directory management Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-18 16:32 ` [PATCH v2 7/7] iommu/riscv: Paging domain support Tomasz Jeznach
  6 siblings, 0 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Introduce device command submission and fault reporting queues,
as described in Chapter 3.1 and 3.2 of the RISC-V IOMMU Architecture
Specification.

Command and fault queues are instantiated in contiguous system memory
local to IOMMU device domain, or mapped from fixed I/O space provided
by the hardware implementation. Detection of the location and maximum
allowed size of the queue utilize WARL properties of queue base control
register. Driver implementation will try to allocate up to 128KB of
system memory, while respecting hardware supported maximum queue size.

Interrupts allocation is based on interrupt vectors availability and
ditributed to all queues in simple round-robin fashion. For hardware
Implementation with fixed event type to interrupt vector assignment
IVEC WARL property is used to discover such mappings.

Address translation, command and queue fault handling in this change
is limited to simple fault reporing without taking any action.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 drivers/iommu/riscv/iommu-bits.h |  75 +++++
 drivers/iommu/riscv/iommu.c      | 473 ++++++++++++++++++++++++++++++-
 drivers/iommu/riscv/iommu.h      |  21 ++
 3 files changed, 567 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
index ba093c29de9f..40c379222821 100644
--- a/drivers/iommu/riscv/iommu-bits.h
+++ b/drivers/iommu/riscv/iommu-bits.h
@@ -704,4 +704,79 @@ struct riscv_iommu_msi_pte {
 #define RISCV_IOMMU_MSI_MRIF_NPPN	RISCV_IOMMU_PPN_FIELD
 #define RISCV_IOMMU_MSI_MRIF_NID_MSB	BIT_ULL(60)
 
+/* Helper functions: command structure builders. */
+
+static inline void riscv_iommu_cmd_inval_vma(struct riscv_iommu_command *cmd)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOTINVAL_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA);
+	cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_inval_set_addr(struct riscv_iommu_command *cmd,
+						  u64 addr)
+{
+	cmd->dword1 = FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_ADDR, phys_to_pfn(addr));
+	cmd->dword0 |= RISCV_IOMMU_CMD_IOTINVAL_AV;
+}
+
+static inline void riscv_iommu_cmd_inval_set_pscid(struct riscv_iommu_command *cmd,
+						   int pscid)
+{
+	cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_PSCID, pscid) |
+		       RISCV_IOMMU_CMD_IOTINVAL_PSCV;
+}
+
+static inline void riscv_iommu_cmd_inval_set_gscid(struct riscv_iommu_command *cmd,
+						   int gscid)
+{
+	cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IOTINVAL_GSCID, gscid) |
+		       RISCV_IOMMU_CMD_IOTINVAL_GV;
+}
+
+static inline void riscv_iommu_cmd_iofence(struct riscv_iommu_command *cmd)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
+		      RISCV_IOMMU_CMD_IOFENCE_PR | RISCV_IOMMU_CMD_IOFENCE_PW;
+	cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_iofence_set_av(struct riscv_iommu_command *cmd,
+						  u64 addr, u32 data)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IOFENCE_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IOFENCE_FUNC_C) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_IOFENCE_DATA, data) |
+		      RISCV_IOMMU_CMD_IOFENCE_AV;
+	cmd->dword1 = addr >> 2;
+}
+
+static inline void riscv_iommu_cmd_iodir_inval_ddt(struct riscv_iommu_command *cmd)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT);
+	cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_iodir_inval_pdt(struct riscv_iommu_command *cmd)
+{
+	cmd->dword0 = FIELD_PREP(RISCV_IOMMU_CMD_OPCODE, RISCV_IOMMU_CMD_IODIR_OPCODE) |
+		      FIELD_PREP(RISCV_IOMMU_CMD_FUNC, RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT);
+	cmd->dword1 = 0;
+}
+
+static inline void riscv_iommu_cmd_iodir_set_did(struct riscv_iommu_command *cmd,
+						 unsigned int devid)
+{
+	cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_DID, devid) |
+		       RISCV_IOMMU_CMD_IODIR_DV;
+}
+
+static inline void riscv_iommu_cmd_iodir_set_pid(struct riscv_iommu_command *cmd,
+						 unsigned int pasid)
+{
+	cmd->dword0 |= FIELD_PREP(RISCV_IOMMU_CMD_IODIR_PID, pasid);
+}
+
 #endif /* _RISCV_IOMMU_BITS_H_ */
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index 721cc71cb959..a4f74588cdc2 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -30,7 +30,14 @@ MODULE_AUTHOR("Nick Kossifidis <mick@ics.forth.gr>");
 MODULE_LICENSE("GPL");
 
 /* Timeouts in [us] */
-#define RISCV_IOMMU_DDTP_TIMEOUT	50000
+#define RISCV_IOMMU_QCSR_TIMEOUT	150000
+#define RISCV_IOMMU_QUEUE_TIMEOUT	150000
+#define RISCV_IOMMU_DDTP_TIMEOUT	10000000
+#define RISCV_IOMMU_IOTINVAL_TIMEOUT	90000000
+
+/* Number of entries per CMD/FLT queue, should be <= INT_MAX */
+#define RISCV_IOMMU_DEF_CQ_COUNT	8192
+#define RISCV_IOMMU_DEF_FQ_COUNT	4096
 
 /* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
 #define phys_to_ppn(va)  (((va) >> 2) & (((1ULL << 44) - 1) << 10))
@@ -96,6 +103,417 @@ static void riscv_iommu_free_pages(struct riscv_iommu_device *iommu, unsigned lo
 		       riscv_iommu_devres_pages_match, &devres);
 }
 
+/*
+ * Hardware queue allocation and management.
+ */
+
+/* Setup queue base, control registers and default queue length */
+#define RISCV_IOMMU_QUEUE_INIT(q, name) do {					\
+	struct riscv_iommu_queue *_q = q;					\
+	_q->qid = RISCV_IOMMU_INTR_ ## name;					\
+	_q->qbr = RISCV_IOMMU_REG_ ## name ## B;				\
+	_q->qcr = RISCV_IOMMU_REG_ ## name ## CSR;				\
+	_q->mask = _q->mask ?: (RISCV_IOMMU_DEF_ ## name ## _COUNT) - 1;	\
+} while (0)
+
+/* Note: offsets are the same for all queues */
+#define Q_HEAD(q) ((q)->qbr + (RISCV_IOMMU_REG_CQH - RISCV_IOMMU_REG_CQB))
+#define Q_TAIL(q) ((q)->qbr + (RISCV_IOMMU_REG_CQT - RISCV_IOMMU_REG_CQB))
+#define Q_ITEM(q, index) ((q)->mask & (index))
+#define Q_IPSR(q) BIT((q)->qid)
+
+/*
+ * Discover queue ring buffer hardware configuration, allocate in-memory
+ * ring buffer or use fixed I/O memory location, configure queue base register.
+ * Must be called before hardware queue is enabled.
+ *
+ * @queue - data structure, configured with RISCV_IOMMU_QUEUE_INIT()
+ * @entry_size - queue single element size in bytes.
+ */
+static int riscv_iommu_queue_alloc(struct riscv_iommu_device *iommu,
+				   struct riscv_iommu_queue *queue,
+				   size_t entry_size)
+{
+	unsigned int logsz;
+	unsigned long addr = 0;
+	u64 qb, rb;
+
+	/*
+	 * Use WARL base register property to discover maximum allowed
+	 * number of entries and optional fixed IO address for queue location.
+	 */
+	riscv_iommu_writeq(iommu, queue->qbr, RISCV_IOMMU_QUEUE_LOGSZ_FIELD);
+	qb = riscv_iommu_readq(iommu, queue->qbr);
+
+	/*
+	 * Calculate and verify hardware supported queue length, as reported
+	 * by the field LOGSZ, where max queue length is equal to 2^(LOGSZ + 1).
+	 * Update queue size based on hardware supported value.
+	 */
+	logsz = ilog2(queue->mask);
+	if (logsz > FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb))
+		logsz = FIELD_GET(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, qb);
+
+	/*
+	 * Use WARL base register property to discover an optional fixed IO address
+	 * for queue ring buffer location. Otherwise allocate contigus system memory.
+	 */
+	if (FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb)) {
+		const size_t queue_size = entry_size << (logsz + 1);
+
+		queue->phys = ppn_to_phys(FIELD_GET(RISCV_IOMMU_PPN_FIELD, qb));
+		queue->base = devm_ioremap(iommu->dev, queue->phys, queue_size);
+	} else {
+		do {
+			const size_t queue_size = entry_size << (logsz + 1);
+
+			addr = riscv_iommu_get_pages(iommu, (unsigned int)get_order(queue_size));
+			queue->base = (u64 *)addr;
+			queue->phys = __pa(addr);
+		} while (!queue->base && logsz-- > 0);
+	}
+
+	if (!queue->base)
+		return -ENOMEM;
+
+	qb = phys_to_ppn(queue->phys) |
+	     FIELD_PREP(RISCV_IOMMU_QUEUE_LOGSZ_FIELD, logsz);
+
+	/* Update base register and read back to verify hw accepted our write */
+	riscv_iommu_writeq(iommu, queue->qbr, qb);
+	rb = riscv_iommu_readq(iommu, queue->qbr);
+	if (rb != qb) {
+		if (addr)
+			riscv_iommu_free_pages(iommu, addr);
+		return -ENODEV;
+	}
+
+	/* Update actual queue mask */
+	queue->mask = (2U << logsz) - 1;
+
+	dev_dbg(iommu->dev, "queue #%u allocated 2^%u entries", queue->qid, logsz + 1);
+
+	return 0;
+}
+
+/* Check interrupt queue status, IPSR */
+static irqreturn_t riscv_iommu_queue_ipsr(int irq, void *data)
+{
+	struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
+
+	if (riscv_iommu_readl(queue->iommu, RISCV_IOMMU_REG_IPSR) & Q_IPSR(queue))
+		return IRQ_WAKE_THREAD;
+
+	return IRQ_NONE;
+}
+
+static int riscv_iommu_queue_vec(struct riscv_iommu_device *iommu, int n)
+{
+	/* Reuse IVEC.CIV mask for all interrupt vectors mapping. */
+	return (iommu->ivec >> (n * 4)) & RISCV_IOMMU_IVEC_CIV;
+}
+
+/*
+ * Enable queue processing in the hardware, register interrupt handler.
+ *
+ * @queue - data structure, already allocated with riscv_iommu_queue_alloc()
+ * @irq_handler - threaded interrupt handler.
+ */
+static int riscv_iommu_queue_enable(struct riscv_iommu_device *iommu,
+				    struct riscv_iommu_queue *queue,
+				    irq_handler_t irq_handler)
+{
+	const unsigned int irq = iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)];
+	u32 csr;
+	int rc;
+
+	if (queue->iommu)
+		return -EBUSY;
+
+	/* Polling not implemented */
+	if (!irq)
+		return -ENODEV;
+
+	queue->iommu = iommu;
+	rc = request_threaded_irq(irq, riscv_iommu_queue_ipsr, irq_handler,
+				  IRQF_ONESHOT | IRQF_SHARED, dev_name(iommu->dev), queue);
+	if (rc) {
+		queue->iommu = NULL;
+		return rc;
+	}
+
+	/*
+	 * Enable queue with interrupts, clear any memory fault if any.
+	 * Wait for the hardware to acknowledge request and activate queue processing.
+	 * Note: All CSR bitfields are in the same offsets for all queues.
+	 */
+	riscv_iommu_writel(iommu, queue->qcr,
+			   RISCV_IOMMU_QUEUE_ENABLE |
+			   RISCV_IOMMU_QUEUE_INTR_ENABLE |
+			   RISCV_IOMMU_QUEUE_MEM_FAULT);
+
+	riscv_iommu_readl_timeout(iommu, queue->qcr,
+				  csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
+				  10, RISCV_IOMMU_QCSR_TIMEOUT);
+
+	if (RISCV_IOMMU_QUEUE_ACTIVE != (csr & (RISCV_IOMMU_QUEUE_ACTIVE |
+						RISCV_IOMMU_QUEUE_BUSY |
+						RISCV_IOMMU_QUEUE_MEM_FAULT))) {
+		/* Best effort to stop and disable failing hardware queue. */
+		riscv_iommu_writel(iommu, queue->qcr, 0);
+		free_irq(irq, queue);
+		queue->iommu = NULL;
+		return -EBUSY;
+	}
+
+	/* Clear any pending interrupt flag. */
+	riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
+
+	return 0;
+}
+
+/*
+ * Disable queue. Wait for the hardware to acknowledge request and
+ * stop processing enqueued requests. Report errors but continue.
+ */
+static void riscv_iommu_queue_disable(struct riscv_iommu_queue *queue)
+{
+	struct riscv_iommu_device *iommu = queue->iommu;
+	u32 csr;
+
+	if (!iommu)
+		return;
+
+	free_irq(iommu->irqs[riscv_iommu_queue_vec(iommu, queue->qid)], queue);
+	riscv_iommu_writel(iommu, queue->qcr, 0);
+	riscv_iommu_readl_timeout(iommu, queue->qcr,
+				  csr, !(csr & RISCV_IOMMU_QUEUE_BUSY),
+				  10, RISCV_IOMMU_QCSR_TIMEOUT);
+
+	if (csr & (RISCV_IOMMU_QUEUE_ACTIVE | RISCV_IOMMU_QUEUE_BUSY))
+		dev_err(iommu->dev, "fail to disable hardware queue #%u, csr 0x%x\n",
+			queue->qid, csr);
+
+	queue->iommu = NULL;
+}
+
+/*
+ * Returns number of available valid queue entries and the first item index or negative
+ * error code.  Update shadow producer index if necessary.
+ */
+static int riscv_iommu_queue_consume(struct riscv_iommu_queue *queue, unsigned int *index)
+{
+	unsigned int head = atomic_read(&queue->head);
+	unsigned int tail = atomic_read(&queue->tail);
+	unsigned int last = Q_ITEM(queue, tail);
+	int available = (int)(tail - head);
+
+	*index = head;
+
+	if (available > 0)
+		return available;
+
+	/* read hardware producer index, check reserved register bits are not set. */
+	if (riscv_iommu_readl_timeout(queue->iommu, Q_TAIL(queue),
+				      tail, (tail & ~queue->mask) == 0,
+				      0, RISCV_IOMMU_QUEUE_TIMEOUT))
+		return -EBUSY;
+
+	if (tail == last)
+		return 0;
+
+	/* update shadow producer index */
+	return (int)(atomic_add_return((tail - last) & queue->mask, &queue->tail) - head);
+}
+
+/*
+ * Release processed queue entries, should match riscv_iommu_queue_consume() calls.
+ */
+static void riscv_iommu_queue_release(struct riscv_iommu_queue *queue, int count)
+{
+	const unsigned int head = atomic_add_return(count, &queue->head);
+
+	riscv_iommu_writel(queue->iommu, Q_HEAD(queue), Q_ITEM(queue, head));
+}
+
+/* Return actual consumer index based on hardware reported queue head index. */
+static unsigned int riscv_iommu_queue_cons(struct riscv_iommu_queue *queue)
+{
+	const unsigned int cons = atomic_read(&queue->head);
+	const unsigned int last = Q_ITEM(queue, cons);
+	unsigned int head;
+
+	if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
+				      !(head & ~queue->mask), 0, RISCV_IOMMU_QUEUE_TIMEOUT))
+		return cons;
+
+	return cons + ((head - last) & queue->mask);
+}
+
+/* Wait for submitted item to be processed. */
+static int riscv_iommu_queue_wait(struct riscv_iommu_queue *queue, unsigned int index,
+				  unsigned int timeout_us)
+{
+	unsigned int cons = atomic_read(&queue->head);
+
+	/* Already processed by the consumer */
+	if ((int)(cons - index) > 0)
+		return 0;
+
+	/* Monitor consumer index */
+	return readx_poll_timeout(riscv_iommu_queue_cons, queue, cons, (int)(cons - index) > 0,
+				  0, timeout_us);
+}
+
+/* Enqueue an entry and wait to be processed if timeout_us > 0 */
+static int riscv_iommu_queue_send(struct riscv_iommu_queue *queue,
+				  void *entry, size_t entry_size,
+				  unsigned int timeout_us)
+{
+	unsigned int prod;
+	unsigned int head;
+	unsigned int tail;
+	unsigned long flags;
+
+	/* Do not preempt submission flow. */
+	local_irq_save(flags);
+
+	/* 1. Allocate some space in the queue */
+	prod = atomic_inc_return(&queue->prod) - 1;
+	head = atomic_read(&queue->head);
+
+	/* 2. Wait for space availability. */
+	if ((prod - head) > queue->mask) {
+		if (readx_poll_timeout(atomic_read, &queue->head,
+				       head, (prod - head) < queue->mask,
+				       0, RISCV_IOMMU_QUEUE_TIMEOUT))
+			goto err_busy;
+	} else if ((prod - head) == queue->mask) {
+		const unsigned int last = Q_ITEM(queue, head);
+
+		if (riscv_iommu_readl_timeout(queue->iommu, Q_HEAD(queue), head,
+					      !(head & ~queue->mask) && head != last,
+					      0, RISCV_IOMMU_QUEUE_TIMEOUT))
+			goto err_busy;
+		atomic_add((head - last) & queue->mask, &queue->head);
+	}
+
+	/* 3. Store entry in the ring buffer. */
+	memcpy(queue->base + Q_ITEM(queue, prod) * entry_size, entry, entry_size);
+
+	/* 4. Wait for all previous entries to be ready */
+	if (readx_poll_timeout(atomic_read, &queue->tail, tail, prod == tail,
+			       0, RISCV_IOMMU_QUEUE_TIMEOUT))
+		goto err_busy;
+
+	/* 5. Complete submission and restore local interrupts */
+	dma_wmb();
+	riscv_iommu_writel(queue->iommu, Q_TAIL(queue), Q_ITEM(queue, prod + 1));
+	atomic_inc(&queue->tail);
+	local_irq_restore(flags);
+
+	if (timeout_us)
+		return WARN_ON(riscv_iommu_queue_wait(queue, prod, timeout_us));
+
+	return 0;
+
+err_busy:
+	local_irq_restore(flags);
+	return -EBUSY;
+}
+
+/*
+ * IOMMU Command queue chapter 3.1
+ */
+
+/* Command queue interrupt handler thread function */
+static irqreturn_t riscv_iommu_cmdq_process(int irq, void *data)
+{
+	const struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
+	unsigned int ctrl;
+
+	/* Clear MF/CQ errors, complete error recovery to be implemented. */
+	ctrl = riscv_iommu_readl(queue->iommu, queue->qcr);
+	if (ctrl & (RISCV_IOMMU_CQCSR_CQMF | RISCV_IOMMU_CQCSR_CMD_TO |
+		    RISCV_IOMMU_CQCSR_CMD_ILL | RISCV_IOMMU_CQCSR_FENCE_W_IP)) {
+		riscv_iommu_writel(queue->iommu, queue->qcr, ctrl);
+		dev_warn(queue->iommu->dev,
+			 "Queue #%u error; fault:%d timeout:%d illegal:%d fence_w_ip:%d\n",
+			 queue->qid,
+			 !!(ctrl & RISCV_IOMMU_CQCSR_CQMF),
+			 !!(ctrl & RISCV_IOMMU_CQCSR_CMD_TO),
+			 !!(ctrl & RISCV_IOMMU_CQCSR_CMD_ILL),
+			 !!(ctrl & RISCV_IOMMU_CQCSR_FENCE_W_IP));
+	}
+
+	/* Placeholder for command queue interrupt notifiers */
+
+	/* Clear command interrupt pending. */
+	riscv_iommu_writel(queue->iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
+
+	return IRQ_HANDLED;
+}
+
+/* Send command to the IOMMU command queue */
+static int riscv_iommu_cmd_send(struct riscv_iommu_device *iommu,
+				struct riscv_iommu_command *cmd,
+				unsigned int timeout_us)
+{
+	return riscv_iommu_queue_send(&iommu->cmdq, cmd, sizeof(*cmd), timeout_us);
+}
+
+/*
+ * IOMMU Fault/Event queue chapter 3.2
+ */
+
+static void riscv_iommu_fault(struct riscv_iommu_device *iommu,
+			      struct riscv_iommu_fq_record *event)
+{
+	unsigned int err = FIELD_GET(RISCV_IOMMU_FQ_HDR_CAUSE, event->hdr);
+	unsigned int devid = FIELD_GET(RISCV_IOMMU_FQ_HDR_DID, event->hdr);
+
+	/* Placeholder for future fault handling implementation, report only. */
+	if (err)
+		dev_warn_ratelimited(iommu->dev,
+				     "Fault %d devid: 0x%x iotval: %llx iotval2: %llx\n",
+				     err, devid, event->iotval, event->iotval2);
+}
+
+/* Fault queue interrupt handler thread function */
+static irqreturn_t riscv_iommu_fltq_process(int irq, void *data)
+{
+	struct riscv_iommu_queue *queue = (struct riscv_iommu_queue *)data;
+	struct riscv_iommu_device *iommu = queue->iommu;
+	struct riscv_iommu_fq_record *events;
+	unsigned int ctrl, idx;
+	int cnt, len;
+
+	events = (struct riscv_iommu_fq_record *)queue->base;
+
+	/* Clear fault interrupt pending and process all received fault events. */
+	riscv_iommu_writel(iommu, RISCV_IOMMU_REG_IPSR, Q_IPSR(queue));
+
+	do {
+		cnt = riscv_iommu_queue_consume(queue, &idx);
+		for (len = 0; len < cnt; idx++, len++)
+			riscv_iommu_fault(iommu, &events[Q_ITEM(queue, idx)]);
+		riscv_iommu_queue_release(queue, cnt);
+	} while (cnt > 0);
+
+	/* Clear MF/OF errors, complete error recovery to be implemented. */
+	ctrl = riscv_iommu_readl(iommu, queue->qcr);
+	if (ctrl & (RISCV_IOMMU_FQCSR_FQMF | RISCV_IOMMU_FQCSR_FQOF)) {
+		riscv_iommu_writel(iommu, queue->qcr, ctrl);
+		dev_warn(iommu->dev,
+			 "Queue #%u error; memory fault:%d overflow:%d\n",
+			 queue->qid,
+			 !!(ctrl & RISCV_IOMMU_FQCSR_FQMF),
+			 !!(ctrl & RISCV_IOMMU_FQCSR_FQOF));
+	}
+
+	return IRQ_HANDLED;
+}
+
 /* Lookup and initialize device context info structure. */
 static struct riscv_iommu_dc *riscv_iommu_get_dc(struct riscv_iommu_device *iommu,
 						 unsigned int devid, bool fetch)
@@ -197,6 +615,7 @@ static int riscv_iommu_set_ddtp_mode(struct riscv_iommu_device *iommu,
 				     unsigned int ddtp_mode)
 {
 	struct device *dev = iommu->dev;
+	struct riscv_iommu_command cmd;
 	u64 ddtp, rq_ddtp;
 	unsigned int mode, rq_mode = ddtp_mode;
 	int rc;
@@ -275,7 +694,17 @@ static int riscv_iommu_set_ddtp_mode(struct riscv_iommu_device *iommu,
 		dev_warn(dev, "DDTP failover to %u mode, requested %u\n",
 			 mode, ddtp_mode);
 
-	return 0;
+	/* Invalidate device context cache */
+	riscv_iommu_cmd_iodir_inval_ddt(&cmd);
+	riscv_iommu_cmd_send(iommu, &cmd, 0);
+
+	/* Invalidate address translation cache */
+	riscv_iommu_cmd_inval_vma(&cmd);
+	riscv_iommu_cmd_send(iommu, &cmd, 0);
+
+	/* IOFENCE.C */
+	riscv_iommu_cmd_iofence(&cmd);
+	return riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
 }
 
 static int riscv_iommu_ddt_alloc(struct riscv_iommu_device *iommu)
@@ -460,6 +889,23 @@ static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
 			return -EINVAL;
 	}
 
+	/* Distribute interrupt vectors, always use first vector for CIV */
+	iommu->ivec = 0;
+	if (iommu->irqs_count) {
+		iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_FIV, 1 % iommu->irqs_count);
+		iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PIV, 2 % iommu->irqs_count);
+		iommu->ivec |= FIELD_PREP(RISCV_IOMMU_IVEC_PMIV, 3 % iommu->irqs_count);
+	}
+	riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_IVEC, iommu->ivec);
+
+	/* Read back and verify */
+	iommu->ivec = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_IVEC);
+	if (riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_CIV) >= RISCV_IOMMU_INTR_COUNT ||
+	    riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_FIV) >= RISCV_IOMMU_INTR_COUNT ||
+	    riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PIV) >= RISCV_IOMMU_INTR_COUNT ||
+	    riscv_iommu_queue_vec(iommu, RISCV_IOMMU_IVEC_PMIV) >= RISCV_IOMMU_INTR_COUNT)
+		return -EINVAL;
+
 	dma_set_mask_and_coherent(iommu->dev,
 				  DMA_BIT_MASK(FIELD_GET(RISCV_IOMMU_CAP_PAS, iommu->caps)));
 
@@ -471,12 +917,17 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
 	iommu_device_unregister(&iommu->iommu);
 	iommu_device_sysfs_remove(&iommu->iommu);
 	riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
+	riscv_iommu_queue_disable(&iommu->cmdq);
+	riscv_iommu_queue_disable(&iommu->fltq);
 }
 
 int riscv_iommu_init(struct riscv_iommu_device *iommu)
 {
 	int rc;
 
+	RISCV_IOMMU_QUEUE_INIT(&iommu->cmdq, CQ);
+	RISCV_IOMMU_QUEUE_INIT(&iommu->fltq, FQ);
+
 	rc = riscv_iommu_init_check(iommu);
 	if (rc)
 		return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
@@ -485,6 +936,22 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 	if (WARN(rc, "cannot allocate device directory\n"))
 		goto err_init;
 
+	rc = riscv_iommu_queue_alloc(iommu, &iommu->cmdq, sizeof(struct riscv_iommu_command));
+	if (WARN(rc, "cannot allocate command queue\n"))
+		goto err_init;
+
+	rc = riscv_iommu_queue_alloc(iommu, &iommu->fltq, sizeof(struct riscv_iommu_fq_record));
+	if (WARN(rc, "cannot allocate fault queue\n"))
+		goto err_init;
+
+	rc = riscv_iommu_queue_enable(iommu, &iommu->cmdq, riscv_iommu_cmdq_process);
+	if (WARN(rc, "cannot enable command queue\n"))
+		goto err_init;
+
+	rc = riscv_iommu_queue_enable(iommu, &iommu->fltq, riscv_iommu_fltq_process);
+	if (WARN(rc, "cannot enable fault queue\n"))
+		goto err_init;
+
 	rc = riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
 	if (WARN(rc, "cannot enable iommu device\n"))
 		goto err_init;
@@ -505,5 +972,7 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
 err_sysfs:
 	riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
 err_init:
+	riscv_iommu_queue_disable(&iommu->fltq);
+	riscv_iommu_queue_disable(&iommu->cmdq);
 	return rc;
 }
diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
index f1696926582c..03e0c45bc7e1 100644
--- a/drivers/iommu/riscv/iommu.h
+++ b/drivers/iommu/riscv/iommu.h
@@ -17,6 +17,22 @@
 
 #include "iommu-bits.h"
 
+struct riscv_iommu_device;
+
+struct riscv_iommu_queue {
+	atomic_t prod;				/* unbounded producer allocation index */
+	atomic_t head;				/* unbounded shadow ring buffer consumer index */
+	atomic_t tail;				/* unbounded shadow ring buffer producer index */
+	unsigned int mask;			/* index mask, queue length - 1 */
+	unsigned int irq;			/* allocated interrupt number */
+	struct riscv_iommu_device *iommu;	/* iommu device handling the queue when active */
+	void *base;				/* ring buffer kernel pointer */
+	dma_addr_t phys;			/* ring buffer physical address */
+	u16 qbr;				/* base register offset, head and tail reference */
+	u16 qcr;				/* control and status register offset */
+	u8 qid;					/* queue identifier, same as RISCV_IOMMU_INTR_XX */
+};
+
 struct riscv_iommu_device {
 	/* iommu core interface */
 	struct iommu_device iommu;
@@ -34,6 +50,11 @@ struct riscv_iommu_device {
 	/* available interrupt numbers, MSI or WSI */
 	unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
 	unsigned int irqs_count;
+	unsigned int ivec;
+
+	/* hardware queues */
+	struct riscv_iommu_queue cmdq;
+	struct riscv_iommu_queue fltq;
 
 	/* device directory */
 	unsigned int ddt_mode;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
                   ` (5 preceding siblings ...)
  2024-04-18 16:32 ` [PATCH v2 6/7] iommu/riscv: Command and fault queue support Tomasz Jeznach
@ 2024-04-18 16:32 ` Tomasz Jeznach
  2024-04-19 12:56   ` Jason Gunthorpe
                     ` (2 more replies)
  6 siblings, 3 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-18 16:32 UTC (permalink / raw)
  To: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux,
	Tomasz Jeznach

Introduce first-stage address translation support.

Page table configured by the IOMMU driver will use the same format
as the CPU’s MMU, and will fallback to identity translation if the
page table format configured for the MMU is not supported by the
IOMMU hardware.

This change introduces IOTINVAL.VMA command, required to invalidate
any cached IOATC entries after mapping is updated and/or removed from
the paging domain. Invalidations for the non-leaf page entries will
be added to the driver code in separate patch series, following spec
update to clarify non-leaf cache invalidation command. With this patch,
allowing only 4K mappings and keeping non-leaf page entries in memory
this should be a reasonable simplification.

Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
---
 drivers/iommu/riscv/Kconfig |   1 +
 drivers/iommu/riscv/iommu.c | 467 +++++++++++++++++++++++++++++++++++-
 2 files changed, 466 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
index 711326992585..6f9fb396034a 100644
--- a/drivers/iommu/riscv/Kconfig
+++ b/drivers/iommu/riscv/Kconfig
@@ -7,6 +7,7 @@ config RISCV_IOMMU
 	select DMA_OPS
 	select IOMMU_API
 	select IOMMU_IOVA
+	select IOMMU_DMA
 	help
 	  Support for implementations of the RISC-V IOMMU architecture that
 	  complements the RISC-V MMU capabilities, providing similar address
diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
index a4f74588cdc2..32ddc372432d 100644
--- a/drivers/iommu/riscv/iommu.c
+++ b/drivers/iommu/riscv/iommu.c
@@ -46,6 +46,10 @@ MODULE_LICENSE("GPL");
 #define dev_to_iommu(dev) \
 	container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
 
+/* IOMMU PSCID allocation namespace. */
+static DEFINE_IDA(riscv_iommu_pscids);
+#define RISCV_IOMMU_MAX_PSCID		BIT(20)
+
 /* Device resource-managed allocations */
 struct riscv_iommu_devres {
 	unsigned long addr;
@@ -752,12 +756,77 @@ static int riscv_iommu_ddt_alloc(struct riscv_iommu_device *iommu)
 	return 0;
 }
 
+struct riscv_iommu_bond {
+	struct list_head list;
+	struct rcu_head rcu;
+	struct device *dev;
+};
+
+/* This struct contains protection domain specific IOMMU driver data. */
+struct riscv_iommu_domain {
+	struct iommu_domain domain;
+	struct list_head bonds;
+	int pscid;
+	int numa_node;
+	int amo_enabled:1;
+	unsigned int pgd_mode;
+	/* paging domain */
+	unsigned long pgd_root;
+};
+
+#define iommu_domain_to_riscv(iommu_domain) \
+	container_of(iommu_domain, struct riscv_iommu_domain, domain)
+
+/*
+ * Send IOTLB.INVAL for whole address space for ranges larger than 2MB.
+ * This limit will be replaced with range invalidations, if supported by
+ * the hardware, when RISC-V IOMMU architecture specification update for
+ * range invalidations update will be available.
+ */
+#define RISCV_IOMMU_IOTLB_INVAL_LIMIT	(2 << 20)
+
+static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
+				    unsigned long start, unsigned long end)
+{
+	struct riscv_iommu_bond *bond;
+	struct riscv_iommu_device *iommu;
+	struct riscv_iommu_command cmd;
+	unsigned long len = end - start + 1;
+	unsigned long iova;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(bond, &domain->bonds, list) {
+		iommu = dev_to_iommu(bond->dev);
+		riscv_iommu_cmd_inval_vma(&cmd);
+		riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
+		if (len > 0 && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
+			for (iova = start; iova < end; iova += PAGE_SIZE) {
+				riscv_iommu_cmd_inval_set_addr(&cmd, iova);
+				riscv_iommu_cmd_send(iommu, &cmd, 0);
+			}
+		} else {
+			riscv_iommu_cmd_send(iommu, &cmd, 0);
+		}
+	}
+
+	list_for_each_entry_rcu(bond, &domain->bonds, list) {
+		iommu = dev_to_iommu(bond->dev);
+
+		riscv_iommu_cmd_iofence(&cmd);
+		riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT);
+	}
+	rcu_read_unlock();
+}
+
 static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
 				     struct device *dev,
 				     struct iommu_domain *iommu_domain)
 {
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+	struct riscv_iommu_domain *domain;
 	struct riscv_iommu_dc *dc;
+	struct riscv_iommu_bond *bond = NULL, *b;
+	struct riscv_iommu_command cmd;
 	u64 fsc, ta, tc;
 	int i;
 
@@ -769,6 +838,20 @@ static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
 		ta = 0;
 		tc = RISCV_IOMMU_DC_TC_V;
 		fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
+	} else if (iommu_domain->type & __IOMMU_DOMAIN_PAGING) {
+		domain = iommu_domain_to_riscv(iommu_domain);
+
+		ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid);
+		tc = RISCV_IOMMU_DC_TC_V;
+		if (domain->amo_enabled)
+			tc |= RISCV_IOMMU_DC_TC_SADE;
+		fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
+		      FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
+
+		bond = kzalloc(sizeof(*bond), GFP_KERNEL);
+		if (!bond)
+			return -ENOMEM;
+		bond->dev = dev;
 	} else {
 		/* This should never happen. */
 		return -ENODEV;
@@ -787,12 +870,390 @@ static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
 		xchg64(&dc->ta, ta);
 		xchg64(&dc->tc, tc);
 
-		/* Device context invalidation will be required. Ignoring for now. */
+		if (!(tc & RISCV_IOMMU_DC_TC_V))
+			continue;
+
+		/* Invalidate device context cache */
+		riscv_iommu_cmd_iodir_inval_ddt(&cmd);
+		riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
+		riscv_iommu_cmd_send(iommu, &cmd, 0);
+
+		if (FIELD_GET(RISCV_IOMMU_PC_FSC_MODE, fsc) == RISCV_IOMMU_DC_FSC_MODE_BARE)
+			continue;
+
+		/* Invalidate last valid PSCID */
+		riscv_iommu_cmd_inval_vma(&cmd);
+		riscv_iommu_cmd_inval_set_pscid(&cmd, FIELD_GET(RISCV_IOMMU_DC_TA_PSCID, ta));
+		riscv_iommu_cmd_send(iommu, &cmd, 0);
+	}
+
+	/* Synchronize directory update */
+	riscv_iommu_cmd_iofence(&cmd);
+	riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
+
+	/* Track domain to devices mapping. */
+	if (bond)
+		list_add_rcu(&bond->list, &domain->bonds);
+
+	/* Remove tracking from previous domain, if needed. */
+	iommu_domain = iommu_get_domain_for_dev(dev);
+	if (iommu_domain && !!(iommu_domain->type & __IOMMU_DOMAIN_PAGING)) {
+		domain = iommu_domain_to_riscv(iommu_domain);
+		bond = NULL;
+		rcu_read_lock();
+		list_for_each_entry_rcu(b, &domain->bonds, list) {
+			if (b->dev == dev) {
+				bond = b;
+				break;
+			}
+		}
+		rcu_read_unlock();
+
+		if (bond) {
+			list_del_rcu(&bond->list);
+			kfree_rcu(bond, rcu);
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * IOVA page translation tree management.
+ */
+
+#define IOMMU_PAGE_SIZE_4K     BIT_ULL(12)
+#define IOMMU_PAGE_SIZE_2M     BIT_ULL(21)
+#define IOMMU_PAGE_SIZE_1G     BIT_ULL(30)
+#define IOMMU_PAGE_SIZE_512G   BIT_ULL(39)
+
+#define PT_SHIFT (PAGE_SHIFT - ilog2(sizeof(pte_t)))
+
+static void riscv_iommu_flush_iotlb_all(struct iommu_domain *iommu_domain)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+
+	riscv_iommu_iotlb_inval(domain, 0, ULONG_MAX);
+}
+
+static void riscv_iommu_iotlb_sync(struct iommu_domain *iommu_domain,
+				   struct iommu_iotlb_gather *gather)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+
+	riscv_iommu_iotlb_inval(domain, gather->start, gather->end);
+}
+
+static inline size_t get_page_size(size_t size)
+{
+	if (size >= IOMMU_PAGE_SIZE_512G)
+		return IOMMU_PAGE_SIZE_512G;
+	if (size >= IOMMU_PAGE_SIZE_1G)
+		return IOMMU_PAGE_SIZE_1G;
+	if (size >= IOMMU_PAGE_SIZE_2M)
+		return IOMMU_PAGE_SIZE_2M;
+	return IOMMU_PAGE_SIZE_4K;
+}
+
+#define _io_pte_present(pte)	((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE))
+#define _io_pte_leaf(pte)	((pte) & _PAGE_LEAF)
+#define _io_pte_none(pte)	((pte) == 0)
+#define _io_pte_entry(pn, prot)	((_PAGE_PFN_MASK & ((pn) << _PAGE_PFN_SHIFT)) | (prot))
+
+static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain,
+				 unsigned long pte, struct list_head *freelist)
+{
+	unsigned long *ptr;
+	int i;
+
+	if (!_io_pte_present(pte) || _io_pte_leaf(pte))
+		return;
+
+	ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
+
+	/* Recursively free all sub page table pages */
+	for (i = 0; i < PTRS_PER_PTE; i++) {
+		pte = READ_ONCE(ptr[i]);
+		if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0) == pte)
+			riscv_iommu_pte_free(domain, pte, freelist);
+	}
+
+	if (freelist)
+		list_add_tail(&virt_to_page(ptr)->lru, freelist);
+	else
+		free_page((unsigned long)ptr);
+}
+
+static unsigned long *riscv_iommu_pte_alloc(struct riscv_iommu_domain *domain,
+					    unsigned long iova, size_t pgsize, gfp_t gfp)
+{
+	unsigned long *ptr = (unsigned long *)domain->pgd_root;
+	unsigned long pte, old;
+	int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
+	struct page *page;
+
+	do {
+		const int shift = PAGE_SHIFT + PT_SHIFT * level;
+
+		ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
+		/*
+		 * Note: returned entry might be a non-leaf if there was existing mapping
+		 * with smaller granularity. Up to the caller to replace and invalidate.
+		 */
+		if (((size_t)1 << shift) == pgsize)
+			return ptr;
+pte_retry:
+		pte = READ_ONCE(*ptr);
+		/*
+		 * This is very likely incorrect as we should not be adding new mapping
+		 * with smaller granularity on top of existing 2M/1G mapping. Fail.
+		 */
+		if (_io_pte_present(pte) && _io_pte_leaf(pte))
+			return NULL;
+		/*
+		 * Non-leaf entry is missing, allocate and try to add to the page table.
+		 * This might race with other mappings, retry on error.
+		 */
+		if (_io_pte_none(pte)) {
+			page = alloc_pages_node(domain->numa_node, __GFP_ZERO | gfp, 0);
+			if (!page)
+				return NULL;
+			old = pte;
+			pte = _io_pte_entry(page_to_pfn(page), _PAGE_TABLE);
+			if (cmpxchg_relaxed(ptr, old, pte) != old) {
+				__free_pages(page, 0);
+				goto pte_retry;
+			}
+		}
+		ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
+	} while (level-- > 0);
+
+	return NULL;
+}
+
+static unsigned long *riscv_iommu_pte_fetch(struct riscv_iommu_domain *domain,
+					    unsigned long iova, size_t *pte_pgsize)
+{
+	unsigned long *ptr = (unsigned long *)domain->pgd_root;
+	unsigned long pte;
+	int level = domain->pgd_mode - RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 + 2;
+
+	do {
+		const int shift = PAGE_SHIFT + PT_SHIFT * level;
+
+		ptr += ((iova >> shift) & (PTRS_PER_PTE - 1));
+		pte = READ_ONCE(*ptr);
+		if (_io_pte_present(pte) && _io_pte_leaf(pte)) {
+			*pte_pgsize = (size_t)1 << shift;
+			return ptr;
+		}
+		if (_io_pte_none(pte))
+			return NULL;
+		ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
+	} while (level-- > 0);
+
+	return NULL;
+}
+
+static int riscv_iommu_map_pages(struct iommu_domain *iommu_domain,
+				 unsigned long iova, phys_addr_t phys,
+				 size_t pgsize, size_t pgcount, int prot,
+				 gfp_t gfp, size_t *mapped)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+	size_t size = 0;
+	size_t page_size = get_page_size(pgsize);
+	unsigned long *ptr;
+	unsigned long pte, old, pte_prot;
+
+	if (!(prot & IOMMU_WRITE))
+		pte_prot = _PAGE_BASE | _PAGE_READ;
+	else if (domain->amo_enabled)
+		pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE;
+	else
+		pte_prot = _PAGE_BASE | _PAGE_READ | _PAGE_WRITE | _PAGE_DIRTY;
+
+	while (pgcount) {
+		ptr = riscv_iommu_pte_alloc(domain, iova, page_size, gfp);
+		if (!ptr) {
+			*mapped = size;
+			return -ENOMEM;
+		}
+
+		old = READ_ONCE(*ptr);
+		pte = _io_pte_entry(phys_to_pfn(phys), pte_prot);
+		if (cmpxchg_relaxed(ptr, old, pte) != old)
+			continue;
+
+		/* TODO: non-leaf page invalidation is pending spec update */
+		riscv_iommu_pte_free(domain, old, NULL);
+
+		size += page_size;
+		iova += page_size;
+		phys += page_size;
+		--pgcount;
 	}
 
+	*mapped = size;
+
 	return 0;
 }
 
+static size_t riscv_iommu_unmap_pages(struct iommu_domain *iommu_domain,
+				      unsigned long iova, size_t pgsize, size_t pgcount,
+				      struct iommu_iotlb_gather *gather)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+	size_t size = pgcount << __ffs(pgsize);
+	unsigned long *ptr, old;
+	size_t unmapped = 0;
+	size_t pte_size;
+
+	while (unmapped < size) {
+		ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
+		if (!ptr)
+			return unmapped;
+
+		/* partial unmap is not allowed, fail. */
+		if (iova & ~(pte_size - 1))
+			return unmapped;
+
+		old = READ_ONCE(*ptr);
+		if (cmpxchg_relaxed(ptr, old, 0) != old)
+			continue;
+
+		iommu_iotlb_gather_add_page(&domain->domain, gather, iova,
+					    pte_size);
+
+		iova += pte_size;
+		unmapped += pte_size;
+	}
+
+	return unmapped;
+}
+
+static phys_addr_t riscv_iommu_iova_to_phys(struct iommu_domain *iommu_domain, dma_addr_t iova)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+	unsigned long pte_size;
+	unsigned long *ptr;
+
+	ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
+	if (_io_pte_none(*ptr) || !_io_pte_present(*ptr))
+		return 0;
+
+	return pfn_to_phys(__page_val_to_pfn(*ptr)) | (iova & (pte_size - 1));
+}
+
+static void riscv_iommu_free_paging_domain(struct iommu_domain *iommu_domain)
+{
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+
+	WARN_ON(!list_empty(&domain->bonds));
+
+	if (domain->pgd_root) {
+		const unsigned long pfn = virt_to_pfn(domain->pgd_root);
+
+		riscv_iommu_pte_free(domain, _io_pte_entry(pfn, _PAGE_TABLE), NULL);
+	}
+
+	if ((int)domain->pscid > 0)
+		ida_free(&riscv_iommu_pscids, domain->pscid);
+
+	kfree(domain);
+}
+
+static bool riscv_iommu_pt_supported(struct riscv_iommu_device *iommu, int pgd_mode)
+{
+	switch (pgd_mode) {
+	case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39:
+		return iommu->caps & RISCV_IOMMU_CAP_S_SV39;
+
+	case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48:
+		return iommu->caps & RISCV_IOMMU_CAP_S_SV48;
+
+	case RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57:
+		return iommu->caps & RISCV_IOMMU_CAP_S_SV57;
+	}
+	return false;
+}
+
+static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
+					    struct device *dev)
+{
+	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
+	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
+	struct page *page;
+
+	if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
+		return -ENODEV;
+
+	domain->numa_node = dev_to_node(iommu->dev);
+	domain->amo_enabled = !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD);
+
+	if (!domain->pgd_root) {
+		page = alloc_pages_node(domain->numa_node,
+					GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0);
+		if (!page)
+			return -ENOMEM;
+		domain->pgd_root = (unsigned long)page_to_virt(page);
+	}
+
+	return riscv_iommu_attach_domain(iommu, dev, iommu_domain);
+}
+
+static const struct iommu_domain_ops riscv_iommu_paging_domain_ops = {
+	.attach_dev = riscv_iommu_attach_paging_domain,
+	.free = riscv_iommu_free_paging_domain,
+	.map_pages = riscv_iommu_map_pages,
+	.unmap_pages = riscv_iommu_unmap_pages,
+	.iova_to_phys = riscv_iommu_iova_to_phys,
+	.iotlb_sync = riscv_iommu_iotlb_sync,
+	.flush_iotlb_all = riscv_iommu_flush_iotlb_all,
+};
+
+static struct iommu_domain *riscv_iommu_alloc_paging_domain(struct device *dev)
+{
+	struct riscv_iommu_domain *domain;
+
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!domain)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD_RCU(&domain->bonds);
+
+	domain->pscid = ida_alloc_range(&riscv_iommu_pscids, 1,
+					RISCV_IOMMU_MAX_PSCID - 1, GFP_KERNEL);
+	if (domain->pscid < 0) {
+		kfree(domain);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * Note: RISC-V Privilege spec mandates that virtual addresses
+	 * need to be sign-extended, so if (VA_BITS - 1) is set, all
+	 * bits >= VA_BITS need to also be set or else we'll get a
+	 * page fault. However the code that creates the mappings
+	 * above us (e.g. iommu_dma_alloc_iova()) won't do that for us
+	 * for now, so we'll end up with invalid virtual addresses
+	 * to map. As a workaround until we get this sorted out
+	 * limit the available virtual addresses to VA_BITS - 1.
+	 */
+	domain->domain.geometry.aperture_start = 0;
+	domain->domain.geometry.aperture_end = DMA_BIT_MASK(VA_BITS - 1);
+	domain->domain.geometry.force_aperture = true;
+
+	/*
+	 * Follow system address translation mode.
+	 * RISC-V IOMMU ATP mode values match RISC-V CPU SATP mode values.
+	 */
+	domain->pgd_mode = satp_mode >> SATP_MODE_SHIFT;
+	domain->numa_node = NUMA_NO_NODE;
+	domain->domain.ops = &riscv_iommu_paging_domain_ops;
+
+	return &domain->domain;
+}
+
 static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
 					      struct device *dev)
 {
@@ -814,7 +1275,7 @@ static struct iommu_domain riscv_iommu_identity_domain = {
 
 static int riscv_iommu_device_domain_type(struct device *dev)
 {
-	return IOMMU_DOMAIN_IDENTITY;
+	return 0;
 }
 
 static struct iommu_group *riscv_iommu_device_group(struct device *dev)
@@ -858,8 +1319,10 @@ static void riscv_iommu_release_device(struct device *dev)
 
 static const struct iommu_ops riscv_iommu_ops = {
 	.owner = THIS_MODULE,
+	.pgsize_bitmap = SZ_4K,
 	.of_xlate = riscv_iommu_of_xlate,
 	.identity_domain = &riscv_iommu_identity_domain,
+	.domain_alloc_paging = riscv_iommu_alloc_paging_domain,
 	.def_domain_type = riscv_iommu_device_domain_type,
 	.device_group = riscv_iommu_device_group,
 	.probe_device = riscv_iommu_probe_device,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
  2024-04-18 16:32 ` [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU Tomasz Jeznach
@ 2024-04-18 17:04   ` Conor Dooley
  2024-04-24 22:37     ` Tomasz Jeznach
  2024-04-22 14:04   ` Rob Herring
  1 sibling, 1 reply; 30+ messages in thread
From: Conor Dooley @ 2024-04-18 17:04 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

[-- Attachment #1: Type: text/plain, Size: 3123 bytes --]

On Thu, Apr 18, 2024 at 09:32:19AM -0700, Tomasz Jeznach wrote:
> Add bindings for the RISC-V IOMMU device drivers.
> 
> Co-developed-by: Anup Patel <apatel@ventanamicro.com>
> Signed-off-by: Anup Patel <apatel@ventanamicro.com>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> ---
>  .../bindings/iommu/riscv,iommu.yaml           | 149 ++++++++++++++++++
>  MAINTAINERS                                   |   7 +
>  2 files changed, 156 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> 
> diff --git a/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> new file mode 100644
> index 000000000000..d6522ddd43fa
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> @@ -0,0 +1,149 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/iommu/riscv,iommu.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: RISC-V IOMMU Architecture Implementation
> +
> +maintainers:
> +  - Tomasz Jeznach <tjeznach@rivosinc.com>
> +
> +description: |+

FYI, the + here is probably not needed.

> +  The RISC-V IOMMU provides memory address translation and isolation for
> +  input and output devices, supporting per-device translation context,
> +  shared process address spaces including the ATS and PRI components of
> +  the PCIe specification, two stage address translation and MSI remapping.
> +  It supports identical translation table format to the RISC-V address
> +  translation tables with page level access and protection attributes.
> +  Hardware uses in-memory command and fault reporting queues with wired
> +  interrupt or MSI notifications.
> +
> +  Visit https://github.com/riscv-non-isa/riscv-iommu for more details.
> +
> +  For information on assigning RISC-V IOMMU to its peripheral devices,
> +  see generic IOMMU bindings.
> +
> +properties:
> +  # For PCIe IOMMU hardware compatible property should contain the vendor
> +  # and device ID according to the PCI Bus Binding specification.
> +  # Since PCI provides built-in identification methods, compatible is not
> +  # actually required. For non-PCIe hardware implementations 'riscv,iommu'
> +  # should be specified along with 'reg' property providing MMIO location.

I dunno, I'd like to see soc-specific compatibles for implementations of
the RISC-V IOMMU. If you need a DT compatible for use in QEMU, I'd
suggest doing what was done for the aplic and having a dedicated
compatible for that and disallow having "riscv,iommu" in isolation.

> +  compatible:
> +    oneOf:
> +      - items:
> +          - const: riscv,pci-iommu
> +          - const: pci1efd,edf1
> +      - items:
> +          - const: pci1efd,edf1

Why are both versions allowed? If the former is more understandable,
can't we just go with that?

> +      - items:
> +          - const: riscv,iommu

Other than the compatible setup I think this is pretty decent though,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver
  2024-04-18 16:32 ` [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver Tomasz Jeznach
@ 2024-04-18 21:22   ` Robin Murphy
  2024-04-24 21:59     ` Tomasz Jeznach
  0 siblings, 1 reply; 30+ messages in thread
From: Robin Murphy @ 2024-04-18 21:22 UTC (permalink / raw)
  To: Tomasz Jeznach, Joerg Roedel, Will Deacon, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On 2024-04-18 5:32 pm, Tomasz Jeznach wrote:
> Introduce platform device driver for implementation of RISC-V IOMMU
> architected hardware.
> 
> Hardware interface definition located in file iommu-bits.h is based on
> ratified RISC-V IOMMU Architecture Specification version 1.0.0.
> 
> This patch implements platform device initialization, early check and
> configuration of the IOMMU interfaces and enables global pass-through
> address translation mode (iommu_mode == BARE), without registering
> hardware instance in the IOMMU subsystem.
> 
> Link: https://github.com/riscv-non-isa/riscv-iommu
> Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
> Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
> Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> ---
>   MAINTAINERS                          |   6 +
>   drivers/iommu/Kconfig                |   1 +
>   drivers/iommu/Makefile               |   2 +-
>   drivers/iommu/riscv/Kconfig          |  16 +
>   drivers/iommu/riscv/Makefile         |   2 +
>   drivers/iommu/riscv/iommu-bits.h     | 707 +++++++++++++++++++++++++++
>   drivers/iommu/riscv/iommu-platform.c |  94 ++++
>   drivers/iommu/riscv/iommu.c          |  89 ++++
>   drivers/iommu/riscv/iommu.h          |  62 +++
>   9 files changed, 978 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/iommu/riscv/Kconfig
>   create mode 100644 drivers/iommu/riscv/Makefile
>   create mode 100644 drivers/iommu/riscv/iommu-bits.h
>   create mode 100644 drivers/iommu/riscv/iommu-platform.c
>   create mode 100644 drivers/iommu/riscv/iommu.c
>   create mode 100644 drivers/iommu/riscv/iommu.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2657f9eae84c..051599c76585 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18972,6 +18972,12 @@ L:	iommu@lists.linux.dev
>   L:	linux-riscv@lists.infradead.org
>   S:	Maintained
>   F:	Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> +F:	drivers/iommu/riscv/Kconfig
> +F:	drivers/iommu/riscv/Makefile
> +F:	drivers/iommu/riscv/iommu-bits.h
> +F:	drivers/iommu/riscv/iommu-platform.c
> +F:	drivers/iommu/riscv/iommu.c
> +F:	drivers/iommu/riscv/iommu.h

I'm pretty sure a single "F: drivers/iommu/riscv/" pattern will suffice.

>   RISC-V MICROCHIP FPGA SUPPORT
>   M:	Conor Dooley <conor.dooley@microchip.com>
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 0af39bbbe3a3..ae762db0365e 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -195,6 +195,7 @@ config MSM_IOMMU
>   source "drivers/iommu/amd/Kconfig"
>   source "drivers/iommu/intel/Kconfig"
>   source "drivers/iommu/iommufd/Kconfig"
> +source "drivers/iommu/riscv/Kconfig"
>   
>   config IRQ_REMAP
>   	bool "Support for Interrupt Remapping"
> diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> index 542760d963ec..5e5a83c6c2aa 100644
> --- a/drivers/iommu/Makefile
> +++ b/drivers/iommu/Makefile
> @@ -1,5 +1,5 @@
>   # SPDX-License-Identifier: GPL-2.0
> -obj-y += amd/ intel/ arm/ iommufd/
> +obj-y += amd/ intel/ arm/ iommufd/ riscv/
>   obj-$(CONFIG_IOMMU_API) += iommu.o
>   obj-$(CONFIG_IOMMU_API) += iommu-traces.o
>   obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
> new file mode 100644
> index 000000000000..d02326bddb4c
> --- /dev/null
> +++ b/drivers/iommu/riscv/Kconfig
> @@ -0,0 +1,16 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +# RISC-V IOMMU support
> +
> +config RISCV_IOMMU
> +	def_bool y if RISCV && 64BIT && MMU

Drop the dependencies here, they're already dependencies. However, also 
consider allowing users to configure this out without disabling 
IOMMU_SUPPORT entirely - I imagine other IOMMU implementations are going 
to end up paired wiothg RISC-V CPUs sooner or later. Furthermore, if 
it's a regular driver model driver, consider allowing it to build as a 
module. Not to mention that the help text below is rather pointless if 
there's no prompt offered in the first place.

> +	depends on RISCV && 64BIT && MMU
> +	select DMA_OPS

Drop this, you're not (and shouldn't be) architecture code implementing 
DMA ops.

> +	select IOMMU_API
> +	select IOMMU_IOVA

Drop this, you're not using the IOVA library either.

> +	help
> +	  Support for implementations of the RISC-V IOMMU architecture that
> +	  complements the RISC-V MMU capabilities, providing similar address
> +	  translation and protection functions for accesses from I/O devices.
> +
> +	  Say Y here if your SoC includes an IOMMU device implementing
> +	  the RISC-V IOMMU architecture.
> diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
> new file mode 100644
> index 000000000000..e4c189de58d3
> --- /dev/null
> +++ b/drivers/iommu/riscv/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
> diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
> new file mode 100644
> index 000000000000..ba093c29de9f
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu-bits.h
> @@ -0,0 +1,707 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + * Copyright © 2023 RISC-V IOMMU Task Group
> + *
> + * RISC-V IOMMU - Register Layout and Data Structures.
> + *
> + * Based on the 'RISC-V IOMMU Architecture Specification', Version 1.0
> + * Published at  https://github.com/riscv-non-isa/riscv-iommu
> + *
> + */
> +
> +#ifndef _RISCV_IOMMU_BITS_H_
> +#define _RISCV_IOMMU_BITS_H_
> +
> +#include <linux/types.h>
> +#include <linux/bitfield.h>
> +#include <linux/bits.h>
> +
> +/*
> + * Chapter 5: Memory Mapped register interface
> + */
> +
> +/* Common field positions */
> +#define RISCV_IOMMU_PPN_FIELD		GENMASK_ULL(53, 10)
> +#define RISCV_IOMMU_QUEUE_LOGSZ_FIELD	GENMASK_ULL(4, 0)
> +#define RISCV_IOMMU_QUEUE_INDEX_FIELD	GENMASK_ULL(31, 0)
> +#define RISCV_IOMMU_QUEUE_ENABLE	BIT(0)
> +#define RISCV_IOMMU_QUEUE_INTR_ENABLE	BIT(1)
> +#define RISCV_IOMMU_QUEUE_MEM_FAULT	BIT(8)
> +#define RISCV_IOMMU_QUEUE_OVERFLOW	BIT(9)
> +#define RISCV_IOMMU_QUEUE_ACTIVE	BIT(16)
> +#define RISCV_IOMMU_QUEUE_BUSY		BIT(17)
> +
> +#define RISCV_IOMMU_ATP_PPN_FIELD	GENMASK_ULL(43, 0)
> +#define RISCV_IOMMU_ATP_MODE_FIELD	GENMASK_ULL(63, 60)
> +
> +/* 5.3 IOMMU Capabilities (64bits) */
> +#define RISCV_IOMMU_REG_CAP		0x0000
> +#define RISCV_IOMMU_CAP_VERSION		GENMASK_ULL(7, 0)
> +#define RISCV_IOMMU_CAP_S_SV32		BIT_ULL(8)
> +#define RISCV_IOMMU_CAP_S_SV39		BIT_ULL(9)
> +#define RISCV_IOMMU_CAP_S_SV48		BIT_ULL(10)
> +#define RISCV_IOMMU_CAP_S_SV57		BIT_ULL(11)
> +#define RISCV_IOMMU_CAP_SVPBMT		BIT_ULL(15)
> +#define RISCV_IOMMU_CAP_G_SV32		BIT_ULL(16)
> +#define RISCV_IOMMU_CAP_G_SV39		BIT_ULL(17)
> +#define RISCV_IOMMU_CAP_G_SV48		BIT_ULL(18)
> +#define RISCV_IOMMU_CAP_G_SV57		BIT_ULL(19)
> +#define RISCV_IOMMU_CAP_AMO_MRIF	BIT_ULL(21)
> +#define RISCV_IOMMU_CAP_MSI_FLAT	BIT_ULL(22)
> +#define RISCV_IOMMU_CAP_MSI_MRIF	BIT_ULL(23)
> +#define RISCV_IOMMU_CAP_AMO_HWAD	BIT_ULL(24)
> +#define RISCV_IOMMU_CAP_ATS		BIT_ULL(25)
> +#define RISCV_IOMMU_CAP_T2GPA		BIT_ULL(26)
> +#define RISCV_IOMMU_CAP_END		BIT_ULL(27)
> +#define RISCV_IOMMU_CAP_IGS		GENMASK_ULL(29, 28)
> +#define RISCV_IOMMU_CAP_HPM		BIT_ULL(30)
> +#define RISCV_IOMMU_CAP_DBG		BIT_ULL(31)
> +#define RISCV_IOMMU_CAP_PAS		GENMASK_ULL(37, 32)
> +#define RISCV_IOMMU_CAP_PD8		BIT_ULL(38)
> +#define RISCV_IOMMU_CAP_PD17		BIT_ULL(39)
> +#define RISCV_IOMMU_CAP_PD20		BIT_ULL(40)
> +
> +#define RISCV_IOMMU_CAP_VERSION_VER_MASK	0xF0
> +#define RISCV_IOMMU_CAP_VERSION_REV_MASK	0x0F
> +
> +/**
> + * enum riscv_iommu_igs_settings - Interrupt Generation Support Settings
> + * @RISCV_IOMMU_CAP_IGS_MSI: I/O MMU supports only MSI generation
> + * @RISCV_IOMMU_CAP_IGS_WSI: I/O MMU supports only Wired-Signaled interrupt
> + * @RISCV_IOMMU_CAP_IGS_BOTH: I/O MMU supports both MSI and WSI generation
> + * @RISCV_IOMMU_CAP_IGS_RSRV: Reserved for standard use
> + */
> +enum riscv_iommu_igs_settings {
> +	RISCV_IOMMU_CAP_IGS_MSI = 0,
> +	RISCV_IOMMU_CAP_IGS_WSI = 1,
> +	RISCV_IOMMU_CAP_IGS_BOTH = 2,
> +	RISCV_IOMMU_CAP_IGS_RSRV = 3
> +};
> +
> +/* 5.4 Features control register (32bits) */
> +#define RISCV_IOMMU_REG_FCTL		0x0008
> +#define RISCV_IOMMU_FCTL_BE		BIT(0)
> +#define RISCV_IOMMU_FCTL_WSI		BIT(1)
> +#define RISCV_IOMMU_FCTL_GXL		BIT(2)
> +
> +/* 5.5 Device-directory-table pointer (64bits) */
> +#define RISCV_IOMMU_REG_DDTP		0x0010
> +#define RISCV_IOMMU_DDTP_MODE		GENMASK_ULL(3, 0)
> +#define RISCV_IOMMU_DDTP_BUSY		BIT_ULL(4)
> +#define RISCV_IOMMU_DDTP_PPN		RISCV_IOMMU_PPN_FIELD
> +
> +/**
> + * enum riscv_iommu_ddtp_modes - I/O MMU translation modes
> + * @RISCV_IOMMU_DDTP_MODE_OFF: No inbound transactions allowed
> + * @RISCV_IOMMU_DDTP_MODE_BARE: Pass-through mode
> + * @RISCV_IOMMU_DDTP_MODE_1LVL: One-level DDT
> + * @RISCV_IOMMU_DDTP_MODE_2LVL: Two-level DDT
> + * @RISCV_IOMMU_DDTP_MODE_3LVL: Three-level DDT
> + * @RISCV_IOMMU_DDTP_MODE_MAX: Max value allowed by specification
> + */
> +enum riscv_iommu_ddtp_modes {
> +	RISCV_IOMMU_DDTP_MODE_OFF = 0,
> +	RISCV_IOMMU_DDTP_MODE_BARE = 1,
> +	RISCV_IOMMU_DDTP_MODE_1LVL = 2,
> +	RISCV_IOMMU_DDTP_MODE_2LVL = 3,
> +	RISCV_IOMMU_DDTP_MODE_3LVL = 4,
> +	RISCV_IOMMU_DDTP_MODE_MAX = 4
> +};
> +
> +/* 5.6 Command Queue Base (64bits) */
> +#define RISCV_IOMMU_REG_CQB		0x0018
> +#define RISCV_IOMMU_CQB_ENTRIES		RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> +#define RISCV_IOMMU_CQB_PPN		RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.7 Command Queue head (32bits) */
> +#define RISCV_IOMMU_REG_CQH		0x0020
> +#define RISCV_IOMMU_CQH_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.8 Command Queue tail (32bits) */
> +#define RISCV_IOMMU_REG_CQT		0x0024
> +#define RISCV_IOMMU_CQT_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.9 Fault Queue Base (64bits) */
> +#define RISCV_IOMMU_REG_FQB		0x0028
> +#define RISCV_IOMMU_FQB_ENTRIES		RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> +#define RISCV_IOMMU_FQB_PPN		RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.10 Fault Queue Head (32bits) */
> +#define RISCV_IOMMU_REG_FQH		0x0030
> +#define RISCV_IOMMU_FQH_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.11 Fault Queue tail (32bits) */
> +#define RISCV_IOMMU_REG_FQT		0x0034
> +#define RISCV_IOMMU_FQT_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.12 Page Request Queue base (64bits) */
> +#define RISCV_IOMMU_REG_PQB		0x0038
> +#define RISCV_IOMMU_PQB_ENTRIES		RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> +#define RISCV_IOMMU_PQB_PPN		RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.13 Page Request Queue head (32bits) */
> +#define RISCV_IOMMU_REG_PQH		0x0040
> +#define RISCV_IOMMU_PQH_INDEX		RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.14 Page Request Queue tail (32bits) */
> +#define RISCV_IOMMU_REG_PQT		0x0044
> +#define RISCV_IOMMU_PQT_INDEX_MASK	RISCV_IOMMU_QUEUE_INDEX_FIELD
> +
> +/* 5.15 Command Queue CSR (32bits) */
> +#define RISCV_IOMMU_REG_CQCSR		0x0048
> +#define RISCV_IOMMU_CQCSR_CQEN		RISCV_IOMMU_QUEUE_ENABLE
> +#define RISCV_IOMMU_CQCSR_CIE		RISCV_IOMMU_QUEUE_INTR_ENABLE
> +#define RISCV_IOMMU_CQCSR_CQMF		RISCV_IOMMU_QUEUE_MEM_FAULT
> +#define RISCV_IOMMU_CQCSR_CMD_TO	BIT(9)
> +#define RISCV_IOMMU_CQCSR_CMD_ILL	BIT(10)
> +#define RISCV_IOMMU_CQCSR_FENCE_W_IP	BIT(11)
> +#define RISCV_IOMMU_CQCSR_CQON		RISCV_IOMMU_QUEUE_ACTIVE
> +#define RISCV_IOMMU_CQCSR_BUSY		RISCV_IOMMU_QUEUE_BUSY
> +
> +/* 5.16 Fault Queue CSR (32bits) */
> +#define RISCV_IOMMU_REG_FQCSR		0x004C
> +#define RISCV_IOMMU_FQCSR_FQEN		RISCV_IOMMU_QUEUE_ENABLE
> +#define RISCV_IOMMU_FQCSR_FIE		RISCV_IOMMU_QUEUE_INTR_ENABLE
> +#define RISCV_IOMMU_FQCSR_FQMF		RISCV_IOMMU_QUEUE_MEM_FAULT
> +#define RISCV_IOMMU_FQCSR_FQOF		RISCV_IOMMU_QUEUE_OVERFLOW
> +#define RISCV_IOMMU_FQCSR_FQON		RISCV_IOMMU_QUEUE_ACTIVE
> +#define RISCV_IOMMU_FQCSR_BUSY		RISCV_IOMMU_QUEUE_BUSY
> +
> +/* 5.17 Page Request Queue CSR (32bits) */
> +#define RISCV_IOMMU_REG_PQCSR		0x0050
> +#define RISCV_IOMMU_PQCSR_PQEN		RISCV_IOMMU_QUEUE_ENABLE
> +#define RISCV_IOMMU_PQCSR_PIE		RISCV_IOMMU_QUEUE_INTR_ENABLE
> +#define RISCV_IOMMU_PQCSR_PQMF		RISCV_IOMMU_QUEUE_MEM_FAULT
> +#define RISCV_IOMMU_PQCSR_PQOF		RISCV_IOMMU_QUEUE_OVERFLOW
> +#define RISCV_IOMMU_PQCSR_PQON		RISCV_IOMMU_QUEUE_ACTIVE
> +#define RISCV_IOMMU_PQCSR_BUSY		RISCV_IOMMU_QUEUE_BUSY
> +
> +/* 5.18 Interrupt Pending Status (32bits) */
> +#define RISCV_IOMMU_REG_IPSR		0x0054
> +
> +#define RISCV_IOMMU_INTR_CQ		0
> +#define RISCV_IOMMU_INTR_FQ		1
> +#define RISCV_IOMMU_INTR_PM		2
> +#define RISCV_IOMMU_INTR_PQ		3
> +#define RISCV_IOMMU_INTR_COUNT		4
> +
> +#define RISCV_IOMMU_IPSR_CIP		BIT(RISCV_IOMMU_INTR_CQ)
> +#define RISCV_IOMMU_IPSR_FIP		BIT(RISCV_IOMMU_INTR_FQ)
> +#define RISCV_IOMMU_IPSR_PMIP		BIT(RISCV_IOMMU_INTR_PM)
> +#define RISCV_IOMMU_IPSR_PIP		BIT(RISCV_IOMMU_INTR_PQ)
> +
> +/* 5.19 Performance monitoring counter overflow status (32bits) */
> +#define RISCV_IOMMU_REG_IOCOUNTOVF	0x0058
> +#define RISCV_IOMMU_IOCOUNTOVF_CY	BIT(0)
> +#define RISCV_IOMMU_IOCOUNTOVF_HPM	GENMASK_ULL(31, 1)
> +
> +/* 5.20 Performance monitoring counter inhibits (32bits) */
> +#define RISCV_IOMMU_REG_IOCOUNTINH	0x005C
> +#define RISCV_IOMMU_IOCOUNTINH_CY	BIT(0)
> +#define RISCV_IOMMU_IOCOUNTINH_HPM	GENMASK(31, 1)
> +
> +/* 5.21 Performance monitoring cycles counter (64bits) */
> +#define RISCV_IOMMU_REG_IOHPMCYCLES     0x0060
> +#define RISCV_IOMMU_IOHPMCYCLES_COUNTER	GENMASK_ULL(62, 0)
> +#define RISCV_IOMMU_IOHPMCYCLES_OVF	BIT_ULL(63)
> +
> +/* 5.22 Performance monitoring event counters (31 * 64bits) */
> +#define RISCV_IOMMU_REG_IOHPMCTR_BASE	0x0068
> +#define RISCV_IOMMU_REG_IOHPMCTR(_n)	(RISCV_IOMMU_REG_IOHPMCTR_BASE + ((_n) * 0x8))
> +
> +/* 5.23 Performance monitoring event selectors (31 * 64bits) */
> +#define RISCV_IOMMU_REG_IOHPMEVT_BASE	0x0160
> +#define RISCV_IOMMU_REG_IOHPMEVT(_n)	(RISCV_IOMMU_REG_IOHPMEVT_BASE + ((_n) * 0x8))
> +#define RISCV_IOMMU_IOHPMEVT_CNT	31
> +#define RISCV_IOMMU_IOHPMEVT_EVENT_ID	GENMASK_ULL(14, 0)
> +#define RISCV_IOMMU_IOHPMEVT_DMASK	BIT_ULL(15)
> +#define RISCV_IOMMU_IOHPMEVT_PID_PSCID	GENMASK_ULL(35, 16)
> +#define RISCV_IOMMU_IOHPMEVT_DID_GSCID	GENMASK_ULL(59, 36)
> +#define RISCV_IOMMU_IOHPMEVT_PV_PSCV	BIT_ULL(60)
> +#define RISCV_IOMMU_IOHPMEVT_DV_GSCV	BIT_ULL(61)
> +#define RISCV_IOMMU_IOHPMEVT_IDT	BIT_ULL(62)
> +#define RISCV_IOMMU_IOHPMEVT_OF		BIT_ULL(63)
> +
> +/**
> + * enum riscv_iommu_hpmevent_id - Performance-monitoring event identifier
> + *
> + * @RISCV_IOMMU_HPMEVENT_INVALID: Invalid event, do not count
> + * @RISCV_IOMMU_HPMEVENT_URQ: Untranslated requests
> + * @RISCV_IOMMU_HPMEVENT_TRQ: Translated requests
> + * @RISCV_IOMMU_HPMEVENT_ATS_RQ: ATS translation requests
> + * @RISCV_IOMMU_HPMEVENT_TLB_MISS: TLB misses
> + * @RISCV_IOMMU_HPMEVENT_DD_WALK: Device directory walks
> + * @RISCV_IOMMU_HPMEVENT_PD_WALK: Process directory walks
> + * @RISCV_IOMMU_HPMEVENT_S_VS_WALKS: S/VS-Stage page table walks
> + * @RISCV_IOMMU_HPMEVENT_G_WALKS: G-Stage page table walks
> + * @RISCV_IOMMU_HPMEVENT_MAX: Value to denote maximum Event IDs
> + */
> +enum riscv_iommu_hpmevent_id {
> +	RISCV_IOMMU_HPMEVENT_INVALID    = 0,
> +	RISCV_IOMMU_HPMEVENT_URQ        = 1,
> +	RISCV_IOMMU_HPMEVENT_TRQ        = 2,
> +	RISCV_IOMMU_HPMEVENT_ATS_RQ     = 3,
> +	RISCV_IOMMU_HPMEVENT_TLB_MISS   = 4,
> +	RISCV_IOMMU_HPMEVENT_DD_WALK    = 5,
> +	RISCV_IOMMU_HPMEVENT_PD_WALK    = 6,
> +	RISCV_IOMMU_HPMEVENT_S_VS_WALKS = 7,
> +	RISCV_IOMMU_HPMEVENT_G_WALKS    = 8,
> +	RISCV_IOMMU_HPMEVENT_MAX        = 9
> +};
> +
> +/* 5.24 Translation request IOVA (64bits) */
> +#define RISCV_IOMMU_REG_TR_REQ_IOVA     0x0258
> +#define RISCV_IOMMU_TR_REQ_IOVA_VPN	GENMASK_ULL(63, 12)
> +
> +/* 5.25 Translation request control (64bits) */
> +#define RISCV_IOMMU_REG_TR_REQ_CTL	0x0260
> +#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY	BIT_ULL(0)
> +#define RISCV_IOMMU_TR_REQ_CTL_PRIV	BIT_ULL(1)
> +#define RISCV_IOMMU_TR_REQ_CTL_EXE	BIT_ULL(2)
> +#define RISCV_IOMMU_TR_REQ_CTL_NW	BIT_ULL(3)
> +#define RISCV_IOMMU_TR_REQ_CTL_PID	GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_TR_REQ_CTL_PV	BIT_ULL(32)
> +#define RISCV_IOMMU_TR_REQ_CTL_DID	GENMASK_ULL(63, 40)
> +
> +/* 5.26 Translation request response (64bits) */
> +#define RISCV_IOMMU_REG_TR_RESPONSE	0x0268
> +#define RISCV_IOMMU_TR_RESPONSE_FAULT	BIT_ULL(0)
> +#define RISCV_IOMMU_TR_RESPONSE_PBMT	GENMASK_ULL(8, 7)
> +#define RISCV_IOMMU_TR_RESPONSE_SZ	BIT_ULL(9)
> +#define RISCV_IOMMU_TR_RESPONSE_PPN	RISCV_IOMMU_PPN_FIELD
> +
> +/* 5.27 Interrupt cause to vector (64bits) */
> +#define RISCV_IOMMU_REG_IVEC		0x02F8
> +#define RISCV_IOMMU_IVEC_CIV		GENMASK_ULL(3, 0)
> +#define RISCV_IOMMU_IVEC_FIV		GENMASK_ULL(7, 4)
> +#define RISCV_IOMMU_IVEC_PMIV		GENMASK_ULL(11, 8)
> +#define RISCV_IOMMU_IVEC_PIV		GENMASK_ULL(15, 12)
> +
> +/* 5.28 MSI Configuration table (32 * 64bits) */
> +#define RISCV_IOMMU_REG_MSI_CONFIG	0x0300
> +#define RISCV_IOMMU_REG_MSI_ADDR(_n)	(RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10))
> +#define RISCV_IOMMU_MSI_ADDR		GENMASK_ULL(55, 2)
> +#define RISCV_IOMMU_REG_MSI_DATA(_n)	(RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x08)
> +#define RISCV_IOMMU_MSI_DATA		GENMASK_ULL(31, 0)
> +#define RISCV_IOMMU_REG_MSI_VEC_CTL(_n)	(RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x0C)
> +#define RISCV_IOMMU_MSI_VEC_CTL_M	BIT_ULL(0)
> +
> +#define RISCV_IOMMU_REG_SIZE	0x1000
> +
> +/*
> + * Chapter 2: Data structures
> + */
> +
> +/*
> + * Device Directory Table macros for non-leaf nodes
> + */
> +#define RISCV_IOMMU_DDTE_VALID	BIT_ULL(0)
> +#define RISCV_IOMMU_DDTE_PPN	RISCV_IOMMU_PPN_FIELD
> +
> +/**
> + * struct riscv_iommu_dc - Device Context
> + * @tc: Translation Control
> + * @iohgatp: I/O Hypervisor guest address translation and protection
> + *	     (Second stage context)
> + * @ta: Translation Attributes
> + * @fsc: First stage context
> + * @msiptp: MSI page table pointer
> + * @msi_addr_mask: MSI address mask
> + * @msi_addr_pattern: MSI address pattern
> + * @_reserved: Reserved for future use, padding
> + *
> + * This structure is used for leaf nodes on the Device Directory Table,
> + * in case RISCV_IOMMU_CAP_MSI_FLAT is not set, the bottom 4 fields are
> + * not present and are skipped with pointer arithmetic to avoid
> + * casting, check out riscv_iommu_get_dc().
> + * See section 2.1 for more details
> + */
> +struct riscv_iommu_dc {
> +	u64 tc;
> +	u64 iohgatp;
> +	u64 ta;
> +	u64 fsc;
> +	u64 msiptp;
> +	u64 msi_addr_mask;
> +	u64 msi_addr_pattern;
> +	u64 _reserved;
> +};
> +
> +/* Translation control fields */
> +#define RISCV_IOMMU_DC_TC_V		BIT_ULL(0)
> +#define RISCV_IOMMU_DC_TC_EN_ATS	BIT_ULL(1)
> +#define RISCV_IOMMU_DC_TC_EN_PRI	BIT_ULL(2)
> +#define RISCV_IOMMU_DC_TC_T2GPA		BIT_ULL(3)
> +#define RISCV_IOMMU_DC_TC_DTF		BIT_ULL(4)
> +#define RISCV_IOMMU_DC_TC_PDTV		BIT_ULL(5)
> +#define RISCV_IOMMU_DC_TC_PRPR		BIT_ULL(6)
> +#define RISCV_IOMMU_DC_TC_GADE		BIT_ULL(7)
> +#define RISCV_IOMMU_DC_TC_SADE		BIT_ULL(8)
> +#define RISCV_IOMMU_DC_TC_DPE		BIT_ULL(9)
> +#define RISCV_IOMMU_DC_TC_SBE		BIT_ULL(10)
> +#define RISCV_IOMMU_DC_TC_SXL		BIT_ULL(11)
> +
> +/* Second-stage (aka G-stage) context fields */
> +#define RISCV_IOMMU_DC_IOHGATP_PPN	RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_DC_IOHGATP_GSCID	GENMASK_ULL(59, 44)
> +#define RISCV_IOMMU_DC_IOHGATP_MODE	RISCV_IOMMU_ATP_MODE_FIELD
> +
> +/**
> + * enum riscv_iommu_dc_iohgatp_modes - Guest address translation/protection modes
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_BARE: No translation/protection
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4: Sv32x4 (2-bit extension of Sv32), when fctl.GXL == 1
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4: Sv39x4 (2-bit extension of Sv39), when fctl.GXL == 0
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4: Sv48x4 (2-bit extension of Sv48), when fctl.GXL == 0
> + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4: Sv57x4 (2-bit extension of Sv57), when fctl.GXL == 0
> + */
> +enum riscv_iommu_dc_iohgatp_modes {
> +	RISCV_IOMMU_DC_IOHGATP_MODE_BARE = 0,
> +	RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4 = 8,
> +	RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4 = 8,
> +	RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4 = 9,
> +	RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4 = 10
> +};
> +
> +/* Translation attributes fields */
> +#define RISCV_IOMMU_DC_TA_PSCID		GENMASK_ULL(31, 12)
> +
> +/* First-stage context fields */
> +#define RISCV_IOMMU_DC_FSC_PPN		RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_DC_FSC_MODE		RISCV_IOMMU_ATP_MODE_FIELD
> +
> +/**
> + * enum riscv_iommu_dc_fsc_atp_modes - First stage address translation/protection modes
> + * @RISCV_IOMMU_DC_FSC_MODE_BARE: No translation/protection
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32: Sv32, when dc.tc.SXL == 1
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39: Sv39, when dc.tc.SXL == 0
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48: Sv48, when dc.tc.SXL == 0
> + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57: Sv57, when dc.tc.SXL == 0
> + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8: 1lvl PDT, 8bit process ids
> + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17: 2lvl PDT, 17bit process ids
> + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20: 3lvl PDT, 20bit process ids
> + *
> + * FSC holds IOSATP when RISCV_IOMMU_DC_TC_PDTV is 0 and PDTP otherwise.
> + * IOSATP controls the first stage address translation (same as the satp register on
> + * the RISC-V MMU), and PDTP holds the process directory table, used to select a
> + * first stage page table based on a process id (for devices that support multiple
> + * process ids).
> + */
> +enum riscv_iommu_dc_fsc_atp_modes {
> +	RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
> +	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
> +	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 = 8,
> +	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48 = 9,
> +	RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57 = 10,
> +	RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8 = 1,
> +	RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17 = 2,
> +	RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20 = 3
> +};
> +
> +/* MSI page table pointer */
> +#define RISCV_IOMMU_DC_MSIPTP_PPN	RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_DC_MSIPTP_MODE	RISCV_IOMMU_ATP_MODE_FIELD
> +#define RISCV_IOMMU_DC_MSIPTP_MODE_OFF	0
> +#define RISCV_IOMMU_DC_MSIPTP_MODE_FLAT	1
> +
> +/* MSI address mask */
> +#define RISCV_IOMMU_DC_MSI_ADDR_MASK	GENMASK_ULL(51, 0)
> +
> +/* MSI address pattern */
> +#define RISCV_IOMMU_DC_MSI_PATTERN	GENMASK_ULL(51, 0)
> +
> +/**
> + * struct riscv_iommu_pc - Process Context
> + * @ta: Translation Attributes
> + * @fsc: First stage context
> + *
> + * This structure is used for leaf nodes on the Process Directory Table
> + * See section 2.3 for more details
> + */
> +struct riscv_iommu_pc {
> +	u64 ta;
> +	u64 fsc;
> +};
> +
> +/* Translation attributes fields */
> +#define RISCV_IOMMU_PC_TA_V	BIT_ULL(0)
> +#define RISCV_IOMMU_PC_TA_ENS	BIT_ULL(1)
> +#define RISCV_IOMMU_PC_TA_SUM	BIT_ULL(2)
> +#define RISCV_IOMMU_PC_TA_PSCID	GENMASK_ULL(31, 12)
> +
> +/* First stage context fields */
> +#define RISCV_IOMMU_PC_FSC_PPN	RISCV_IOMMU_ATP_PPN_FIELD
> +#define RISCV_IOMMU_PC_FSC_MODE	RISCV_IOMMU_ATP_MODE_FIELD
> +
> +/*
> + * Chapter 3: In-memory queue interface
> + */
> +
> +/**
> + * struct riscv_iommu_command - Generic I/O MMU command structure
> + * @dword0: Includes the opcode and the function identifier
> + * @dword1: Opcode specific data
> + *
> + * The commands are interpreted as two 64bit fields, where the first
> + * 7bits of the first field are the opcode which also defines the
> + * command's format, followed by a 3bit field that specifies the
> + * function invoked by that command, and the rest is opcode-specific.
> + * This is a generic struct which will be populated differently
> + * according to each command. For more infos on the commands and
> + * the command queue check section 3.1.
> + */
> +struct riscv_iommu_command {
> +	u64 dword0;
> +	u64 dword1;
> +};
> +
> +/* Fields on dword0, common for all commands */
> +#define RISCV_IOMMU_CMD_OPCODE	GENMASK_ULL(6, 0)
> +#define	RISCV_IOMMU_CMD_FUNC	GENMASK_ULL(9, 7)
> +
> +/* 3.1.1 I/O MMU Page-table cache invalidation */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_IOTINVAL_OPCODE		1
> +#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA	0
> +#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA	1
> +#define RISCV_IOMMU_CMD_IOTINVAL_AV		BIT_ULL(10)
> +#define RISCV_IOMMU_CMD_IOTINVAL_PSCID		GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_IOTINVAL_PSCV		BIT_ULL(32)
> +#define RISCV_IOMMU_CMD_IOTINVAL_GV		BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_IOTINVAL_GSCID		GENMASK_ULL(59, 44)
> +/* dword1[61:10] is the 4K-aligned page address */
> +#define RISCV_IOMMU_CMD_IOTINVAL_ADDR		GENMASK_ULL(61, 10)
> +
> +/* 3.1.2 I/O MMU Command Queue Fences */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_IOFENCE_OPCODE		2
> +#define RISCV_IOMMU_CMD_IOFENCE_FUNC_C		0
> +#define RISCV_IOMMU_CMD_IOFENCE_AV		BIT_ULL(10)
> +#define RISCV_IOMMU_CMD_IOFENCE_WSI		BIT_ULL(11)
> +#define RISCV_IOMMU_CMD_IOFENCE_PR		BIT_ULL(12)
> +#define RISCV_IOMMU_CMD_IOFENCE_PW		BIT_ULL(13)
> +#define RISCV_IOMMU_CMD_IOFENCE_DATA		GENMASK_ULL(63, 32)
> +/* dword1 is the address, word-size aligned and shifted to the right by two bits. */
> +
> +/* 3.1.3 I/O MMU Directory cache invalidation */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_IODIR_OPCODE		3
> +#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT	0
> +#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT	1
> +#define RISCV_IOMMU_CMD_IODIR_PID		GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_IODIR_DV		BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_IODIR_DID		GENMASK_ULL(63, 40)
> +/* dword1 is reserved for standard use */
> +
> +/* 3.1.4 I/O MMU PCIe ATS */
> +/* Fields on dword0 */
> +#define RISCV_IOMMU_CMD_ATS_OPCODE		4
> +#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL		0
> +#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR		1
> +#define RISCV_IOMMU_CMD_ATS_PID			GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_CMD_ATS_PV			BIT_ULL(32)
> +#define RISCV_IOMMU_CMD_ATS_DSV			BIT_ULL(33)
> +#define RISCV_IOMMU_CMD_ATS_RID			GENMASK_ULL(55, 40)
> +#define RISCV_IOMMU_CMD_ATS_DSEG		GENMASK_ULL(63, 56)
> +/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
> +
> +/* ATS.INVAL payload*/
> +#define RISCV_IOMMU_CMD_ATS_INVAL_G		BIT_ULL(0)
> +/* Bits 1 - 10 are zeroed */
> +#define RISCV_IOMMU_CMD_ATS_INVAL_S		BIT_ULL(11)
> +#define RISCV_IOMMU_CMD_ATS_INVAL_UADDR		GENMASK_ULL(63, 12)
> +
> +/* ATS.PRGR payload */
> +/* Bits 0 - 31 are zeroed */
> +#define RISCV_IOMMU_CMD_ATS_PRGR_PRG_INDEX	GENMASK_ULL(40, 32)
> +/* Bits 41 - 43 are zeroed */
> +#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE	GENMASK_ULL(47, 44)
> +#define RISCV_IOMMU_CMD_ATS_PRGR_DST_ID		GENMASK_ULL(63, 48)
> +
> +/**
> + * struct riscv_iommu_fq_record - Fault/Event Queue Record
> + * @hdr: Header, includes fault/event cause, PID/DID, transaction type etc
> + * @_reserved: Low 32bits for custom use, high 32bits for standard use
> + * @iotval: Transaction-type/cause specific format
> + * @iotval2: Cause specific format
> + *
> + * The fault/event queue reports events and failures raised when
> + * processing transactions. Each record is a 32byte structure where
> + * the first dword has a fixed format for providing generic infos
> + * regarding the fault/event, and two more dwords are there for
> + * fault/event-specific information. For more details see section
> + * 3.2.
> + */
> +struct riscv_iommu_fq_record {
> +	u64 hdr;
> +	u64 _reserved;
> +	u64 iotval;
> +	u64 iotval2;
> +};
> +
> +/* Fields on header */
> +#define RISCV_IOMMU_FQ_HDR_CAUSE	GENMASK_ULL(11, 0)
> +#define RISCV_IOMMU_FQ_HDR_PID		GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_FQ_HDR_PV		BIT_ULL(32)
> +#define RISCV_IOMMU_FQ_HDR_PRIV		BIT_ULL(33)
> +#define RISCV_IOMMU_FQ_HDR_TTYPE	GENMASK_ULL(39, 34)
> +#define RISCV_IOMMU_FQ_HDR_DID		GENMASK_ULL(63, 40)
> +
> +/**
> + * enum riscv_iommu_fq_causes - Fault/event cause values
> + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT: Instruction access fault
> + * @RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED: Read address misaligned
> + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT: Read load fault
> + * @RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED: Write/AMO address misaligned
> + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT: Write/AMO access fault
> + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S: Instruction page fault
> + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S: Read page fault
> + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S: Write/AMO page fault
> + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS: Instruction guest page fault
> + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS: Read guest page fault
> + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS: Write/AMO guest page fault
> + * @RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED: All inbound transactions disallowed
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT: DDT entry load access fault
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_INVALID: DDT entry invalid
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED: DDT entry misconfigured
> + * @RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED: Transaction type disallowed
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT: MSI PTE load access fault
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_INVALID: MSI PTE invalid
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED: MSI PTE misconfigured
> + * @RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT: MRIF access fault
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT: PDT entry load access fault
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_INVALID: PDT entry invalid
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED: PDT entry misconfigured
> + * @RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED: DDT data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED: PDT data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED: MSI page table data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED: MRIF data corruption
> + * @RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR: Internal data path error
> + * @RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT: IOMMU MSI write access fault
> + * @RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED: First/second stage page table data corruption
> + *
> + * Values are on table 11 of the spec, encodings 275 - 2047 are reserved for standard
> + * use, and 2048 - 4095 for custom use.
> + */
> +enum riscv_iommu_fq_causes {
> +	RISCV_IOMMU_FQ_CAUSE_INST_FAULT = 1,
> +	RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED = 4,
> +	RISCV_IOMMU_FQ_CAUSE_RD_FAULT = 5,
> +	RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED = 6,
> +	RISCV_IOMMU_FQ_CAUSE_WR_FAULT = 7,
> +	RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S = 12,
> +	RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S = 13,
> +	RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S = 15,
> +	RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS = 20,
> +	RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS = 21,
> +	RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS = 23,
> +	RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED = 256,
> +	RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT = 257,
> +	RISCV_IOMMU_FQ_CAUSE_DDT_INVALID = 258,
> +	RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED = 259,
> +	RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED = 260,
> +	RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT = 261,
> +	RISCV_IOMMU_FQ_CAUSE_MSI_INVALID = 262,
> +	RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED = 263,
> +	RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT = 264,
> +	RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT = 265,
> +	RISCV_IOMMU_FQ_CAUSE_PDT_INVALID = 266,
> +	RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED = 267,
> +	RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED = 268,
> +	RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED = 269,
> +	RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED = 270,
> +	RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED = 271,
> +	RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR = 272,
> +	RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT = 273,
> +	RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED = 274
> +};
> +
> +/**
> + * enum riscv_iommu_fq_ttypes: Fault/event transaction types
> + * @RISCV_IOMMU_FQ_TTYPE_NONE: None. Fault not caused by an inbound transaction.
> + * @RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH: Instruction fetch from untranslated address
> + * @RISCV_IOMMU_FQ_TTYPE_UADDR_RD: Read from untranslated address
> + * @RISCV_IOMMU_FQ_TTYPE_UADDR_WR: Write/AMO to untranslated address
> + * @RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH: Instruction fetch from translated address
> + * @RISCV_IOMMU_FQ_TTYPE_TADDR_RD: Read from translated address
> + * @RISCV_IOMMU_FQ_TTYPE_TADDR_WR: Write/AMO to translated address
> + * @RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ: PCIe ATS translation request
> + * @RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ: PCIe message request
> + *
> + * Values are on table 12 of the spec, type 4 and 10 - 31 are reserved for standard use
> + * and 31 - 63 for custom use.
> + */
> +enum riscv_iommu_fq_ttypes {
> +	RISCV_IOMMU_FQ_TTYPE_NONE = 0,
> +	RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH = 1,
> +	RISCV_IOMMU_FQ_TTYPE_UADDR_RD = 2,
> +	RISCV_IOMMU_FQ_TTYPE_UADDR_WR = 3,
> +	RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
> +	RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
> +	RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
> +	RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
> +	RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
> +};
> +
> +/**
> + * struct riscv_iommu_pq_record - PCIe Page Request record
> + * @hdr: Header, includes PID, DID etc
> + * @payload: Holds the page address, request group and permission bits
> + *
> + * For more infos on the PCIe Page Request queue see chapter 3.3.
> + */
> +struct riscv_iommu_pq_record {
> +	u64 hdr;
> +	u64 payload;
> +};
> +
> +/* Header fields */
> +#define RISCV_IOMMU_PREQ_HDR_PID	GENMASK_ULL(31, 12)
> +#define RISCV_IOMMU_PREQ_HDR_PV		BIT_ULL(32)
> +#define RISCV_IOMMU_PREQ_HDR_PRIV	BIT_ULL(33)
> +#define RISCV_IOMMU_PREQ_HDR_EXEC	BIT_ULL(34)
> +#define RISCV_IOMMU_PREQ_HDR_DID	GENMASK_ULL(63, 40)
> +
> +/* Payload fields */
> +#define RISCV_IOMMU_PREQ_PAYLOAD_R	BIT_ULL(0)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_W	BIT_ULL(1)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_L	BIT_ULL(2)
> +#define RISCV_IOMMU_PREQ_PAYLOAD_M	GENMASK_ULL(2, 0)	/* Mask of RWL for convenience */
> +#define RISCV_IOMMU_PREQ_PRG_INDEX	GENMASK_ULL(11, 3)
> +#define RISCV_IOMMU_PREQ_UADDR		GENMASK_ULL(63, 12)
> +
> +/**
> + * struct riscv_iommu_msi_pte - MSI Page Table Entry
> + * @pte: MSI PTE
> + * @mrif_info: Memory-resident interrupt file info
> + *
> + * The MSI Page Table is used for virtualizing MSIs, so that when
> + * a device sends an MSI to a guest, the IOMMU can reroute it
> + * by translating the MSI address, either to a guest interrupt file
> + * or a memory resident interrupt file (MRIF). Note that this page table
> + * is an array of MSI PTEs, not a multi-level pt, each entry
> + * is a leaf entry. For more infos check out the AIA spec, chapter 9.5.
> + *
> + * Also in basic mode the mrif_info field is ignored by the IOMMU and can
> + * be used by software, any other reserved fields on pte must be zeroed-out
> + * by software.
> + */
> +struct riscv_iommu_msi_pte {
> +	u64 pte;
> +	u64 mrif_info;
> +};
> +
> +/* Fields on pte */
> +#define RISCV_IOMMU_MSI_PTE_V		BIT_ULL(0)
> +#define RISCV_IOMMU_MSI_PTE_M		GENMASK_ULL(2, 1)
> +#define RISCV_IOMMU_MSI_PTE_MRIF_ADDR	GENMASK_ULL(53, 7)	/* When M == 1 (MRIF mode) */
> +#define RISCV_IOMMU_MSI_PTE_PPN		RISCV_IOMMU_PPN_FIELD	/* When M == 3 (basic mode) */
> +#define RISCV_IOMMU_MSI_PTE_C		BIT_ULL(63)
> +
> +/* Fields on mrif_info */
> +#define RISCV_IOMMU_MSI_MRIF_NID	GENMASK_ULL(9, 0)
> +#define RISCV_IOMMU_MSI_MRIF_NPPN	RISCV_IOMMU_PPN_FIELD
> +#define RISCV_IOMMU_MSI_MRIF_NID_MSB	BIT_ULL(60)
> +
> +#endif /* _RISCV_IOMMU_BITS_H_ */
> diff --git a/drivers/iommu/riscv/iommu-platform.c b/drivers/iommu/riscv/iommu-platform.c
> new file mode 100644
> index 000000000000..770086ae2ab3
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu-platform.c
> @@ -0,0 +1,94 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * RISC-V IOMMU as a platform device
> + *
> + * Copyright © 2023 FORTH-ICS/CARV
> + * Copyright © 2023-2024 Rivos Inc.
> + *
> + * Authors
> + *	Nick Kossifidis <mick@ics.forth.gr>
> + *	Tomasz Jeznach <tjeznach@rivosinc.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_platform.h>
> +#include <linux/platform_device.h>
> +
> +#include "iommu-bits.h"
> +#include "iommu.h"
> +
> +static int riscv_iommu_platform_probe(struct platform_device *pdev)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct riscv_iommu_device *iommu = NULL;
> +	struct resource *res = NULL;
> +	int vec;
> +
> +	iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
> +	if (!iommu)
> +		return -ENOMEM;
> +
> +	iommu->dev = dev;
> +	iommu->reg = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
> +	if (IS_ERR(iommu->reg))
> +		return dev_err_probe(dev, PTR_ERR(iommu->reg),
> +				     "could not map register region\n");
> +
> +	dev_set_drvdata(dev, iommu);
> +
> +	/* Check device reported capabilities / features. */
> +	iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
> +	iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> +
> +	/* For now we only support WSI */
> +	switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
> +	case RISCV_IOMMU_CAP_IGS_WSI:
> +	case RISCV_IOMMU_CAP_IGS_BOTH:
> +		break;
> +	default:
> +		return dev_err_probe(dev, -ENODEV,
> +				     "unable to use wire-signaled interrupts\n");
> +	}
> +
> +	iommu->irqs_count = platform_irq_count(pdev);
> +	if (iommu->irqs_count <= 0)
> +		return dev_err_probe(dev, -ENODEV,
> +				     "no IRQ resources provided\n");
> +
> +	for (vec = 0; vec < iommu->irqs_count; vec++)
> +		iommu->irqs[vec] = platform_get_irq(pdev, vec);

And if I've specified 97 interrupts in my DT because I won't let schema 
be the boss of me?

> +	/* Enable wire-signaled interrupts, fctl.WSI */
> +	if (!(iommu->fctl & RISCV_IOMMU_FCTL_WSI)) {
> +		iommu->fctl ^= RISCV_IOMMU_FCTL_WSI;

Using XOR to only ever set a 0 bit to 1 seems a bit obtuse.

> +		riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
> +	}
> +
> +	return riscv_iommu_init(iommu);
> +};
> +
> +static void riscv_iommu_platform_remove(struct platform_device *pdev)
> +{
> +	riscv_iommu_remove(dev_get_drvdata(&pdev->dev));
> +};
> +
> +static const struct of_device_id riscv_iommu_of_match[] = {
> +	{.compatible = "riscv,iommu",},
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(of, riscv_iommu_of_match);

And yet it cannot be a module?

> +static struct platform_driver riscv_iommu_platform_driver = {
> +	.probe = riscv_iommu_platform_probe,
> +	.remove_new = riscv_iommu_platform_remove,
> +	.driver = {
> +		.name = "riscv,iommu",
> +		.of_match_table = riscv_iommu_of_match,
> +		.suppress_bind_attrs = true,
> +	},
> +};
> +
> +module_driver(riscv_iommu_platform_driver, platform_driver_register,
> +	      platform_driver_unregister);

module_platform_driver() is a thing. Or builtin_platform_driver(), as 
things currently stand.

> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> new file mode 100644
> index 000000000000..af68c89200a9
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * IOMMU API for RISC-V IOMMU implementations.
> + *
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + *
> + * Authors
> + *	Tomasz Jeznach <tjeznach@rivosinc.com>
> + *	Nick Kossifidis <mick@ics.forth.gr>
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

I guess that ends up as "iommu:"? Given that that prefix already belongs 
to the core code, please pick something more specific.

> +#include <linux/compiler.h>
> +#include <linux/dma-mapping.h>
> +#include <linux/init.h>
> +#include <linux/iommu.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +
> +#include "iommu-bits.h"
> +#include "iommu.h"
> +
> +MODULE_DESCRIPTION("Driver for RISC-V IOMMU");
> +MODULE_AUTHOR("Tomasz Jeznach <tjeznach@rivosinc.com>");
> +MODULE_AUTHOR("Nick Kossifidis <mick@ics.forth.gr>");
> +MODULE_LICENSE("GPL");
> +
> +/* Timeouts in [us] */
> +#define RISCV_IOMMU_DDTP_TIMEOUT	50000
> +
> +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> +{
> +	u64 ddtp;
> +
> +	/* Hardware must be configured in OFF | BARE mode at system initialization. */
> +	riscv_iommu_readq_timeout(iommu, RISCV_IOMMU_REG_DDTP,
> +				  ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
> +				  10, RISCV_IOMMU_DDTP_TIMEOUT);
> +	if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE)
> +		return -EBUSY;

It looks like RISC-V already supports kdump, so you probably want to be 
prepared to find the IOMMU with its pants down and deal with it from day 
one.

> +
> +	/* Configure accesses to in-memory data structures for CPU-native byte order. */
> +	if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) {
> +		if (!(iommu->caps & RISCV_IOMMU_CAP_END))
> +			return -EINVAL;
> +		riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL,
> +				   iommu->fctl ^ RISCV_IOMMU_FCTL_BE);
> +		iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> +		if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE))
> +			return -EINVAL;
> +	}

That's fun. It could be reorganised to avoid the duplicate check, but it 
is rather majestic as-is :)

> +
> +	dma_set_mask_and_coherent(iommu->dev,
> +				  DMA_BIT_MASK(FIELD_GET(RISCV_IOMMU_CAP_PAS, iommu->caps)));

This isn't a check, so I would think it belongs to "_init" rather than 
"_init_check".

Thanks,
Robin.

> +
> +	return 0;
> +}
> +
> +void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> +{
> +	iommu_device_sysfs_remove(&iommu->iommu);
> +}
> +
> +int riscv_iommu_init(struct riscv_iommu_device *iommu)
> +{
> +	int rc;
> +
> +	rc = riscv_iommu_init_check(iommu);
> +	if (rc)
> +		return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> +	/*
> +	 * Placeholder for a complete IOMMU device initialization.
> +	 * For now, only bare minimum: enable global identity mapping mode and register sysfs.
> +	 */
> +	riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> +			   FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> +
> +	rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> +				    dev_name(iommu->dev));
> +	if (WARN(rc, "cannot register sysfs interface\n"))
> +		goto err_sysfs;
> +
> +	return 0;
> +
> +err_sysfs:
> +	return rc;
> +}
> diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> new file mode 100644
> index 000000000000..700e33dc2446
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu.h
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + *
> + * Authors
> + *	Tomasz Jeznach <tjeznach@rivosinc.com>
> + *	Nick Kossifidis <mick@ics.forth.gr>
> + */
> +
> +#ifndef _RISCV_IOMMU_H_
> +#define _RISCV_IOMMU_H_
> +
> +#include <linux/iommu.h>
> +#include <linux/types.h>
> +#include <linux/iopoll.h>
> +
> +#include "iommu-bits.h"
> +
> +struct riscv_iommu_device {
> +	/* iommu core interface */
> +	struct iommu_device iommu;
> +
> +	/* iommu hardware */
> +	struct device *dev;
> +
> +	/* hardware control register space */
> +	void __iomem *reg;
> +
> +	/* supported and enabled hardware capabilities */
> +	u64 caps;
> +	u32 fctl;
> +
> +	/* available interrupt numbers, MSI or WSI */
> +	unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> +	unsigned int irqs_count;
> +};
> +
> +int riscv_iommu_init(struct riscv_iommu_device *iommu);
> +void riscv_iommu_remove(struct riscv_iommu_device *iommu);
> +
> +#define riscv_iommu_readl(iommu, addr) \
> +	readl_relaxed((iommu)->reg + (addr))
> +
> +#define riscv_iommu_readq(iommu, addr) \
> +	readq_relaxed((iommu)->reg + (addr))
> +
> +#define riscv_iommu_writel(iommu, addr, val) \
> +	writel_relaxed((val), (iommu)->reg + (addr))
> +
> +#define riscv_iommu_writeq(iommu, addr, val) \
> +	writeq_relaxed((val), (iommu)->reg + (addr))
> +
> +#define riscv_iommu_readq_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
> +	readx_poll_timeout(readq_relaxed, (iommu)->reg + (addr), val, cond, \
> +			   delay_us, timeout_us)
> +
> +#define riscv_iommu_readl_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
> +	readx_poll_timeout(readl_relaxed, (iommu)->reg + (addr), val, cond, \
> +			   delay_us, timeout_us)
> +
> +#endif

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 3/7] iommu/riscv: Add RISC-V IOMMU PCIe device driver
  2024-04-18 16:32 ` [PATCH v2 3/7] iommu/riscv: Add RISC-V IOMMU PCIe " Tomasz Jeznach
@ 2024-04-18 22:07   ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2024-04-18 22:07 UTC (permalink / raw)
  To: Tomasz Jeznach, Joerg Roedel, Will Deacon, Paul Walmsley
  Cc: Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On 2024-04-18 5:32 pm, Tomasz Jeznach wrote:
> Introduce device driver for PCIe implementation
> of RISC-V IOMMU architected hardware.
> 
> IOMMU hardware and system support for MSI or MSI-X is
> required by this implementation.
> 
> Vendor and device identifiers used in this patch
> matches QEMU implementation of the RISC-V IOMMU PCIe
> device, from Rivos VID (0x1efd) range allocated by the PCI-SIG.
> 
> Link: https://lore.kernel.org/qemu-devel/20240307160319.675044-1-dbarboza@ventanamicro.com/
> Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
> Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> ---
>   MAINTAINERS                     |   1 +
>   drivers/iommu/riscv/Kconfig     |   6 ++
>   drivers/iommu/riscv/Makefile    |   1 +
>   drivers/iommu/riscv/iommu-pci.c | 154 ++++++++++++++++++++++++++++++++
>   4 files changed, 162 insertions(+)
>   create mode 100644 drivers/iommu/riscv/iommu-pci.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 051599c76585..4da290d5e9db 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18975,6 +18975,7 @@ F:	Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
>   F:	drivers/iommu/riscv/Kconfig
>   F:	drivers/iommu/riscv/Makefile
>   F:	drivers/iommu/riscv/iommu-bits.h
> +F:	drivers/iommu/riscv/iommu-pci.c
>   F:	drivers/iommu/riscv/iommu-platform.c
>   F:	drivers/iommu/riscv/iommu.c
>   F:	drivers/iommu/riscv/iommu.h
> diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
> index d02326bddb4c..711326992585 100644
> --- a/drivers/iommu/riscv/Kconfig
> +++ b/drivers/iommu/riscv/Kconfig
> @@ -14,3 +14,9 @@ config RISCV_IOMMU
>   
>   	  Say Y here if your SoC includes an IOMMU device implementing
>   	  the RISC-V IOMMU architecture.
> +
> +config RISCV_IOMMU_PCI
> +	def_bool y if RISCV_IOMMU && PCI_MSI
> +	depends on RISCV_IOMMU && PCI_MSI
> +	help
> +	  Support for the PCI implementation of RISC-V IOMMU architecture.

Similar comments as before.

> diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
> index e4c189de58d3..f54c9ed17d41 100644
> --- a/drivers/iommu/riscv/Makefile
> +++ b/drivers/iommu/riscv/Makefile
> @@ -1,2 +1,3 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
> +obj-$(CONFIG_RISCV_IOMMU_PCI) += iommu-pci.o
> diff --git a/drivers/iommu/riscv/iommu-pci.c b/drivers/iommu/riscv/iommu-pci.c
> new file mode 100644
> index 000000000000..9263c6e475be
> --- /dev/null
> +++ b/drivers/iommu/riscv/iommu-pci.c
> @@ -0,0 +1,154 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * Copyright © 2022-2024 Rivos Inc.
> + * Copyright © 2023 FORTH-ICS/CARV
> + *
> + * RISCV IOMMU as a PCIe device
> + *
> + * Authors
> + *	Tomasz Jeznach <tjeznach@rivosinc.com>
> + *	Nick Kossifidis <mick@ics.forth.gr>
> + */
> +
> +#include <linux/compiler.h>
> +#include <linux/init.h>
> +#include <linux/iommu.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +
> +#include "iommu-bits.h"
> +#include "iommu.h"
> +
> +/* Rivos Inc. assigned PCI Vendor and Device IDs */
> +#ifndef PCI_VENDOR_ID_RIVOS
> +#define PCI_VENDOR_ID_RIVOS             0x1efd
> +#endif
> +
> +#ifndef PCI_DEVICE_ID_RIVOS_IOMMU
> +#define PCI_DEVICE_ID_RIVOS_IOMMU       0xedf1
> +#endif
> +
> +static int riscv_iommu_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> +{
> +	struct device *dev = &pdev->dev;
> +	struct riscv_iommu_device *iommu;
> +	int rc, vec;
> +
> +	rc = pci_enable_device_mem(pdev);
> +	if (rc)
> +		return rc;
> +
> +	rc = pci_request_mem_regions(pdev, KBUILD_MODNAME);
> +	if (rc)
> +		goto fail;
> +
> +	pci_set_master(pdev);
> +
> +	if (!(pci_resource_flags(pdev, 0) & IORESOURCE_MEM))
> +		goto fail;
> +
> +	if (pci_resource_len(pdev, 0) < RISCV_IOMMU_REG_SIZE)
> +		goto fail;
> +
> +	iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
> +	if (!iommu)
> +		goto fail;
> +
> +	iommu->dev = dev;
> +	iommu->reg = pci_iomap(pdev, 0, RISCV_IOMMU_REG_SIZE);

Maybe consider some of the pcim_* devres helpers, to simplify 
cleanup/remove?

> +
> +	if (!iommu->reg)
> +		goto fail;
> +
> +	dev_set_drvdata(dev, iommu);
> +
> +	/* Check device reported capabilities / features. */
> +	iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
> +	iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> +
> +	/* The PCI driver only uses MSIs, make sure the IOMMU supports this */
> +	switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
> +	case RISCV_IOMMU_CAP_IGS_MSI:
> +	case RISCV_IOMMU_CAP_IGS_BOTH:
> +		break;
> +	default:
> +		dev_err(dev, "unable to use message-signaled interrupts\n");
> +		rc = -ENODEV;
> +		goto fail_unmap;
> +	}
> +
> +	/* Allocate and assign IRQ vectors for the various events */
> +	rc = pci_alloc_irq_vectors(pdev, 1, RISCV_IOMMU_INTR_COUNT,
> +				   PCI_IRQ_MSIX | PCI_IRQ_MSI);
> +	if (rc <= 0) {
> +		dev_err(dev, "unable to allocate irq vectors\n");
> +		goto fail_unmap;
> +	}
> +	for (vec = 0; vec < rc; vec++) {
> +		iommu->irqs[vec] = msi_get_virq(dev, vec);
> +		if (!iommu->irqs[vec])

Can that ever fail if the loop is already bounded to the number of 
vectors successfully allocated?

> +			break;
> +	}
> +	iommu->irqs_count = vec;
> +
> +	/* Enable message-signaled interrupts, fctl.WSI */
> +	if (iommu->fctl & RISCV_IOMMU_FCTL_WSI) {
> +		iommu->fctl ^= RISCV_IOMMU_FCTL_WSI;
> +		riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
> +	}
> +
> +	rc = riscv_iommu_init(iommu);
> +	if (!rc)
> +		return 0;
> +
> +fail_unmap:
> +	iounmap(iommu->reg);
> +	pci_free_irq_vectors(pdev);
> +fail:
> +	pci_release_regions(pdev);
> +	pci_clear_master(pdev);
> +	pci_disable_device(pdev);
> +	return rc;
> +}
> +
> +static void riscv_iommu_pci_remove(struct pci_dev *pdev)
> +{
> +	struct riscv_iommu_device *iommu = dev_get_drvdata(&pdev->dev);
> +
> +	riscv_iommu_remove(iommu);
> +	iounmap(iommu->reg);
> +	pci_free_irq_vectors(pdev);
> +	pci_release_regions(pdev);
> +	pci_clear_master(pdev);
> +	pci_disable_device(pdev);
> +}
> +
> +static const struct pci_device_id riscv_iommu_pci_tbl[] = {
> +	{PCI_VENDOR_ID_RIVOS, PCI_DEVICE_ID_RIVOS_IOMMU,
> +	 PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0},
> +	{0,}
> +};
> +
> +MODULE_DEVICE_TABLE(pci, riscv_iommu_pci_tbl);
> +
> +static const struct of_device_id riscv_iommu_of_match[] = {
> +	{.compatible = "riscv,pci-iommu",},
> +	{},
> +};
> +
> +MODULE_DEVICE_TABLE(of, riscv_iommu_of_match);
> +
> +static struct pci_driver riscv_iommu_pci_driver = {
> +	.name = KBUILD_MODNAME,
> +	.id_table = riscv_iommu_pci_tbl,
> +	.probe = riscv_iommu_pci_probe,
> +	.remove = riscv_iommu_pci_remove,
> +	.driver = {
> +		.of_match_table = riscv_iommu_of_match,

Does an of_match_table serve any functional purpose for a PCI driver? I 
can't find any other examples of this being done.

Thanks,
Robin.

> +		.suppress_bind_attrs = true,
> +	},
> +};
> +
> +module_pci_driver(riscv_iommu_pci_driver);

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/7] iommu/riscv: Device directory management.
  2024-04-18 16:32 ` [PATCH v2 5/7] iommu/riscv: Device directory management Tomasz Jeznach
@ 2024-04-19 12:40   ` Jason Gunthorpe
  2024-04-24 23:01     ` Tomasz Jeznach
  2024-04-22  5:11   ` Baolu Lu
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 12:40 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Thu, Apr 18, 2024 at 09:32:23AM -0700, Tomasz Jeznach wrote:
> @@ -31,13 +32,350 @@ MODULE_LICENSE("GPL");
>  /* Timeouts in [us] */
>  #define RISCV_IOMMU_DDTP_TIMEOUT	50000
>  
> -static int riscv_iommu_attach_identity_domain(struct iommu_domain *domain,
> -					      struct device *dev)
> +/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
> +#define phys_to_ppn(va)  (((va) >> 2) & (((1ULL << 44) - 1) << 10))
> +#define ppn_to_phys(pn)	 (((pn) << 2) & (((1ULL << 44) - 1) << 12))
> +
> +#define dev_to_iommu(dev) \
> +	container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)

We have iommu_get_iommu_dev() now

> +static unsigned long riscv_iommu_get_pages(struct riscv_iommu_device *iommu, unsigned int order)
> +{
> +	struct riscv_iommu_devres *devres;
> +	struct page *pages;
> +
> +	pages = alloc_pages_node(dev_to_node(iommu->dev),
> +				 GFP_KERNEL_ACCOUNT | __GFP_ZERO, order);
> +	if (unlikely(!pages)) {
> +		dev_err(iommu->dev, "Page allocation failed, order %u\n", order);
> +		return 0;
> +	}

This needs adjusting for the recently merged allocation accounting

> +static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
> +				     struct device *dev,
> +				     struct iommu_domain *iommu_domain)
> +{
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct riscv_iommu_dc *dc;
> +	u64 fsc, ta, tc;
> +	int i;
> +
> +	if (!iommu_domain) {
> +		ta = 0;
> +		tc = 0;
> +		fsc = 0;
> +	} else if (iommu_domain->type == IOMMU_DOMAIN_IDENTITY) {
> +		ta = 0;
> +		tc = RISCV_IOMMU_DC_TC_V;
> +		fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
> +	} else {
> +		/* This should never happen. */
> +		return -ENODEV;
> +	}

Please don't write it like this. This function is already being called
by functions that are already under specific ops, don't check
domain->type here.

Instead have the caller compute and pass in the ta/tc/fsc
values. Maybe in a tidy struct..

> +	/* Update existing or allocate new entries in device directory */
> +	for (i = 0; i < fwspec->num_ids; i++) {
> +		dc = riscv_iommu_get_dc(iommu, fwspec->ids[i], !iommu_domain);
> +		if (!dc && !iommu_domain)
> +			continue;
> +		if (!dc)
> +			return -ENODEV;

But if this fails some of the fwspecs were left in a weird state ?

Drivers should try hard to have attach functions that fail and make no
change at all or fully succeed.

Meaning ideally preallocate any required memory before doing any
change to the HW visable structures.

> +
> +		/* Swap device context, update TC valid bit as the last operation */
> +		xchg64(&dc->fsc, fsc);
> +		xchg64(&dc->ta, ta);
> +		xchg64(&dc->tc, tc);

This doesn't loook right? When you get to adding PAGING suport fsc has
the page table pfn and ta has the cache tag, so this will end up
tearing the data for sure, eg when asked to replace a PAGING domain
with another PAGING domain? That will create a functional/security
problem, right?

I would encourage you to re-use the ARM sequencing code, ideally moved
to some generic helper library. Every iommu driver dealing with
multi-quanta descriptors seems to have this same fundamental
sequencing problem.

> +static void riscv_iommu_release_device(struct device *dev)
> +{
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +
> +	riscv_iommu_attach_domain(iommu, dev, NULL);
> +}

The release_domain has landed too now. Please don't invent weird NULL
domain types that have special meaning. I assume clearing the V bit is
a blocking behavior? So please implement a proper blocking domain and
set release_domain = &riscv_iommu_blocking and just omit this release
function.

> @@ -133,12 +480,14 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
>  	rc = riscv_iommu_init_check(iommu);
>  	if (rc)
>  		return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> -	/*
> -	 * Placeholder for a complete IOMMU device initialization.
> -	 * For now, only bare minimum: enable global identity mapping mode and register sysfs.
> -	 */
> -	riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> -			   FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> +
> +	rc = riscv_iommu_ddt_alloc(iommu);
> +	if (WARN(rc, "cannot allocate device directory\n"))
> +		goto err_init;

memory allocation failure already makes noisy prints, more prints are
not needed..

> +	rc = riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> +	if (WARN(rc, "cannot enable iommu device\n"))
> +		goto err_init;

This is not a proper use of WARN, it should only be used for things
that cannot happen not undesired error paths.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-18 16:32 ` [PATCH v2 7/7] iommu/riscv: Paging domain support Tomasz Jeznach
@ 2024-04-19 12:56   ` Jason Gunthorpe
  2024-04-22  7:40     ` Baolu Lu
  2024-04-24 23:30     ` Tomasz Jeznach
  2024-04-22  5:21   ` Baolu Lu
  2024-04-23 17:00   ` Andrew Jones
  2 siblings, 2 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2024-04-19 12:56 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Thu, Apr 18, 2024 at 09:32:25AM -0700, Tomasz Jeznach wrote:

> diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> index a4f74588cdc2..32ddc372432d 100644
> --- a/drivers/iommu/riscv/iommu.c
> +++ b/drivers/iommu/riscv/iommu.c
> @@ -46,6 +46,10 @@ MODULE_LICENSE("GPL");
>  #define dev_to_iommu(dev) \
>  	container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
>  
> +/* IOMMU PSCID allocation namespace. */
> +static DEFINE_IDA(riscv_iommu_pscids);
> +#define RISCV_IOMMU_MAX_PSCID		BIT(20)
> +

You may consider putting this IDA in the riscv_iommu_device() and move
the pscid from the domain to the bond?

>  /* Device resource-managed allocations */
>  struct riscv_iommu_devres {
>  	unsigned long addr;
> @@ -752,12 +756,77 @@ static int riscv_iommu_ddt_alloc(struct riscv_iommu_device *iommu)
>  	return 0;
>  }
>  
> +struct riscv_iommu_bond {
> +	struct list_head list;
> +	struct rcu_head rcu;
> +	struct device *dev;
> +};
> +
> +/* This struct contains protection domain specific IOMMU driver data. */
> +struct riscv_iommu_domain {
> +	struct iommu_domain domain;
> +	struct list_head bonds;
> +	int pscid;
> +	int numa_node;
> +	int amo_enabled:1;
> +	unsigned int pgd_mode;
> +	/* paging domain */
> +	unsigned long pgd_root;
> +};

Glad to see there is no riscv_iommu_device pointer in the domain!

> +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
> +				    unsigned long start, unsigned long end)
> +{
> +	struct riscv_iommu_bond *bond;
> +	struct riscv_iommu_device *iommu;
> +	struct riscv_iommu_command cmd;
> +	unsigned long len = end - start + 1;
> +	unsigned long iova;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(bond, &domain->bonds, list) {
> +		iommu = dev_to_iommu(bond->dev);

Pedantically this locking isn't locked right, there is technically
nothing that prevents bond->dev and the iommu instance struct from
being freed here. eg iommufd can hit races here if userspace can hot
unplug devices.

I suggest storing the iommu pointer itself in the bond instead of the
device then add a synchronize_rcu() to the iommu unregister path.

> +		riscv_iommu_cmd_inval_vma(&cmd);
> +		riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
> +		if (len > 0 && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
> +			for (iova = start; iova < end; iova += PAGE_SIZE) {
> +				riscv_iommu_cmd_inval_set_addr(&cmd, iova);
> +				riscv_iommu_cmd_send(iommu, &cmd, 0);
> +			}
> +		} else {
> +			riscv_iommu_cmd_send(iommu, &cmd, 0);
> +		}
> +	}

This seems suboptimal, you probably want to copy the new design that
Intel is doing where you allocate "bonds" that are already
de-duplicated. Ie if I have 10 devices on the same iommu sharing the
domain the above will invalidate the PSCID 10 times. It should only be
done once.

ie add a "bond" for the (iommu,pscid) and refcount that based on how
many devices are used. Then another "bond" for the ATS stuff eventually.

> +
> +	list_for_each_entry_rcu(bond, &domain->bonds, list) {
> +		iommu = dev_to_iommu(bond->dev);
> +
> +		riscv_iommu_cmd_iofence(&cmd);
> +		riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT);
> +	}
> +	rcu_read_unlock();
> +}
> +

> @@ -787,12 +870,390 @@ static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
>  		xchg64(&dc->ta, ta);
>  		xchg64(&dc->tc, tc);
>  
> -		/* Device context invalidation will be required. Ignoring for now. */
> +		if (!(tc & RISCV_IOMMU_DC_TC_V))
> +			continue;

No negative caching in HW?

> +		/* Invalidate device context cache */
> +		riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> +		riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
> +		riscv_iommu_cmd_send(iommu, &cmd, 0);
> +
> +		if (FIELD_GET(RISCV_IOMMU_PC_FSC_MODE, fsc) == RISCV_IOMMU_DC_FSC_MODE_BARE)
> +			continue;
> +
> +		/* Invalidate last valid PSCID */
> +		riscv_iommu_cmd_inval_vma(&cmd);
> +		riscv_iommu_cmd_inval_set_pscid(&cmd, FIELD_GET(RISCV_IOMMU_DC_TA_PSCID, ta));
> +		riscv_iommu_cmd_send(iommu, &cmd, 0);
> +	}
> +
> +	/* Synchronize directory update */
> +	riscv_iommu_cmd_iofence(&cmd);
> +	riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> +
> +	/* Track domain to devices mapping. */
> +	if (bond)
> +		list_add_rcu(&bond->list, &domain->bonds);

This is in the wrong order, the invalidation on the pscid needs to
start before the pscid is loaded into HW in the first place otherwise
concurrent invalidations may miss HW updates.

> +
> +	/* Remove tracking from previous domain, if needed. */
> +	iommu_domain = iommu_get_domain_for_dev(dev);
> +	if (iommu_domain && !!(iommu_domain->type & __IOMMU_DOMAIN_PAGING)) {

No need for !!, && is already booleanizing

> +		domain = iommu_domain_to_riscv(iommu_domain);
> +		bond = NULL;
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(b, &domain->bonds, list) {
> +			if (b->dev == dev) {
> +				bond = b;
> +				break;
> +			}
> +		}
> +		rcu_read_unlock();
> +
> +		if (bond) {
> +			list_del_rcu(&bond->list);
> +			kfree_rcu(bond, rcu);
> +		}
> +	}
> +
> +	return 0;
> +}

> +static inline size_t get_page_size(size_t size)
> +{
> +	if (size >= IOMMU_PAGE_SIZE_512G)
> +		return IOMMU_PAGE_SIZE_512G;
> +	if (size >= IOMMU_PAGE_SIZE_1G)
> +		return IOMMU_PAGE_SIZE_1G;
> +	if (size >= IOMMU_PAGE_SIZE_2M)
> +		return IOMMU_PAGE_SIZE_2M;
> +	return IOMMU_PAGE_SIZE_4K;
> +}
> +
> +#define _io_pte_present(pte)	((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE))
> +#define _io_pte_leaf(pte)	((pte) & _PAGE_LEAF)
> +#define _io_pte_none(pte)	((pte) == 0)
> +#define _io_pte_entry(pn, prot)	((_PAGE_PFN_MASK & ((pn) << _PAGE_PFN_SHIFT)) | (prot))
> +
> +static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain,
> +				 unsigned long pte, struct list_head *freelist)
> +{
> +	unsigned long *ptr;
> +	int i;
> +
> +	if (!_io_pte_present(pte) || _io_pte_leaf(pte))
> +		return;
> +
> +	ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> +
> +	/* Recursively free all sub page table pages */
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = READ_ONCE(ptr[i]);
> +		if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0) == pte)
> +			riscv_iommu_pte_free(domain, pte, freelist);
> +	}
> +
> +	if (freelist)
> +		list_add_tail(&virt_to_page(ptr)->lru, freelist);
> +	else
> +		free_page((unsigned long)ptr);
> +}

Consider putting the page table handling in its own file?

> +static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> +					    struct device *dev)
> +{
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> +	struct page *page;
> +
> +	if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> +		return -ENODEV;
> +
> +	domain->numa_node = dev_to_node(iommu->dev);
> +	domain->amo_enabled = !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD);
> +
> +	if (!domain->pgd_root) {
> +		page = alloc_pages_node(domain->numa_node,
> +					GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0);
> +		if (!page)
> +			return -ENOMEM;
> +		domain->pgd_root = (unsigned long)page_to_virt(page);

The pgd_root should be allocated by the alloc_paging function, not
during attach. There is no locking here that will protect against
concurrent attach and also map before attach should work.

You can pick up the numa affinity from the alloc paging dev pointer
(note it may be null still in some cases)

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/7] iommu/riscv: Device directory management.
  2024-04-18 16:32 ` [PATCH v2 5/7] iommu/riscv: Device directory management Tomasz Jeznach
  2024-04-19 12:40   ` Jason Gunthorpe
@ 2024-04-22  5:11   ` Baolu Lu
  2024-04-24 23:07     ` Tomasz Jeznach
  1 sibling, 1 reply; 30+ messages in thread
From: Baolu Lu @ 2024-04-22  5:11 UTC (permalink / raw)
  To: Tomasz Jeznach, Joerg Roedel, Will Deacon, Robin Murphy,
	Paul Walmsley
  Cc: baolu.lu, Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L,
	Nick Kossifidis, Sebastien Boeuf, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, devicetree, iommu, linux-riscv,
	linux-kernel, linux

On 4/19/24 12:32 AM, Tomasz Jeznach wrote:
> Introduce device context allocation and device directory tree
> management including capabilities discovery sequence, as described
> in Chapter 2.1 of the RISC-V IOMMU Architecture Specification.
> 
> Device directory mode will be auto detected using DDTP WARL property,
> using highest mode supported by the driver and hardware. If none
> supported can be configured, driver will fall back to global pass-through.
> 
> First level DDTP page can be located in I/O (detected using DDTP WARL)
> and system memory.
> 
> Only identity protection domain is supported by this implementation.
> 
> Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
> Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> ---
>   drivers/iommu/riscv/iommu.c | 369 +++++++++++++++++++++++++++++++++++-
>   drivers/iommu/riscv/iommu.h |   5 +
>   2 files changed, 365 insertions(+), 9 deletions(-)

[ ... ]

> +
> +/*
> + * Discover supported DDT modes starting from requested value,
> + * configure DDTP register with accepted mode and root DDT address.
> + * Accepted iommu->ddt_mode is updated on success.
> + */
> +static int riscv_iommu_set_ddtp_mode(struct riscv_iommu_device *iommu,
> +				     unsigned int ddtp_mode)
> +{
> +	struct device *dev = iommu->dev;
> +	u64 ddtp, rq_ddtp;
> +	unsigned int mode, rq_mode = ddtp_mode;
> +	int rc;
> +
> +	rc = readq_relaxed_poll_timeout(iommu->reg + RISCV_IOMMU_REG_DDTP,
> +					ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
> +					10, RISCV_IOMMU_DDTP_TIMEOUT);
> +	if (rc < 0)
> +		return -EBUSY;
> +
> +	/* Disallow state transition from xLVL to xLVL. */
> +	switch (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp)) {
> +	case RISCV_IOMMU_DDTP_MODE_BARE:
> +	case RISCV_IOMMU_DDTP_MODE_OFF:
> +		break;
> +	default:
> +		if (rq_mode != RISCV_IOMMU_DDTP_MODE_BARE &&
> +		    rq_mode != RISCV_IOMMU_DDTP_MODE_OFF)
> +			return -EINVAL;

Is this check duplicate? It appears that it's always true in the default
branch.

> +		break;
> +	}
> +
> +	do {
> +		rq_ddtp = FIELD_PREP(RISCV_IOMMU_DDTP_MODE, rq_mode);
> +		if (rq_mode > RISCV_IOMMU_DDTP_MODE_BARE)
> +			rq_ddtp |= phys_to_ppn(iommu->ddt_phys);
> +
> +		riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, rq_ddtp);
> +
> +		rc = readq_relaxed_poll_timeout(iommu->reg + RISCV_IOMMU_REG_DDTP,
> +						ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
> +						10, RISCV_IOMMU_DDTP_TIMEOUT);
> +		if (rc < 0) {
> +			dev_warn(dev, "timeout when setting ddtp (ddt mode: %u, read: %llx)\n",
> +				 rq_mode, ddtp);
> +			return -EBUSY;
> +		}
> +
> +		/* Verify IOMMU hardware accepts new DDTP config. */
> +		mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
> +
> +		if (rq_mode == mode)
> +			break;
> +
> +		/* Hardware mandatory DDTP mode has not been accepted. */
> +		if (rq_mode < RISCV_IOMMU_DDTP_MODE_1LVL && rq_ddtp != ddtp) {
> +			dev_warn(dev, "DDTP update failed hw: %llx vs %llx\n", ddtp, rq_ddtp);
> +			return -EINVAL;
> +		}
> +
> +		/*
> +		 * Mode field is WARL, an IOMMU may support a subset of
> +		 * directory table levels in which case if we tried to set
> +		 * an unsupported number of levels we'll readback either
> +		 * a valid xLVL or off/bare. If we got off/bare, try again
> +		 * with a smaller xLVL.
> +		 */
> +		if (mode < RISCV_IOMMU_DDTP_MODE_1LVL &&
> +		    rq_mode > RISCV_IOMMU_DDTP_MODE_1LVL) {
> +			dev_dbg(dev, "DDTP hw mode %u vs %u\n", mode, rq_mode);
> +			rq_mode--;
> +			continue;
> +		}
> +
> +		/*
> +		 * We tried all supported modes and IOMMU hardware failed to
> +		 * accept new settings, something went very wrong since off/bare
> +		 * and at least one xLVL must be supported.
> +		 */
> +		dev_warn(dev, "DDTP hw mode %u, failed to set %u\n", mode, ddtp_mode);
> +		return -EINVAL;
> +	} while (1);
> +
> +	iommu->ddt_mode = mode;
> +	if (mode != ddtp_mode)
> +		dev_warn(dev, "DDTP failover to %u mode, requested %u\n",
> +			 mode, ddtp_mode);
> +
> +	return 0;
> +}
> +

[ ... ]

> +
> +static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
> +				     struct device *dev,
> +				     struct iommu_domain *iommu_domain)
> +{
> +	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct riscv_iommu_dc *dc;
> +	u64 fsc, ta, tc;
> +	int i;
> +
> +	if (!iommu_domain) {
> +		ta = 0;
> +		tc = 0;
> +		fsc = 0;
> +	} else if (iommu_domain->type == IOMMU_DOMAIN_IDENTITY) {
> +		ta = 0;
> +		tc = RISCV_IOMMU_DC_TC_V;
> +		fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
> +	} else {
> +		/* This should never happen. */
> +		return -ENODEV;
> +	}

Move the domain->type check code to the domain-specific ops.

> +
> +	/* Update existing or allocate new entries in device directory */
> +	for (i = 0; i < fwspec->num_ids; i++) {
> +		dc = riscv_iommu_get_dc(iommu, fwspec->ids[i], !iommu_domain);
> +		if (!dc && !iommu_domain)
> +			continue;
> +		if (!dc)
> +			return -ENODEV;
> +
> +		/* Swap device context, update TC valid bit as the last operation */
> +		xchg64(&dc->fsc, fsc);
> +		xchg64(&dc->ta, ta);
> +		xchg64(&dc->tc, tc);
> +
> +		/* Device context invalidation will be required. Ignoring for now. */
> +	}
> +
>   	return 0;
>   }
>   
> +static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
> +					      struct device *dev)
> +{
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +
> +	/* Global pass-through already enabled, do nothing. */
> +	if (iommu->ddt_mode == RISCV_IOMMU_DDTP_MODE_BARE)
> +		return 0;
> +
> +	return riscv_iommu_attach_domain(iommu, dev, iommu_domain);
> +}
> +
>   static struct iommu_domain riscv_iommu_identity_domain = {
>   	.type = IOMMU_DOMAIN_IDENTITY,
>   	.ops = &(const struct iommu_domain_ops) {
> @@ -82,6 +420,13 @@ static void riscv_iommu_probe_finalize(struct device *dev)
>   	iommu_setup_dma_ops(dev, 0, U64_MAX);
>   }
>   
> +static void riscv_iommu_release_device(struct device *dev)
> +{
> +	struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> +
> +	riscv_iommu_attach_domain(iommu, dev, NULL);

Attaching a NULL domain to a device has already been removed. You can
use the iommu_ops->release_domain here.

> +}
> +
>   static const struct iommu_ops riscv_iommu_ops = {
>   	.owner = THIS_MODULE,
>   	.of_xlate = riscv_iommu_of_xlate,
> @@ -90,6 +435,7 @@ static const struct iommu_ops riscv_iommu_ops = {
>   	.device_group = riscv_iommu_device_group,
>   	.probe_device = riscv_iommu_probe_device,
>   	.probe_finalize = riscv_iommu_probe_finalize,

The probe_finalize op will be removed soon.

https://lore.kernel.org/linux-iommu/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com/

> +	.release_device = riscv_iommu_release_device,
>   };
>   
>   static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> @@ -124,6 +470,7 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
>   {
>   	iommu_device_unregister(&iommu->iommu);
>   	iommu_device_sysfs_remove(&iommu->iommu);
> +	riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
>   }
>   
>   int riscv_iommu_init(struct riscv_iommu_device *iommu)
> @@ -133,12 +480,14 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
>   	rc = riscv_iommu_init_check(iommu);
>   	if (rc)
>   		return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> -	/*
> -	 * Placeholder for a complete IOMMU device initialization.
> -	 * For now, only bare minimum: enable global identity mapping mode and register sysfs.
> -	 */
> -	riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> -			   FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> +
> +	rc = riscv_iommu_ddt_alloc(iommu);
> +	if (WARN(rc, "cannot allocate device directory\n"))
> +		goto err_init;
> +
> +	rc = riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> +	if (WARN(rc, "cannot enable iommu device\n"))
> +		goto err_init;
>   
>   	rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
>   				    dev_name(iommu->dev));
> @@ -154,5 +503,7 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
>   err_iommu:
>   	iommu_device_sysfs_remove(&iommu->iommu);
>   err_sysfs:
> +	riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> +err_init:
>   	return rc;
>   }
> diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> index 700e33dc2446..f1696926582c 100644
> --- a/drivers/iommu/riscv/iommu.h
> +++ b/drivers/iommu/riscv/iommu.h
> @@ -34,6 +34,11 @@ struct riscv_iommu_device {
>   	/* available interrupt numbers, MSI or WSI */
>   	unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
>   	unsigned int irqs_count;
> +
> +	/* device directory */
> +	unsigned int ddt_mode;
> +	dma_addr_t ddt_phys;
> +	u64 *ddt_root;
>   };
>   
>   int riscv_iommu_init(struct riscv_iommu_device *iommu);

Best regards,
baolu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-18 16:32 ` [PATCH v2 7/7] iommu/riscv: Paging domain support Tomasz Jeznach
  2024-04-19 12:56   ` Jason Gunthorpe
@ 2024-04-22  5:21   ` Baolu Lu
  2024-04-22 19:30     ` Jason Gunthorpe
  2024-04-23 17:00   ` Andrew Jones
  2 siblings, 1 reply; 30+ messages in thread
From: Baolu Lu @ 2024-04-22  5:21 UTC (permalink / raw)
  To: Tomasz Jeznach, Joerg Roedel, Will Deacon, Robin Murphy,
	Paul Walmsley
  Cc: baolu.lu, Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L,
	Nick Kossifidis, Sebastien Boeuf, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, devicetree, iommu, linux-riscv,
	linux-kernel, linux

On 4/19/24 12:32 AM, Tomasz Jeznach wrote:
> Introduce first-stage address translation support.
> 
> Page table configured by the IOMMU driver will use the same format
> as the CPU’s MMU, and will fallback to identity translation if the
> page table format configured for the MMU is not supported by the
> IOMMU hardware.
> 
> This change introduces IOTINVAL.VMA command, required to invalidate
> any cached IOATC entries after mapping is updated and/or removed from
> the paging domain. Invalidations for the non-leaf page entries will
> be added to the driver code in separate patch series, following spec
> update to clarify non-leaf cache invalidation command. With this patch,
> allowing only 4K mappings and keeping non-leaf page entries in memory
> this should be a reasonable simplification.
> 
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> ---
>   drivers/iommu/riscv/Kconfig |   1 +
>   drivers/iommu/riscv/iommu.c | 467 +++++++++++++++++++++++++++++++++++-
>   2 files changed, 466 insertions(+), 2 deletions(-)
> 

[...]

> +
>   static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
>   				     struct device *dev,
>   				     struct iommu_domain *iommu_domain)
>   {
>   	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> +	struct riscv_iommu_domain *domain;
>   	struct riscv_iommu_dc *dc;
> +	struct riscv_iommu_bond *bond = NULL, *b;
> +	struct riscv_iommu_command cmd;
>   	u64 fsc, ta, tc;
>   	int i;
>   
> @@ -769,6 +838,20 @@ static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
>   		ta = 0;
>   		tc = RISCV_IOMMU_DC_TC_V;
>   		fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
> +	} else if (iommu_domain->type & __IOMMU_DOMAIN_PAGING) {
> +		domain = iommu_domain_to_riscv(iommu_domain);
> +
> +		ta = FIELD_PREP(RISCV_IOMMU_PC_TA_PSCID, domain->pscid);
> +		tc = RISCV_IOMMU_DC_TC_V;
> +		if (domain->amo_enabled)
> +			tc |= RISCV_IOMMU_DC_TC_SADE;
> +		fsc = FIELD_PREP(RISCV_IOMMU_PC_FSC_MODE, domain->pgd_mode) |
> +		      FIELD_PREP(RISCV_IOMMU_PC_FSC_PPN, virt_to_pfn(domain->pgd_root));
> +
> +		bond = kzalloc(sizeof(*bond), GFP_KERNEL);
> +		if (!bond)
> +			return -ENOMEM;
> +		bond->dev = dev;
>   	} else {
>   		/* This should never happen. */
>   		return -ENODEV;
> @@ -787,12 +870,390 @@ static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
>   		xchg64(&dc->ta, ta);
>   		xchg64(&dc->tc, tc);
>   
> -		/* Device context invalidation will be required. Ignoring for now. */
> +		if (!(tc & RISCV_IOMMU_DC_TC_V))
> +			continue;
> +
> +		/* Invalidate device context cache */
> +		riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> +		riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
> +		riscv_iommu_cmd_send(iommu, &cmd, 0);
> +
> +		if (FIELD_GET(RISCV_IOMMU_PC_FSC_MODE, fsc) == RISCV_IOMMU_DC_FSC_MODE_BARE)
> +			continue;
> +
> +		/* Invalidate last valid PSCID */
> +		riscv_iommu_cmd_inval_vma(&cmd);
> +		riscv_iommu_cmd_inval_set_pscid(&cmd, FIELD_GET(RISCV_IOMMU_DC_TA_PSCID, ta));
> +		riscv_iommu_cmd_send(iommu, &cmd, 0);
> +	}
> +
> +	/* Synchronize directory update */
> +	riscv_iommu_cmd_iofence(&cmd);
> +	riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> +
> +	/* Track domain to devices mapping. */
> +	if (bond)
> +		list_add_rcu(&bond->list, &domain->bonds);
> +
> +	/* Remove tracking from previous domain, if needed. */
> +	iommu_domain = iommu_get_domain_for_dev(dev);

Calling iommu_get_domain_for_dev() in the domain attaching path is very
fragile because it heavily depends on the order of calling the attach
callback and setting the domain pointer in the core.

Perhaps the driver can use dev_iommu_priv_set/get() to keep the active
domain in the per-device private data?

> +	if (iommu_domain && !!(iommu_domain->type & __IOMMU_DOMAIN_PAGING)) {
> +		domain = iommu_domain_to_riscv(iommu_domain);
> +		bond = NULL;
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(b, &domain->bonds, list) {
> +			if (b->dev == dev) {
> +				bond = b;
> +				break;
> +			}
> +		}
> +		rcu_read_unlock();
> +
> +		if (bond) {
> +			list_del_rcu(&bond->list);
> +			kfree_rcu(bond, rcu);
> +		}
> +	}
> +
> +	return 0;
> +}

Best regards,
baolu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-19 12:56   ` Jason Gunthorpe
@ 2024-04-22  7:40     ` Baolu Lu
  2024-04-24 23:30     ` Tomasz Jeznach
  1 sibling, 0 replies; 30+ messages in thread
From: Baolu Lu @ 2024-04-22  7:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Tomasz Jeznach
  Cc: baolu.lu, Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On 2024/4/19 20:56, Jason Gunthorpe wrote:
>> +		riscv_iommu_cmd_inval_vma(&cmd);
>> +		riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
>> +		if (len > 0 && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
>> +			for (iova = start; iova < end; iova += PAGE_SIZE) {
>> +				riscv_iommu_cmd_inval_set_addr(&cmd, iova);
>> +				riscv_iommu_cmd_send(iommu, &cmd, 0);
>> +			}
>> +		} else {
>> +			riscv_iommu_cmd_send(iommu, &cmd, 0);
>> +		}
>> +	}
> This seems suboptimal, you probably want to copy the new design that
> Intel is doing where you allocate "bonds" that are already
> de-duplicated. Ie if I have 10 devices on the same iommu sharing the
> domain the above will invalidate the PSCID 10 times. It should only be
> done once.
> 
> ie add a "bond" for the (iommu,pscid) and refcount that based on how
> many devices are used. Then another "bond" for the ATS stuff eventually.

The latest version is under discussion here.

https://lore.kernel.org/linux-iommu/20240416080656.60968-1-baolu.lu@linux.intel.com/

Supposedly, you can make such optimization after the base code is landed
in the mainline tree if the change is big.

Best regards,
baolu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
  2024-04-18 16:32 ` [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU Tomasz Jeznach
  2024-04-18 17:04   ` Conor Dooley
@ 2024-04-22 14:04   ` Rob Herring
  1 sibling, 0 replies; 30+ messages in thread
From: Rob Herring @ 2024-04-22 14:04 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Krzysztof Kozlowski, Conor Dooley, devicetree,
	iommu, linux-riscv, linux-kernel, linux

On Thu, Apr 18, 2024 at 09:32:19AM -0700, Tomasz Jeznach wrote:
> Add bindings for the RISC-V IOMMU device drivers.
> 
> Co-developed-by: Anup Patel <apatel@ventanamicro.com>
> Signed-off-by: Anup Patel <apatel@ventanamicro.com>
> Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> ---
>  .../bindings/iommu/riscv,iommu.yaml           | 149 ++++++++++++++++++
>  MAINTAINERS                                   |   7 +
>  2 files changed, 156 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> 
> diff --git a/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> new file mode 100644
> index 000000000000..d6522ddd43fa
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> @@ -0,0 +1,149 @@
> +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> +%YAML 1.2
> +---
> +$id: http://devicetree.org/schemas/iommu/riscv,iommu.yaml#
> +$schema: http://devicetree.org/meta-schemas/core.yaml#
> +
> +title: RISC-V IOMMU Architecture Implementation
> +
> +maintainers:
> +  - Tomasz Jeznach <tjeznach@rivosinc.com>
> +
> +description: |+
> +  The RISC-V IOMMU provides memory address translation and isolation for
> +  input and output devices, supporting per-device translation context,
> +  shared process address spaces including the ATS and PRI components of
> +  the PCIe specification, two stage address translation and MSI remapping.
> +  It supports identical translation table format to the RISC-V address
> +  translation tables with page level access and protection attributes.
> +  Hardware uses in-memory command and fault reporting queues with wired
> +  interrupt or MSI notifications.
> +
> +  Visit https://github.com/riscv-non-isa/riscv-iommu for more details.
> +
> +  For information on assigning RISC-V IOMMU to its peripheral devices,
> +  see generic IOMMU bindings.
> +
> +properties:
> +  # For PCIe IOMMU hardware compatible property should contain the vendor
> +  # and device ID according to the PCI Bus Binding specification.
> +  # Since PCI provides built-in identification methods, compatible is not
> +  # actually required. For non-PCIe hardware implementations 'riscv,iommu'
> +  # should be specified along with 'reg' property providing MMIO location.
> +  compatible:
> +    oneOf:
> +      - items:
> +          - const: riscv,pci-iommu
> +          - const: pci1efd,edf1

Given the PCI compatible string is a specific vendor and device, it is 
more specific than "riscv,pci-iommu" and should come first.

> +      - items:
> +          - const: pci1efd,edf1

Why do you need to support this without riscv,pci-iommu?

> +      - items:
> +          - const: riscv,iommu

I agree with what Conor said on this.

Rob

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-22  5:21   ` Baolu Lu
@ 2024-04-22 19:30     ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2024-04-22 19:30 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Tomasz Jeznach, Joerg Roedel, Will Deacon, Robin Murphy,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L,
	Nick Kossifidis, Sebastien Boeuf, Rob Herring,
	Krzysztof Kozlowski, Conor Dooley, devicetree, iommu, linux-riscv,
	linux-kernel, linux

On Mon, Apr 22, 2024 at 01:21:05PM +0800, Baolu Lu wrote:
> > +	/* Track domain to devices mapping. */
> > +	if (bond)
> > +		list_add_rcu(&bond->list, &domain->bonds);
> > +
> > +	/* Remove tracking from previous domain, if needed. */
> > +	iommu_domain = iommu_get_domain_for_dev(dev);
> 
> Calling iommu_get_domain_for_dev() in the domain attaching path is very
> fragile because it heavily depends on the order of calling the attach
> callback and setting the domain pointer in the core.

We have a couple places doing this already, the core code accomodates
it well enough for deleting from a list.. So I think it is OK to keep
doing.

> Perhaps the driver can use dev_iommu_priv_set/get() to keep the active
> domain in the per-device private data?
> 
> > +	if (iommu_domain && !!(iommu_domain->type & __IOMMU_DOMAIN_PAGING)) {
> > +		domain = iommu_domain_to_riscv(iommu_domain);
> > +		bond = NULL;
> > +		rcu_read_lock();
> > +		list_for_each_entry_rcu(b, &domain->bonds, list) {
> > +			if (b->dev == dev) {
> > +				bond = b;
> > +				break;
> > +			}
> > +		}
> > +		rcu_read_unlock();

But now that I look again, this is not safe, you have to hold some
kind of per-domain lock to mutate the list. rcu_*read*_lock() cannot
be used for write.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-18 16:32 ` [PATCH v2 7/7] iommu/riscv: Paging domain support Tomasz Jeznach
  2024-04-19 12:56   ` Jason Gunthorpe
  2024-04-22  5:21   ` Baolu Lu
@ 2024-04-23 17:00   ` Andrew Jones
  2 siblings, 0 replies; 30+ messages in thread
From: Andrew Jones @ 2024-04-23 17:00 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Anup Patel, devicetree, Conor Dooley, Albert Ou, linux,
	linux-kernel, Rob Herring, Sebastien Boeuf, iommu, Palmer Dabbelt,
	Nick Kossifidis, Krzysztof Kozlowski, linux-riscv

On Thu, Apr 18, 2024 at 09:32:25AM -0700, Tomasz Jeznach wrote:
...
> +static size_t riscv_iommu_unmap_pages(struct iommu_domain *iommu_domain,
> +				      unsigned long iova, size_t pgsize, size_t pgcount,
> +				      struct iommu_iotlb_gather *gather)
> +{
> +	struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> +	size_t size = pgcount << __ffs(pgsize);
> +	unsigned long *ptr, old;
> +	size_t unmapped = 0;
> +	size_t pte_size;
> +
> +	while (unmapped < size) {
> +		ptr = riscv_iommu_pte_fetch(domain, iova, &pte_size);
> +		if (!ptr)
> +			return unmapped;
> +
> +		/* partial unmap is not allowed, fail. */
> +		if (iova & ~(pte_size - 1))
                           ^ Shouldn't this ~ be removed?

> +			return unmapped;
> +
> +		old = READ_ONCE(*ptr);
> +		if (cmpxchg_relaxed(ptr, old, 0) != old)
> +			continue;
> +
> +		iommu_iotlb_gather_add_page(&domain->domain, gather, iova,
> +					    pte_size);
> +
> +		iova += pte_size;
> +		unmapped += pte_size;
> +	}
> +
> +	return unmapped;
> +}
> +

Thanks,
drew

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver
  2024-04-18 21:22   ` Robin Murphy
@ 2024-04-24 21:59     ` Tomasz Jeznach
  2024-04-25 11:23       ` Robin Murphy
  0 siblings, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-24 21:59 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Will Deacon, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Thu, Apr 18, 2024 at 2:22 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 2024-04-18 5:32 pm, Tomasz Jeznach wrote:
> > Introduce platform device driver for implementation of RISC-V IOMMU
> > architected hardware.
> >
> > Hardware interface definition located in file iommu-bits.h is based on
> > ratified RISC-V IOMMU Architecture Specification version 1.0.0.
> >
> > This patch implements platform device initialization, early check and
> > configuration of the IOMMU interfaces and enables global pass-through
> > address translation mode (iommu_mode == BARE), without registering
> > hardware instance in the IOMMU subsystem.
> >
> > Link: https://github.com/riscv-non-isa/riscv-iommu
> > Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
> > Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
> > Co-developed-by: Sebastien Boeuf <seb@rivosinc.com>
> > Signed-off-by: Sebastien Boeuf <seb@rivosinc.com>
> > Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> > ---
> >   MAINTAINERS                          |   6 +
> >   drivers/iommu/Kconfig                |   1 +
> >   drivers/iommu/Makefile               |   2 +-
> >   drivers/iommu/riscv/Kconfig          |  16 +
> >   drivers/iommu/riscv/Makefile         |   2 +
> >   drivers/iommu/riscv/iommu-bits.h     | 707 +++++++++++++++++++++++++++
> >   drivers/iommu/riscv/iommu-platform.c |  94 ++++
> >   drivers/iommu/riscv/iommu.c          |  89 ++++
> >   drivers/iommu/riscv/iommu.h          |  62 +++
> >   9 files changed, 978 insertions(+), 1 deletion(-)
> >   create mode 100644 drivers/iommu/riscv/Kconfig
> >   create mode 100644 drivers/iommu/riscv/Makefile
> >   create mode 100644 drivers/iommu/riscv/iommu-bits.h
> >   create mode 100644 drivers/iommu/riscv/iommu-platform.c
> >   create mode 100644 drivers/iommu/riscv/iommu.c
> >   create mode 100644 drivers/iommu/riscv/iommu.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2657f9eae84c..051599c76585 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -18972,6 +18972,12 @@ L:   iommu@lists.linux.dev
> >   L:  linux-riscv@lists.infradead.org
> >   S:  Maintained
> >   F:  Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> > +F:   drivers/iommu/riscv/Kconfig
> > +F:   drivers/iommu/riscv/Makefile
> > +F:   drivers/iommu/riscv/iommu-bits.h
> > +F:   drivers/iommu/riscv/iommu-platform.c
> > +F:   drivers/iommu/riscv/iommu.c
> > +F:   drivers/iommu/riscv/iommu.h
>
> I'm pretty sure a single "F: drivers/iommu/riscv/" pattern will suffice.
>

Correct. But will required a workaround for pretty naive MAINTAINERS update
check in scripts/checkpatch.pl:3014 in next patch.

> >   RISC-V MICROCHIP FPGA SUPPORT
> >   M:  Conor Dooley <conor.dooley@microchip.com>
> > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> > index 0af39bbbe3a3..ae762db0365e 100644
> > --- a/drivers/iommu/Kconfig
> > +++ b/drivers/iommu/Kconfig
> > @@ -195,6 +195,7 @@ config MSM_IOMMU
> >   source "drivers/iommu/amd/Kconfig"
> >   source "drivers/iommu/intel/Kconfig"
> >   source "drivers/iommu/iommufd/Kconfig"
> > +source "drivers/iommu/riscv/Kconfig"
> >
> >   config IRQ_REMAP
> >       bool "Support for Interrupt Remapping"
> > diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
> > index 542760d963ec..5e5a83c6c2aa 100644
> > --- a/drivers/iommu/Makefile
> > +++ b/drivers/iommu/Makefile
> > @@ -1,5 +1,5 @@
> >   # SPDX-License-Identifier: GPL-2.0
> > -obj-y += amd/ intel/ arm/ iommufd/
> > +obj-y += amd/ intel/ arm/ iommufd/ riscv/
> >   obj-$(CONFIG_IOMMU_API) += iommu.o
> >   obj-$(CONFIG_IOMMU_API) += iommu-traces.o
> >   obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
> > diff --git a/drivers/iommu/riscv/Kconfig b/drivers/iommu/riscv/Kconfig
> > new file mode 100644
> > index 000000000000..d02326bddb4c
> > --- /dev/null
> > +++ b/drivers/iommu/riscv/Kconfig
> > @@ -0,0 +1,16 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +# RISC-V IOMMU support
> > +
> > +config RISCV_IOMMU
> > +     def_bool y if RISCV && 64BIT && MMU
>
> Drop the dependencies here, they're already dependencies. However, also
> consider allowing users to configure this out without disabling
> IOMMU_SUPPORT entirely - I imagine other IOMMU implementations are going
> to end up paired wiothg RISC-V CPUs sooner or later. Furthermore, if
> it's a regular driver model driver, consider allowing it to build as a
> module. Not to mention that the help text below is rather pointless if
> there's no prompt offered in the first place.
>

Ack. Changed to optional builtin driver for now, also cleaned up all
module info.
I'll revisit proper module build later.

> > +     depends on RISCV && 64BIT && MMU
> > +     select DMA_OPS
>
> Drop this, you're not (and shouldn't be) architecture code implementing
> DMA ops.
>
> > +     select IOMMU_API
> > +     select IOMMU_IOVA
>
> Drop this, you're not using the IOVA library either.
>
> > +     help
> > +       Support for implementations of the RISC-V IOMMU architecture that
> > +       complements the RISC-V MMU capabilities, providing similar address
> > +       translation and protection functions for accesses from I/O devices.
> > +
> > +       Say Y here if your SoC includes an IOMMU device implementing
> > +       the RISC-V IOMMU architecture.
> > diff --git a/drivers/iommu/riscv/Makefile b/drivers/iommu/riscv/Makefile
> > new file mode 100644
> > index 000000000000..e4c189de58d3
> > --- /dev/null
> > +++ b/drivers/iommu/riscv/Makefile
> > @@ -0,0 +1,2 @@
> > +# SPDX-License-Identifier: GPL-2.0-only
> > +obj-$(CONFIG_RISCV_IOMMU) += iommu.o iommu-platform.o
> > diff --git a/drivers/iommu/riscv/iommu-bits.h b/drivers/iommu/riscv/iommu-bits.h
> > new file mode 100644
> > index 000000000000..ba093c29de9f
> > --- /dev/null
> > +++ b/drivers/iommu/riscv/iommu-bits.h
> > @@ -0,0 +1,707 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright © 2022-2024 Rivos Inc.
> > + * Copyright © 2023 FORTH-ICS/CARV
> > + * Copyright © 2023 RISC-V IOMMU Task Group
> > + *
> > + * RISC-V IOMMU - Register Layout and Data Structures.
> > + *
> > + * Based on the 'RISC-V IOMMU Architecture Specification', Version 1.0
> > + * Published at  https://github.com/riscv-non-isa/riscv-iommu
> > + *
> > + */
> > +
> > +#ifndef _RISCV_IOMMU_BITS_H_
> > +#define _RISCV_IOMMU_BITS_H_
> > +
> > +#include <linux/types.h>
> > +#include <linux/bitfield.h>
> > +#include <linux/bits.h>
> > +
> > +/*
> > + * Chapter 5: Memory Mapped register interface
> > + */
> > +
> > +/* Common field positions */
> > +#define RISCV_IOMMU_PPN_FIELD                GENMASK_ULL(53, 10)
> > +#define RISCV_IOMMU_QUEUE_LOGSZ_FIELD        GENMASK_ULL(4, 0)
> > +#define RISCV_IOMMU_QUEUE_INDEX_FIELD        GENMASK_ULL(31, 0)
> > +#define RISCV_IOMMU_QUEUE_ENABLE     BIT(0)
> > +#define RISCV_IOMMU_QUEUE_INTR_ENABLE        BIT(1)
> > +#define RISCV_IOMMU_QUEUE_MEM_FAULT  BIT(8)
> > +#define RISCV_IOMMU_QUEUE_OVERFLOW   BIT(9)
> > +#define RISCV_IOMMU_QUEUE_ACTIVE     BIT(16)
> > +#define RISCV_IOMMU_QUEUE_BUSY               BIT(17)
> > +
> > +#define RISCV_IOMMU_ATP_PPN_FIELD    GENMASK_ULL(43, 0)
> > +#define RISCV_IOMMU_ATP_MODE_FIELD   GENMASK_ULL(63, 60)
> > +
> > +/* 5.3 IOMMU Capabilities (64bits) */
> > +#define RISCV_IOMMU_REG_CAP          0x0000
> > +#define RISCV_IOMMU_CAP_VERSION              GENMASK_ULL(7, 0)
> > +#define RISCV_IOMMU_CAP_S_SV32               BIT_ULL(8)
> > +#define RISCV_IOMMU_CAP_S_SV39               BIT_ULL(9)
> > +#define RISCV_IOMMU_CAP_S_SV48               BIT_ULL(10)
> > +#define RISCV_IOMMU_CAP_S_SV57               BIT_ULL(11)
> > +#define RISCV_IOMMU_CAP_SVPBMT               BIT_ULL(15)
> > +#define RISCV_IOMMU_CAP_G_SV32               BIT_ULL(16)
> > +#define RISCV_IOMMU_CAP_G_SV39               BIT_ULL(17)
> > +#define RISCV_IOMMU_CAP_G_SV48               BIT_ULL(18)
> > +#define RISCV_IOMMU_CAP_G_SV57               BIT_ULL(19)
> > +#define RISCV_IOMMU_CAP_AMO_MRIF     BIT_ULL(21)
> > +#define RISCV_IOMMU_CAP_MSI_FLAT     BIT_ULL(22)
> > +#define RISCV_IOMMU_CAP_MSI_MRIF     BIT_ULL(23)
> > +#define RISCV_IOMMU_CAP_AMO_HWAD     BIT_ULL(24)
> > +#define RISCV_IOMMU_CAP_ATS          BIT_ULL(25)
> > +#define RISCV_IOMMU_CAP_T2GPA                BIT_ULL(26)
> > +#define RISCV_IOMMU_CAP_END          BIT_ULL(27)
> > +#define RISCV_IOMMU_CAP_IGS          GENMASK_ULL(29, 28)
> > +#define RISCV_IOMMU_CAP_HPM          BIT_ULL(30)
> > +#define RISCV_IOMMU_CAP_DBG          BIT_ULL(31)
> > +#define RISCV_IOMMU_CAP_PAS          GENMASK_ULL(37, 32)
> > +#define RISCV_IOMMU_CAP_PD8          BIT_ULL(38)
> > +#define RISCV_IOMMU_CAP_PD17         BIT_ULL(39)
> > +#define RISCV_IOMMU_CAP_PD20         BIT_ULL(40)
> > +
> > +#define RISCV_IOMMU_CAP_VERSION_VER_MASK     0xF0
> > +#define RISCV_IOMMU_CAP_VERSION_REV_MASK     0x0F
> > +
> > +/**
> > + * enum riscv_iommu_igs_settings - Interrupt Generation Support Settings
> > + * @RISCV_IOMMU_CAP_IGS_MSI: I/O MMU supports only MSI generation
> > + * @RISCV_IOMMU_CAP_IGS_WSI: I/O MMU supports only Wired-Signaled interrupt
> > + * @RISCV_IOMMU_CAP_IGS_BOTH: I/O MMU supports both MSI and WSI generation
> > + * @RISCV_IOMMU_CAP_IGS_RSRV: Reserved for standard use
> > + */
> > +enum riscv_iommu_igs_settings {
> > +     RISCV_IOMMU_CAP_IGS_MSI = 0,
> > +     RISCV_IOMMU_CAP_IGS_WSI = 1,
> > +     RISCV_IOMMU_CAP_IGS_BOTH = 2,
> > +     RISCV_IOMMU_CAP_IGS_RSRV = 3
> > +};
> > +
> > +/* 5.4 Features control register (32bits) */
> > +#define RISCV_IOMMU_REG_FCTL         0x0008
> > +#define RISCV_IOMMU_FCTL_BE          BIT(0)
> > +#define RISCV_IOMMU_FCTL_WSI         BIT(1)
> > +#define RISCV_IOMMU_FCTL_GXL         BIT(2)
> > +
> > +/* 5.5 Device-directory-table pointer (64bits) */
> > +#define RISCV_IOMMU_REG_DDTP         0x0010
> > +#define RISCV_IOMMU_DDTP_MODE                GENMASK_ULL(3, 0)
> > +#define RISCV_IOMMU_DDTP_BUSY                BIT_ULL(4)
> > +#define RISCV_IOMMU_DDTP_PPN         RISCV_IOMMU_PPN_FIELD
> > +
> > +/**
> > + * enum riscv_iommu_ddtp_modes - I/O MMU translation modes
> > + * @RISCV_IOMMU_DDTP_MODE_OFF: No inbound transactions allowed
> > + * @RISCV_IOMMU_DDTP_MODE_BARE: Pass-through mode
> > + * @RISCV_IOMMU_DDTP_MODE_1LVL: One-level DDT
> > + * @RISCV_IOMMU_DDTP_MODE_2LVL: Two-level DDT
> > + * @RISCV_IOMMU_DDTP_MODE_3LVL: Three-level DDT
> > + * @RISCV_IOMMU_DDTP_MODE_MAX: Max value allowed by specification
> > + */
> > +enum riscv_iommu_ddtp_modes {
> > +     RISCV_IOMMU_DDTP_MODE_OFF = 0,
> > +     RISCV_IOMMU_DDTP_MODE_BARE = 1,
> > +     RISCV_IOMMU_DDTP_MODE_1LVL = 2,
> > +     RISCV_IOMMU_DDTP_MODE_2LVL = 3,
> > +     RISCV_IOMMU_DDTP_MODE_3LVL = 4,
> > +     RISCV_IOMMU_DDTP_MODE_MAX = 4
> > +};
> > +
> > +/* 5.6 Command Queue Base (64bits) */
> > +#define RISCV_IOMMU_REG_CQB          0x0018
> > +#define RISCV_IOMMU_CQB_ENTRIES              RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> > +#define RISCV_IOMMU_CQB_PPN          RISCV_IOMMU_PPN_FIELD
> > +
> > +/* 5.7 Command Queue head (32bits) */
> > +#define RISCV_IOMMU_REG_CQH          0x0020
> > +#define RISCV_IOMMU_CQH_INDEX                RISCV_IOMMU_QUEUE_INDEX_FIELD
> > +
> > +/* 5.8 Command Queue tail (32bits) */
> > +#define RISCV_IOMMU_REG_CQT          0x0024
> > +#define RISCV_IOMMU_CQT_INDEX                RISCV_IOMMU_QUEUE_INDEX_FIELD
> > +
> > +/* 5.9 Fault Queue Base (64bits) */
> > +#define RISCV_IOMMU_REG_FQB          0x0028
> > +#define RISCV_IOMMU_FQB_ENTRIES              RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> > +#define RISCV_IOMMU_FQB_PPN          RISCV_IOMMU_PPN_FIELD
> > +
> > +/* 5.10 Fault Queue Head (32bits) */
> > +#define RISCV_IOMMU_REG_FQH          0x0030
> > +#define RISCV_IOMMU_FQH_INDEX                RISCV_IOMMU_QUEUE_INDEX_FIELD
> > +
> > +/* 5.11 Fault Queue tail (32bits) */
> > +#define RISCV_IOMMU_REG_FQT          0x0034
> > +#define RISCV_IOMMU_FQT_INDEX                RISCV_IOMMU_QUEUE_INDEX_FIELD
> > +
> > +/* 5.12 Page Request Queue base (64bits) */
> > +#define RISCV_IOMMU_REG_PQB          0x0038
> > +#define RISCV_IOMMU_PQB_ENTRIES              RISCV_IOMMU_QUEUE_LOGSZ_FIELD
> > +#define RISCV_IOMMU_PQB_PPN          RISCV_IOMMU_PPN_FIELD
> > +
> > +/* 5.13 Page Request Queue head (32bits) */
> > +#define RISCV_IOMMU_REG_PQH          0x0040
> > +#define RISCV_IOMMU_PQH_INDEX                RISCV_IOMMU_QUEUE_INDEX_FIELD
> > +
> > +/* 5.14 Page Request Queue tail (32bits) */
> > +#define RISCV_IOMMU_REG_PQT          0x0044
> > +#define RISCV_IOMMU_PQT_INDEX_MASK   RISCV_IOMMU_QUEUE_INDEX_FIELD
> > +
> > +/* 5.15 Command Queue CSR (32bits) */
> > +#define RISCV_IOMMU_REG_CQCSR                0x0048
> > +#define RISCV_IOMMU_CQCSR_CQEN               RISCV_IOMMU_QUEUE_ENABLE
> > +#define RISCV_IOMMU_CQCSR_CIE                RISCV_IOMMU_QUEUE_INTR_ENABLE
> > +#define RISCV_IOMMU_CQCSR_CQMF               RISCV_IOMMU_QUEUE_MEM_FAULT
> > +#define RISCV_IOMMU_CQCSR_CMD_TO     BIT(9)
> > +#define RISCV_IOMMU_CQCSR_CMD_ILL    BIT(10)
> > +#define RISCV_IOMMU_CQCSR_FENCE_W_IP BIT(11)
> > +#define RISCV_IOMMU_CQCSR_CQON               RISCV_IOMMU_QUEUE_ACTIVE
> > +#define RISCV_IOMMU_CQCSR_BUSY               RISCV_IOMMU_QUEUE_BUSY
> > +
> > +/* 5.16 Fault Queue CSR (32bits) */
> > +#define RISCV_IOMMU_REG_FQCSR                0x004C
> > +#define RISCV_IOMMU_FQCSR_FQEN               RISCV_IOMMU_QUEUE_ENABLE
> > +#define RISCV_IOMMU_FQCSR_FIE                RISCV_IOMMU_QUEUE_INTR_ENABLE
> > +#define RISCV_IOMMU_FQCSR_FQMF               RISCV_IOMMU_QUEUE_MEM_FAULT
> > +#define RISCV_IOMMU_FQCSR_FQOF               RISCV_IOMMU_QUEUE_OVERFLOW
> > +#define RISCV_IOMMU_FQCSR_FQON               RISCV_IOMMU_QUEUE_ACTIVE
> > +#define RISCV_IOMMU_FQCSR_BUSY               RISCV_IOMMU_QUEUE_BUSY
> > +
> > +/* 5.17 Page Request Queue CSR (32bits) */
> > +#define RISCV_IOMMU_REG_PQCSR                0x0050
> > +#define RISCV_IOMMU_PQCSR_PQEN               RISCV_IOMMU_QUEUE_ENABLE
> > +#define RISCV_IOMMU_PQCSR_PIE                RISCV_IOMMU_QUEUE_INTR_ENABLE
> > +#define RISCV_IOMMU_PQCSR_PQMF               RISCV_IOMMU_QUEUE_MEM_FAULT
> > +#define RISCV_IOMMU_PQCSR_PQOF               RISCV_IOMMU_QUEUE_OVERFLOW
> > +#define RISCV_IOMMU_PQCSR_PQON               RISCV_IOMMU_QUEUE_ACTIVE
> > +#define RISCV_IOMMU_PQCSR_BUSY               RISCV_IOMMU_QUEUE_BUSY
> > +
> > +/* 5.18 Interrupt Pending Status (32bits) */
> > +#define RISCV_IOMMU_REG_IPSR         0x0054
> > +
> > +#define RISCV_IOMMU_INTR_CQ          0
> > +#define RISCV_IOMMU_INTR_FQ          1
> > +#define RISCV_IOMMU_INTR_PM          2
> > +#define RISCV_IOMMU_INTR_PQ          3
> > +#define RISCV_IOMMU_INTR_COUNT               4
> > +
> > +#define RISCV_IOMMU_IPSR_CIP         BIT(RISCV_IOMMU_INTR_CQ)
> > +#define RISCV_IOMMU_IPSR_FIP         BIT(RISCV_IOMMU_INTR_FQ)
> > +#define RISCV_IOMMU_IPSR_PMIP                BIT(RISCV_IOMMU_INTR_PM)
> > +#define RISCV_IOMMU_IPSR_PIP         BIT(RISCV_IOMMU_INTR_PQ)
> > +
> > +/* 5.19 Performance monitoring counter overflow status (32bits) */
> > +#define RISCV_IOMMU_REG_IOCOUNTOVF   0x0058
> > +#define RISCV_IOMMU_IOCOUNTOVF_CY    BIT(0)
> > +#define RISCV_IOMMU_IOCOUNTOVF_HPM   GENMASK_ULL(31, 1)
> > +
> > +/* 5.20 Performance monitoring counter inhibits (32bits) */
> > +#define RISCV_IOMMU_REG_IOCOUNTINH   0x005C
> > +#define RISCV_IOMMU_IOCOUNTINH_CY    BIT(0)
> > +#define RISCV_IOMMU_IOCOUNTINH_HPM   GENMASK(31, 1)
> > +
> > +/* 5.21 Performance monitoring cycles counter (64bits) */
> > +#define RISCV_IOMMU_REG_IOHPMCYCLES     0x0060
> > +#define RISCV_IOMMU_IOHPMCYCLES_COUNTER      GENMASK_ULL(62, 0)
> > +#define RISCV_IOMMU_IOHPMCYCLES_OVF  BIT_ULL(63)
> > +
> > +/* 5.22 Performance monitoring event counters (31 * 64bits) */
> > +#define RISCV_IOMMU_REG_IOHPMCTR_BASE        0x0068
> > +#define RISCV_IOMMU_REG_IOHPMCTR(_n) (RISCV_IOMMU_REG_IOHPMCTR_BASE + ((_n) * 0x8))
> > +
> > +/* 5.23 Performance monitoring event selectors (31 * 64bits) */
> > +#define RISCV_IOMMU_REG_IOHPMEVT_BASE        0x0160
> > +#define RISCV_IOMMU_REG_IOHPMEVT(_n) (RISCV_IOMMU_REG_IOHPMEVT_BASE + ((_n) * 0x8))
> > +#define RISCV_IOMMU_IOHPMEVT_CNT     31
> > +#define RISCV_IOMMU_IOHPMEVT_EVENT_ID        GENMASK_ULL(14, 0)
> > +#define RISCV_IOMMU_IOHPMEVT_DMASK   BIT_ULL(15)
> > +#define RISCV_IOMMU_IOHPMEVT_PID_PSCID       GENMASK_ULL(35, 16)
> > +#define RISCV_IOMMU_IOHPMEVT_DID_GSCID       GENMASK_ULL(59, 36)
> > +#define RISCV_IOMMU_IOHPMEVT_PV_PSCV BIT_ULL(60)
> > +#define RISCV_IOMMU_IOHPMEVT_DV_GSCV BIT_ULL(61)
> > +#define RISCV_IOMMU_IOHPMEVT_IDT     BIT_ULL(62)
> > +#define RISCV_IOMMU_IOHPMEVT_OF              BIT_ULL(63)
> > +
> > +/**
> > + * enum riscv_iommu_hpmevent_id - Performance-monitoring event identifier
> > + *
> > + * @RISCV_IOMMU_HPMEVENT_INVALID: Invalid event, do not count
> > + * @RISCV_IOMMU_HPMEVENT_URQ: Untranslated requests
> > + * @RISCV_IOMMU_HPMEVENT_TRQ: Translated requests
> > + * @RISCV_IOMMU_HPMEVENT_ATS_RQ: ATS translation requests
> > + * @RISCV_IOMMU_HPMEVENT_TLB_MISS: TLB misses
> > + * @RISCV_IOMMU_HPMEVENT_DD_WALK: Device directory walks
> > + * @RISCV_IOMMU_HPMEVENT_PD_WALK: Process directory walks
> > + * @RISCV_IOMMU_HPMEVENT_S_VS_WALKS: S/VS-Stage page table walks
> > + * @RISCV_IOMMU_HPMEVENT_G_WALKS: G-Stage page table walks
> > + * @RISCV_IOMMU_HPMEVENT_MAX: Value to denote maximum Event IDs
> > + */
> > +enum riscv_iommu_hpmevent_id {
> > +     RISCV_IOMMU_HPMEVENT_INVALID    = 0,
> > +     RISCV_IOMMU_HPMEVENT_URQ        = 1,
> > +     RISCV_IOMMU_HPMEVENT_TRQ        = 2,
> > +     RISCV_IOMMU_HPMEVENT_ATS_RQ     = 3,
> > +     RISCV_IOMMU_HPMEVENT_TLB_MISS   = 4,
> > +     RISCV_IOMMU_HPMEVENT_DD_WALK    = 5,
> > +     RISCV_IOMMU_HPMEVENT_PD_WALK    = 6,
> > +     RISCV_IOMMU_HPMEVENT_S_VS_WALKS = 7,
> > +     RISCV_IOMMU_HPMEVENT_G_WALKS    = 8,
> > +     RISCV_IOMMU_HPMEVENT_MAX        = 9
> > +};
> > +
> > +/* 5.24 Translation request IOVA (64bits) */
> > +#define RISCV_IOMMU_REG_TR_REQ_IOVA     0x0258
> > +#define RISCV_IOMMU_TR_REQ_IOVA_VPN  GENMASK_ULL(63, 12)
> > +
> > +/* 5.25 Translation request control (64bits) */
> > +#define RISCV_IOMMU_REG_TR_REQ_CTL   0x0260
> > +#define RISCV_IOMMU_TR_REQ_CTL_GO_BUSY       BIT_ULL(0)
> > +#define RISCV_IOMMU_TR_REQ_CTL_PRIV  BIT_ULL(1)
> > +#define RISCV_IOMMU_TR_REQ_CTL_EXE   BIT_ULL(2)
> > +#define RISCV_IOMMU_TR_REQ_CTL_NW    BIT_ULL(3)
> > +#define RISCV_IOMMU_TR_REQ_CTL_PID   GENMASK_ULL(31, 12)
> > +#define RISCV_IOMMU_TR_REQ_CTL_PV    BIT_ULL(32)
> > +#define RISCV_IOMMU_TR_REQ_CTL_DID   GENMASK_ULL(63, 40)
> > +
> > +/* 5.26 Translation request response (64bits) */
> > +#define RISCV_IOMMU_REG_TR_RESPONSE  0x0268
> > +#define RISCV_IOMMU_TR_RESPONSE_FAULT        BIT_ULL(0)
> > +#define RISCV_IOMMU_TR_RESPONSE_PBMT GENMASK_ULL(8, 7)
> > +#define RISCV_IOMMU_TR_RESPONSE_SZ   BIT_ULL(9)
> > +#define RISCV_IOMMU_TR_RESPONSE_PPN  RISCV_IOMMU_PPN_FIELD
> > +
> > +/* 5.27 Interrupt cause to vector (64bits) */
> > +#define RISCV_IOMMU_REG_IVEC         0x02F8
> > +#define RISCV_IOMMU_IVEC_CIV         GENMASK_ULL(3, 0)
> > +#define RISCV_IOMMU_IVEC_FIV         GENMASK_ULL(7, 4)
> > +#define RISCV_IOMMU_IVEC_PMIV                GENMASK_ULL(11, 8)
> > +#define RISCV_IOMMU_IVEC_PIV         GENMASK_ULL(15, 12)
> > +
> > +/* 5.28 MSI Configuration table (32 * 64bits) */
> > +#define RISCV_IOMMU_REG_MSI_CONFIG   0x0300
> > +#define RISCV_IOMMU_REG_MSI_ADDR(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10))
> > +#define RISCV_IOMMU_MSI_ADDR         GENMASK_ULL(55, 2)
> > +#define RISCV_IOMMU_REG_MSI_DATA(_n) (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x08)
> > +#define RISCV_IOMMU_MSI_DATA         GENMASK_ULL(31, 0)
> > +#define RISCV_IOMMU_REG_MSI_VEC_CTL(_n)      (RISCV_IOMMU_REG_MSI_CONFIG + ((_n) * 0x10) + 0x0C)
> > +#define RISCV_IOMMU_MSI_VEC_CTL_M    BIT_ULL(0)
> > +
> > +#define RISCV_IOMMU_REG_SIZE 0x1000
> > +
> > +/*
> > + * Chapter 2: Data structures
> > + */
> > +
> > +/*
> > + * Device Directory Table macros for non-leaf nodes
> > + */
> > +#define RISCV_IOMMU_DDTE_VALID       BIT_ULL(0)
> > +#define RISCV_IOMMU_DDTE_PPN RISCV_IOMMU_PPN_FIELD
> > +
> > +/**
> > + * struct riscv_iommu_dc - Device Context
> > + * @tc: Translation Control
> > + * @iohgatp: I/O Hypervisor guest address translation and protection
> > + *        (Second stage context)
> > + * @ta: Translation Attributes
> > + * @fsc: First stage context
> > + * @msiptp: MSI page table pointer
> > + * @msi_addr_mask: MSI address mask
> > + * @msi_addr_pattern: MSI address pattern
> > + * @_reserved: Reserved for future use, padding
> > + *
> > + * This structure is used for leaf nodes on the Device Directory Table,
> > + * in case RISCV_IOMMU_CAP_MSI_FLAT is not set, the bottom 4 fields are
> > + * not present and are skipped with pointer arithmetic to avoid
> > + * casting, check out riscv_iommu_get_dc().
> > + * See section 2.1 for more details
> > + */
> > +struct riscv_iommu_dc {
> > +     u64 tc;
> > +     u64 iohgatp;
> > +     u64 ta;
> > +     u64 fsc;
> > +     u64 msiptp;
> > +     u64 msi_addr_mask;
> > +     u64 msi_addr_pattern;
> > +     u64 _reserved;
> > +};
> > +
> > +/* Translation control fields */
> > +#define RISCV_IOMMU_DC_TC_V          BIT_ULL(0)
> > +#define RISCV_IOMMU_DC_TC_EN_ATS     BIT_ULL(1)
> > +#define RISCV_IOMMU_DC_TC_EN_PRI     BIT_ULL(2)
> > +#define RISCV_IOMMU_DC_TC_T2GPA              BIT_ULL(3)
> > +#define RISCV_IOMMU_DC_TC_DTF                BIT_ULL(4)
> > +#define RISCV_IOMMU_DC_TC_PDTV               BIT_ULL(5)
> > +#define RISCV_IOMMU_DC_TC_PRPR               BIT_ULL(6)
> > +#define RISCV_IOMMU_DC_TC_GADE               BIT_ULL(7)
> > +#define RISCV_IOMMU_DC_TC_SADE               BIT_ULL(8)
> > +#define RISCV_IOMMU_DC_TC_DPE                BIT_ULL(9)
> > +#define RISCV_IOMMU_DC_TC_SBE                BIT_ULL(10)
> > +#define RISCV_IOMMU_DC_TC_SXL                BIT_ULL(11)
> > +
> > +/* Second-stage (aka G-stage) context fields */
> > +#define RISCV_IOMMU_DC_IOHGATP_PPN   RISCV_IOMMU_ATP_PPN_FIELD
> > +#define RISCV_IOMMU_DC_IOHGATP_GSCID GENMASK_ULL(59, 44)
> > +#define RISCV_IOMMU_DC_IOHGATP_MODE  RISCV_IOMMU_ATP_MODE_FIELD
> > +
> > +/**
> > + * enum riscv_iommu_dc_iohgatp_modes - Guest address translation/protection modes
> > + * @RISCV_IOMMU_DC_IOHGATP_MODE_BARE: No translation/protection
> > + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4: Sv32x4 (2-bit extension of Sv32), when fctl.GXL == 1
> > + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4: Sv39x4 (2-bit extension of Sv39), when fctl.GXL == 0
> > + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4: Sv48x4 (2-bit extension of Sv48), when fctl.GXL == 0
> > + * @RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4: Sv57x4 (2-bit extension of Sv57), when fctl.GXL == 0
> > + */
> > +enum riscv_iommu_dc_iohgatp_modes {
> > +     RISCV_IOMMU_DC_IOHGATP_MODE_BARE = 0,
> > +     RISCV_IOMMU_DC_IOHGATP_MODE_SV32X4 = 8,
> > +     RISCV_IOMMU_DC_IOHGATP_MODE_SV39X4 = 8,
> > +     RISCV_IOMMU_DC_IOHGATP_MODE_SV48X4 = 9,
> > +     RISCV_IOMMU_DC_IOHGATP_MODE_SV57X4 = 10
> > +};
> > +
> > +/* Translation attributes fields */
> > +#define RISCV_IOMMU_DC_TA_PSCID              GENMASK_ULL(31, 12)
> > +
> > +/* First-stage context fields */
> > +#define RISCV_IOMMU_DC_FSC_PPN               RISCV_IOMMU_ATP_PPN_FIELD
> > +#define RISCV_IOMMU_DC_FSC_MODE              RISCV_IOMMU_ATP_MODE_FIELD
> > +
> > +/**
> > + * enum riscv_iommu_dc_fsc_atp_modes - First stage address translation/protection modes
> > + * @RISCV_IOMMU_DC_FSC_MODE_BARE: No translation/protection
> > + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32: Sv32, when dc.tc.SXL == 1
> > + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39: Sv39, when dc.tc.SXL == 0
> > + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48: Sv48, when dc.tc.SXL == 0
> > + * @RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57: Sv57, when dc.tc.SXL == 0
> > + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8: 1lvl PDT, 8bit process ids
> > + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17: 2lvl PDT, 17bit process ids
> > + * @RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20: 3lvl PDT, 20bit process ids
> > + *
> > + * FSC holds IOSATP when RISCV_IOMMU_DC_TC_PDTV is 0 and PDTP otherwise.
> > + * IOSATP controls the first stage address translation (same as the satp register on
> > + * the RISC-V MMU), and PDTP holds the process directory table, used to select a
> > + * first stage page table based on a process id (for devices that support multiple
> > + * process ids).
> > + */
> > +enum riscv_iommu_dc_fsc_atp_modes {
> > +     RISCV_IOMMU_DC_FSC_MODE_BARE = 0,
> > +     RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV32 = 8,
> > +     RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV39 = 8,
> > +     RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV48 = 9,
> > +     RISCV_IOMMU_DC_FSC_IOSATP_MODE_SV57 = 10,
> > +     RISCV_IOMMU_DC_FSC_PDTP_MODE_PD8 = 1,
> > +     RISCV_IOMMU_DC_FSC_PDTP_MODE_PD17 = 2,
> > +     RISCV_IOMMU_DC_FSC_PDTP_MODE_PD20 = 3
> > +};
> > +
> > +/* MSI page table pointer */
> > +#define RISCV_IOMMU_DC_MSIPTP_PPN    RISCV_IOMMU_ATP_PPN_FIELD
> > +#define RISCV_IOMMU_DC_MSIPTP_MODE   RISCV_IOMMU_ATP_MODE_FIELD
> > +#define RISCV_IOMMU_DC_MSIPTP_MODE_OFF       0
> > +#define RISCV_IOMMU_DC_MSIPTP_MODE_FLAT      1
> > +
> > +/* MSI address mask */
> > +#define RISCV_IOMMU_DC_MSI_ADDR_MASK GENMASK_ULL(51, 0)
> > +
> > +/* MSI address pattern */
> > +#define RISCV_IOMMU_DC_MSI_PATTERN   GENMASK_ULL(51, 0)
> > +
> > +/**
> > + * struct riscv_iommu_pc - Process Context
> > + * @ta: Translation Attributes
> > + * @fsc: First stage context
> > + *
> > + * This structure is used for leaf nodes on the Process Directory Table
> > + * See section 2.3 for more details
> > + */
> > +struct riscv_iommu_pc {
> > +     u64 ta;
> > +     u64 fsc;
> > +};
> > +
> > +/* Translation attributes fields */
> > +#define RISCV_IOMMU_PC_TA_V  BIT_ULL(0)
> > +#define RISCV_IOMMU_PC_TA_ENS        BIT_ULL(1)
> > +#define RISCV_IOMMU_PC_TA_SUM        BIT_ULL(2)
> > +#define RISCV_IOMMU_PC_TA_PSCID      GENMASK_ULL(31, 12)
> > +
> > +/* First stage context fields */
> > +#define RISCV_IOMMU_PC_FSC_PPN       RISCV_IOMMU_ATP_PPN_FIELD
> > +#define RISCV_IOMMU_PC_FSC_MODE      RISCV_IOMMU_ATP_MODE_FIELD
> > +
> > +/*
> > + * Chapter 3: In-memory queue interface
> > + */
> > +
> > +/**
> > + * struct riscv_iommu_command - Generic I/O MMU command structure
> > + * @dword0: Includes the opcode and the function identifier
> > + * @dword1: Opcode specific data
> > + *
> > + * The commands are interpreted as two 64bit fields, where the first
> > + * 7bits of the first field are the opcode which also defines the
> > + * command's format, followed by a 3bit field that specifies the
> > + * function invoked by that command, and the rest is opcode-specific.
> > + * This is a generic struct which will be populated differently
> > + * according to each command. For more infos on the commands and
> > + * the command queue check section 3.1.
> > + */
> > +struct riscv_iommu_command {
> > +     u64 dword0;
> > +     u64 dword1;
> > +};
> > +
> > +/* Fields on dword0, common for all commands */
> > +#define RISCV_IOMMU_CMD_OPCODE       GENMASK_ULL(6, 0)
> > +#define      RISCV_IOMMU_CMD_FUNC    GENMASK_ULL(9, 7)
> > +
> > +/* 3.1.1 I/O MMU Page-table cache invalidation */
> > +/* Fields on dword0 */
> > +#define RISCV_IOMMU_CMD_IOTINVAL_OPCODE              1
> > +#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_VMA    0
> > +#define RISCV_IOMMU_CMD_IOTINVAL_FUNC_GVMA   1
> > +#define RISCV_IOMMU_CMD_IOTINVAL_AV          BIT_ULL(10)
> > +#define RISCV_IOMMU_CMD_IOTINVAL_PSCID               GENMASK_ULL(31, 12)
> > +#define RISCV_IOMMU_CMD_IOTINVAL_PSCV                BIT_ULL(32)
> > +#define RISCV_IOMMU_CMD_IOTINVAL_GV          BIT_ULL(33)
> > +#define RISCV_IOMMU_CMD_IOTINVAL_GSCID               GENMASK_ULL(59, 44)
> > +/* dword1[61:10] is the 4K-aligned page address */
> > +#define RISCV_IOMMU_CMD_IOTINVAL_ADDR                GENMASK_ULL(61, 10)
> > +
> > +/* 3.1.2 I/O MMU Command Queue Fences */
> > +/* Fields on dword0 */
> > +#define RISCV_IOMMU_CMD_IOFENCE_OPCODE               2
> > +#define RISCV_IOMMU_CMD_IOFENCE_FUNC_C               0
> > +#define RISCV_IOMMU_CMD_IOFENCE_AV           BIT_ULL(10)
> > +#define RISCV_IOMMU_CMD_IOFENCE_WSI          BIT_ULL(11)
> > +#define RISCV_IOMMU_CMD_IOFENCE_PR           BIT_ULL(12)
> > +#define RISCV_IOMMU_CMD_IOFENCE_PW           BIT_ULL(13)
> > +#define RISCV_IOMMU_CMD_IOFENCE_DATA         GENMASK_ULL(63, 32)
> > +/* dword1 is the address, word-size aligned and shifted to the right by two bits. */
> > +
> > +/* 3.1.3 I/O MMU Directory cache invalidation */
> > +/* Fields on dword0 */
> > +#define RISCV_IOMMU_CMD_IODIR_OPCODE         3
> > +#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_DDT 0
> > +#define RISCV_IOMMU_CMD_IODIR_FUNC_INVAL_PDT 1
> > +#define RISCV_IOMMU_CMD_IODIR_PID            GENMASK_ULL(31, 12)
> > +#define RISCV_IOMMU_CMD_IODIR_DV             BIT_ULL(33)
> > +#define RISCV_IOMMU_CMD_IODIR_DID            GENMASK_ULL(63, 40)
> > +/* dword1 is reserved for standard use */
> > +
> > +/* 3.1.4 I/O MMU PCIe ATS */
> > +/* Fields on dword0 */
> > +#define RISCV_IOMMU_CMD_ATS_OPCODE           4
> > +#define RISCV_IOMMU_CMD_ATS_FUNC_INVAL               0
> > +#define RISCV_IOMMU_CMD_ATS_FUNC_PRGR                1
> > +#define RISCV_IOMMU_CMD_ATS_PID                      GENMASK_ULL(31, 12)
> > +#define RISCV_IOMMU_CMD_ATS_PV                       BIT_ULL(32)
> > +#define RISCV_IOMMU_CMD_ATS_DSV                      BIT_ULL(33)
> > +#define RISCV_IOMMU_CMD_ATS_RID                      GENMASK_ULL(55, 40)
> > +#define RISCV_IOMMU_CMD_ATS_DSEG             GENMASK_ULL(63, 56)
> > +/* dword1 is the ATS payload, two different payload types for INVAL and PRGR */
> > +
> > +/* ATS.INVAL payload*/
> > +#define RISCV_IOMMU_CMD_ATS_INVAL_G          BIT_ULL(0)
> > +/* Bits 1 - 10 are zeroed */
> > +#define RISCV_IOMMU_CMD_ATS_INVAL_S          BIT_ULL(11)
> > +#define RISCV_IOMMU_CMD_ATS_INVAL_UADDR              GENMASK_ULL(63, 12)
> > +
> > +/* ATS.PRGR payload */
> > +/* Bits 0 - 31 are zeroed */
> > +#define RISCV_IOMMU_CMD_ATS_PRGR_PRG_INDEX   GENMASK_ULL(40, 32)
> > +/* Bits 41 - 43 are zeroed */
> > +#define RISCV_IOMMU_CMD_ATS_PRGR_RESP_CODE   GENMASK_ULL(47, 44)
> > +#define RISCV_IOMMU_CMD_ATS_PRGR_DST_ID              GENMASK_ULL(63, 48)
> > +
> > +/**
> > + * struct riscv_iommu_fq_record - Fault/Event Queue Record
> > + * @hdr: Header, includes fault/event cause, PID/DID, transaction type etc
> > + * @_reserved: Low 32bits for custom use, high 32bits for standard use
> > + * @iotval: Transaction-type/cause specific format
> > + * @iotval2: Cause specific format
> > + *
> > + * The fault/event queue reports events and failures raised when
> > + * processing transactions. Each record is a 32byte structure where
> > + * the first dword has a fixed format for providing generic infos
> > + * regarding the fault/event, and two more dwords are there for
> > + * fault/event-specific information. For more details see section
> > + * 3.2.
> > + */
> > +struct riscv_iommu_fq_record {
> > +     u64 hdr;
> > +     u64 _reserved;
> > +     u64 iotval;
> > +     u64 iotval2;
> > +};
> > +
> > +/* Fields on header */
> > +#define RISCV_IOMMU_FQ_HDR_CAUSE     GENMASK_ULL(11, 0)
> > +#define RISCV_IOMMU_FQ_HDR_PID               GENMASK_ULL(31, 12)
> > +#define RISCV_IOMMU_FQ_HDR_PV                BIT_ULL(32)
> > +#define RISCV_IOMMU_FQ_HDR_PRIV              BIT_ULL(33)
> > +#define RISCV_IOMMU_FQ_HDR_TTYPE     GENMASK_ULL(39, 34)
> > +#define RISCV_IOMMU_FQ_HDR_DID               GENMASK_ULL(63, 40)
> > +
> > +/**
> > + * enum riscv_iommu_fq_causes - Fault/event cause values
> > + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT: Instruction access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED: Read address misaligned
> > + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT: Read load fault
> > + * @RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED: Write/AMO address misaligned
> > + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT: Write/AMO access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S: Instruction page fault
> > + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S: Read page fault
> > + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S: Write/AMO page fault
> > + * @RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS: Instruction guest page fault
> > + * @RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS: Read guest page fault
> > + * @RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS: Write/AMO guest page fault
> > + * @RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED: All inbound transactions disallowed
> > + * @RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT: DDT entry load access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_DDT_INVALID: DDT entry invalid
> > + * @RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED: DDT entry misconfigured
> > + * @RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED: Transaction type disallowed
> > + * @RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT: MSI PTE load access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_MSI_INVALID: MSI PTE invalid
> > + * @RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED: MSI PTE misconfigured
> > + * @RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT: MRIF access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT: PDT entry load access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_PDT_INVALID: PDT entry invalid
> > + * @RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED: PDT entry misconfigured
> > + * @RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED: DDT data corruption
> > + * @RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED: PDT data corruption
> > + * @RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED: MSI page table data corruption
> > + * @RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED: MRIF data corruption
> > + * @RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR: Internal data path error
> > + * @RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT: IOMMU MSI write access fault
> > + * @RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED: First/second stage page table data corruption
> > + *
> > + * Values are on table 11 of the spec, encodings 275 - 2047 are reserved for standard
> > + * use, and 2048 - 4095 for custom use.
> > + */
> > +enum riscv_iommu_fq_causes {
> > +     RISCV_IOMMU_FQ_CAUSE_INST_FAULT = 1,
> > +     RISCV_IOMMU_FQ_CAUSE_RD_ADDR_MISALIGNED = 4,
> > +     RISCV_IOMMU_FQ_CAUSE_RD_FAULT = 5,
> > +     RISCV_IOMMU_FQ_CAUSE_WR_ADDR_MISALIGNED = 6,
> > +     RISCV_IOMMU_FQ_CAUSE_WR_FAULT = 7,
> > +     RISCV_IOMMU_FQ_CAUSE_INST_FAULT_S = 12,
> > +     RISCV_IOMMU_FQ_CAUSE_RD_FAULT_S = 13,
> > +     RISCV_IOMMU_FQ_CAUSE_WR_FAULT_S = 15,
> > +     RISCV_IOMMU_FQ_CAUSE_INST_FAULT_VS = 20,
> > +     RISCV_IOMMU_FQ_CAUSE_RD_FAULT_VS = 21,
> > +     RISCV_IOMMU_FQ_CAUSE_WR_FAULT_VS = 23,
> > +     RISCV_IOMMU_FQ_CAUSE_DMA_DISABLED = 256,
> > +     RISCV_IOMMU_FQ_CAUSE_DDT_LOAD_FAULT = 257,
> > +     RISCV_IOMMU_FQ_CAUSE_DDT_INVALID = 258,
> > +     RISCV_IOMMU_FQ_CAUSE_DDT_MISCONFIGURED = 259,
> > +     RISCV_IOMMU_FQ_CAUSE_TTYPE_BLOCKED = 260,
> > +     RISCV_IOMMU_FQ_CAUSE_MSI_LOAD_FAULT = 261,
> > +     RISCV_IOMMU_FQ_CAUSE_MSI_INVALID = 262,
> > +     RISCV_IOMMU_FQ_CAUSE_MSI_MISCONFIGURED = 263,
> > +     RISCV_IOMMU_FQ_CAUSE_MRIF_FAULT = 264,
> > +     RISCV_IOMMU_FQ_CAUSE_PDT_LOAD_FAULT = 265,
> > +     RISCV_IOMMU_FQ_CAUSE_PDT_INVALID = 266,
> > +     RISCV_IOMMU_FQ_CAUSE_PDT_MISCONFIGURED = 267,
> > +     RISCV_IOMMU_FQ_CAUSE_DDT_CORRUPTED = 268,
> > +     RISCV_IOMMU_FQ_CAUSE_PDT_CORRUPTED = 269,
> > +     RISCV_IOMMU_FQ_CAUSE_MSI_PT_CORRUPTED = 270,
> > +     RISCV_IOMMU_FQ_CAUSE_MRIF_CORRUIPTED = 271,
> > +     RISCV_IOMMU_FQ_CAUSE_INTERNAL_DP_ERROR = 272,
> > +     RISCV_IOMMU_FQ_CAUSE_MSI_WR_FAULT = 273,
> > +     RISCV_IOMMU_FQ_CAUSE_PT_CORRUPTED = 274
> > +};
> > +
> > +/**
> > + * enum riscv_iommu_fq_ttypes: Fault/event transaction types
> > + * @RISCV_IOMMU_FQ_TTYPE_NONE: None. Fault not caused by an inbound transaction.
> > + * @RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH: Instruction fetch from untranslated address
> > + * @RISCV_IOMMU_FQ_TTYPE_UADDR_RD: Read from untranslated address
> > + * @RISCV_IOMMU_FQ_TTYPE_UADDR_WR: Write/AMO to untranslated address
> > + * @RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH: Instruction fetch from translated address
> > + * @RISCV_IOMMU_FQ_TTYPE_TADDR_RD: Read from translated address
> > + * @RISCV_IOMMU_FQ_TTYPE_TADDR_WR: Write/AMO to translated address
> > + * @RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ: PCIe ATS translation request
> > + * @RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ: PCIe message request
> > + *
> > + * Values are on table 12 of the spec, type 4 and 10 - 31 are reserved for standard use
> > + * and 31 - 63 for custom use.
> > + */
> > +enum riscv_iommu_fq_ttypes {
> > +     RISCV_IOMMU_FQ_TTYPE_NONE = 0,
> > +     RISCV_IOMMU_FQ_TTYPE_UADDR_INST_FETCH = 1,
> > +     RISCV_IOMMU_FQ_TTYPE_UADDR_RD = 2,
> > +     RISCV_IOMMU_FQ_TTYPE_UADDR_WR = 3,
> > +     RISCV_IOMMU_FQ_TTYPE_TADDR_INST_FETCH = 5,
> > +     RISCV_IOMMU_FQ_TTYPE_TADDR_RD = 6,
> > +     RISCV_IOMMU_FQ_TTYPE_TADDR_WR = 7,
> > +     RISCV_IOMMU_FQ_TTYPE_PCIE_ATS_REQ = 8,
> > +     RISCV_IOMMU_FW_TTYPE_PCIE_MSG_REQ = 9,
> > +};
> > +
> > +/**
> > + * struct riscv_iommu_pq_record - PCIe Page Request record
> > + * @hdr: Header, includes PID, DID etc
> > + * @payload: Holds the page address, request group and permission bits
> > + *
> > + * For more infos on the PCIe Page Request queue see chapter 3.3.
> > + */
> > +struct riscv_iommu_pq_record {
> > +     u64 hdr;
> > +     u64 payload;
> > +};
> > +
> > +/* Header fields */
> > +#define RISCV_IOMMU_PREQ_HDR_PID     GENMASK_ULL(31, 12)
> > +#define RISCV_IOMMU_PREQ_HDR_PV              BIT_ULL(32)
> > +#define RISCV_IOMMU_PREQ_HDR_PRIV    BIT_ULL(33)
> > +#define RISCV_IOMMU_PREQ_HDR_EXEC    BIT_ULL(34)
> > +#define RISCV_IOMMU_PREQ_HDR_DID     GENMASK_ULL(63, 40)
> > +
> > +/* Payload fields */
> > +#define RISCV_IOMMU_PREQ_PAYLOAD_R   BIT_ULL(0)
> > +#define RISCV_IOMMU_PREQ_PAYLOAD_W   BIT_ULL(1)
> > +#define RISCV_IOMMU_PREQ_PAYLOAD_L   BIT_ULL(2)
> > +#define RISCV_IOMMU_PREQ_PAYLOAD_M   GENMASK_ULL(2, 0)       /* Mask of RWL for convenience */
> > +#define RISCV_IOMMU_PREQ_PRG_INDEX   GENMASK_ULL(11, 3)
> > +#define RISCV_IOMMU_PREQ_UADDR               GENMASK_ULL(63, 12)
> > +
> > +/**
> > + * struct riscv_iommu_msi_pte - MSI Page Table Entry
> > + * @pte: MSI PTE
> > + * @mrif_info: Memory-resident interrupt file info
> > + *
> > + * The MSI Page Table is used for virtualizing MSIs, so that when
> > + * a device sends an MSI to a guest, the IOMMU can reroute it
> > + * by translating the MSI address, either to a guest interrupt file
> > + * or a memory resident interrupt file (MRIF). Note that this page table
> > + * is an array of MSI PTEs, not a multi-level pt, each entry
> > + * is a leaf entry. For more infos check out the AIA spec, chapter 9.5.
> > + *
> > + * Also in basic mode the mrif_info field is ignored by the IOMMU and can
> > + * be used by software, any other reserved fields on pte must be zeroed-out
> > + * by software.
> > + */
> > +struct riscv_iommu_msi_pte {
> > +     u64 pte;
> > +     u64 mrif_info;
> > +};
> > +
> > +/* Fields on pte */
> > +#define RISCV_IOMMU_MSI_PTE_V                BIT_ULL(0)
> > +#define RISCV_IOMMU_MSI_PTE_M                GENMASK_ULL(2, 1)
> > +#define RISCV_IOMMU_MSI_PTE_MRIF_ADDR        GENMASK_ULL(53, 7)      /* When M == 1 (MRIF mode) */
> > +#define RISCV_IOMMU_MSI_PTE_PPN              RISCV_IOMMU_PPN_FIELD   /* When M == 3 (basic mode) */
> > +#define RISCV_IOMMU_MSI_PTE_C                BIT_ULL(63)
> > +
> > +/* Fields on mrif_info */
> > +#define RISCV_IOMMU_MSI_MRIF_NID     GENMASK_ULL(9, 0)
> > +#define RISCV_IOMMU_MSI_MRIF_NPPN    RISCV_IOMMU_PPN_FIELD
> > +#define RISCV_IOMMU_MSI_MRIF_NID_MSB BIT_ULL(60)
> > +
> > +#endif /* _RISCV_IOMMU_BITS_H_ */
> > diff --git a/drivers/iommu/riscv/iommu-platform.c b/drivers/iommu/riscv/iommu-platform.c
> > new file mode 100644
> > index 000000000000..770086ae2ab3
> > --- /dev/null
> > +++ b/drivers/iommu/riscv/iommu-platform.c
> > @@ -0,0 +1,94 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * RISC-V IOMMU as a platform device
> > + *
> > + * Copyright © 2023 FORTH-ICS/CARV
> > + * Copyright © 2023-2024 Rivos Inc.
> > + *
> > + * Authors
> > + *   Nick Kossifidis <mick@ics.forth.gr>
> > + *   Tomasz Jeznach <tjeznach@rivosinc.com>
> > + */
> > +
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +#include <linux/of_platform.h>
> > +#include <linux/platform_device.h>
> > +
> > +#include "iommu-bits.h"
> > +#include "iommu.h"
> > +
> > +static int riscv_iommu_platform_probe(struct platform_device *pdev)
> > +{
> > +     struct device *dev = &pdev->dev;
> > +     struct riscv_iommu_device *iommu = NULL;
> > +     struct resource *res = NULL;
> > +     int vec;
> > +
> > +     iommu = devm_kzalloc(dev, sizeof(*iommu), GFP_KERNEL);
> > +     if (!iommu)
> > +             return -ENOMEM;
> > +
> > +     iommu->dev = dev;
> > +     iommu->reg = devm_platform_get_and_ioremap_resource(pdev, 0, &res);
> > +     if (IS_ERR(iommu->reg))
> > +             return dev_err_probe(dev, PTR_ERR(iommu->reg),
> > +                                  "could not map register region\n");
> > +
> > +     dev_set_drvdata(dev, iommu);
> > +
> > +     /* Check device reported capabilities / features. */
> > +     iommu->caps = riscv_iommu_readq(iommu, RISCV_IOMMU_REG_CAP);
> > +     iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> > +
> > +     /* For now we only support WSI */
> > +     switch (FIELD_GET(RISCV_IOMMU_CAP_IGS, iommu->caps)) {
> > +     case RISCV_IOMMU_CAP_IGS_WSI:
> > +     case RISCV_IOMMU_CAP_IGS_BOTH:
> > +             break;
> > +     default:
> > +             return dev_err_probe(dev, -ENODEV,
> > +                                  "unable to use wire-signaled interrupts\n");
> > +     }
> > +
> > +     iommu->irqs_count = platform_irq_count(pdev);
> > +     if (iommu->irqs_count <= 0)
> > +             return dev_err_probe(dev, -ENODEV,
> > +                                  "no IRQ resources provided\n");
> > +
> > +     for (vec = 0; vec < iommu->irqs_count; vec++)
> > +             iommu->irqs[vec] = platform_get_irq(pdev, vec);
>
> And if I've specified 97 interrupts in my DT because I won't let schema
> be the boss of me?
>
> > +     /* Enable wire-signaled interrupts, fctl.WSI */
> > +     if (!(iommu->fctl & RISCV_IOMMU_FCTL_WSI)) {
> > +             iommu->fctl ^= RISCV_IOMMU_FCTL_WSI;
>
> Using XOR to only ever set a 0 bit to 1 seems a bit obtuse.
>
> > +             riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL, iommu->fctl);
> > +     }
> > +
> > +     return riscv_iommu_init(iommu);
> > +};
> > +
> > +static void riscv_iommu_platform_remove(struct platform_device *pdev)
> > +{
> > +     riscv_iommu_remove(dev_get_drvdata(&pdev->dev));
> > +};
> > +
> > +static const struct of_device_id riscv_iommu_of_match[] = {
> > +     {.compatible = "riscv,iommu",},
> > +     {},
> > +};
> > +
> > +MODULE_DEVICE_TABLE(of, riscv_iommu_of_match);
>
> And yet it cannot be a module?
>
> > +static struct platform_driver riscv_iommu_platform_driver = {
> > +     .probe = riscv_iommu_platform_probe,
> > +     .remove_new = riscv_iommu_platform_remove,
> > +     .driver = {
> > +             .name = "riscv,iommu",
> > +             .of_match_table = riscv_iommu_of_match,
> > +             .suppress_bind_attrs = true,
> > +     },
> > +};
> > +
> > +module_driver(riscv_iommu_platform_driver, platform_driver_register,
> > +           platform_driver_unregister);
>
> module_platform_driver() is a thing. Or builtin_platform_driver(), as
> things currently stand.
>
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > new file mode 100644
> > index 000000000000..af68c89200a9
> > --- /dev/null
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -0,0 +1,89 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * IOMMU API for RISC-V IOMMU implementations.
> > + *
> > + * Copyright © 2022-2024 Rivos Inc.
> > + * Copyright © 2023 FORTH-ICS/CARV
> > + *
> > + * Authors
> > + *   Tomasz Jeznach <tjeznach@rivosinc.com>
> > + *   Nick Kossifidis <mick@ics.forth.gr>
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>
> I guess that ends up as "iommu:"? Given that that prefix already belongs
> to the core code, please pick something more specific.
>
> > +#include <linux/compiler.h>
> > +#include <linux/dma-mapping.h>
> > +#include <linux/init.h>
> > +#include <linux/iommu.h>
> > +#include <linux/kernel.h>
> > +#include <linux/module.h>
> > +
> > +#include "iommu-bits.h"
> > +#include "iommu.h"
> > +
> > +MODULE_DESCRIPTION("Driver for RISC-V IOMMU");
> > +MODULE_AUTHOR("Tomasz Jeznach <tjeznach@rivosinc.com>");
> > +MODULE_AUTHOR("Nick Kossifidis <mick@ics.forth.gr>");
> > +MODULE_LICENSE("GPL");
> > +
> > +/* Timeouts in [us] */
> > +#define RISCV_IOMMU_DDTP_TIMEOUT     50000
> > +
> > +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> > +{
> > +     u64 ddtp;
> > +
> > +     /* Hardware must be configured in OFF | BARE mode at system initialization. */
> > +     riscv_iommu_readq_timeout(iommu, RISCV_IOMMU_REG_DDTP,
> > +                               ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
> > +                               10, RISCV_IOMMU_DDTP_TIMEOUT);
> > +     if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE)
> > +             return -EBUSY;
>
> It looks like RISC-V already supports kdump, so you probably want to be
> prepared to find the IOMMU with its pants down and deal with it from day
> one.
>

This is the simplest check/fail for the kexec and/or boot loaders
leaving IOMMU translations active.
I've been already looking into kexec path to quiesce all devices and
IOMMU in shutdown path.
I'm not convinced it's ready for the prime time on RISC-V, will
address this in follow up patches.

> > +
> > +     /* Configure accesses to in-memory data structures for CPU-native byte order. */
> > +     if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE)) {
> > +             if (!(iommu->caps & RISCV_IOMMU_CAP_END))
> > +                     return -EINVAL;
> > +             riscv_iommu_writel(iommu, RISCV_IOMMU_REG_FCTL,
> > +                                iommu->fctl ^ RISCV_IOMMU_FCTL_BE);
> > +             iommu->fctl = riscv_iommu_readl(iommu, RISCV_IOMMU_REG_FCTL);
> > +             if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN) != !!(iommu->fctl & RISCV_IOMMU_FCTL_BE))
> > +                     return -EINVAL;
> > +     }
>
> That's fun. It could be reorganised to avoid the duplicate check, but it
> is rather majestic as-is :)
>
> > +
> > +     dma_set_mask_and_coherent(iommu->dev,
> > +                               DMA_BIT_MASK(FIELD_GET(RISCV_IOMMU_CAP_PAS, iommu->caps)));
>
> This isn't a check, so I would think it belongs to "_init" rather than
> "_init_check".
>
> Thanks,
> Robin.
>

ACK to all other comments.

Thanks
- Tomasz

> > +
> > +     return 0;
> > +}
> > +
> > +void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> > +{
> > +     iommu_device_sysfs_remove(&iommu->iommu);
> > +}
> > +
> > +int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > +{
> > +     int rc;
> > +
> > +     rc = riscv_iommu_init_check(iommu);
> > +     if (rc)
> > +             return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> > +     /*
> > +      * Placeholder for a complete IOMMU device initialization.
> > +      * For now, only bare minimum: enable global identity mapping mode and register sysfs.
> > +      */
> > +     riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> > +                        FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> > +
> > +     rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> > +                                 dev_name(iommu->dev));
> > +     if (WARN(rc, "cannot register sysfs interface\n"))
> > +             goto err_sysfs;
> > +
> > +     return 0;
> > +
> > +err_sysfs:
> > +     return rc;
> > +}
> > diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> > new file mode 100644
> > index 000000000000..700e33dc2446
> > --- /dev/null
> > +++ b/drivers/iommu/riscv/iommu.h
> > @@ -0,0 +1,62 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright © 2022-2024 Rivos Inc.
> > + * Copyright © 2023 FORTH-ICS/CARV
> > + *
> > + * Authors
> > + *   Tomasz Jeznach <tjeznach@rivosinc.com>
> > + *   Nick Kossifidis <mick@ics.forth.gr>
> > + */
> > +
> > +#ifndef _RISCV_IOMMU_H_
> > +#define _RISCV_IOMMU_H_
> > +
> > +#include <linux/iommu.h>
> > +#include <linux/types.h>
> > +#include <linux/iopoll.h>
> > +
> > +#include "iommu-bits.h"
> > +
> > +struct riscv_iommu_device {
> > +     /* iommu core interface */
> > +     struct iommu_device iommu;
> > +
> > +     /* iommu hardware */
> > +     struct device *dev;
> > +
> > +     /* hardware control register space */
> > +     void __iomem *reg;
> > +
> > +     /* supported and enabled hardware capabilities */
> > +     u64 caps;
> > +     u32 fctl;
> > +
> > +     /* available interrupt numbers, MSI or WSI */
> > +     unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> > +     unsigned int irqs_count;
> > +};
> > +
> > +int riscv_iommu_init(struct riscv_iommu_device *iommu);
> > +void riscv_iommu_remove(struct riscv_iommu_device *iommu);
> > +
> > +#define riscv_iommu_readl(iommu, addr) \
> > +     readl_relaxed((iommu)->reg + (addr))
> > +
> > +#define riscv_iommu_readq(iommu, addr) \
> > +     readq_relaxed((iommu)->reg + (addr))
> > +
> > +#define riscv_iommu_writel(iommu, addr, val) \
> > +     writel_relaxed((val), (iommu)->reg + (addr))
> > +
> > +#define riscv_iommu_writeq(iommu, addr, val) \
> > +     writeq_relaxed((val), (iommu)->reg + (addr))
> > +
> > +#define riscv_iommu_readq_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
> > +     readx_poll_timeout(readq_relaxed, (iommu)->reg + (addr), val, cond, \
> > +                        delay_us, timeout_us)
> > +
> > +#define riscv_iommu_readl_timeout(iommu, addr, val, cond, delay_us, timeout_us) \
> > +     readx_poll_timeout(readl_relaxed, (iommu)->reg + (addr), val, cond, \
> > +                        delay_us, timeout_us)
> > +
> > +#endif

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
  2024-04-18 17:04   ` Conor Dooley
@ 2024-04-24 22:37     ` Tomasz Jeznach
  2024-04-25 17:11       ` Conor Dooley
  0 siblings, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-24 22:37 UTC (permalink / raw)
  To: Conor Dooley
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Thu, Apr 18, 2024 at 10:04 AM Conor Dooley <conor@kernel.org> wrote:
>
> On Thu, Apr 18, 2024 at 09:32:19AM -0700, Tomasz Jeznach wrote:
> > Add bindings for the RISC-V IOMMU device drivers.
> >
> > Co-developed-by: Anup Patel <apatel@ventanamicro.com>
> > Signed-off-by: Anup Patel <apatel@ventanamicro.com>
> > Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> > ---
> >  .../bindings/iommu/riscv,iommu.yaml           | 149 ++++++++++++++++++
> >  MAINTAINERS                                   |   7 +
> >  2 files changed, 156 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> >
> > diff --git a/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> > new file mode 100644
> > index 000000000000..d6522ddd43fa
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
> > @@ -0,0 +1,149 @@
> > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
> > +%YAML 1.2
> > +---
> > +$id: http://devicetree.org/schemas/iommu/riscv,iommu.yaml#
> > +$schema: http://devicetree.org/meta-schemas/core.yaml#
> > +
> > +title: RISC-V IOMMU Architecture Implementation
> > +
> > +maintainers:
> > +  - Tomasz Jeznach <tjeznach@rivosinc.com>
> > +
> > +description: |+
>
> FYI, the + here is probably not needed.
>
> > +  The RISC-V IOMMU provides memory address translation and isolation for
> > +  input and output devices, supporting per-device translation context,
> > +  shared process address spaces including the ATS and PRI components of
> > +  the PCIe specification, two stage address translation and MSI remapping.
> > +  It supports identical translation table format to the RISC-V address
> > +  translation tables with page level access and protection attributes.
> > +  Hardware uses in-memory command and fault reporting queues with wired
> > +  interrupt or MSI notifications.
> > +
> > +  Visit https://github.com/riscv-non-isa/riscv-iommu for more details.
> > +
> > +  For information on assigning RISC-V IOMMU to its peripheral devices,
> > +  see generic IOMMU bindings.
> > +
> > +properties:
> > +  # For PCIe IOMMU hardware compatible property should contain the vendor
> > +  # and device ID according to the PCI Bus Binding specification.
> > +  # Since PCI provides built-in identification methods, compatible is not
> > +  # actually required. For non-PCIe hardware implementations 'riscv,iommu'
> > +  # should be specified along with 'reg' property providing MMIO location.
>
> I dunno, I'd like to see soc-specific compatibles for implementations of
> the RISC-V IOMMU. If you need a DT compatible for use in QEMU, I'd
> suggest doing what was done for the aplic and having a dedicated
> compatible for that and disallow having "riscv,iommu" in isolation.
>

Makes sense.  Will update to something like below:
  compatible:
    oneOf:
      - items:
          - enum:
              - qemu,iommu
          - const: riscv,iommu
      - items:
          - enum:
              - pci1efd,edf1
          - const: riscv,pci-iommu

> > +  compatible:
> > +    oneOf:
> > +      - items:
> > +          - const: riscv,pci-iommu
> > +          - const: pci1efd,edf1
> > +      - items:
> > +          - const: pci1efd,edf1
>
> Why are both versions allowed? If the former is more understandable,
> can't we just go with that?
>
> > +      - items:
> > +          - const: riscv,iommu
>
> Other than the compatible setup I think this is pretty decent though,
> Conor.

ACK to other comments,

Thanks for the review,
- Tomasz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/7] iommu/riscv: Device directory management.
  2024-04-19 12:40   ` Jason Gunthorpe
@ 2024-04-24 23:01     ` Tomasz Jeznach
  2024-04-24 23:07       ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-24 23:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Fri, Apr 19, 2024 at 5:40 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Apr 18, 2024 at 09:32:23AM -0700, Tomasz Jeznach wrote:
> > @@ -31,13 +32,350 @@ MODULE_LICENSE("GPL");
> >  /* Timeouts in [us] */
> >  #define RISCV_IOMMU_DDTP_TIMEOUT     50000
> >
> > -static int riscv_iommu_attach_identity_domain(struct iommu_domain *domain,
> > -                                           struct device *dev)
> > +/* RISC-V IOMMU PPN <> PHYS address conversions, PHYS <=> PPN[53:10] */
> > +#define phys_to_ppn(va)  (((va) >> 2) & (((1ULL << 44) - 1) << 10))
> > +#define ppn_to_phys(pn)       (((pn) << 2) & (((1ULL << 44) - 1) << 12))
> > +
> > +#define dev_to_iommu(dev) \
> > +     container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
>
> We have iommu_get_iommu_dev() now
>
> > +static unsigned long riscv_iommu_get_pages(struct riscv_iommu_device *iommu, unsigned int order)
> > +{
> > +     struct riscv_iommu_devres *devres;
> > +     struct page *pages;
> > +
> > +     pages = alloc_pages_node(dev_to_node(iommu->dev),
> > +                              GFP_KERNEL_ACCOUNT | __GFP_ZERO, order);
> > +     if (unlikely(!pages)) {
> > +             dev_err(iommu->dev, "Page allocation failed, order %u\n", order);
> > +             return 0;
> > +     }
>
> This needs adjusting for the recently merged allocation accounting
>
> > +static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
> > +                                  struct device *dev,
> > +                                  struct iommu_domain *iommu_domain)
> > +{
> > +     struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > +     struct riscv_iommu_dc *dc;
> > +     u64 fsc, ta, tc;
> > +     int i;
> > +
> > +     if (!iommu_domain) {
> > +             ta = 0;
> > +             tc = 0;
> > +             fsc = 0;
> > +     } else if (iommu_domain->type == IOMMU_DOMAIN_IDENTITY) {
> > +             ta = 0;
> > +             tc = RISCV_IOMMU_DC_TC_V;
> > +             fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
> > +     } else {
> > +             /* This should never happen. */
> > +             return -ENODEV;
> > +     }
>
> Please don't write it like this. This function is already being called
> by functions that are already under specific ops, don't check
> domain->type here.
>
> Instead have the caller compute and pass in the ta/tc/fsc
> values. Maybe in a tidy struct..
>
> > +     /* Update existing or allocate new entries in device directory */
> > +     for (i = 0; i < fwspec->num_ids; i++) {
> > +             dc = riscv_iommu_get_dc(iommu, fwspec->ids[i], !iommu_domain);
> > +             if (!dc && !iommu_domain)
> > +                     continue;
> > +             if (!dc)
> > +                     return -ENODEV;
>
> But if this fails some of the fwspecs were left in a weird state ?
>
> Drivers should try hard to have attach functions that fail and make no
> change at all or fully succeed.
>
> Meaning ideally preallocate any required memory before doing any
> change to the HW visable structures.
>

Good point. Done.
Looking at the fwspec->ids[] I'm assuming nobody will add/modify the
IDs after iommu_probe_device() completes.

> > +
> > +             /* Swap device context, update TC valid bit as the last operation */
> > +             xchg64(&dc->fsc, fsc);
> > +             xchg64(&dc->ta, ta);
> > +             xchg64(&dc->tc, tc);
>
> This doesn't loook right? When you get to adding PAGING suport fsc has
> the page table pfn and ta has the cache tag, so this will end up
> tearing the data for sure, eg when asked to replace a PAGING domain
> with another PAGING domain? That will create a functional/security
> problem, right?
>
> I would encourage you to re-use the ARM sequencing code, ideally moved
> to some generic helper library. Every iommu driver dealing with
> multi-quanta descriptors seems to have this same fundamental
> sequencing problem.
>

Good point. Reworked.

> > +static void riscv_iommu_release_device(struct device *dev)
> > +{
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +
> > +     riscv_iommu_attach_domain(iommu, dev, NULL);
> > +}
>
> The release_domain has landed too now. Please don't invent weird NULL
> domain types that have special meaning. I assume clearing the V bit is
> a blocking behavior? So please implement a proper blocking domain and
> set release_domain = &riscv_iommu_blocking and just omit this release
> function.
>

Updated to use release_domain, should be cleaner now.
Clearing TC.V is a blocking (but noisy) behavior, should be fine for
release domain where devices should be quiesced already.

> > @@ -133,12 +480,14 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> >       rc = riscv_iommu_init_check(iommu);
> >       if (rc)
> >               return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> > -     /*
> > -      * Placeholder for a complete IOMMU device initialization.
> > -      * For now, only bare minimum: enable global identity mapping mode and register sysfs.
> > -      */
> > -     riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> > -                        FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> > +
> > +     rc = riscv_iommu_ddt_alloc(iommu);
> > +     if (WARN(rc, "cannot allocate device directory\n"))
> > +             goto err_init;
>
> memory allocation failure already makes noisy prints, more prints are
> not needed..
>
> > +     rc = riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> > +     if (WARN(rc, "cannot enable iommu device\n"))
> > +             goto err_init;
>
> This is not a proper use of WARN, it should only be used for things
> that cannot happen not undesired error paths.
>
> Jason

Thanks, ack to all. Will push updated v3 shortly.
- Tomasz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/7] iommu/riscv: Device directory management.
  2024-04-24 23:01     ` Tomasz Jeznach
@ 2024-04-24 23:07       ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 23:07 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Wed, Apr 24, 2024 at 04:01:04PM -0700, Tomasz Jeznach wrote:
> > > +     /* Update existing or allocate new entries in device directory */
> > > +     for (i = 0; i < fwspec->num_ids; i++) {
> > > +             dc = riscv_iommu_get_dc(iommu, fwspec->ids[i], !iommu_domain);
> > > +             if (!dc && !iommu_domain)
> > > +                     continue;
> > > +             if (!dc)
> > > +                     return -ENODEV;
> >
> > But if this fails some of the fwspecs were left in a weird state ?
> >
> > Drivers should try hard to have attach functions that fail and make no
> > change at all or fully succeed.
> >
> > Meaning ideally preallocate any required memory before doing any
> > change to the HW visable structures.
> 
> Good point. Done.
> Looking at the fwspec->ids[] I'm assuming nobody will add/modify the
> IDs after iommu_probe_device() completes.

Yes

> > > +             /* Swap device context, update TC valid bit as the last operation */
> > > +             xchg64(&dc->fsc, fsc);
> > > +             xchg64(&dc->ta, ta);
> > > +             xchg64(&dc->tc, tc);
> >
> > This doesn't loook right? When you get to adding PAGING suport fsc has
> > the page table pfn and ta has the cache tag, so this will end up
> > tearing the data for sure, eg when asked to replace a PAGING domain
> > with another PAGING domain? That will create a functional/security
> > problem, right?
> >
> > I would encourage you to re-use the ARM sequencing code, ideally moved
> > to some generic helper library. Every iommu driver dealing with
> > multi-quanta descriptors seems to have this same fundamental
> > sequencing problem.
> >
> 
> Good point. Reworked.

I suppose by force clearing the v bit before starting the sequence?

That is OK but won't support some non-embedded focused features in the
long run. It is a good approach to get the driver landed though.
 
> > The release_domain has landed too now. Please don't invent weird NULL
> > domain types that have special meaning. I assume clearing the V bit is
> > a blocking behavior? So please implement a proper blocking domain and
> > set release_domain = &riscv_iommu_blocking and just omit this release
> > function.
> >
> 
> Updated to use release_domain, should be cleaner now.
> Clearing TC.V is a blocking (but noisy) behavior, should be fine for
> release domain where devices should be quiesced already.

blocking is fine to be noisy.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 5/7] iommu/riscv: Device directory management.
  2024-04-22  5:11   ` Baolu Lu
@ 2024-04-24 23:07     ` Tomasz Jeznach
  0 siblings, 0 replies; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-24 23:07 UTC (permalink / raw)
  To: Baolu Lu
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Sun, Apr 21, 2024 at 10:13 PM Baolu Lu <baolu.lu@linux.intel.com> wrote:
>
> On 4/19/24 12:32 AM, Tomasz Jeznach wrote:
> > Introduce device context allocation and device directory tree
> > management including capabilities discovery sequence, as described
> > in Chapter 2.1 of the RISC-V IOMMU Architecture Specification.
> >
> > Device directory mode will be auto detected using DDTP WARL property,
> > using highest mode supported by the driver and hardware. If none
> > supported can be configured, driver will fall back to global pass-through.
> >
> > First level DDTP page can be located in I/O (detected using DDTP WARL)
> > and system memory.
> >
> > Only identity protection domain is supported by this implementation.
> >
> > Co-developed-by: Nick Kossifidis <mick@ics.forth.gr>
> > Signed-off-by: Nick Kossifidis <mick@ics.forth.gr>
> > Signed-off-by: Tomasz Jeznach <tjeznach@rivosinc.com>
> > ---
> >   drivers/iommu/riscv/iommu.c | 369 +++++++++++++++++++++++++++++++++++-
> >   drivers/iommu/riscv/iommu.h |   5 +
> >   2 files changed, 365 insertions(+), 9 deletions(-)
>
> [ ... ]
>
> > +
> > +/*
> > + * Discover supported DDT modes starting from requested value,
> > + * configure DDTP register with accepted mode and root DDT address.
> > + * Accepted iommu->ddt_mode is updated on success.
> > + */
> > +static int riscv_iommu_set_ddtp_mode(struct riscv_iommu_device *iommu,
> > +                                  unsigned int ddtp_mode)
> > +{
> > +     struct device *dev = iommu->dev;
> > +     u64 ddtp, rq_ddtp;
> > +     unsigned int mode, rq_mode = ddtp_mode;
> > +     int rc;
> > +
> > +     rc = readq_relaxed_poll_timeout(iommu->reg + RISCV_IOMMU_REG_DDTP,
> > +                                     ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
> > +                                     10, RISCV_IOMMU_DDTP_TIMEOUT);
> > +     if (rc < 0)
> > +             return -EBUSY;
> > +
> > +     /* Disallow state transition from xLVL to xLVL. */
> > +     switch (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp)) {
> > +     case RISCV_IOMMU_DDTP_MODE_BARE:
> > +     case RISCV_IOMMU_DDTP_MODE_OFF:
> > +             break;
> > +     default:
> > +             if (rq_mode != RISCV_IOMMU_DDTP_MODE_BARE &&
> > +                 rq_mode != RISCV_IOMMU_DDTP_MODE_OFF)
> > +                     return -EINVAL;
>
> Is this check duplicate? It appears that it's always true in the default
> branch.
>

No. The condition in the switch represents current mode, while check
in default branch checks desired mode device will be configured to.
I've reworked the code in v3 to be more readable.

> > +             break;
> > +     }
> > +
> > +     do {
> > +             rq_ddtp = FIELD_PREP(RISCV_IOMMU_DDTP_MODE, rq_mode);
> > +             if (rq_mode > RISCV_IOMMU_DDTP_MODE_BARE)
> > +                     rq_ddtp |= phys_to_ppn(iommu->ddt_phys);
> > +
> > +             riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP, rq_ddtp);
> > +
> > +             rc = readq_relaxed_poll_timeout(iommu->reg + RISCV_IOMMU_REG_DDTP,
> > +                                             ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
> > +                                             10, RISCV_IOMMU_DDTP_TIMEOUT);
> > +             if (rc < 0) {
> > +                     dev_warn(dev, "timeout when setting ddtp (ddt mode: %u, read: %llx)\n",
> > +                              rq_mode, ddtp);
> > +                     return -EBUSY;
> > +             }
> > +
> > +             /* Verify IOMMU hardware accepts new DDTP config. */
> > +             mode = FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp);
> > +
> > +             if (rq_mode == mode)
> > +                     break;
> > +
> > +             /* Hardware mandatory DDTP mode has not been accepted. */
> > +             if (rq_mode < RISCV_IOMMU_DDTP_MODE_1LVL && rq_ddtp != ddtp) {
> > +                     dev_warn(dev, "DDTP update failed hw: %llx vs %llx\n", ddtp, rq_ddtp);
> > +                     return -EINVAL;
> > +             }
> > +
> > +             /*
> > +              * Mode field is WARL, an IOMMU may support a subset of
> > +              * directory table levels in which case if we tried to set
> > +              * an unsupported number of levels we'll readback either
> > +              * a valid xLVL or off/bare. If we got off/bare, try again
> > +              * with a smaller xLVL.
> > +              */
> > +             if (mode < RISCV_IOMMU_DDTP_MODE_1LVL &&
> > +                 rq_mode > RISCV_IOMMU_DDTP_MODE_1LVL) {
> > +                     dev_dbg(dev, "DDTP hw mode %u vs %u\n", mode, rq_mode);
> > +                     rq_mode--;
> > +                     continue;
> > +             }
> > +
> > +             /*
> > +              * We tried all supported modes and IOMMU hardware failed to
> > +              * accept new settings, something went very wrong since off/bare
> > +              * and at least one xLVL must be supported.
> > +              */
> > +             dev_warn(dev, "DDTP hw mode %u, failed to set %u\n", mode, ddtp_mode);
> > +             return -EINVAL;
> > +     } while (1);
> > +
> > +     iommu->ddt_mode = mode;
> > +     if (mode != ddtp_mode)
> > +             dev_warn(dev, "DDTP failover to %u mode, requested %u\n",
> > +                      mode, ddtp_mode);
> > +
> > +     return 0;
> > +}
> > +
>
> [ ... ]
>
> > +
> > +static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
> > +                                  struct device *dev,
> > +                                  struct iommu_domain *iommu_domain)
> > +{
> > +     struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
> > +     struct riscv_iommu_dc *dc;
> > +     u64 fsc, ta, tc;
> > +     int i;
> > +
> > +     if (!iommu_domain) {
> > +             ta = 0;
> > +             tc = 0;
> > +             fsc = 0;
> > +     } else if (iommu_domain->type == IOMMU_DOMAIN_IDENTITY) {
> > +             ta = 0;
> > +             tc = RISCV_IOMMU_DC_TC_V;
> > +             fsc = FIELD_PREP(RISCV_IOMMU_DC_FSC_MODE, RISCV_IOMMU_DC_FSC_MODE_BARE);
> > +     } else {
> > +             /* This should never happen. */
> > +             return -ENODEV;
> > +     }
>
> Move the domain->type check code to the domain-specific ops.
>
> > +
> > +     /* Update existing or allocate new entries in device directory */
> > +     for (i = 0; i < fwspec->num_ids; i++) {
> > +             dc = riscv_iommu_get_dc(iommu, fwspec->ids[i], !iommu_domain);
> > +             if (!dc && !iommu_domain)
> > +                     continue;
> > +             if (!dc)
> > +                     return -ENODEV;
> > +
> > +             /* Swap device context, update TC valid bit as the last operation */
> > +             xchg64(&dc->fsc, fsc);
> > +             xchg64(&dc->ta, ta);
> > +             xchg64(&dc->tc, tc);
> > +
> > +             /* Device context invalidation will be required. Ignoring for now. */
> > +     }
> > +
> >       return 0;
> >   }
> >
> > +static int riscv_iommu_attach_identity_domain(struct iommu_domain *iommu_domain,
> > +                                           struct device *dev)
> > +{
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +
> > +     /* Global pass-through already enabled, do nothing. */
> > +     if (iommu->ddt_mode == RISCV_IOMMU_DDTP_MODE_BARE)
> > +             return 0;
> > +
> > +     return riscv_iommu_attach_domain(iommu, dev, iommu_domain);
> > +}
> > +
> >   static struct iommu_domain riscv_iommu_identity_domain = {
> >       .type = IOMMU_DOMAIN_IDENTITY,
> >       .ops = &(const struct iommu_domain_ops) {
> > @@ -82,6 +420,13 @@ static void riscv_iommu_probe_finalize(struct device *dev)
> >       iommu_setup_dma_ops(dev, 0, U64_MAX);
> >   }
> >
> > +static void riscv_iommu_release_device(struct device *dev)
> > +{
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +
> > +     riscv_iommu_attach_domain(iommu, dev, NULL);
>
> Attaching a NULL domain to a device has already been removed. You can
> use the iommu_ops->release_domain here.
>
> > +}
> > +
> >   static const struct iommu_ops riscv_iommu_ops = {
> >       .owner = THIS_MODULE,
> >       .of_xlate = riscv_iommu_of_xlate,
> > @@ -90,6 +435,7 @@ static const struct iommu_ops riscv_iommu_ops = {
> >       .device_group = riscv_iommu_device_group,
> >       .probe_device = riscv_iommu_probe_device,
> >       .probe_finalize = riscv_iommu_probe_finalize,
>
> The probe_finalize op will be removed soon.
>
> https://lore.kernel.org/linux-iommu/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com/

Thanks, I'm aware of the change, once this change is pulled for
iommu/next I'll just remove probe_finalize.

>
> > +     .release_device = riscv_iommu_release_device,
> >   };
> >
> >   static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
> > @@ -124,6 +470,7 @@ void riscv_iommu_remove(struct riscv_iommu_device *iommu)
> >   {
> >       iommu_device_unregister(&iommu->iommu);
> >       iommu_device_sysfs_remove(&iommu->iommu);
> > +     riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> >   }
> >
> >   int riscv_iommu_init(struct riscv_iommu_device *iommu)
> > @@ -133,12 +480,14 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> >       rc = riscv_iommu_init_check(iommu);
> >       if (rc)
> >               return dev_err_probe(iommu->dev, rc, "unexpected device state\n");
> > -     /*
> > -      * Placeholder for a complete IOMMU device initialization.
> > -      * For now, only bare minimum: enable global identity mapping mode and register sysfs.
> > -      */
> > -     riscv_iommu_writeq(iommu, RISCV_IOMMU_REG_DDTP,
> > -                        FIELD_PREP(RISCV_IOMMU_DDTP_MODE, RISCV_IOMMU_DDTP_MODE_BARE));
> > +
> > +     rc = riscv_iommu_ddt_alloc(iommu);
> > +     if (WARN(rc, "cannot allocate device directory\n"))
> > +             goto err_init;
> > +
> > +     rc = riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_MAX);
> > +     if (WARN(rc, "cannot enable iommu device\n"))
> > +             goto err_init;
> >
> >       rc = iommu_device_sysfs_add(&iommu->iommu, NULL, NULL, "riscv-iommu@%s",
> >                                   dev_name(iommu->dev));
> > @@ -154,5 +503,7 @@ int riscv_iommu_init(struct riscv_iommu_device *iommu)
> >   err_iommu:
> >       iommu_device_sysfs_remove(&iommu->iommu);
> >   err_sysfs:
> > +     riscv_iommu_set_ddtp_mode(iommu, RISCV_IOMMU_DDTP_MODE_OFF);
> > +err_init:
> >       return rc;
> >   }
> > diff --git a/drivers/iommu/riscv/iommu.h b/drivers/iommu/riscv/iommu.h
> > index 700e33dc2446..f1696926582c 100644
> > --- a/drivers/iommu/riscv/iommu.h
> > +++ b/drivers/iommu/riscv/iommu.h
> > @@ -34,6 +34,11 @@ struct riscv_iommu_device {
> >       /* available interrupt numbers, MSI or WSI */
> >       unsigned int irqs[RISCV_IOMMU_INTR_COUNT];
> >       unsigned int irqs_count;
> > +
> > +     /* device directory */
> > +     unsigned int ddt_mode;
> > +     dma_addr_t ddt_phys;
> > +     u64 *ddt_root;
> >   };
> >
> >   int riscv_iommu_init(struct riscv_iommu_device *iommu);
>
> Best regards,
> baolu

Thank you, Best.
- Tomasz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-19 12:56   ` Jason Gunthorpe
  2024-04-22  7:40     ` Baolu Lu
@ 2024-04-24 23:30     ` Tomasz Jeznach
  2024-04-24 23:39       ` Jason Gunthorpe
  1 sibling, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-24 23:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Fri, Apr 19, 2024 at 5:56 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Apr 18, 2024 at 09:32:25AM -0700, Tomasz Jeznach wrote:
>
> > diff --git a/drivers/iommu/riscv/iommu.c b/drivers/iommu/riscv/iommu.c
> > index a4f74588cdc2..32ddc372432d 100644
> > --- a/drivers/iommu/riscv/iommu.c
> > +++ b/drivers/iommu/riscv/iommu.c
> > @@ -46,6 +46,10 @@ MODULE_LICENSE("GPL");
> >  #define dev_to_iommu(dev) \
> >       container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
> >
> > +/* IOMMU PSCID allocation namespace. */
> > +static DEFINE_IDA(riscv_iommu_pscids);
> > +#define RISCV_IOMMU_MAX_PSCID                BIT(20)
> > +
>
> You may consider putting this IDA in the riscv_iommu_device() and move
> the pscid from the domain to the bond?
>

I've been considering containing IDA inside riscv_iommu_device at some
point,  but it made PCSID management more complicated.  In the follow
up patches it is desired for PSCID to be unique across all IOMMUs in
the system (within guest's GSCID), as the protection domains might
(and will) be shared between more than single IOMMU device.

> >  /* Device resource-managed allocations */
> >  struct riscv_iommu_devres {
> >       unsigned long addr;
> > @@ -752,12 +756,77 @@ static int riscv_iommu_ddt_alloc(struct riscv_iommu_device *iommu)
> >       return 0;
> >  }
> >
> > +struct riscv_iommu_bond {
> > +     struct list_head list;
> > +     struct rcu_head rcu;
> > +     struct device *dev;
> > +};
> > +
> > +/* This struct contains protection domain specific IOMMU driver data. */
> > +struct riscv_iommu_domain {
> > +     struct iommu_domain domain;
> > +     struct list_head bonds;
> > +     int pscid;
> > +     int numa_node;
> > +     int amo_enabled:1;
> > +     unsigned int pgd_mode;
> > +     /* paging domain */
> > +     unsigned long pgd_root;
> > +};
>
> Glad to see there is no riscv_iommu_device pointer in the domain!
>
> > +static void riscv_iommu_iotlb_inval(struct riscv_iommu_domain *domain,
> > +                                 unsigned long start, unsigned long end)
> > +{
> > +     struct riscv_iommu_bond *bond;
> > +     struct riscv_iommu_device *iommu;
> > +     struct riscv_iommu_command cmd;
> > +     unsigned long len = end - start + 1;
> > +     unsigned long iova;
> > +
> > +     rcu_read_lock();
> > +     list_for_each_entry_rcu(bond, &domain->bonds, list) {
> > +             iommu = dev_to_iommu(bond->dev);
>
> Pedantically this locking isn't locked right, there is technically
> nothing that prevents bond->dev and the iommu instance struct from
> being freed here. eg iommufd can hit races here if userspace can hot
> unplug devices.
>
> I suggest storing the iommu pointer itself in the bond instead of the
> device then add a synchronize_rcu() to the iommu unregister path.
>

Very good point. Thanks for pointing this out.
Reworked to add locking around list modifications (and do not
incorrectly rely on iommu group mutex locks).

> > +             riscv_iommu_cmd_inval_vma(&cmd);
> > +             riscv_iommu_cmd_inval_set_pscid(&cmd, domain->pscid);
> > +             if (len > 0 && len < RISCV_IOMMU_IOTLB_INVAL_LIMIT) {
> > +                     for (iova = start; iova < end; iova += PAGE_SIZE) {
> > +                             riscv_iommu_cmd_inval_set_addr(&cmd, iova);
> > +                             riscv_iommu_cmd_send(iommu, &cmd, 0);
> > +                     }
> > +             } else {
> > +                     riscv_iommu_cmd_send(iommu, &cmd, 0);
> > +             }
> > +     }
>
> This seems suboptimal, you probably want to copy the new design that
> Intel is doing where you allocate "bonds" that are already
> de-duplicated. Ie if I have 10 devices on the same iommu sharing the
> domain the above will invalidate the PSCID 10 times. It should only be
> done once.
>
> ie add a "bond" for the (iommu,pscid) and refcount that based on how
> many devices are used. Then another "bond" for the ATS stuff eventually.
>

Agree, not perfect to send duplicate invalidations.
This should improve with follow up patchsets introducing of SVA
(reusing the same, extended bond structure) and update to send IOTLB
range invalidations.

For this change I've decided to go with as simple as possible
implementation and over-invalidate for domains with multiple devices
attached. Hope this makes sense.

> > +
> > +     list_for_each_entry_rcu(bond, &domain->bonds, list) {
> > +             iommu = dev_to_iommu(bond->dev);
> > +
> > +             riscv_iommu_cmd_iofence(&cmd);
> > +             riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_QUEUE_TIMEOUT);
> > +     }
> > +     rcu_read_unlock();
> > +}
> > +
>
> > @@ -787,12 +870,390 @@ static int riscv_iommu_attach_domain(struct riscv_iommu_device *iommu,
> >               xchg64(&dc->ta, ta);
> >               xchg64(&dc->tc, tc);
> >
> > -             /* Device context invalidation will be required. Ignoring for now. */
> > +             if (!(tc & RISCV_IOMMU_DC_TC_V))
> > +                     continue;
>
> No negative caching in HW?
>
No. Disallowed by the spec.

> > +             /* Invalidate device context cache */
> > +             riscv_iommu_cmd_iodir_inval_ddt(&cmd);
> > +             riscv_iommu_cmd_iodir_set_did(&cmd, fwspec->ids[i]);
> > +             riscv_iommu_cmd_send(iommu, &cmd, 0);
> > +
> > +             if (FIELD_GET(RISCV_IOMMU_PC_FSC_MODE, fsc) == RISCV_IOMMU_DC_FSC_MODE_BARE)
> > +                     continue;
> > +
> > +             /* Invalidate last valid PSCID */
> > +             riscv_iommu_cmd_inval_vma(&cmd);
> > +             riscv_iommu_cmd_inval_set_pscid(&cmd, FIELD_GET(RISCV_IOMMU_DC_TA_PSCID, ta));
> > +             riscv_iommu_cmd_send(iommu, &cmd, 0);
> > +     }
> > +
> > +     /* Synchronize directory update */
> > +     riscv_iommu_cmd_iofence(&cmd);
> > +     riscv_iommu_cmd_send(iommu, &cmd, RISCV_IOMMU_IOTINVAL_TIMEOUT);
> > +
> > +     /* Track domain to devices mapping. */
> > +     if (bond)
> > +             list_add_rcu(&bond->list, &domain->bonds);
>
> This is in the wrong order, the invalidation on the pscid needs to
> start before the pscid is loaded into HW in the first place otherwise
> concurrent invalidations may miss HW updates.
>
> > +
> > +     /* Remove tracking from previous domain, if needed. */
> > +     iommu_domain = iommu_get_domain_for_dev(dev);
> > +     if (iommu_domain && !!(iommu_domain->type & __IOMMU_DOMAIN_PAGING)) {
>
> No need for !!, && is already booleanizing
>
> > +             domain = iommu_domain_to_riscv(iommu_domain);
> > +             bond = NULL;
> > +             rcu_read_lock();
> > +             list_for_each_entry_rcu(b, &domain->bonds, list) {
> > +                     if (b->dev == dev) {
> > +                             bond = b;
> > +                             break;
> > +                     }
> > +             }
> > +             rcu_read_unlock();
> > +
> > +             if (bond) {
> > +                     list_del_rcu(&bond->list);
> > +                     kfree_rcu(bond, rcu);
> > +             }
> > +     }
> > +
> > +     return 0;
> > +}
>
> > +static inline size_t get_page_size(size_t size)
> > +{
> > +     if (size >= IOMMU_PAGE_SIZE_512G)
> > +             return IOMMU_PAGE_SIZE_512G;
> > +     if (size >= IOMMU_PAGE_SIZE_1G)
> > +             return IOMMU_PAGE_SIZE_1G;
> > +     if (size >= IOMMU_PAGE_SIZE_2M)
> > +             return IOMMU_PAGE_SIZE_2M;
> > +     return IOMMU_PAGE_SIZE_4K;
> > +}
> > +
> > +#define _io_pte_present(pte) ((pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE))
> > +#define _io_pte_leaf(pte)    ((pte) & _PAGE_LEAF)
> > +#define _io_pte_none(pte)    ((pte) == 0)
> > +#define _io_pte_entry(pn, prot)      ((_PAGE_PFN_MASK & ((pn) << _PAGE_PFN_SHIFT)) | (prot))
> > +
> > +static void riscv_iommu_pte_free(struct riscv_iommu_domain *domain,
> > +                              unsigned long pte, struct list_head *freelist)
> > +{
> > +     unsigned long *ptr;
> > +     int i;
> > +
> > +     if (!_io_pte_present(pte) || _io_pte_leaf(pte))
> > +             return;
> > +
> > +     ptr = (unsigned long *)pfn_to_virt(__page_val_to_pfn(pte));
> > +
> > +     /* Recursively free all sub page table pages */
> > +     for (i = 0; i < PTRS_PER_PTE; i++) {
> > +             pte = READ_ONCE(ptr[i]);
> > +             if (!_io_pte_none(pte) && cmpxchg_relaxed(ptr + i, pte, 0) == pte)
> > +                     riscv_iommu_pte_free(domain, pte, freelist);
> > +     }
> > +
> > +     if (freelist)
> > +             list_add_tail(&virt_to_page(ptr)->lru, freelist);
> > +     else
> > +             free_page((unsigned long)ptr);
> > +}
>
> Consider putting the page table handling in its own file?
>

It was in separate file at some point, but merged to iommu.c, as its
simple enough with ~300 lines only. Probably not worth separating this
out.

> > +static int riscv_iommu_attach_paging_domain(struct iommu_domain *iommu_domain,
> > +                                         struct device *dev)
> > +{
> > +     struct riscv_iommu_device *iommu = dev_to_iommu(dev);
> > +     struct riscv_iommu_domain *domain = iommu_domain_to_riscv(iommu_domain);
> > +     struct page *page;
> > +
> > +     if (!riscv_iommu_pt_supported(iommu, domain->pgd_mode))
> > +             return -ENODEV;
> > +
> > +     domain->numa_node = dev_to_node(iommu->dev);
> > +     domain->amo_enabled = !!(iommu->caps & RISCV_IOMMU_CAP_AMO_HWAD);
> > +
> > +     if (!domain->pgd_root) {
> > +             page = alloc_pages_node(domain->numa_node,
> > +                                     GFP_KERNEL_ACCOUNT | __GFP_ZERO, 0);
> > +             if (!page)
> > +                     return -ENOMEM;
> > +             domain->pgd_root = (unsigned long)page_to_virt(page);
>
> The pgd_root should be allocated by the alloc_paging function, not
> during attach. There is no locking here that will protect against
> concurrent attach and also map before attach should work.
>
> You can pick up the numa affinity from the alloc paging dev pointer
> (note it may be null still in some cases)
>

Good point. Thanks. Will send update shortly with v3.

> Jason

Ack to all other comments, thank you!
Best,
- Tomasz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-24 23:30     ` Tomasz Jeznach
@ 2024-04-24 23:39       ` Jason Gunthorpe
  2024-04-24 23:54         ` Tomasz Jeznach
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2024-04-24 23:39 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Wed, Apr 24, 2024 at 04:30:45PM -0700, Tomasz Jeznach wrote:
> > > @@ -46,6 +46,10 @@ MODULE_LICENSE("GPL");
> > >  #define dev_to_iommu(dev) \
> > >       container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
> > >
> > > +/* IOMMU PSCID allocation namespace. */
> > > +static DEFINE_IDA(riscv_iommu_pscids);
> > > +#define RISCV_IOMMU_MAX_PSCID                BIT(20)
> > > +
> >
> > You may consider putting this IDA in the riscv_iommu_device() and move
> > the pscid from the domain to the bond?
> >
> 
> I've been considering containing IDA inside riscv_iommu_device at some
> point,  but it made PCSID management more complicated.  In the follow
> up patches it is desired for PSCID to be unique across all IOMMUs in
> the system (within guest's GSCID), as the protection domains might
> (and will) be shared between more than single IOMMU device.

The PCSID isn't scoped under the GSCID? That doesn't sound very good,
it means VM's can't direct issue invalidation with their local view of
the PCSID space?

> > This seems suboptimal, you probably want to copy the new design that
> > Intel is doing where you allocate "bonds" that are already
> > de-duplicated. Ie if I have 10 devices on the same iommu sharing the
> > domain the above will invalidate the PSCID 10 times. It should only be
> > done once.
> >
> > ie add a "bond" for the (iommu,pscid) and refcount that based on how
> > many devices are used. Then another "bond" for the ATS stuff eventually.
> >
> 
> Agree, not perfect to send duplicate invalidations.
> This should improve with follow up patchsets introducing of SVA
> (reusing the same, extended bond structure) and update to send IOTLB
> range invalidations.
> 
> For this change I've decided to go with as simple as possible
> implementation and over-invalidate for domains with multiple devices
> attached. Hope this makes sense.

It is fine as long as you do fix it..

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-24 23:39       ` Jason Gunthorpe
@ 2024-04-24 23:54         ` Tomasz Jeznach
  2024-04-25  0:48           ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: Tomasz Jeznach @ 2024-04-24 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Wed, Apr 24, 2024 at 4:39 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Wed, Apr 24, 2024 at 04:30:45PM -0700, Tomasz Jeznach wrote:
> > > > @@ -46,6 +46,10 @@ MODULE_LICENSE("GPL");
> > > >  #define dev_to_iommu(dev) \
> > > >       container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
> > > >
> > > > +/* IOMMU PSCID allocation namespace. */
> > > > +static DEFINE_IDA(riscv_iommu_pscids);
> > > > +#define RISCV_IOMMU_MAX_PSCID                BIT(20)
> > > > +
> > >
> > > You may consider putting this IDA in the riscv_iommu_device() and move
> > > the pscid from the domain to the bond?
> > >
> >
> > I've been considering containing IDA inside riscv_iommu_device at some
> > point,  but it made PCSID management more complicated.  In the follow
> > up patches it is desired for PSCID to be unique across all IOMMUs in
> > the system (within guest's GSCID), as the protection domains might
> > (and will) be shared between more than single IOMMU device.
>
> The PCSID isn't scoped under the GSCID? That doesn't sound very good,
> it means VM's can't direct issue invalidation with their local view of
> the PCSID space?
>

To clarify: PSCID namespace is per GSCID.
However there might be more than one IOMMU in a single system sharing
the same GSCID, and with e.g. SVA domains attached to more than one
IOMMU. It was simpler to manage PCSID globally.

PSCID management for the VM assigned GSCID will be the VM's responsibility.

> > > This seems suboptimal, you probably want to copy the new design that
> > > Intel is doing where you allocate "bonds" that are already
> > > de-duplicated. Ie if I have 10 devices on the same iommu sharing the
> > > domain the above will invalidate the PSCID 10 times. It should only be
> > > done once.
> > >
> > > ie add a "bond" for the (iommu,pscid) and refcount that based on how
> > > many devices are used. Then another "bond" for the ATS stuff eventually.
> > >
> >
> > Agree, not perfect to send duplicate invalidations.
> > This should improve with follow up patchsets introducing of SVA
> > (reusing the same, extended bond structure) and update to send IOTLB
> > range invalidations.
> >
> > For this change I've decided to go with as simple as possible
> > implementation and over-invalidate for domains with multiple devices
> > attached. Hope this makes sense.
>
> It is fine as long as you do fix it..
>

SG. I'll have a second look if it can be fixed sooner.

> Jason

Best,
- Tomasz

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 7/7] iommu/riscv: Paging domain support
  2024-04-24 23:54         ` Tomasz Jeznach
@ 2024-04-25  0:48           ` Jason Gunthorpe
  0 siblings, 0 replies; 30+ messages in thread
From: Jason Gunthorpe @ 2024-04-25  0:48 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On Wed, Apr 24, 2024 at 04:54:01PM -0700, Tomasz Jeznach wrote:
> On Wed, Apr 24, 2024 at 4:39 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Wed, Apr 24, 2024 at 04:30:45PM -0700, Tomasz Jeznach wrote:
> > > > > @@ -46,6 +46,10 @@ MODULE_LICENSE("GPL");
> > > > >  #define dev_to_iommu(dev) \
> > > > >       container_of((dev)->iommu->iommu_dev, struct riscv_iommu_device, iommu)
> > > > >
> > > > > +/* IOMMU PSCID allocation namespace. */
> > > > > +static DEFINE_IDA(riscv_iommu_pscids);
> > > > > +#define RISCV_IOMMU_MAX_PSCID                BIT(20)
> > > > > +
> > > >
> > > > You may consider putting this IDA in the riscv_iommu_device() and move
> > > > the pscid from the domain to the bond?
> > > >
> > >
> > > I've been considering containing IDA inside riscv_iommu_device at some
> > > point,  but it made PCSID management more complicated.  In the follow
> > > up patches it is desired for PSCID to be unique across all IOMMUs in
> > > the system (within guest's GSCID), as the protection domains might
> > > (and will) be shared between more than single IOMMU device.
> >
> > The PCSID isn't scoped under the GSCID? That doesn't sound very good,
> > it means VM's can't direct issue invalidation with their local view of
> > the PCSID space?
> >
> 
> To clarify: PSCID namespace is per GSCID.
> However there might be more than one IOMMU in a single system sharing
> the same GSCID

I assume this is because GSCID ends up shared with kvm?

> and with e.g. SVA domains attached to more than one
> IOMMU. It was simpler to manage PCSID globally.

If the PSCID is moved into the invalidation list like Intel structured
it then it doesn't matter for SVA, or really anything.

AFAIK the only reason to do otherwise is if you have a reason to share
the ID with the CPU/MM and the IOMMU probably to coordinate
invalidations. But if you do this then you really just always want to
use the MM's global ID space in the first place...

So I'm not sure :)

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver
  2024-04-24 21:59     ` Tomasz Jeznach
@ 2024-04-25 11:23       ` Robin Murphy
  0 siblings, 0 replies; 30+ messages in thread
From: Robin Murphy @ 2024-04-25 11:23 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

On 24/04/2024 10:59 pm, Tomasz Jeznach wrote:
[...]
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 2657f9eae84c..051599c76585 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -18972,6 +18972,12 @@ L:   iommu@lists.linux.dev
>>>    L:  linux-riscv@lists.infradead.org
>>>    S:  Maintained
>>>    F:  Documentation/devicetree/bindings/iommu/riscv,iommu.yaml
>>> +F:   drivers/iommu/riscv/Kconfig
>>> +F:   drivers/iommu/riscv/Makefile
>>> +F:   drivers/iommu/riscv/iommu-bits.h
>>> +F:   drivers/iommu/riscv/iommu-platform.c
>>> +F:   drivers/iommu/riscv/iommu.c
>>> +F:   drivers/iommu/riscv/iommu.h
>>
>> I'm pretty sure a single "F: drivers/iommu/riscv/" pattern will suffice.
>>
> 
> Correct. But will required a workaround for pretty naive MAINTAINERS update
> check in scripts/checkpatch.pl:3014 in next patch.

As long as what you're doing is clearly reasonable to humans, the 
correct workaround for any checkpatch complaint is to ignore checkpatch.

[...]
>>> +static int riscv_iommu_init_check(struct riscv_iommu_device *iommu)
>>> +{
>>> +     u64 ddtp;
>>> +
>>> +     /* Hardware must be configured in OFF | BARE mode at system initialization. */
>>> +     riscv_iommu_readq_timeout(iommu, RISCV_IOMMU_REG_DDTP,
>>> +                               ddtp, !(ddtp & RISCV_IOMMU_DDTP_BUSY),
>>> +                               10, RISCV_IOMMU_DDTP_TIMEOUT);
>>> +     if (FIELD_GET(RISCV_IOMMU_DDTP_MODE, ddtp) > RISCV_IOMMU_DDTP_MODE_BARE)
>>> +             return -EBUSY;
>>
>> It looks like RISC-V already supports kdump, so you probably want to be
>> prepared to find the IOMMU with its pants down and deal with it from day
>> one.
>>
> 
> This is the simplest check/fail for the kexec and/or boot loaders
> leaving IOMMU translations active.
> I've been already looking into kexec path to quiesce all devices and
> IOMMU in shutdown path.
> I'm not convinced it's ready for the prime time on RISC-V, will
> address this in follow up patches.

Yeah, for regular kexec you definitely want an orderly shutdown of the 
IOMMU, although there's still a bit of an open question about whether 
it's better to actively block any remaining traffic from devices whose 
drivers haven't cleanly stopped them. It's in the kdump crash kernel 
case that you can't have any expectations and need to be able to recover 
the IOMMU into a usable state, since it's likely to be in the way of 
devices which the crash kernel wants to take over and use.

Thanks,
Robin.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU
  2024-04-24 22:37     ` Tomasz Jeznach
@ 2024-04-25 17:11       ` Conor Dooley
  0 siblings, 0 replies; 30+ messages in thread
From: Conor Dooley @ 2024-04-25 17:11 UTC (permalink / raw)
  To: Tomasz Jeznach
  Cc: Joerg Roedel, Will Deacon, Robin Murphy, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Anup Patel, Sunil V L, Nick Kossifidis,
	Sebastien Boeuf, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
	devicetree, iommu, linux-riscv, linux-kernel, linux

[-- Attachment #1: Type: text/plain, Size: 439 bytes --]

On Wed, Apr 24, 2024 at 03:37:14PM -0700, Tomasz Jeznach wrote:
> Makes sense.  Will update to something like below:
>   compatible:
>     oneOf:
>       - items:
>           - enum:
>               - qemu,iommu
>           - const: riscv,iommu
>       - items:
>           - enum:
>               - pci1efd,edf1
>           - const: riscv,pci-iommu

That seems reasonable to me, addresses both what Rob and I pointed out.

Thanks,
Conor.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-04-25 17:11 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-18 16:32 [PATCH v2 0/7] Linux RISC-V IOMMU Support Tomasz Jeznach
2024-04-18 16:32 ` [PATCH v2 1/7] dt-bindings: iommu: riscv: Add bindings for RISC-V IOMMU Tomasz Jeznach
2024-04-18 17:04   ` Conor Dooley
2024-04-24 22:37     ` Tomasz Jeznach
2024-04-25 17:11       ` Conor Dooley
2024-04-22 14:04   ` Rob Herring
2024-04-18 16:32 ` [PATCH v2 2/7] iommu/riscv: Add RISC-V IOMMU platform device driver Tomasz Jeznach
2024-04-18 21:22   ` Robin Murphy
2024-04-24 21:59     ` Tomasz Jeznach
2024-04-25 11:23       ` Robin Murphy
2024-04-18 16:32 ` [PATCH v2 3/7] iommu/riscv: Add RISC-V IOMMU PCIe " Tomasz Jeznach
2024-04-18 22:07   ` Robin Murphy
2024-04-18 16:32 ` [PATCH v2 4/7] iommu/riscv: Enable IOMMU registration and device probe Tomasz Jeznach
2024-04-18 16:32 ` [PATCH v2 5/7] iommu/riscv: Device directory management Tomasz Jeznach
2024-04-19 12:40   ` Jason Gunthorpe
2024-04-24 23:01     ` Tomasz Jeznach
2024-04-24 23:07       ` Jason Gunthorpe
2024-04-22  5:11   ` Baolu Lu
2024-04-24 23:07     ` Tomasz Jeznach
2024-04-18 16:32 ` [PATCH v2 6/7] iommu/riscv: Command and fault queue support Tomasz Jeznach
2024-04-18 16:32 ` [PATCH v2 7/7] iommu/riscv: Paging domain support Tomasz Jeznach
2024-04-19 12:56   ` Jason Gunthorpe
2024-04-22  7:40     ` Baolu Lu
2024-04-24 23:30     ` Tomasz Jeznach
2024-04-24 23:39       ` Jason Gunthorpe
2024-04-24 23:54         ` Tomasz Jeznach
2024-04-25  0:48           ` Jason Gunthorpe
2024-04-22  5:21   ` Baolu Lu
2024-04-22 19:30     ` Jason Gunthorpe
2024-04-23 17:00   ` Andrew Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).