All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-07 20:38 ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources.

This patch-set allows limiting rdma resources to set of processes.
It extend device cgroup controller for limiting rdma device limits.

With this patch, user verbs module queries rdma device cgroup controller
to query process's limit to consume such resource. It uncharge resource 
counter after resource is being freed.

It extends the task structure to hold the statistic information about process's 
rdma resource usage so that when process migrates from one to other controller,
right amount of resources can be migrated from one to other cgroup.

Future patches will support RDMA flows resource and will be enhanced further
to enforce limit of other resources and capabilities.

Parav Pandit (7):
  devcg: Added user option to rdma resource tracking.
  devcg: Added rdma resource tracking module.
  devcg: Added infrastructure for rdma device cgroup.
  devcg: Added rdma resource tracker object per task
  devcg: device cgroup's extension for RDMA resource.
  devcg: Added support to use RDMA device cgroup.
  devcg: Added Documentation of RDMA device cgroup.

 Documentation/cgroups/devices.txt     |  32 ++-
 drivers/infiniband/core/uverbs_cmd.c  | 139 +++++++++--
 drivers/infiniband/core/uverbs_main.c |  39 +++-
 include/linux/device_cgroup.h         |  53 +++++
 include/linux/device_rdma_cgroup.h    |  83 +++++++
 include/linux/sched.h                 |  12 +-
 init/Kconfig                          |  12 +
 security/Makefile                     |   1 +
 security/device_cgroup.c              | 119 +++++++---
 security/device_rdma_cgroup.c         | 422 ++++++++++++++++++++++++++++++++++
 10 files changed, 850 insertions(+), 62 deletions(-)
 create mode 100644 include/linux/device_rdma_cgroup.h
 create mode 100644 security/device_rdma_cgroup.c

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-07 20:38 ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w

Currently user space applications can easily take away all the rdma
device specific resources such as AH, CQ, QP, MR etc. Due to which other
applications in other cgroup or kernel space ULPs may not even get chance
to allocate any rdma resources.

This patch-set allows limiting rdma resources to set of processes.
It extend device cgroup controller for limiting rdma device limits.

With this patch, user verbs module queries rdma device cgroup controller
to query process's limit to consume such resource. It uncharge resource 
counter after resource is being freed.

It extends the task structure to hold the statistic information about process's 
rdma resource usage so that when process migrates from one to other controller,
right amount of resources can be migrated from one to other cgroup.

Future patches will support RDMA flows resource and will be enhanced further
to enforce limit of other resources and capabilities.

Parav Pandit (7):
  devcg: Added user option to rdma resource tracking.
  devcg: Added rdma resource tracking module.
  devcg: Added infrastructure for rdma device cgroup.
  devcg: Added rdma resource tracker object per task
  devcg: device cgroup's extension for RDMA resource.
  devcg: Added support to use RDMA device cgroup.
  devcg: Added Documentation of RDMA device cgroup.

 Documentation/cgroups/devices.txt     |  32 ++-
 drivers/infiniband/core/uverbs_cmd.c  | 139 +++++++++--
 drivers/infiniband/core/uverbs_main.c |  39 +++-
 include/linux/device_cgroup.h         |  53 +++++
 include/linux/device_rdma_cgroup.h    |  83 +++++++
 include/linux/sched.h                 |  12 +-
 init/Kconfig                          |  12 +
 security/Makefile                     |   1 +
 security/device_cgroup.c              | 119 +++++++---
 security/device_rdma_cgroup.c         | 422 ++++++++++++++++++++++++++++++++++
 10 files changed, 850 insertions(+), 62 deletions(-)
 create mode 100644 include/linux/device_rdma_cgroup.h
 create mode 100644 security/device_rdma_cgroup.c

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 95+ messages in thread

* [PATCH 1/7] devcg: Added user option to rdma resource tracking.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

Added user configuration option to enable/disable RDMA resource tracking
feature of device cgroup as sub module.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 init/Kconfig | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 2184b34..089db85 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -977,6 +977,18 @@ config CGROUP_DEVICE
 	  Provides a cgroup implementing whitelists for devices which
 	  a process in the cgroup can mknod or open.
 
+config CGROUP_RDMA_RESOURCE
+	bool "RDMA Resource Controller for cgroups"
+	depends on CGROUP_DEVICE
+	default n
+	help
+	  This option enables limiting rdma resources for a device cgroup.
+	  Using this option, user space processes can be limited to use
+	  limited number of RDMA resources such as MR, PD, QP, AH, FLOW, CQ
+	  etc.
+
+	  Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	help
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 1/7] devcg: Added user option to rdma resource tracking.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w

Added user configuration option to enable/disable RDMA resource tracking
feature of device cgroup as sub module.

Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 init/Kconfig | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index 2184b34..089db85 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -977,6 +977,18 @@ config CGROUP_DEVICE
 	  Provides a cgroup implementing whitelists for devices which
 	  a process in the cgroup can mknod or open.
 
+config CGROUP_RDMA_RESOURCE
+	bool "RDMA Resource Controller for cgroups"
+	depends on CGROUP_DEVICE
+	default n
+	help
+	  This option enables limiting rdma resources for a device cgroup.
+	  Using this option, user space processes can be limited to use
+	  limited number of RDMA resources such as MR, PD, QP, AH, FLOW, CQ
+	  etc.
+
+	  Say N if unsure.
+
 config CPUSETS
 	bool "Cpuset support"
 	help
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 2/7] devcg: Added rdma resource tracking module.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

Added RDMA resource tracking object of device cgroup.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 security/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/Makefile b/security/Makefile
index c9bfbc8..c9ad56d 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_SECURITY_TOMOYO)		+= tomoyo/
 obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)		+= yama/
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
+obj-$(CONFIG_CGROUP_RDMA_RESOURCE)	+= device_rdma_cgroup.o
 
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY)		+= integrity
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 2/7] devcg: Added rdma resource tracking module.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w

Added RDMA resource tracking object of device cgroup.

Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 security/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/Makefile b/security/Makefile
index c9bfbc8..c9ad56d 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -23,6 +23,7 @@ obj-$(CONFIG_SECURITY_TOMOYO)		+= tomoyo/
 obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)		+= yama/
 obj-$(CONFIG_CGROUP_DEVICE)		+= device_cgroup.o
+obj-$(CONFIG_CGROUP_RDMA_RESOURCE)	+= device_rdma_cgroup.o
 
 # Object integrity file lists
 subdir-$(CONFIG_INTEGRITY)		+= integrity
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
  2015-09-07 20:38 ` Parav Pandit
                   ` (2 preceding siblings ...)
  (?)
@ 2015-09-07 20:38 ` Parav Pandit
  2015-09-08  5:31     ` Haggai Eran
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

1. Moved necessary functions and data structures to header file to
reuse them at device cgroup white list functionality and for rdma
functionality.
2. Added infrastructure to invoke RDMA specific routines for resource
configuration, query and during fork handling.
3. Added sysfs interface files for configuring max limit of each rdma
resource and one file for querying controllers current resource usage.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/device_cgroup.h |  53 +++++++++++++++++++
 security/device_cgroup.c      | 119 +++++++++++++++++++++++++++++-------------
 2 files changed, 136 insertions(+), 36 deletions(-)

diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
index 8b64221..cdbdd60 100644
--- a/include/linux/device_cgroup.h
+++ b/include/linux/device_cgroup.h
@@ -1,6 +1,57 @@
+#ifndef _DEVICE_CGROUP
+#define _DEVICE_CGROUP
+
 #include <linux/fs.h>
+#include <linux/cgroup.h>
+#include <linux/device_rdma_cgroup.h>
 
 #ifdef CONFIG_CGROUP_DEVICE
+
+enum devcg_behavior {
+	DEVCG_DEFAULT_NONE,
+	DEVCG_DEFAULT_ALLOW,
+	DEVCG_DEFAULT_DENY,
+};
+
+/*
+ * exception list locking rules:
+ * hold devcgroup_mutex for update/read.
+ * hold rcu_read_lock() for read.
+ */
+
+struct dev_exception_item {
+	u32 major, minor;
+	short type;
+	short access;
+	struct list_head list;
+	struct rcu_head rcu;
+};
+
+struct dev_cgroup {
+	struct cgroup_subsys_state css;
+	struct list_head exceptions;
+	enum devcg_behavior behavior;
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+	struct devcgroup_rdma rdma;
+#endif
+};
+
+static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s)
+{
+	return s ? container_of(s, struct dev_cgroup, css) : NULL;
+}
+
+static inline struct dev_cgroup *parent_devcgroup(struct dev_cgroup *dev_cg)
+{
+	return css_to_devcgroup(dev_cg->css.parent);
+}
+
+static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
+{
+	return css_to_devcgroup(task_css(task, devices_cgrp_id));
+}
+
 extern int __devcgroup_inode_permission(struct inode *inode, int mask);
 extern int devcgroup_inode_mknod(int mode, dev_t dev);
 static inline int devcgroup_inode_permission(struct inode *inode, int mask)
@@ -17,3 +68,5 @@ static inline int devcgroup_inode_permission(struct inode *inode, int mask)
 static inline int devcgroup_inode_mknod(int mode, dev_t dev)
 { return 0; }
 #endif
+
+#endif
diff --git a/security/device_cgroup.c b/security/device_cgroup.c
index 188c1d2..a0b3239 100644
--- a/security/device_cgroup.c
+++ b/security/device_cgroup.c
@@ -25,42 +25,6 @@
 
 static DEFINE_MUTEX(devcgroup_mutex);
 
-enum devcg_behavior {
-	DEVCG_DEFAULT_NONE,
-	DEVCG_DEFAULT_ALLOW,
-	DEVCG_DEFAULT_DENY,
-};
-
-/*
- * exception list locking rules:
- * hold devcgroup_mutex for update/read.
- * hold rcu_read_lock() for read.
- */
-
-struct dev_exception_item {
-	u32 major, minor;
-	short type;
-	short access;
-	struct list_head list;
-	struct rcu_head rcu;
-};
-
-struct dev_cgroup {
-	struct cgroup_subsys_state css;
-	struct list_head exceptions;
-	enum devcg_behavior behavior;
-};
-
-static inline struct dev_cgroup *css_to_devcgroup(struct cgroup_subsys_state *s)
-{
-	return s ? container_of(s, struct dev_cgroup, css) : NULL;
-}
-
-static inline struct dev_cgroup *task_devcgroup(struct task_struct *task)
-{
-	return css_to_devcgroup(task_css(task, devices_cgrp_id));
-}
-
 /*
  * called under devcgroup_mutex
  */
@@ -223,6 +187,9 @@ devcgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 	INIT_LIST_HEAD(&dev_cgroup->exceptions);
 	dev_cgroup->behavior = DEVCG_DEFAULT_NONE;
 
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+	init_devcgroup_rdma_tracker(dev_cgroup);
+#endif
 	return &dev_cgroup->css;
 }
 
@@ -234,6 +201,25 @@ static void devcgroup_css_free(struct cgroup_subsys_state *css)
 	kfree(dev_cgroup);
 }
 
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+static int devcgroup_can_attach(struct cgroup_subsys_state *dst_css,
+				struct cgroup_taskset *tset)
+{
+	return devcgroup_rdma_can_attach(dst_css, tset);
+}
+
+static void devcgroup_cancel_attach(struct cgroup_subsys_state *dst_css,
+				    struct cgroup_taskset *tset)
+{
+	devcgroup_cancel_attach(dst_css, tset);
+}
+
+static void devcgroup_fork(struct task_struct *task, void *priv)
+{
+	devcgroup_rdma_fork(task, priv);
+}
+#endif
+
 #define DEVCG_ALLOW 1
 #define DEVCG_DENY 2
 #define DEVCG_LIST 3
@@ -788,6 +774,62 @@ static struct cftype dev_cgroup_files[] = {
 		.seq_show = devcgroup_seq_show,
 		.private = DEVCG_LIST,
 	},
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+	{
+		.name = "rdma.resource.uctx.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_UCTX,
+	},
+	{
+		.name = "rdma.resource.cq.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_CQ,
+	},
+	{
+		.name = "rdma.resource.ah.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_AH,
+	},
+	{
+		.name = "rdma.resource.pd.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_PD,
+	},
+	{
+		.name = "rdma.resource.flow.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_FLOW,
+	},
+	{
+		.name = "rdma.resource.srq.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_SRQ,
+	},
+	{
+		.name = "rdma.resource.qp.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_QP,
+	},
+	{
+		.name = "rdma.resource.mr.max",
+		.write = devcgroup_rdma_set_max_resource,
+		.seq_show = devcgroup_rdma_get_max_resource,
+		.private = DEVCG_RDMA_RES_TYPE_MR,
+	},
+	{
+		.name = "rdma.resource.usage",
+		.seq_show = devcgroup_rdma_show_usage,
+		.private = DEVCG_RDMA_LIST_USAGE,
+	},
+#endif
 	{ }	/* terminate */
 };
 
@@ -796,6 +838,11 @@ struct cgroup_subsys devices_cgrp_subsys = {
 	.css_free = devcgroup_css_free,
 	.css_online = devcgroup_online,
 	.css_offline = devcgroup_offline,
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+	.fork = devcgroup_fork,
+	.can_attach = devcgroup_can_attach,
+	.cancel_attach = devcgroup_cancel_attach,
+#endif
 	.legacy_cftypes = dev_cgroup_files,
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 4/7] devcg: Added rdma resource tracker object per task
  2015-09-07 20:38 ` Parav Pandit
                   ` (3 preceding siblings ...)
  (?)
@ 2015-09-07 20:38 ` Parav Pandit
  2015-09-08  5:48     ` Haggai Eran
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

Added RDMA device resource tracking object per task.
Added comments to capture usage of task lock by device cgroup
for rdma.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/sched.h | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f15..a5f79b6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,8 @@ union rcu_special {
 };
 struct rcu_node;
 
+struct task_rdma_res_counter;
+
 enum perf_event_task_context {
 	perf_invalid_context = -1,
 	perf_hw_context = 0,
@@ -1637,6 +1639,14 @@ struct task_struct {
 	struct css_set __rcu *cgroups;
 	/* cg_list protected by css_set_lock and tsk->alloc_lock */
 	struct list_head cg_list;
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+	/* RDMA resource accounting counters, allocated only
+	 * when RDMA resources are created by a task.
+	 */
+	struct task_rdma_res_counter *rdma_res_counter;
+#endif
+
 #endif
 #ifdef CONFIG_FUTEX
 	struct robust_list_head __user *robust_list;
@@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
  * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
  * subscriptions and synchronises with wait4().  Also used in procfs.  Also
  * pins the final release of task.io_context.  Also protects ->cpuset and
- * ->cgroup.subsys[]. And ->vfork_done.
+ * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
  *
  * Nests both inside and outside of read_lock(&tasklist_lock).
  * It must not be nested with write_lock_irq(&tasklist_lock),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

Extension of device cgroup for RDMA device resources.
This implements RDMA resource tracker to limit RDMA resources such as
AH, CQ, PD, QP, MR, SRQ etc resources for processes of the cgroup.
It implements RDMA resource limit module to limit consuming RDMA
resources for processes of the cgroup.
RDMA resources are tracked on per task basis.
RDMA resources across multiple such devices are limited among multiple
processes of the owning device cgroup.

RDMA device cgroup extension returns error when user space applications
try to allocate resources more than its configured limit.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 include/linux/device_rdma_cgroup.h |  83 ++++++++
 security/device_rdma_cgroup.c      | 422 +++++++++++++++++++++++++++++++++++++
 2 files changed, 505 insertions(+)
 create mode 100644 include/linux/device_rdma_cgroup.h
 create mode 100644 security/device_rdma_cgroup.c

diff --git a/include/linux/device_rdma_cgroup.h b/include/linux/device_rdma_cgroup.h
new file mode 100644
index 0000000..a2c261b
--- /dev/null
+++ b/include/linux/device_rdma_cgroup.h
@@ -0,0 +1,83 @@
+#ifndef _DEVICE_RDMA_CGROUP_H
+#define _DEVICE_RDMA_CGROUP_H
+
+#include <linux/cgroup.h>
+
+/* RDMA resources from device cgroup perspective */
+enum devcgroup_rdma_rt {
+	DEVCG_RDMA_RES_TYPE_UCTX,
+	DEVCG_RDMA_RES_TYPE_CQ,
+	DEVCG_RDMA_RES_TYPE_PD,
+	DEVCG_RDMA_RES_TYPE_AH,
+	DEVCG_RDMA_RES_TYPE_MR,
+	DEVCG_RDMA_RES_TYPE_MW,
+	DEVCG_RDMA_RES_TYPE_SRQ,
+	DEVCG_RDMA_RES_TYPE_QP,
+	DEVCG_RDMA_RES_TYPE_FLOW,
+	DEVCG_RDMA_RES_TYPE_MAX,
+};
+
+struct ib_ucontext;
+
+#define DEVCG_RDMA_MAX_RESOURCES S32_MAX
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+
+#define DEVCG_RDMA_MAX_RESOURCE_STR "max"
+
+enum devcgroup_rdma_access_files {
+	DEVCG_RDMA_LIST_USAGE,
+};
+
+struct task_rdma_res_counter {
+	/* allows atomic increment of task and cgroup counters
+	 *  to avoid race with migration task.
+	 */
+	spinlock_t lock;
+	u32 usage[DEVCG_RDMA_RES_TYPE_MAX];
+};
+
+struct devcgroup_rdma_tracker {
+	int limit;
+	atomic_t usage;
+	int failcnt;
+};
+
+struct devcgroup_rdma {
+	struct devcgroup_rdma_tracker tracker[DEVCG_RDMA_RES_TYPE_MAX];
+};
+
+struct dev_cgroup;
+
+void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg);
+ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of,
+					char *buf,
+					size_t nbytes, loff_t off);
+int devcgroup_rdma_get_max_resource(struct seq_file *m, void *v);
+int devcgroup_rdma_show_usage(struct seq_file *m, void *v);
+
+int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num);
+void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
+				      enum devcgroup_rdma_rt type, int num);
+void devcgroup_rdma_fork(struct task_struct *task, void *priv);
+
+int devcgroup_rdma_can_attach(struct cgroup_subsys_state *css,
+			      struct cgroup_taskset *tset);
+void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *css,
+				  struct cgroup_taskset *tset);
+int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type);
+#else
+
+static inline int devcgroup_rdma_try_charge_resource(
+				enum devcgroup_rdma_rt type, int num)
+{ return 0; }
+static inline void devcgroup_rdma_uncharge_resource(
+				struct ib_ucontext *ucontext,
+				enum devcgroup_rdma_rt type, int num)
+{ }
+static inline int devcgroup_rdma_query_resource_limit(
+				enum devcgroup_rdma_rt type)
+{ return DEVCG_RDMA_MAX_RESOURCES; }
+#endif
+
+#endif
diff --git a/security/device_rdma_cgroup.c b/security/device_rdma_cgroup.c
new file mode 100644
index 0000000..fb4cc59
--- /dev/null
+++ b/security/device_rdma_cgroup.c
@@ -0,0 +1,422 @@
+/*
+ * RDMA device cgroup controller of device controller cgroup.
+ *
+ * Provides a cgroup hierarchy to limit various RDMA resource allocation to a
+ * configured limit of the cgroup.
+ *
+ * Its easy for user space applications to consume of RDMA device specific
+ * hardware resources. Such resource exhaustion should be prevented so that
+ * user space applications and other kernel consumers gets chance to allocate
+ * and effectively use the hardware resources.
+ *
+ * In order to use the device rdma controller, set the maximum resource count
+ * per cgroup, which ensures that total rdma resources for processes belonging
+ * to a cgroup doesn't exceed configured limit.
+ *
+ * RDMA resource limits are hierarchical, so the highest configured limit of
+ * the hierarchy is enforced. Allowing resource limit configuration to default
+ * cgroup allows fair share to kernel space ULPs as well.
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/slab.h>
+#include <linux/device_rdma_cgroup.h>
+#include <linux/device_cgroup.h>
+#include <rdma/ib_verbs.h>
+
+/**
+ * init_devcgroup_rdma_tracker - initialize resource limits.
+ * @dev_cg: device cgroup pointer for which limits should be
+ * initialized.
+ */
+void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg)
+{
+	int i;
+
+	for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++)
+		dev_cg->rdma.tracker[i].limit = DEVCG_RDMA_MAX_RESOURCES;
+}
+
+ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of,
+					char *buf,
+					size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct dev_cgroup *dev_cg = css_to_devcgroup(css);
+	s64 new_limit;
+	int type = of_cft(of)->private;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, DEVCG_RDMA_MAX_RESOURCE_STR)) {
+		new_limit = DEVCG_RDMA_MAX_RESOURCES;
+		goto max_limit;
+	}
+
+	err = kstrtoll(buf, 0, &new_limit);
+	if (err)
+		return err;
+
+	if (new_limit < 0 || new_limit >= DEVCG_RDMA_MAX_RESOURCES)
+		return -EINVAL;
+
+max_limit:
+	dev_cg->rdma.tracker[type].limit = new_limit;
+	return nbytes;
+}
+
+int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
+{
+	struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
+	int type = seq_cft(sf)->private;
+	u32 usage;
+
+	if (dev_cg->rdma.tracker[type].limit ==	DEVCG_RDMA_MAX_RESOURCES) {
+		seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
+	} else {
+		usage = dev_cg->rdma.tracker[type].limit;
+		seq_printf(sf, "%u\n", usage);
+	}
+	return 0;
+}
+
+static const char * const rdma_res_name[] = {
+	[DEVCG_RDMA_RES_TYPE_UCTX] = "uctx",
+	[DEVCG_RDMA_RES_TYPE_CQ] = "cq",
+	[DEVCG_RDMA_RES_TYPE_PD] = "pd",
+	[DEVCG_RDMA_RES_TYPE_AH] = "ah",
+	[DEVCG_RDMA_RES_TYPE_MR] = "mr",
+	[DEVCG_RDMA_RES_TYPE_MW] = "mw",
+	[DEVCG_RDMA_RES_TYPE_SRQ] = "srq",
+	[DEVCG_RDMA_RES_TYPE_QP] = "qp",
+	[DEVCG_RDMA_RES_TYPE_FLOW] = "flow",
+};
+
+int devcgroup_rdma_show_usage(struct seq_file *m, void *v)
+{
+	struct dev_cgroup *devcg = css_to_devcgroup(seq_css(m));
+	const char *res_name = NULL;
+	u32 usage;
+	int i;
+
+	for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) {
+		res_name = rdma_res_name[i];
+		usage = atomic_read(&devcg->rdma.tracker[i].usage);
+		if (usage == DEVCG_RDMA_MAX_RESOURCES)
+			seq_printf(m, "%s %s\n", res_name,
+				   DEVCG_RDMA_MAX_RESOURCE_STR);
+		else
+			seq_printf(m, "%s %u\n", res_name, usage);
+	};
+	return 0;
+}
+
+static void rdma_free_res_counter(struct task_struct *task)
+{
+	struct task_rdma_res_counter *res_cnt = NULL;
+	bool free_res = false;
+
+	task_lock(task);
+	res_cnt = task->rdma_res_counter;
+	if (res_cnt &&
+	    res_cnt->usage[DEVCG_RDMA_RES_TYPE_UCTX] == 0) {
+		/* free resource counters if this is the last
+		 * ucontext, which is getting deallocated.
+		 */
+		task->rdma_res_counter = NULL;
+		free_res = true;
+	}
+	task_unlock(task);
+
+	/* synchronize with task migration activity from one to other cgroup
+	 * which might be reading this task's resource counters.
+	 */
+	synchronize_rcu();
+	if (free_res)
+		kfree(res_cnt);
+}
+
+static void uncharge_resource(struct dev_cgroup *dev_cg,
+			      enum devcgroup_rdma_rt type, s64 num)
+{
+	/*
+	 * A negative count (or overflow for that matter) is invalid,
+	 * and indicates a bug in the device rdma controller.
+	 */
+	WARN_ON_ONCE(atomic_add_negative(-num,
+					 &dev_cg->rdma.tracker[type].usage));
+}
+
+static void uncharge_task_resource(struct task_struct *task,
+				   struct dev_cgroup *cg,
+				   enum devcgroup_rdma_rt type,
+				   int num)
+{
+	struct dev_cgroup *p;
+
+	if (!num)
+		return;
+
+	/* protect against actual task which might be
+	 * freeing resource counter memory due to no resource
+	 * consumption.
+	 */
+	task_lock(task);
+	if (!task->rdma_res_counter) {
+		task_unlock(task);
+		return;
+	}
+	for (p = cg; p; p = parent_devcgroup(p))
+		uncharge_resource(p, type, num);
+
+	task_unlock(task);
+}
+
+/**
+ * devcgroup_rdma_uncharge_resource - hierarchically uncharge
+ * rdma resource count
+ * @ucontext: the ucontext from which to uncharge the resource
+ * pass null when caller knows that there was past allocation
+ * and its calling from same process context to which this resource
+ * belongs.
+ * @type: the type of resource to uncharge
+ * @num: the number of resource to uncharge
+ */
+void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
+				      enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *dev_cg, *p;
+	struct task_struct *ctx_task;
+
+	if (!num)
+		return;
+
+	/* get cgroup of ib_ucontext it belong to, to uncharge
+	 * so that when its called from any worker tasks or any
+	 * other tasks to which this resource doesn't belong to,
+	 * it can be uncharged correctly.
+	 */
+	if (ucontext)
+		ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
+	else
+		ctx_task = current;
+	dev_cg = task_devcgroup(ctx_task);
+
+	spin_lock(&ctx_task->rdma_res_counter->lock);
+	ctx_task->rdma_res_counter->usage[type] -= num;
+
+	for (p = dev_cg; p; p = parent_devcgroup(p))
+		uncharge_resource(p, type, num);
+
+	spin_unlock(&ctx_task->rdma_res_counter->lock);
+
+	if (type == DEVCG_RDMA_RES_TYPE_UCTX)
+		rdma_free_res_counter(ctx_task);
+}
+EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);
+
+/**
+ * This function does not follow configured rdma resource limit.
+ * It cannot fail and the new rdma resource count may exceed the limit.
+ * This is only used during task migration where there is no other
+ * way out than violating the limit.
+ */
+static void charge_resource(struct dev_cgroup *dev_cg,
+			    enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *p;
+
+	for (p = dev_cg; p; p = parent_devcgroup(p)) {
+		struct devcgroup_rdma *rdma = &p->rdma;
+
+		atomic_add(num, &rdma->tracker[type].usage);
+	}
+}
+
+/**
+ * try_charge_resource - hierarchically try to charge
+ * the rdma resource count
+ * @type: the type of resource to uncharge
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows the set limit. It will fail if the charge would cause
+ * the new value to exceed the hierarchical limit. Returns 0 if the charge
+ * succeded, otherwise -EAGAIN.
+ */
+static int try_charge_resource(struct dev_cgroup *dev_cg,
+			       enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *p, *q;
+
+	for (p = dev_cg; p; p = parent_devcgroup(p)) {
+		struct devcgroup_rdma *rdma = &p->rdma;
+		s64 new = atomic_add_return(num,
+					&rdma->tracker[type].usage);
+
+		if (new > rdma->tracker[type].limit)
+			goto revert;
+	}
+	return 0;
+
+revert:
+	for (q = dev_cg; q != p; q = parent_devcgroup(q))
+		uncharge_resource(q, type, num);
+	uncharge_resource(q, type, num);
+	return -EAGAIN;
+}
+
+/**
+ * devcgroup_rdma_try_charge_resource - hierarchically try to charge
+ * the rdma resource count
+ * @type: the type of resource to uncharge
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows the set limit in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *dev_cg = task_devcgroup(current);
+	struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
+	int status;
+
+	if (!res_cnt) {
+		res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
+		if (!res_cnt)
+			return -ENOMEM;
+
+		spin_lock_init(&res_cnt->lock);
+		rcu_assign_pointer(current->rdma_res_counter, res_cnt);
+	}
+
+	/* synchronize with migration task by taking lock, to avoid
+	 * race condition of performing cgroup resource migration
+	 * in non atomic way with this task, which can leads to leaked
+	 * resources in older cgroup.
+	 */
+	spin_lock(&res_cnt->lock);
+	status = try_charge_resource(dev_cg, type, num);
+	if (status)
+		goto busy;
+
+	/* single task updating its rdma resource usage, so atomic is
+	 * not required.
+	 */
+	current->rdma_res_counter->usage[type] += num;
+
+busy:
+	spin_unlock(&res_cnt->lock);
+	return status;
+}
+EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);
+
+/**
+ * devcgroup_rdma_query_resource_limit - query the resource limit
+ * for a given resource type of the calling user process. It returns the
+ * hierarchically smallest limit of the cgroup hierarchy.
+ * @type: the type of resource to query the limit
+ * Returns resource limit across all the RDMA devices accessible
+ * to this process.
+ */
+int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type)
+{
+	struct dev_cgroup *dev_cg, *p;
+	int cur_limit, limit;
+
+	dev_cg = task_devcgroup(current);
+	limit = dev_cg->rdma.tracker[type].limit;
+
+	/* find the controller in the given hirerchy with lowest limit,
+	 * and report its limit to avoid confusion to user and applications,
+	 * who rely on the query functionality.
+	 */
+	for (p = dev_cg; p; p = parent_devcgroup(p)) {
+		cur_limit = p->rdma.tracker[type].limit;
+		limit = min_t(int, cur_limit, limit);
+	}
+	return limit;
+}
+EXPORT_SYMBOL(devcgroup_rdma_query_resource_limit);
+
+int devcgroup_rdma_can_attach(struct cgroup_subsys_state *dst_css,
+			      struct cgroup_taskset *tset)
+{
+	struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css);
+	struct dev_cgroup *old_cg;
+	struct task_struct *task;
+	struct task_rdma_res_counter *task_res_cnt;
+	int val, i;
+
+	cgroup_taskset_for_each(task, tset) {
+		old_cg = task_devcgroup(task);
+
+		/* protect against a task which might be deallocating
+		 * rdma_res_counter structure because last resource
+		 * of the task might undergoing deallocation.
+		 */
+		rcu_read_lock();
+		task_res_cnt = rcu_dereference(task->rdma_res_counter);
+		if (!task_res_cnt)
+			goto empty_task;
+
+		spin_lock(&task_res_cnt->lock);
+		for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) {
+			val = task_res_cnt->usage[i];
+
+			charge_resource(dst_cg, i, val);
+			uncharge_task_resource(task, old_cg, i, val);
+		}
+		spin_unlock(&task_res_cnt->lock);
+
+empty_task:
+		rcu_read_unlock();
+	}
+	return 0;
+}
+
+void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *dst_css,
+				  struct cgroup_taskset *tset)
+{
+	struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css);
+	struct dev_cgroup *old_cg;
+	struct task_struct *task;
+	struct task_rdma_res_counter *task_res_cnt;
+	u32 val; int i;
+
+	cgroup_taskset_for_each(task, tset) {
+		old_cg = task_devcgroup(task);
+
+		/* protect against task deallocating rdma_res_counter structure
+		 * because last ucontext resource of the task might be
+		 * getting deallocated.
+		 */
+		rcu_read_lock();
+		task_res_cnt = rcu_dereference(task->rdma_res_counter);
+		if (!task_res_cnt)
+			goto empty_task;
+
+		spin_lock(&task_res_cnt->lock);
+		for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) {
+			val = task_res_cnt->usage[i];
+
+			charge_resource(old_cg, i, val);
+			uncharge_task_resource(task, dst_cg, i, val);
+		}
+		spin_unlock(&task_res_cnt->lock);
+empty_task:
+		rcu_read_unlock();
+	}
+}
+
+void devcgroup_rdma_fork(struct task_struct *task, void *priv)
+{
+	/* There is per task resource counters,
+	 * so whatever clone as copied over, ignore it.
+	 */
+	task->rdma_res_counter = NULL;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w

Extension of device cgroup for RDMA device resources.
This implements RDMA resource tracker to limit RDMA resources such as
AH, CQ, PD, QP, MR, SRQ etc resources for processes of the cgroup.
It implements RDMA resource limit module to limit consuming RDMA
resources for processes of the cgroup.
RDMA resources are tracked on per task basis.
RDMA resources across multiple such devices are limited among multiple
processes of the owning device cgroup.

RDMA device cgroup extension returns error when user space applications
try to allocate resources more than its configured limit.

Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 include/linux/device_rdma_cgroup.h |  83 ++++++++
 security/device_rdma_cgroup.c      | 422 +++++++++++++++++++++++++++++++++++++
 2 files changed, 505 insertions(+)
 create mode 100644 include/linux/device_rdma_cgroup.h
 create mode 100644 security/device_rdma_cgroup.c

diff --git a/include/linux/device_rdma_cgroup.h b/include/linux/device_rdma_cgroup.h
new file mode 100644
index 0000000..a2c261b
--- /dev/null
+++ b/include/linux/device_rdma_cgroup.h
@@ -0,0 +1,83 @@
+#ifndef _DEVICE_RDMA_CGROUP_H
+#define _DEVICE_RDMA_CGROUP_H
+
+#include <linux/cgroup.h>
+
+/* RDMA resources from device cgroup perspective */
+enum devcgroup_rdma_rt {
+	DEVCG_RDMA_RES_TYPE_UCTX,
+	DEVCG_RDMA_RES_TYPE_CQ,
+	DEVCG_RDMA_RES_TYPE_PD,
+	DEVCG_RDMA_RES_TYPE_AH,
+	DEVCG_RDMA_RES_TYPE_MR,
+	DEVCG_RDMA_RES_TYPE_MW,
+	DEVCG_RDMA_RES_TYPE_SRQ,
+	DEVCG_RDMA_RES_TYPE_QP,
+	DEVCG_RDMA_RES_TYPE_FLOW,
+	DEVCG_RDMA_RES_TYPE_MAX,
+};
+
+struct ib_ucontext;
+
+#define DEVCG_RDMA_MAX_RESOURCES S32_MAX
+
+#ifdef CONFIG_CGROUP_RDMA_RESOURCE
+
+#define DEVCG_RDMA_MAX_RESOURCE_STR "max"
+
+enum devcgroup_rdma_access_files {
+	DEVCG_RDMA_LIST_USAGE,
+};
+
+struct task_rdma_res_counter {
+	/* allows atomic increment of task and cgroup counters
+	 *  to avoid race with migration task.
+	 */
+	spinlock_t lock;
+	u32 usage[DEVCG_RDMA_RES_TYPE_MAX];
+};
+
+struct devcgroup_rdma_tracker {
+	int limit;
+	atomic_t usage;
+	int failcnt;
+};
+
+struct devcgroup_rdma {
+	struct devcgroup_rdma_tracker tracker[DEVCG_RDMA_RES_TYPE_MAX];
+};
+
+struct dev_cgroup;
+
+void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg);
+ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of,
+					char *buf,
+					size_t nbytes, loff_t off);
+int devcgroup_rdma_get_max_resource(struct seq_file *m, void *v);
+int devcgroup_rdma_show_usage(struct seq_file *m, void *v);
+
+int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num);
+void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
+				      enum devcgroup_rdma_rt type, int num);
+void devcgroup_rdma_fork(struct task_struct *task, void *priv);
+
+int devcgroup_rdma_can_attach(struct cgroup_subsys_state *css,
+			      struct cgroup_taskset *tset);
+void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *css,
+				  struct cgroup_taskset *tset);
+int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type);
+#else
+
+static inline int devcgroup_rdma_try_charge_resource(
+				enum devcgroup_rdma_rt type, int num)
+{ return 0; }
+static inline void devcgroup_rdma_uncharge_resource(
+				struct ib_ucontext *ucontext,
+				enum devcgroup_rdma_rt type, int num)
+{ }
+static inline int devcgroup_rdma_query_resource_limit(
+				enum devcgroup_rdma_rt type)
+{ return DEVCG_RDMA_MAX_RESOURCES; }
+#endif
+
+#endif
diff --git a/security/device_rdma_cgroup.c b/security/device_rdma_cgroup.c
new file mode 100644
index 0000000..fb4cc59
--- /dev/null
+++ b/security/device_rdma_cgroup.c
@@ -0,0 +1,422 @@
+/*
+ * RDMA device cgroup controller of device controller cgroup.
+ *
+ * Provides a cgroup hierarchy to limit various RDMA resource allocation to a
+ * configured limit of the cgroup.
+ *
+ * Its easy for user space applications to consume of RDMA device specific
+ * hardware resources. Such resource exhaustion should be prevented so that
+ * user space applications and other kernel consumers gets chance to allocate
+ * and effectively use the hardware resources.
+ *
+ * In order to use the device rdma controller, set the maximum resource count
+ * per cgroup, which ensures that total rdma resources for processes belonging
+ * to a cgroup doesn't exceed configured limit.
+ *
+ * RDMA resource limits are hierarchical, so the highest configured limit of
+ * the hierarchy is enforced. Allowing resource limit configuration to default
+ * cgroup allows fair share to kernel space ULPs as well.
+ *
+ * This file is subject to the terms and conditions of version 2 of the GNU
+ * General Public License.  See the file COPYING in the main directory of the
+ * Linux distribution for more details.
+ */
+
+#include <linux/slab.h>
+#include <linux/device_rdma_cgroup.h>
+#include <linux/device_cgroup.h>
+#include <rdma/ib_verbs.h>
+
+/**
+ * init_devcgroup_rdma_tracker - initialize resource limits.
+ * @dev_cg: device cgroup pointer for which limits should be
+ * initialized.
+ */
+void init_devcgroup_rdma_tracker(struct dev_cgroup *dev_cg)
+{
+	int i;
+
+	for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++)
+		dev_cg->rdma.tracker[i].limit = DEVCG_RDMA_MAX_RESOURCES;
+}
+
+ssize_t devcgroup_rdma_set_max_resource(struct kernfs_open_file *of,
+					char *buf,
+					size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct dev_cgroup *dev_cg = css_to_devcgroup(css);
+	s64 new_limit;
+	int type = of_cft(of)->private;
+	int err;
+
+	buf = strstrip(buf);
+	if (!strcmp(buf, DEVCG_RDMA_MAX_RESOURCE_STR)) {
+		new_limit = DEVCG_RDMA_MAX_RESOURCES;
+		goto max_limit;
+	}
+
+	err = kstrtoll(buf, 0, &new_limit);
+	if (err)
+		return err;
+
+	if (new_limit < 0 || new_limit >= DEVCG_RDMA_MAX_RESOURCES)
+		return -EINVAL;
+
+max_limit:
+	dev_cg->rdma.tracker[type].limit = new_limit;
+	return nbytes;
+}
+
+int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
+{
+	struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
+	int type = seq_cft(sf)->private;
+	u32 usage;
+
+	if (dev_cg->rdma.tracker[type].limit ==	DEVCG_RDMA_MAX_RESOURCES) {
+		seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
+	} else {
+		usage = dev_cg->rdma.tracker[type].limit;
+		seq_printf(sf, "%u\n", usage);
+	}
+	return 0;
+}
+
+static const char * const rdma_res_name[] = {
+	[DEVCG_RDMA_RES_TYPE_UCTX] = "uctx",
+	[DEVCG_RDMA_RES_TYPE_CQ] = "cq",
+	[DEVCG_RDMA_RES_TYPE_PD] = "pd",
+	[DEVCG_RDMA_RES_TYPE_AH] = "ah",
+	[DEVCG_RDMA_RES_TYPE_MR] = "mr",
+	[DEVCG_RDMA_RES_TYPE_MW] = "mw",
+	[DEVCG_RDMA_RES_TYPE_SRQ] = "srq",
+	[DEVCG_RDMA_RES_TYPE_QP] = "qp",
+	[DEVCG_RDMA_RES_TYPE_FLOW] = "flow",
+};
+
+int devcgroup_rdma_show_usage(struct seq_file *m, void *v)
+{
+	struct dev_cgroup *devcg = css_to_devcgroup(seq_css(m));
+	const char *res_name = NULL;
+	u32 usage;
+	int i;
+
+	for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) {
+		res_name = rdma_res_name[i];
+		usage = atomic_read(&devcg->rdma.tracker[i].usage);
+		if (usage == DEVCG_RDMA_MAX_RESOURCES)
+			seq_printf(m, "%s %s\n", res_name,
+				   DEVCG_RDMA_MAX_RESOURCE_STR);
+		else
+			seq_printf(m, "%s %u\n", res_name, usage);
+	};
+	return 0;
+}
+
+static void rdma_free_res_counter(struct task_struct *task)
+{
+	struct task_rdma_res_counter *res_cnt = NULL;
+	bool free_res = false;
+
+	task_lock(task);
+	res_cnt = task->rdma_res_counter;
+	if (res_cnt &&
+	    res_cnt->usage[DEVCG_RDMA_RES_TYPE_UCTX] == 0) {
+		/* free resource counters if this is the last
+		 * ucontext, which is getting deallocated.
+		 */
+		task->rdma_res_counter = NULL;
+		free_res = true;
+	}
+	task_unlock(task);
+
+	/* synchronize with task migration activity from one to other cgroup
+	 * which might be reading this task's resource counters.
+	 */
+	synchronize_rcu();
+	if (free_res)
+		kfree(res_cnt);
+}
+
+static void uncharge_resource(struct dev_cgroup *dev_cg,
+			      enum devcgroup_rdma_rt type, s64 num)
+{
+	/*
+	 * A negative count (or overflow for that matter) is invalid,
+	 * and indicates a bug in the device rdma controller.
+	 */
+	WARN_ON_ONCE(atomic_add_negative(-num,
+					 &dev_cg->rdma.tracker[type].usage));
+}
+
+static void uncharge_task_resource(struct task_struct *task,
+				   struct dev_cgroup *cg,
+				   enum devcgroup_rdma_rt type,
+				   int num)
+{
+	struct dev_cgroup *p;
+
+	if (!num)
+		return;
+
+	/* protect against actual task which might be
+	 * freeing resource counter memory due to no resource
+	 * consumption.
+	 */
+	task_lock(task);
+	if (!task->rdma_res_counter) {
+		task_unlock(task);
+		return;
+	}
+	for (p = cg; p; p = parent_devcgroup(p))
+		uncharge_resource(p, type, num);
+
+	task_unlock(task);
+}
+
+/**
+ * devcgroup_rdma_uncharge_resource - hierarchically uncharge
+ * rdma resource count
+ * @ucontext: the ucontext from which to uncharge the resource
+ * pass null when caller knows that there was past allocation
+ * and its calling from same process context to which this resource
+ * belongs.
+ * @type: the type of resource to uncharge
+ * @num: the number of resource to uncharge
+ */
+void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
+				      enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *dev_cg, *p;
+	struct task_struct *ctx_task;
+
+	if (!num)
+		return;
+
+	/* get cgroup of ib_ucontext it belong to, to uncharge
+	 * so that when its called from any worker tasks or any
+	 * other tasks to which this resource doesn't belong to,
+	 * it can be uncharged correctly.
+	 */
+	if (ucontext)
+		ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
+	else
+		ctx_task = current;
+	dev_cg = task_devcgroup(ctx_task);
+
+	spin_lock(&ctx_task->rdma_res_counter->lock);
+	ctx_task->rdma_res_counter->usage[type] -= num;
+
+	for (p = dev_cg; p; p = parent_devcgroup(p))
+		uncharge_resource(p, type, num);
+
+	spin_unlock(&ctx_task->rdma_res_counter->lock);
+
+	if (type == DEVCG_RDMA_RES_TYPE_UCTX)
+		rdma_free_res_counter(ctx_task);
+}
+EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);
+
+/**
+ * This function does not follow configured rdma resource limit.
+ * It cannot fail and the new rdma resource count may exceed the limit.
+ * This is only used during task migration where there is no other
+ * way out than violating the limit.
+ */
+static void charge_resource(struct dev_cgroup *dev_cg,
+			    enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *p;
+
+	for (p = dev_cg; p; p = parent_devcgroup(p)) {
+		struct devcgroup_rdma *rdma = &p->rdma;
+
+		atomic_add(num, &rdma->tracker[type].usage);
+	}
+}
+
+/**
+ * try_charge_resource - hierarchically try to charge
+ * the rdma resource count
+ * @type: the type of resource to uncharge
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows the set limit. It will fail if the charge would cause
+ * the new value to exceed the hierarchical limit. Returns 0 if the charge
+ * succeded, otherwise -EAGAIN.
+ */
+static int try_charge_resource(struct dev_cgroup *dev_cg,
+			       enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *p, *q;
+
+	for (p = dev_cg; p; p = parent_devcgroup(p)) {
+		struct devcgroup_rdma *rdma = &p->rdma;
+		s64 new = atomic_add_return(num,
+					&rdma->tracker[type].usage);
+
+		if (new > rdma->tracker[type].limit)
+			goto revert;
+	}
+	return 0;
+
+revert:
+	for (q = dev_cg; q != p; q = parent_devcgroup(q))
+		uncharge_resource(q, type, num);
+	uncharge_resource(q, type, num);
+	return -EAGAIN;
+}
+
+/**
+ * devcgroup_rdma_try_charge_resource - hierarchically try to charge
+ * the rdma resource count
+ * @type: the type of resource to uncharge
+ * @num: the number of rdma resource to charge
+ *
+ * This function follows the set limit in hierarchical way.
+ * It will fail if the charge would cause the new value to exceed the
+ * hierarchical limit.
+ * Returns 0 if the charge succeded, otherwise -EAGAIN.
+ */
+int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
+{
+	struct dev_cgroup *dev_cg = task_devcgroup(current);
+	struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
+	int status;
+
+	if (!res_cnt) {
+		res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
+		if (!res_cnt)
+			return -ENOMEM;
+
+		spin_lock_init(&res_cnt->lock);
+		rcu_assign_pointer(current->rdma_res_counter, res_cnt);
+	}
+
+	/* synchronize with migration task by taking lock, to avoid
+	 * race condition of performing cgroup resource migration
+	 * in non atomic way with this task, which can leads to leaked
+	 * resources in older cgroup.
+	 */
+	spin_lock(&res_cnt->lock);
+	status = try_charge_resource(dev_cg, type, num);
+	if (status)
+		goto busy;
+
+	/* single task updating its rdma resource usage, so atomic is
+	 * not required.
+	 */
+	current->rdma_res_counter->usage[type] += num;
+
+busy:
+	spin_unlock(&res_cnt->lock);
+	return status;
+}
+EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);
+
+/**
+ * devcgroup_rdma_query_resource_limit - query the resource limit
+ * for a given resource type of the calling user process. It returns the
+ * hierarchically smallest limit of the cgroup hierarchy.
+ * @type: the type of resource to query the limit
+ * Returns resource limit across all the RDMA devices accessible
+ * to this process.
+ */
+int devcgroup_rdma_query_resource_limit(enum devcgroup_rdma_rt type)
+{
+	struct dev_cgroup *dev_cg, *p;
+	int cur_limit, limit;
+
+	dev_cg = task_devcgroup(current);
+	limit = dev_cg->rdma.tracker[type].limit;
+
+	/* find the controller in the given hirerchy with lowest limit,
+	 * and report its limit to avoid confusion to user and applications,
+	 * who rely on the query functionality.
+	 */
+	for (p = dev_cg; p; p = parent_devcgroup(p)) {
+		cur_limit = p->rdma.tracker[type].limit;
+		limit = min_t(int, cur_limit, limit);
+	}
+	return limit;
+}
+EXPORT_SYMBOL(devcgroup_rdma_query_resource_limit);
+
+int devcgroup_rdma_can_attach(struct cgroup_subsys_state *dst_css,
+			      struct cgroup_taskset *tset)
+{
+	struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css);
+	struct dev_cgroup *old_cg;
+	struct task_struct *task;
+	struct task_rdma_res_counter *task_res_cnt;
+	int val, i;
+
+	cgroup_taskset_for_each(task, tset) {
+		old_cg = task_devcgroup(task);
+
+		/* protect against a task which might be deallocating
+		 * rdma_res_counter structure because last resource
+		 * of the task might undergoing deallocation.
+		 */
+		rcu_read_lock();
+		task_res_cnt = rcu_dereference(task->rdma_res_counter);
+		if (!task_res_cnt)
+			goto empty_task;
+
+		spin_lock(&task_res_cnt->lock);
+		for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) {
+			val = task_res_cnt->usage[i];
+
+			charge_resource(dst_cg, i, val);
+			uncharge_task_resource(task, old_cg, i, val);
+		}
+		spin_unlock(&task_res_cnt->lock);
+
+empty_task:
+		rcu_read_unlock();
+	}
+	return 0;
+}
+
+void devcgroup_rdma_cancel_attach(struct cgroup_subsys_state *dst_css,
+				  struct cgroup_taskset *tset)
+{
+	struct dev_cgroup *dst_cg = css_to_devcgroup(dst_css);
+	struct dev_cgroup *old_cg;
+	struct task_struct *task;
+	struct task_rdma_res_counter *task_res_cnt;
+	u32 val; int i;
+
+	cgroup_taskset_for_each(task, tset) {
+		old_cg = task_devcgroup(task);
+
+		/* protect against task deallocating rdma_res_counter structure
+		 * because last ucontext resource of the task might be
+		 * getting deallocated.
+		 */
+		rcu_read_lock();
+		task_res_cnt = rcu_dereference(task->rdma_res_counter);
+		if (!task_res_cnt)
+			goto empty_task;
+
+		spin_lock(&task_res_cnt->lock);
+		for (i = 0; i < DEVCG_RDMA_RES_TYPE_MAX; i++) {
+			val = task_res_cnt->usage[i];
+
+			charge_resource(old_cg, i, val);
+			uncharge_task_resource(task, dst_cg, i, val);
+		}
+		spin_unlock(&task_res_cnt->lock);
+empty_task:
+		rcu_read_unlock();
+	}
+}
+
+void devcgroup_rdma_fork(struct task_struct *task, void *priv)
+{
+	/* There is per task resource counters,
+	 * so whatever clone as copied over, ignore it.
+	 */
+	task->rdma_res_counter = NULL;
+}
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 6/7] devcg: Added support to use RDMA device cgroup.
  2015-09-07 20:38 ` Parav Pandit
                   ` (5 preceding siblings ...)
  (?)
@ 2015-09-07 20:38 ` Parav Pandit
  2015-09-08  8:40     ` Haggai Eran
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

RDMA uverbs modules now queries associated device cgroup rdma controller
before allocating device resources and uncharge them while freeing
rdma device resources.
Since fput() sequence can free the resources from the workqueue
context (instead of task context which allocated the resource),
it passes associated ucontext pointer during uncharge, so that
rdma cgroup controller can correctly free the resource of right
task and right cgroup.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 drivers/infiniband/core/uverbs_cmd.c  | 139 +++++++++++++++++++++++++++++-----
 drivers/infiniband/core/uverbs_main.c |  39 +++++++++-
 2 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index bbb02ff..c080374 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -37,6 +37,7 @@
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/device_rdma_cgroup.h>
 
 #include <asm/uaccess.h>
 
@@ -281,6 +282,19 @@ static void put_xrcd_read(struct ib_uobject *uobj)
 	put_uobj_read(uobj);
 }
 
+static void init_ucontext_lists(struct ib_ucontext *ucontext)
+{
+	INIT_LIST_HEAD(&ucontext->pd_list);
+	INIT_LIST_HEAD(&ucontext->mr_list);
+	INIT_LIST_HEAD(&ucontext->mw_list);
+	INIT_LIST_HEAD(&ucontext->cq_list);
+	INIT_LIST_HEAD(&ucontext->qp_list);
+	INIT_LIST_HEAD(&ucontext->srq_list);
+	INIT_LIST_HEAD(&ucontext->ah_list);
+	INIT_LIST_HEAD(&ucontext->xrcd_list);
+	INIT_LIST_HEAD(&ucontext->rule_list);
+}
+
 ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 			      const char __user *buf,
 			      int in_len, int out_len)
@@ -313,22 +327,18 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
 		   (unsigned long) cmd.response + sizeof resp,
 		   in_len - sizeof cmd, out_len - sizeof resp);
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_UCTX, 1);
+	if (ret)
+		goto err;
+
 	ucontext = ibdev->alloc_ucontext(ibdev, &udata);
 	if (IS_ERR(ucontext)) {
 		ret = PTR_ERR(ucontext);
-		goto err;
+		goto err_alloc;
 	}
 
 	ucontext->device = ibdev;
-	INIT_LIST_HEAD(&ucontext->pd_list);
-	INIT_LIST_HEAD(&ucontext->mr_list);
-	INIT_LIST_HEAD(&ucontext->mw_list);
-	INIT_LIST_HEAD(&ucontext->cq_list);
-	INIT_LIST_HEAD(&ucontext->qp_list);
-	INIT_LIST_HEAD(&ucontext->srq_list);
-	INIT_LIST_HEAD(&ucontext->ah_list);
-	INIT_LIST_HEAD(&ucontext->xrcd_list);
-	INIT_LIST_HEAD(&ucontext->rule_list);
+	init_ucontext_lists(ucontext);
 	rcu_read_lock();
 	ucontext->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
 	rcu_read_unlock();
@@ -395,6 +405,8 @@ err_free:
 	put_pid(ucontext->tgid);
 	ibdev->dealloc_ucontext(ucontext);
 
+err_alloc:
+	devcgroup_rdma_uncharge_resource(NULL, DEVCG_RDMA_RES_TYPE_UCTX, 1);
 err:
 	mutex_unlock(&file->mutex);
 	return ret;
@@ -412,15 +424,23 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->vendor_id		= attr->vendor_id;
 	resp->vendor_part_id	= attr->vendor_part_id;
 	resp->hw_ver		= attr->hw_ver;
-	resp->max_qp		= attr->max_qp;
+	resp->max_qp		= min_t(int, attr->max_qp,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_QP));
 	resp->max_qp_wr		= attr->max_qp_wr;
 	resp->device_cap_flags	= attr->device_cap_flags;
 	resp->max_sge		= attr->max_sge;
 	resp->max_sge_rd	= attr->max_sge_rd;
-	resp->max_cq		= attr->max_cq;
+	resp->max_cq		= min_t(int, attr->max_cq,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_CQ));
 	resp->max_cqe		= attr->max_cqe;
-	resp->max_mr		= attr->max_mr;
-	resp->max_pd		= attr->max_pd;
+	resp->max_mr		= min_t(int, attr->max_mr,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_MR));
+	resp->max_pd		= min_t(int, attr->max_pd,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_PD));
 	resp->max_qp_rd_atom	= attr->max_qp_rd_atom;
 	resp->max_ee_rd_atom	= attr->max_ee_rd_atom;
 	resp->max_res_rd_atom	= attr->max_res_rd_atom;
@@ -429,16 +449,22 @@ static void copy_query_dev_fields(struct ib_uverbs_file *file,
 	resp->atomic_cap		= attr->atomic_cap;
 	resp->max_ee			= attr->max_ee;
 	resp->max_rdd			= attr->max_rdd;
-	resp->max_mw			= attr->max_mw;
+	resp->max_mw			= min_t(int, attr->max_mw,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_MW));
 	resp->max_raw_ipv6_qp		= attr->max_raw_ipv6_qp;
 	resp->max_raw_ethy_qp		= attr->max_raw_ethy_qp;
 	resp->max_mcast_grp		= attr->max_mcast_grp;
 	resp->max_mcast_qp_attach	= attr->max_mcast_qp_attach;
 	resp->max_total_mcast_qp_attach	= attr->max_total_mcast_qp_attach;
-	resp->max_ah			= attr->max_ah;
+	resp->max_ah			= min_t(int, attr->max_ah,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_AH));
 	resp->max_fmr			= attr->max_fmr;
 	resp->max_map_per_fmr		= attr->max_map_per_fmr;
-	resp->max_srq			= attr->max_srq;
+	resp->max_srq			= min_t(int, attr->max_srq,
+					devcgroup_rdma_query_resource_limit(
+						DEVCG_RDMA_RES_TYPE_SRQ));
 	resp->max_srq_wr		= attr->max_srq_wr;
 	resp->max_srq_sge		= attr->max_srq_sge;
 	resp->max_pkeys			= attr->max_pkeys;
@@ -550,6 +576,12 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
 	if (!uobj)
 		return -ENOMEM;
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_PD, 1);
+	if (ret) {
+		kfree(uobj);
+		return -EPERM;
+	}
+
 	init_uobj(uobj, 0, file->ucontext, &pd_lock_class);
 	down_write(&uobj->mutex);
 
@@ -595,6 +627,9 @@ err_idr:
 	ib_dealloc_pd(pd);
 
 err:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_PD, 1);
+
 	put_uobj_write(uobj);
 	return ret;
 }
@@ -623,6 +658,9 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_PD, 1);
+
 	idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -987,6 +1025,10 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
 		}
 	}
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_MR, 1);
+	if (ret)
+		goto err_charge;
+
 	mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
 				     cmd.access_flags, &udata);
 	if (IS_ERR(mr)) {
@@ -1033,8 +1075,10 @@ err_copy:
 
 err_unreg:
 	ib_dereg_mr(mr);
-
 err_put:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_MR, 1);
+err_charge:
 	put_pd_read(pd);
 
 err_free:
@@ -1162,6 +1206,9 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_MR, 1);
+
 	idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1379,6 +1426,10 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
 	if (cmd_sz > offsetof(typeof(*cmd), flags) + sizeof(cmd->flags))
 		attr.flags = cmd->flags;
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_CQ, 1);
+	if (ret)
+		goto err_charge;
+
 	cq = file->device->ib_dev->create_cq(file->device->ib_dev, &attr,
 					     file->ucontext, uhw);
 	if (IS_ERR(cq)) {
@@ -1426,6 +1477,9 @@ err_free:
 	ib_destroy_cq(cq);
 
 err_file:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_CQ, 1);
+err_charge:
 	if (ev_file)
 		ib_uverbs_release_ucq(file, ev_file, obj);
 
@@ -1700,6 +1754,9 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_CQ, 1);
+
 	idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -1818,6 +1875,10 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
 	INIT_LIST_HEAD(&obj->uevent.event_list);
 	INIT_LIST_HEAD(&obj->mcast_list);
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_QP, 1);
+	if (ret)
+		goto err_put;
+
 	if (cmd.qp_type == IB_QPT_XRC_TGT)
 		qp = ib_create_qp(pd, &attr);
 	else
@@ -1825,7 +1886,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
 
 	if (IS_ERR(qp)) {
 		ret = PTR_ERR(qp);
-		goto err_put;
+		goto err_create;
 	}
 
 	if (cmd.qp_type != IB_QPT_XRC_TGT) {
@@ -1900,6 +1961,9 @@ err_copy:
 err_destroy:
 	ib_destroy_qp(qp);
 
+err_create:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_QP, 1);
 err_put:
 	if (xrcd)
 		put_xrcd_read(xrcd_uobj);
@@ -2256,6 +2320,9 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_QP, 1);
+
 	if (obj->uxrcd)
 		atomic_dec(&obj->uxrcd->refcnt);
 
@@ -2665,10 +2732,14 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
 	memset(&attr.dmac, 0, sizeof(attr.dmac));
 	memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16);
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_AH, 1);
+	if (ret)
+		goto err_put;
+
 	ah = ib_create_ah(pd, &attr);
 	if (IS_ERR(ah)) {
 		ret = PTR_ERR(ah);
-		goto err_put;
+		goto err_create;
 	}
 
 	ah->uobject  = uobj;
@@ -2704,6 +2775,9 @@ err_copy:
 err_destroy:
 	ib_destroy_ah(ah);
 
+err_create:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_AH, 1);
 err_put:
 	put_pd_read(pd);
 
@@ -2737,6 +2811,9 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_AH, 1);
+
 	idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 
 	mutex_lock(&file->mutex);
@@ -2986,10 +3063,15 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file,
 		err = -EINVAL;
 		goto err_free;
 	}
+
+	err = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_FLOW, 1);
+	if (err)
+		goto err_free;
+
 	flow_id = ib_create_flow(qp, flow_attr, IB_FLOW_DOMAIN_USER);
 	if (IS_ERR(flow_id)) {
 		err = PTR_ERR(flow_id);
-		goto err_free;
+		goto err_create;
 	}
 	flow_id->qp = qp;
 	flow_id->uobject = uobj;
@@ -3023,6 +3105,9 @@ err_copy:
 	idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
 destroy_flow:
 	ib_destroy_flow(flow_id);
+err_create:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_FLOW, 1);
 err_free:
 	kfree(flow_attr);
 err_put:
@@ -3064,6 +3149,9 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
 	if (!ret)
 		uobj->live = 0;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_FLOW, 1);
+
 	put_uobj_write(uobj);
 
 	idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
@@ -3129,6 +3217,10 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
 	obj->uevent.events_reported = 0;
 	INIT_LIST_HEAD(&obj->uevent.event_list);
 
+	ret = devcgroup_rdma_try_charge_resource(DEVCG_RDMA_RES_TYPE_SRQ, 1);
+	if (ret)
+		goto err_put_cq;
+
 	srq = pd->device->create_srq(pd, &attr, udata);
 	if (IS_ERR(srq)) {
 		ret = PTR_ERR(srq);
@@ -3193,6 +3285,8 @@ err_destroy:
 	ib_destroy_srq(srq);
 
 err_put:
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_SRQ, 1);
 	put_pd_read(pd);
 
 err_put_cq:
@@ -3372,6 +3466,9 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
 	if (ret)
 		return ret;
 
+	devcgroup_rdma_uncharge_resource(file->ucontext,
+					 DEVCG_RDMA_RES_TYPE_SRQ, 1);
+
 	if (srq_type == IB_SRQT_XRC) {
 		us = container_of(obj, struct ib_usrq_object, uevent);
 		atomic_dec(&us->uxrcd->refcnt);
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index f6eef2d..31544d4 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -45,6 +45,7 @@
 #include <linux/cdev.h>
 #include <linux/anon_inodes.h>
 #include <linux/slab.h>
+#include <linux/device_rdma_cgroup.h>
 
 #include <asm/uaccess.h>
 
@@ -200,6 +201,7 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 				      struct ib_ucontext *context)
 {
 	struct ib_uobject *uobj, *tmp;
+	int uobj_cnt = 0, ret;
 
 	if (!context)
 		return 0;
@@ -212,8 +214,12 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
 		ib_destroy_ah(ah);
 		kfree(uobj);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_AH, uobj_cnt);
 
+	uobj_cnt = 0;
 	/* Remove MWs before QPs, in order to support type 2A MWs. */
 	list_for_each_entry_safe(uobj, tmp, &context->mw_list, list) {
 		struct ib_mw *mw = uobj->object;
@@ -221,16 +227,24 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
 		ib_dealloc_mw(mw);
 		kfree(uobj);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_MW, uobj_cnt);
 
+	uobj_cnt = 0;
 	list_for_each_entry_safe(uobj, tmp, &context->rule_list, list) {
 		struct ib_flow *flow_id = uobj->object;
 
 		idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
 		ib_destroy_flow(flow_id);
 		kfree(uobj);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_FLOW, uobj_cnt);
 
+	uobj_cnt = 0;
 	list_for_each_entry_safe(uobj, tmp, &context->qp_list, list) {
 		struct ib_qp *qp = uobj->object;
 		struct ib_uqp_object *uqp =
@@ -245,8 +259,12 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		}
 		ib_uverbs_release_uevent(file, &uqp->uevent);
 		kfree(uqp);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_QP, uobj_cnt);
 
+	uobj_cnt = 0;
 	list_for_each_entry_safe(uobj, tmp, &context->srq_list, list) {
 		struct ib_srq *srq = uobj->object;
 		struct ib_uevent_object *uevent =
@@ -256,8 +274,12 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		ib_destroy_srq(srq);
 		ib_uverbs_release_uevent(file, uevent);
 		kfree(uevent);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_SRQ, uobj_cnt);
 
+	uobj_cnt = 0;
 	list_for_each_entry_safe(uobj, tmp, &context->cq_list, list) {
 		struct ib_cq *cq = uobj->object;
 		struct ib_uverbs_event_file *ev_file = cq->cq_context;
@@ -268,15 +290,22 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 		ib_destroy_cq(cq);
 		ib_uverbs_release_ucq(file, ev_file, ucq);
 		kfree(ucq);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_CQ, uobj_cnt);
 
+	uobj_cnt = 0;
 	list_for_each_entry_safe(uobj, tmp, &context->mr_list, list) {
 		struct ib_mr *mr = uobj->object;
 
 		idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
 		ib_dereg_mr(mr);
 		kfree(uobj);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_MR, uobj_cnt);
 
 	mutex_lock(&file->device->xrcd_tree_mutex);
 	list_for_each_entry_safe(uobj, tmp, &context->xrcd_list, list) {
@@ -290,17 +319,25 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
 	}
 	mutex_unlock(&file->device->xrcd_tree_mutex);
 
+	uobj_cnt = 0;
 	list_for_each_entry_safe(uobj, tmp, &context->pd_list, list) {
 		struct ib_pd *pd = uobj->object;
 
 		idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
 		ib_dealloc_pd(pd);
 		kfree(uobj);
+		uobj_cnt++;
 	}
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_PD, uobj_cnt);
 
 	put_pid(context->tgid);
 
-	return context->device->dealloc_ucontext(context);
+	ret = context->device->dealloc_ucontext(context);
+
+	devcgroup_rdma_uncharge_resource(context,
+					 DEVCG_RDMA_RES_TYPE_UCTX, 1);
+	return ret;
 }
 
 static void ib_uverbs_release_file(struct kref *ref)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 7/7] devcg: Added Documentation of RDMA device cgroup.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	dledford
  Cc: corbet, james.l.morris, serge, haggaie, ogerlitz, matanb, raindel,
	akpm, linux-security-module, pandit.parav

Modified device cgroup documentation to reflect its dual purpose
without creating new cgroup subsystem for rdma.

Added documentation to describe functionality and usage of device cgroup
extension for RDMA.

Signed-off-by: Parav Pandit <pandit.parav@gmail.com>
---
 Documentation/cgroups/devices.txt | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 3c1095c..eca5b70 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -1,9 +1,12 @@
-Device Whitelist Controller
+Device Controller
 
 1. Description:
 
-Implement a cgroup to track and enforce open and mknod restrictions
-on device files.  A device cgroup associates a device access
+Device controller implements a cgroup for two purposes.
+
+1.1 Device white list controller
+It implement a cgroup to track and enforce open and mknod
+restrictions on device files.  A device cgroup associates a device access
 whitelist with each cgroup.  A whitelist entry has 4 fields.
 'type' is a (all), c (char), or b (block).  'all' means it applies
 to all types and all major and minor numbers.  Major and minor are
@@ -15,8 +18,15 @@ cgroup gets a copy of the parent.  Administrators can then remove
 devices from the whitelist or add new entries.  A child cgroup can
 never receive a device access which is denied by its parent.
 
+1.2 RDMA device resource controller
+It implements a cgroup to limit various RDMA device resources for
+a controller. Such resource includes RDMA PD, CQ, AH, MR, SRQ, QP, FLOW.
+It limits RDMA resources access to tasks of the cgroup across multiple
+RDMA devices.
+
 2. User Interface
 
+2.1 Device white list controller
 An entry is added using devices.allow, and removed using
 devices.deny.  For instance
 
@@ -33,6 +43,22 @@ will remove the default 'a *:* rwm' entry. Doing
 
 will add the 'a *:* rwm' entry to the whitelist.
 
+2.2 RDMA device controller
+
+RDMA resources are limited using devices.rdma.resource.max.<resource_name>.
+Doing
+	echo 200 > /sys/fs/cgroup/1/rdma.resource.max_qp
+will limit maximum number of QP across all the process of cgroup to 200.
+
+More examples:
+	echo 200 > /sys/fs/cgroup/1/rdma.resource.max_flow
+	echo 10  > /sys/fs/cgroup/1/rdma.resource.max_pd
+	echo 15  > /sys/fs/cgroup/1/rdma.resource.max_srq
+	echo 1   > /sys/fs/cgroup/1/rdma.resource.max_uctx
+
+RDMA resource current usage can be tracked using devices.rdma.resource.usage
+	cat /sys/fs/cgroup/1/devices.rdma.resource.usage
+
 3. Security
 
 Any task can move itself between cgroups.  This clearly won't
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 95+ messages in thread

* [PATCH 7/7] devcg: Added Documentation of RDMA device cgroup.
@ 2015-09-07 20:38   ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:38 UTC (permalink / raw)
  To: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA,
	pandit.parav-Re5JQEeQqe8AvxtiuMwx3w

Modified device cgroup documentation to reflect its dual purpose
without creating new cgroup subsystem for rdma.

Added documentation to describe functionality and usage of device cgroup
extension for RDMA.

Signed-off-by: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
---
 Documentation/cgroups/devices.txt | 32 +++++++++++++++++++++++++++++---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt
index 3c1095c..eca5b70 100644
--- a/Documentation/cgroups/devices.txt
+++ b/Documentation/cgroups/devices.txt
@@ -1,9 +1,12 @@
-Device Whitelist Controller
+Device Controller
 
 1. Description:
 
-Implement a cgroup to track and enforce open and mknod restrictions
-on device files.  A device cgroup associates a device access
+Device controller implements a cgroup for two purposes.
+
+1.1 Device white list controller
+It implement a cgroup to track and enforce open and mknod
+restrictions on device files.  A device cgroup associates a device access
 whitelist with each cgroup.  A whitelist entry has 4 fields.
 'type' is a (all), c (char), or b (block).  'all' means it applies
 to all types and all major and minor numbers.  Major and minor are
@@ -15,8 +18,15 @@ cgroup gets a copy of the parent.  Administrators can then remove
 devices from the whitelist or add new entries.  A child cgroup can
 never receive a device access which is denied by its parent.
 
+1.2 RDMA device resource controller
+It implements a cgroup to limit various RDMA device resources for
+a controller. Such resource includes RDMA PD, CQ, AH, MR, SRQ, QP, FLOW.
+It limits RDMA resources access to tasks of the cgroup across multiple
+RDMA devices.
+
 2. User Interface
 
+2.1 Device white list controller
 An entry is added using devices.allow, and removed using
 devices.deny.  For instance
 
@@ -33,6 +43,22 @@ will remove the default 'a *:* rwm' entry. Doing
 
 will add the 'a *:* rwm' entry to the whitelist.
 
+2.2 RDMA device controller
+
+RDMA resources are limited using devices.rdma.resource.max.<resource_name>.
+Doing
+	echo 200 > /sys/fs/cgroup/1/rdma.resource.max_qp
+will limit maximum number of QP across all the process of cgroup to 200.
+
+More examples:
+	echo 200 > /sys/fs/cgroup/1/rdma.resource.max_flow
+	echo 10  > /sys/fs/cgroup/1/rdma.resource.max_pd
+	echo 15  > /sys/fs/cgroup/1/rdma.resource.max_srq
+	echo 1   > /sys/fs/cgroup/1/rdma.resource.max_uctx
+
+RDMA resource current usage can be tracked using devices.rdma.resource.usage
+	cat /sys/fs/cgroup/1/devices.rdma.resource.usage
+
 3. Security
 
 Any task can move itself between cgroups.  This clearly won't
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-07 20:38 ` Parav Pandit
                   ` (7 preceding siblings ...)
  (?)
@ 2015-09-07 20:55 ` Parav Pandit
  -1 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-07 20:55 UTC (permalink / raw)
  To: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan, hannes,
	Doug Ledford
  Cc: corbet, james.l.morris, serge, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel, akpm, linux-security-module, Parav Pandit

Hi Doug, Tejun,

This is from cgroups for-4.3 branch.
linux-rdma trunk will face compilation error as its behind Tejun's
for-4.3 branch.
Patch has dependency on the some of the cgroup subsystem functionality
for fork().
Therefore its required to merge those changes first to linux-rdma trunk.

Parav


On Tue, Sep 8, 2015 at 2:08 AM, Parav Pandit <pandit.parav@gmail.com> wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.
>
> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.
>
> With this patch, user verbs module queries rdma device cgroup controller
> to query process's limit to consume such resource. It uncharge resource
> counter after resource is being freed.
>
> It extends the task structure to hold the statistic information about process's
> rdma resource usage so that when process migrates from one to other controller,
> right amount of resources can be migrated from one to other cgroup.
>
> Future patches will support RDMA flows resource and will be enhanced further
> to enforce limit of other resources and capabilities.
>
> Parav Pandit (7):
>   devcg: Added user option to rdma resource tracking.
>   devcg: Added rdma resource tracking module.
>   devcg: Added infrastructure for rdma device cgroup.
>   devcg: Added rdma resource tracker object per task
>   devcg: device cgroup's extension for RDMA resource.
>   devcg: Added support to use RDMA device cgroup.
>   devcg: Added Documentation of RDMA device cgroup.
>
>  Documentation/cgroups/devices.txt     |  32 ++-
>  drivers/infiniband/core/uverbs_cmd.c  | 139 +++++++++--
>  drivers/infiniband/core/uverbs_main.c |  39 +++-
>  include/linux/device_cgroup.h         |  53 +++++
>  include/linux/device_rdma_cgroup.h    |  83 +++++++
>  include/linux/sched.h                 |  12 +-
>  init/Kconfig                          |  12 +
>  security/Makefile                     |   1 +
>  security/device_cgroup.c              | 119 +++++++---
>  security/device_rdma_cgroup.c         | 422 ++++++++++++++++++++++++++++++++++
>  10 files changed, 850 insertions(+), 62 deletions(-)
>  create mode 100644 include/linux/device_rdma_cgroup.h
>  create mode 100644 security/device_rdma_cgroup.c
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
@ 2015-09-08  5:31     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  5:31 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
> index 8b64221..cdbdd60 100644
> --- a/include/linux/device_cgroup.h
> +++ b/include/linux/device_cgroup.h
> @@ -1,6 +1,57 @@
> +#ifndef _DEVICE_CGROUP
> +#define _DEVICE_CGROUP
> +
>  #include <linux/fs.h>
> +#include <linux/cgroup.h>
> +#include <linux/device_rdma_cgroup.h>

You cannot add this include line before adding the device_rdma_cgroup.h
(added in patch 5). You should reorder the patches so that after each
patch the kernel builds correctly.

I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
before it was added to the kernel.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
@ 2015-09-08  5:31     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  5:31 UTC (permalink / raw)
  To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 07/09/2015 23:38, Parav Pandit wrote:
> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
> index 8b64221..cdbdd60 100644
> --- a/include/linux/device_cgroup.h
> +++ b/include/linux/device_cgroup.h
> @@ -1,6 +1,57 @@
> +#ifndef _DEVICE_CGROUP
> +#define _DEVICE_CGROUP
> +
>  #include <linux/fs.h>
> +#include <linux/cgroup.h>
> +#include <linux/device_rdma_cgroup.h>

You cannot add this include line before adding the device_rdma_cgroup.h
(added in patch 5). You should reorder the patches so that after each
patch the kernel builds correctly.

I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
before it was added to the kernel.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
@ 2015-09-08  5:48     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  5:48 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>   * pins the final release of task.io_context.  Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
s/projtects/protects/
>   *
>   * Nests both inside and outside of read_lock(&tasklist_lock).
>   * It must not be nested with write_lock_irq(&tasklist_lock),


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
@ 2015-09-08  5:48     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  5:48 UTC (permalink / raw)
  To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 07/09/2015 23:38, Parav Pandit wrote:
> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>   * pins the final release of task.io_context.  Also protects ->cpuset and
> - * ->cgroup.subsys[]. And ->vfork_done.
> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
s/projtects/protects/
>   *
>   * Nests both inside and outside of read_lock(&tasklist_lock).
>   * It must not be nested with write_lock_irq(&tasklist_lock),

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
@ 2015-09-08  7:02       ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-08  7:02 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 11:01 AM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
>> index 8b64221..cdbdd60 100644
>> --- a/include/linux/device_cgroup.h
>> +++ b/include/linux/device_cgroup.h
>> @@ -1,6 +1,57 @@
>> +#ifndef _DEVICE_CGROUP
>> +#define _DEVICE_CGROUP
>> +
>>  #include <linux/fs.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/device_rdma_cgroup.h>
>
> You cannot add this include line before adding the device_rdma_cgroup.h
> (added in patch 5). You should reorder the patches so that after each
> patch the kernel builds correctly.
>
o.k. got it. I will send V1 with this suggested changes.

> I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
> before it was added to the kernel.
>
o.k.

> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup.
@ 2015-09-08  7:02       ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-08  7:02 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford,
	Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 8, 2015 at 11:01 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> diff --git a/include/linux/device_cgroup.h b/include/linux/device_cgroup.h
>> index 8b64221..cdbdd60 100644
>> --- a/include/linux/device_cgroup.h
>> +++ b/include/linux/device_cgroup.h
>> @@ -1,6 +1,57 @@
>> +#ifndef _DEVICE_CGROUP
>> +#define _DEVICE_CGROUP
>> +
>>  #include <linux/fs.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/device_rdma_cgroup.h>
>
> You cannot add this include line before adding the device_rdma_cgroup.h
> (added in patch 5). You should reorder the patches so that after each
> patch the kernel builds correctly.
>
o.k. got it. I will send V1 with this suggested changes.

> I also noticed in patch 2 you add device_rdma_cgroup.o to the Makefile
> before it was added to the kernel.
>
o.k.

> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
  2015-09-08  5:48     ` Haggai Eran
  (?)
@ 2015-09-08  7:04     ` Parav Pandit
  2015-09-08  8:24         ` Haggai Eran
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-08  7:04 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>> - * ->cgroup.subsys[]. And ->vfork_done.
>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
> s/projtects/protects/
>>   *
>>   * Nests both inside and outside of read_lock(&tasklist_lock).
>>   * It must not be nested with write_lock_irq(&tasklist_lock),
>

Hi Haggai Eran,
Did you miss to put comments or I missed something?

Parav

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
  2015-09-07 20:38   ` Parav Pandit
@ 2015-09-08  8:22     ` Haggai Eran
  -1 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:22 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> +/* RDMA resources from device cgroup perspective */
> +enum devcgroup_rdma_rt {
> +	DEVCG_RDMA_RES_TYPE_UCTX,
> +	DEVCG_RDMA_RES_TYPE_CQ,
> +	DEVCG_RDMA_RES_TYPE_PD,
> +	DEVCG_RDMA_RES_TYPE_AH,
> +	DEVCG_RDMA_RES_TYPE_MR,
> +	DEVCG_RDMA_RES_TYPE_MW,
I didn't see memory windows in dev_cgroup_files in patch 3. Is it used?
> +	DEVCG_RDMA_RES_TYPE_SRQ,
> +	DEVCG_RDMA_RES_TYPE_QP,
> +	DEVCG_RDMA_RES_TYPE_FLOW,
> +	DEVCG_RDMA_RES_TYPE_MAX,
> +};

> +struct devcgroup_rdma_tracker {
> +	int limit;
> +	atomic_t usage;
> +	int failcnt;
> +};
Have you considered using struct res_counter?

> + * RDMA resource limits are hierarchical, so the highest configured limit of
> + * the hierarchy is enforced. Allowing resource limit configuration to default
> + * cgroup allows fair share to kernel space ULPs as well.
In what way is the highest configured limit of the hierarchy enforced? I
would expect all the limits along the hierarchy to be enforced.

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> +	struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> +	int type = seq_cft(sf)->private;
> +	u32 usage;
> +
> +	if (dev_cg->rdma.tracker[type].limit ==	DEVCG_RDMA_MAX_RESOURCES) {
> +		seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
> +	} else {
> +		usage = dev_cg->rdma.tracker[type].limit;
If this is the resource limit, don't name it 'usage'.

> +		seq_printf(sf, "%u\n", usage);
> +	}
> +	return 0;
> +}

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> +	struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> +	int type = seq_cft(sf)->private;
> +	u32 usage;
> +
> +	if (dev_cg->rdma.tracker[type].limit ==	DEVCG_RDMA_MAX_RESOURCES) {
> +		seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
I'm not sure hiding the actual number is good, especially in the
show_usage case.

> +	} else {
> +		usage = dev_cg->rdma.tracker[type].limit;
> +		seq_printf(sf, "%u\n", usage);
> +	}
> +	return 0;
> +}

> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +				      enum devcgroup_rdma_rt type, int num)
> +{
> +	struct dev_cgroup *dev_cg, *p;
> +	struct task_struct *ctx_task;
> +
> +	if (!num)
> +		return;
> +
> +	/* get cgroup of ib_ucontext it belong to, to uncharge
> +	 * so that when its called from any worker tasks or any
> +	 * other tasks to which this resource doesn't belong to,
> +	 * it can be uncharged correctly.
> +	 */
> +	if (ucontext)
> +		ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> +	else
> +		ctx_task = current;
> +	dev_cg = task_devcgroup(ctx_task);
> +
> +	spin_lock(&ctx_task->rdma_res_counter->lock);
Don't you need an rcu read lock and rcu_dereference to access
rdma_res_counter?

> +	ctx_task->rdma_res_counter->usage[type] -= num;
> +
> +	for (p = dev_cg; p; p = parent_devcgroup(p))
> +		uncharge_resource(p, type, num);
> +
> +	spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> +	if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> +		rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);

> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
> +{
> +	struct dev_cgroup *dev_cg = task_devcgroup(current);
> +	struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
> +	int status;
> +
> +	if (!res_cnt) {
> +		res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
> +		if (!res_cnt)
> +			return -ENOMEM;
> +
> +		spin_lock_init(&res_cnt->lock);
> +		rcu_assign_pointer(current->rdma_res_counter, res_cnt);
Don't you need the task lock to update rdma_res_counter here?

> +	}
> +
> +	/* synchronize with migration task by taking lock, to avoid
> +	 * race condition of performing cgroup resource migration
> +	 * in non atomic way with this task, which can leads to leaked
> +	 * resources in older cgroup.
> +	 */
> +	spin_lock(&res_cnt->lock);
> +	status = try_charge_resource(dev_cg, type, num);
> +	if (status)
> +		goto busy;
> +
> +	/* single task updating its rdma resource usage, so atomic is
> +	 * not required.
> +	 */
> +	current->rdma_res_counter->usage[type] += num;
> +
> +busy:
> +	spin_unlock(&res_cnt->lock);
> +	return status;
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);

Regards,
Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-08  8:22     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:22 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> +/* RDMA resources from device cgroup perspective */
> +enum devcgroup_rdma_rt {
> +	DEVCG_RDMA_RES_TYPE_UCTX,
> +	DEVCG_RDMA_RES_TYPE_CQ,
> +	DEVCG_RDMA_RES_TYPE_PD,
> +	DEVCG_RDMA_RES_TYPE_AH,
> +	DEVCG_RDMA_RES_TYPE_MR,
> +	DEVCG_RDMA_RES_TYPE_MW,
I didn't see memory windows in dev_cgroup_files in patch 3. Is it used?
> +	DEVCG_RDMA_RES_TYPE_SRQ,
> +	DEVCG_RDMA_RES_TYPE_QP,
> +	DEVCG_RDMA_RES_TYPE_FLOW,
> +	DEVCG_RDMA_RES_TYPE_MAX,
> +};

> +struct devcgroup_rdma_tracker {
> +	int limit;
> +	atomic_t usage;
> +	int failcnt;
> +};
Have you considered using struct res_counter?

> + * RDMA resource limits are hierarchical, so the highest configured limit of
> + * the hierarchy is enforced. Allowing resource limit configuration to default
> + * cgroup allows fair share to kernel space ULPs as well.
In what way is the highest configured limit of the hierarchy enforced? I
would expect all the limits along the hierarchy to be enforced.

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> +	struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> +	int type = seq_cft(sf)->private;
> +	u32 usage;
> +
> +	if (dev_cg->rdma.tracker[type].limit ==	DEVCG_RDMA_MAX_RESOURCES) {
> +		seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
> +	} else {
> +		usage = dev_cg->rdma.tracker[type].limit;
If this is the resource limit, don't name it 'usage'.

> +		seq_printf(sf, "%u\n", usage);
> +	}
> +	return 0;
> +}

> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
> +{
> +	struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
> +	int type = seq_cft(sf)->private;
> +	u32 usage;
> +
> +	if (dev_cg->rdma.tracker[type].limit ==	DEVCG_RDMA_MAX_RESOURCES) {
> +		seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
I'm not sure hiding the actual number is good, especially in the
show_usage case.

> +	} else {
> +		usage = dev_cg->rdma.tracker[type].limit;
> +		seq_printf(sf, "%u\n", usage);
> +	}
> +	return 0;
> +}

> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +				      enum devcgroup_rdma_rt type, int num)
> +{
> +	struct dev_cgroup *dev_cg, *p;
> +	struct task_struct *ctx_task;
> +
> +	if (!num)
> +		return;
> +
> +	/* get cgroup of ib_ucontext it belong to, to uncharge
> +	 * so that when its called from any worker tasks or any
> +	 * other tasks to which this resource doesn't belong to,
> +	 * it can be uncharged correctly.
> +	 */
> +	if (ucontext)
> +		ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> +	else
> +		ctx_task = current;
> +	dev_cg = task_devcgroup(ctx_task);
> +
> +	spin_lock(&ctx_task->rdma_res_counter->lock);
Don't you need an rcu read lock and rcu_dereference to access
rdma_res_counter?

> +	ctx_task->rdma_res_counter->usage[type] -= num;
> +
> +	for (p = dev_cg; p; p = parent_devcgroup(p))
> +		uncharge_resource(p, type, num);
> +
> +	spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> +	if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> +		rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);

> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
> +{
> +	struct dev_cgroup *dev_cg = task_devcgroup(current);
> +	struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
> +	int status;
> +
> +	if (!res_cnt) {
> +		res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
> +		if (!res_cnt)
> +			return -ENOMEM;
> +
> +		spin_lock_init(&res_cnt->lock);
> +		rcu_assign_pointer(current->rdma_res_counter, res_cnt);
Don't you need the task lock to update rdma_res_counter here?

> +	}
> +
> +	/* synchronize with migration task by taking lock, to avoid
> +	 * race condition of performing cgroup resource migration
> +	 * in non atomic way with this task, which can leads to leaked
> +	 * resources in older cgroup.
> +	 */
> +	spin_lock(&res_cnt->lock);
> +	status = try_charge_resource(dev_cg, type, num);
> +	if (status)
> +		goto busy;
> +
> +	/* single task updating its rdma resource usage, so atomic is
> +	 * not required.
> +	 */
> +	current->rdma_res_counter->usage[type] += num;
> +
> +busy:
> +	spin_unlock(&res_cnt->lock);
> +	return status;
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);

Regards,
Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
@ 2015-09-08  8:24         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 10:04, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie@mellanox.com> wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
>>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>>> - * ->cgroup.subsys[]. And ->vfork_done.
>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
>> s/projtects/protects/
>>>   *
>>>   * Nests both inside and outside of read_lock(&tasklist_lock).
>>>   * It must not be nested with write_lock_irq(&tasklist_lock),
>>
> 
> Hi Haggai Eran,
> Did you miss to put comments or I missed something?

Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in
your comment. You should change the word "projtects" to "protects".

Haggai


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
@ 2015-09-08  8:24         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:24 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford,
	Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 08/09/2015 10:04, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
>>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>>> - * ->cgroup.subsys[]. And ->vfork_done.
>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
>> s/projtects/protects/
>>>   *
>>>   * Nests both inside and outside of read_lock(&tasklist_lock).
>>>   * It must not be nested with write_lock_irq(&tasklist_lock),
>>
> 
> Hi Haggai Eran,
> Did you miss to put comments or I missed something?

Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in
your comment. You should change the word "projtects" to "protects".

Haggai

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 4/7] devcg: Added rdma resource tracker object per task
  2015-09-08  8:24         ` Haggai Eran
  (?)
@ 2015-09-08  8:26         ` Parav Pandit
  -1 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-08  8:26 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 1:54 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 08/09/2015 10:04, Parav Pandit wrote:
>> On Tue, Sep 8, 2015 at 11:18 AM, Haggai Eran <haggaie@mellanox.com> wrote:
>>> On 07/09/2015 23:38, Parav Pandit wrote:
>>>> @@ -2676,7 +2686,7 @@ static inline int thread_group_empty(struct task_struct *p)
>>>>   * Protects ->fs, ->files, ->mm, ->group_info, ->comm, keyring
>>>>   * subscriptions and synchronises with wait4().  Also used in procfs.  Also
>>>>   * pins the final release of task.io_context.  Also protects ->cpuset and
>>>> - * ->cgroup.subsys[]. And ->vfork_done.
>>>> + * ->cgroup.subsys[]. Also projtects ->vfork_done and ->rdma_res_counter.
>>> s/projtects/protects/
>>>>   *
>>>>   * Nests both inside and outside of read_lock(&tasklist_lock).
>>>>   * It must not be nested with write_lock_irq(&tasklist_lock),
>>>
>>
>> Hi Haggai Eran,
>> Did you miss to put comments or I missed something?
>
> Yes, I wrote "s/projtects/protects/" to tell you that you have a typo in
> your comment. You should change the word "projtects" to "protects".
>
> Haggai
>
ah. ok. Right. Will correct it.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
  2015-09-07 20:38   ` Parav Pandit
@ 2015-09-08  8:36     ` Haggai Eran
  -1 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:36 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +				      enum devcgroup_rdma_rt type, int num)
> +{
> +	struct dev_cgroup *dev_cg, *p;
> +	struct task_struct *ctx_task;
> +
> +	if (!num)
> +		return;
> +
> +	/* get cgroup of ib_ucontext it belong to, to uncharge
> +	 * so that when its called from any worker tasks or any
> +	 * other tasks to which this resource doesn't belong to,
> +	 * it can be uncharged correctly.
> +	 */
> +	if (ucontext)
> +		ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> +	else
> +		ctx_task = current;
So what happens if a process creates a ucontext, forks, and then the
child creates and destroys a CQ? If I understand correctly, created
resources are always charged to the current process (the child), but
when it is destroyed the owner of the ucontext (the parent) will be
uncharged.

Since ucontexts are not meant to be used by multiple processes, I think
it would be okay to always charge the owner process (the one that
created the ucontext).

> +	dev_cg = task_devcgroup(ctx_task);
> +
> +	spin_lock(&ctx_task->rdma_res_counter->lock);
> +	ctx_task->rdma_res_counter->usage[type] -= num;
> +
> +	for (p = dev_cg; p; p = parent_devcgroup(p))
> +		uncharge_resource(p, type, num);
> +
> +	spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> +	if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> +		rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-08  8:36     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:36 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
> +				      enum devcgroup_rdma_rt type, int num)
> +{
> +	struct dev_cgroup *dev_cg, *p;
> +	struct task_struct *ctx_task;
> +
> +	if (!num)
> +		return;
> +
> +	/* get cgroup of ib_ucontext it belong to, to uncharge
> +	 * so that when its called from any worker tasks or any
> +	 * other tasks to which this resource doesn't belong to,
> +	 * it can be uncharged correctly.
> +	 */
> +	if (ucontext)
> +		ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
> +	else
> +		ctx_task = current;
So what happens if a process creates a ucontext, forks, and then the
child creates and destroys a CQ? If I understand correctly, created
resources are always charged to the current process (the child), but
when it is destroyed the owner of the ucontext (the parent) will be
uncharged.

Since ucontexts are not meant to be used by multiple processes, I think
it would be okay to always charge the owner process (the one that
created the ucontext).

> +	dev_cg = task_devcgroup(ctx_task);
> +
> +	spin_lock(&ctx_task->rdma_res_counter->lock);
> +	ctx_task->rdma_res_counter->usage[type] -= num;
> +
> +	for (p = dev_cg; p; p = parent_devcgroup(p))
> +		uncharge_resource(p, type, num);
> +
> +	spin_unlock(&ctx_task->rdma_res_counter->lock);
> +
> +	if (type == DEVCG_RDMA_RES_TYPE_UCTX)
> +		rdma_free_res_counter(ctx_task);
> +}
> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.
@ 2015-09-08  8:40     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:40 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
> +{
> +	INIT_LIST_HEAD(&ucontext->pd_list);
> +	INIT_LIST_HEAD(&ucontext->mr_list);
> +	INIT_LIST_HEAD(&ucontext->mw_list);
> +	INIT_LIST_HEAD(&ucontext->cq_list);
> +	INIT_LIST_HEAD(&ucontext->qp_list);
> +	INIT_LIST_HEAD(&ucontext->srq_list);
> +	INIT_LIST_HEAD(&ucontext->ah_list);
> +	INIT_LIST_HEAD(&ucontext->xrcd_list);
> +	INIT_LIST_HEAD(&ucontext->rule_list);
> +}

I don't see how this change is related to the patch.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.
@ 2015-09-08  8:40     ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08  8:40 UTC (permalink / raw)
  To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 07/09/2015 23:38, Parav Pandit wrote:
> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
> +{
> +	INIT_LIST_HEAD(&ucontext->pd_list);
> +	INIT_LIST_HEAD(&ucontext->mr_list);
> +	INIT_LIST_HEAD(&ucontext->mw_list);
> +	INIT_LIST_HEAD(&ucontext->cq_list);
> +	INIT_LIST_HEAD(&ucontext->qp_list);
> +	INIT_LIST_HEAD(&ucontext->srq_list);
> +	INIT_LIST_HEAD(&ucontext->ah_list);
> +	INIT_LIST_HEAD(&ucontext->xrcd_list);
> +	INIT_LIST_HEAD(&ucontext->rule_list);
> +}

I don't see how this change is related to the patch.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
  2015-09-08  8:22     ` Haggai Eran
  (?)
@ 2015-09-08 10:18     ` Parav Pandit
  2015-09-08 13:50         ` Haggai Eran
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-08 10:18 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 1:52 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +/* RDMA resources from device cgroup perspective */
>> +enum devcgroup_rdma_rt {
>> +     DEVCG_RDMA_RES_TYPE_UCTX,
>> +     DEVCG_RDMA_RES_TYPE_CQ,
>> +     DEVCG_RDMA_RES_TYPE_PD,
>> +     DEVCG_RDMA_RES_TYPE_AH,
>> +     DEVCG_RDMA_RES_TYPE_MR,
>> +     DEVCG_RDMA_RES_TYPE_MW,
> I didn't see memory windows in dev_cgroup_files in patch 3. Is it used?

ib_uverbs_dereg_mr() needs a fix in my patch for MW and alloc_mw()
also needs to use it.
I will fix it.

>> +     DEVCG_RDMA_RES_TYPE_SRQ,
>> +     DEVCG_RDMA_RES_TYPE_QP,
>> +     DEVCG_RDMA_RES_TYPE_FLOW,
>> +     DEVCG_RDMA_RES_TYPE_MAX,
>> +};
>
>> +struct devcgroup_rdma_tracker {
>> +     int limit;
>> +     atomic_t usage;
>> +     int failcnt;
>> +};
> Have you considered using struct res_counter?

No. I will look into the structure and see if it fits or not.

>
>> + * RDMA resource limits are hierarchical, so the highest configured limit of
>> + * the hierarchy is enforced. Allowing resource limit configuration to default
>> + * cgroup allows fair share to kernel space ULPs as well.
> In what way is the highest configured limit of the hierarchy enforced? I
> would expect all the limits along the hierarchy to be enforced.
>
In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.

Lets take example to clarify.
Say cg_A, cg_B, cg_C
Role              name                           limit
Parent           cg_A                           100
Child_level1  cg_B (child of cg_A)    20
Child_level2: cg_C (child of cg_B)    50

If the process allocating rdma resource belongs to cg_C, limit lowest
limit in the hierarchy is applied during charge() stage.
If cg_A limit happens to be 10, since 10 is lowest, its limit would be
applicable as you expected.
this is similar to newly added PID subsystem in functionality.

>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>> +{
>> +     struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>> +     int type = seq_cft(sf)->private;
>> +     u32 usage;
>> +
>> +     if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>> +             seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>> +     } else {
>> +             usage = dev_cg->rdma.tracker[type].limit;
> If this is the resource limit, don't name it 'usage'.
>
o.k. This is typo mistake from usage show function I made. I will change it.

>> +             seq_printf(sf, "%u\n", usage);
>> +     }
>> +     return 0;
>> +}
>
>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>> +{
>> +     struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>> +     int type = seq_cft(sf)->private;
>> +     u32 usage;
>> +
>> +     if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>> +             seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
> I'm not sure hiding the actual number is good, especially in the
> show_usage case.

This is similar to following other controller same as newly added PID
subsystem in showing max limit.

>
>> +     } else {
>> +             usage = dev_cg->rdma.tracker[type].limit;
>> +             seq_printf(sf, "%u\n", usage);
>> +     }
>> +     return 0;
>> +}
>
>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>> +                                   enum devcgroup_rdma_rt type, int num)
>> +{
>> +     struct dev_cgroup *dev_cg, *p;
>> +     struct task_struct *ctx_task;
>> +
>> +     if (!num)
>> +             return;
>> +
>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>> +      * so that when its called from any worker tasks or any
>> +      * other tasks to which this resource doesn't belong to,
>> +      * it can be uncharged correctly.
>> +      */
>> +     if (ucontext)
>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>> +     else
>> +             ctx_task = current;
>> +     dev_cg = task_devcgroup(ctx_task);
>> +
>> +     spin_lock(&ctx_task->rdma_res_counter->lock);
> Don't you need an rcu read lock and rcu_dereference to access
> rdma_res_counter?

I believe, its not required because when uncharge() is happening, it
can happen only from 3 contexts.
(a) from the caller task context, who has made allocation call, so no
synchronizing needed.
(b) from the dealloc resource context, again this is from the same
task context which allocated, it so this is single threaded, no need
to syncronize.
(c) from the fput() context when process is terminated abruptly or as
part of differed cleanup, when this is happening there cannot be
allocator task anyway.

>
>> +     ctx_task->rdma_res_counter->usage[type] -= num;
>> +
>> +     for (p = dev_cg; p; p = parent_devcgroup(p))
>> +             uncharge_resource(p, type, num);
>> +
>> +     spin_unlock(&ctx_task->rdma_res_counter->lock);
>> +
>> +     if (type == DEVCG_RDMA_RES_TYPE_UCTX)
>> +             rdma_free_res_counter(ctx_task);
>> +}
>> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);
>
>> +int devcgroup_rdma_try_charge_resource(enum devcgroup_rdma_rt type, int num)
>> +{
>> +     struct dev_cgroup *dev_cg = task_devcgroup(current);
>> +     struct task_rdma_res_counter *res_cnt = current->rdma_res_counter;
>> +     int status;
>> +
>> +     if (!res_cnt) {
>> +             res_cnt = kzalloc(sizeof(*res_cnt), GFP_KERNEL);
>> +             if (!res_cnt)
>> +                     return -ENOMEM;
>> +
>> +             spin_lock_init(&res_cnt->lock);
>> +             rcu_assign_pointer(current->rdma_res_counter, res_cnt);
> Don't you need the task lock to update rdma_res_counter here?
>
No. this is the caller task allocating it, so its single threaded.
It needs to syncronize with migration thread which is reading counters
of all the processes, while they are getting allocated and freed.
Therefore rcu() is sufficient.

>> +     }
>> +
>> +     /* synchronize with migration task by taking lock, to avoid
>> +      * race condition of performing cgroup resource migration
>> +      * in non atomic way with this task, which can leads to leaked
>> +      * resources in older cgroup.
>> +      */
>> +     spin_lock(&res_cnt->lock);
>> +     status = try_charge_resource(dev_cg, type, num);
>> +     if (status)
>> +             goto busy;
>> +
>> +     /* single task updating its rdma resource usage, so atomic is
>> +      * not required.
>> +      */
>> +     current->rdma_res_counter->usage[type] += num;
>> +
>> +busy:
>> +     spin_unlock(&res_cnt->lock);
>> +     return status;
>> +}
>> +EXPORT_SYMBOL(devcgroup_rdma_try_charge_resource);
>
> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.
  2015-09-08  8:40     ` Haggai Eran
  (?)
@ 2015-09-08 10:22     ` Parav Pandit
  2015-09-08 13:40         ` Haggai Eran
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-08 10:22 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
>> +{
>> +     INIT_LIST_HEAD(&ucontext->pd_list);
>> +     INIT_LIST_HEAD(&ucontext->mr_list);
>> +     INIT_LIST_HEAD(&ucontext->mw_list);
>> +     INIT_LIST_HEAD(&ucontext->cq_list);
>> +     INIT_LIST_HEAD(&ucontext->qp_list);
>> +     INIT_LIST_HEAD(&ucontext->srq_list);
>> +     INIT_LIST_HEAD(&ucontext->ah_list);
>> +     INIT_LIST_HEAD(&ucontext->xrcd_list);
>> +     INIT_LIST_HEAD(&ucontext->rule_list);
>> +}
>
> I don't see how this change is related to the patch.

Its not but code which I added makes this function to grow longer, so
to keep it to same readability level, I did the cleanup.
May be I can send separate patch for cleanup?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-08 10:50       ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-08 10:50 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>> +                                   enum devcgroup_rdma_rt type, int num)
>> +{
>> +     struct dev_cgroup *dev_cg, *p;
>> +     struct task_struct *ctx_task;
>> +
>> +     if (!num)
>> +             return;
>> +
>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>> +      * so that when its called from any worker tasks or any
>> +      * other tasks to which this resource doesn't belong to,
>> +      * it can be uncharged correctly.
>> +      */
>> +     if (ucontext)
>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>> +     else
>> +             ctx_task = current;
> So what happens if a process creates a ucontext, forks, and then the
> child creates and destroys a CQ? If I understand correctly, created
> resources are always charged to the current process (the child), but
> when it is destroyed the owner of the ucontext (the parent) will be
> uncharged.
>
> Since ucontexts are not meant to be used by multiple processes, I think
> it would be okay to always charge the owner process (the one that
> created the ucontext).

I need to think about it. I would like to avoid keep per task resource
counters for two reasons.
For a while I thought that native fork() doesn't take care to share
the RDMA resources and all CQ, QP dmaable memory from PID namespace
perspective.

1. Because, it could well happen that process and its child process is
created in PID namespace_A, after which child is migrated to new PID
namespace_B.
after which parent from the namespace_A is terminated. I am not sure
how the ucontext ownership changes from parent to child process at
that point today.
I prefer to keep this complexity out if at all it exists as process
migration across namespaces is not a frequent event for which to
optimize the code for.

2. by having per task counter (as cost of memory some memory) allows
to avoid using atomic during charge(), uncharge().

The intent is to have per task (process and thread) to have their
resource counter instance, but I can see that its broken where its
charging parent process as of now without atomics.
As you said its ok to always charge the owner process, I have to relax
2nd requirement and fallback to use atomics for charge(), uncharge()
or I have to get rid of ucontext from the uncharge() API which is
difficult due to fput() being in worker thread context.

>
>> +     dev_cg = task_devcgroup(ctx_task);
>> +
>> +     spin_lock(&ctx_task->rdma_res_counter->lock);
>> +     ctx_task->rdma_res_counter->usage[type] -= num;
>> +
>> +     for (p = dev_cg; p; p = parent_devcgroup(p))
>> +             uncharge_resource(p, type, num);
>> +
>> +     spin_unlock(&ctx_task->rdma_res_counter->lock);
>> +
>> +     if (type == DEVCG_RDMA_RES_TYPE_UCTX)
>> +             rdma_free_res_counter(ctx_task);
>> +}
>> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-08 10:50       ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-08 10:50 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, Johannes Weiner, Doug Ledford,
	Jonathan Corbet, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 07/09/2015 23:38, Parav Pandit wrote:
>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>> +                                   enum devcgroup_rdma_rt type, int num)
>> +{
>> +     struct dev_cgroup *dev_cg, *p;
>> +     struct task_struct *ctx_task;
>> +
>> +     if (!num)
>> +             return;
>> +
>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>> +      * so that when its called from any worker tasks or any
>> +      * other tasks to which this resource doesn't belong to,
>> +      * it can be uncharged correctly.
>> +      */
>> +     if (ucontext)
>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>> +     else
>> +             ctx_task = current;
> So what happens if a process creates a ucontext, forks, and then the
> child creates and destroys a CQ? If I understand correctly, created
> resources are always charged to the current process (the child), but
> when it is destroyed the owner of the ucontext (the parent) will be
> uncharged.
>
> Since ucontexts are not meant to be used by multiple processes, I think
> it would be okay to always charge the owner process (the one that
> created the ucontext).

I need to think about it. I would like to avoid keep per task resource
counters for two reasons.
For a while I thought that native fork() doesn't take care to share
the RDMA resources and all CQ, QP dmaable memory from PID namespace
perspective.

1. Because, it could well happen that process and its child process is
created in PID namespace_A, after which child is migrated to new PID
namespace_B.
after which parent from the namespace_A is terminated. I am not sure
how the ucontext ownership changes from parent to child process at
that point today.
I prefer to keep this complexity out if at all it exists as process
migration across namespaces is not a frequent event for which to
optimize the code for.

2. by having per task counter (as cost of memory some memory) allows
to avoid using atomic during charge(), uncharge().

The intent is to have per task (process and thread) to have their
resource counter instance, but I can see that its broken where its
charging parent process as of now without atomics.
As you said its ok to always charge the owner process, I have to relax
2nd requirement and fallback to use atomics for charge(), uncharge()
or I have to get rid of ucontext from the uncharge() API which is
difficult due to fput() being in worker thread context.

>
>> +     dev_cg = task_devcgroup(ctx_task);
>> +
>> +     spin_lock(&ctx_task->rdma_res_counter->lock);
>> +     ctx_task->rdma_res_counter->usage[type] -= num;
>> +
>> +     for (p = dev_cg; p; p = parent_devcgroup(p))
>> +             uncharge_resource(p, type, num);
>> +
>> +     spin_unlock(&ctx_task->rdma_res_counter->lock);
>> +
>> +     if (type == DEVCG_RDMA_RES_TYPE_UCTX)
>> +             rdma_free_res_counter(ctx_task);
>> +}
>> +EXPORT_SYMBOL(devcgroup_rdma_uncharge_resource);
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-08 12:45   ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 12:45 UTC (permalink / raw)
  To: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma, tj,
	lizefan, hannes, dledford
  Cc: corbet, james.l.morris, serge, ogerlitz, matanb, raindel, akpm,
	linux-security-module

On 07/09/2015 23:38, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.
> 
> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.
I don't think extending the device cgroup is the right place for these
limits. It is currently a very generic controller and adding various
RDMA resources to it looks out of place. Why not create a new controller
for rdma?

Another thing I noticed is that all limits in this cgroup are global,
while the resources they control are hardware device specific.
I think it would be better if the cgroup controlled the limits of each
device separately.

> With this patch, user verbs module queries rdma device cgroup controller
> to query process's limit to consume such resource. It uncharge resource 
> counter after resource is being freed.
This is another reason why per-device limits would be better. Since
limits are reflected to user-space when querying a specific device, it
will show the same maximum limit on every device opened. If the user
opens 3 devices they might expect to be able to open 3 times the number
of the resources they actually can.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-08 12:45   ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 12:45 UTC (permalink / raw)
  To: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	lizefan-hv44wF8Li93QT0dZR+AlfA, hannes-druUgvl0LCNAfugRpC6u6w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, ogerlitz-VPRAkNaXOzVWk0Htik3J/w,
	matanb-VPRAkNaXOzVWk0Htik3J/w, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On 07/09/2015 23:38, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.
> 
> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.
I don't think extending the device cgroup is the right place for these
limits. It is currently a very generic controller and adding various
RDMA resources to it looks out of place. Why not create a new controller
for rdma?

Another thing I noticed is that all limits in this cgroup are global,
while the resources they control are hardware device specific.
I think it would be better if the cgroup controlled the limits of each
device separately.

> With this patch, user verbs module queries rdma device cgroup controller
> to query process's limit to consume such resource. It uncharge resource 
> counter after resource is being freed.
This is another reason why per-device limits would be better. Since
limits are reflected to user-space when querying a specific device, it
will show the same maximum limit on every device opened. If the user
opens 3 devices they might expect to be able to open 3 times the number
of the resources they actually can.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.
  2015-09-08 10:22     ` Parav Pandit
@ 2015-09-08 13:40         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 13:40 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 13:22, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran <haggaie@mellanox.com> wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
>>> +{
>>> +     INIT_LIST_HEAD(&ucontext->pd_list);
>>> +     INIT_LIST_HEAD(&ucontext->mr_list);
>>> +     INIT_LIST_HEAD(&ucontext->mw_list);
>>> +     INIT_LIST_HEAD(&ucontext->cq_list);
>>> +     INIT_LIST_HEAD(&ucontext->qp_list);
>>> +     INIT_LIST_HEAD(&ucontext->srq_list);
>>> +     INIT_LIST_HEAD(&ucontext->ah_list);
>>> +     INIT_LIST_HEAD(&ucontext->xrcd_list);
>>> +     INIT_LIST_HEAD(&ucontext->rule_list);
>>> +}
>>
>> I don't see how this change is related to the patch.
> 
> Its not but code which I added makes this function to grow longer, so
> to keep it to same readability level, I did the cleanup.
> May be I can send separate patch for cleanup?

Sounds good to me.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 6/7] devcg: Added support to use RDMA device cgroup.
@ 2015-09-08 13:40         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 13:40 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 13:22, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:10 PM, Haggai Eran <haggaie@mellanox.com> wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +static void init_ucontext_lists(struct ib_ucontext *ucontext)
>>> +{
>>> +     INIT_LIST_HEAD(&ucontext->pd_list);
>>> +     INIT_LIST_HEAD(&ucontext->mr_list);
>>> +     INIT_LIST_HEAD(&ucontext->mw_list);
>>> +     INIT_LIST_HEAD(&ucontext->cq_list);
>>> +     INIT_LIST_HEAD(&ucontext->qp_list);
>>> +     INIT_LIST_HEAD(&ucontext->srq_list);
>>> +     INIT_LIST_HEAD(&ucontext->ah_list);
>>> +     INIT_LIST_HEAD(&ucontext->xrcd_list);
>>> +     INIT_LIST_HEAD(&ucontext->rule_list);
>>> +}
>>
>> I don't see how this change is related to the patch.
> 
> Its not but code which I added makes this function to grow longer, so
> to keep it to same readability level, I did the cleanup.
> May be I can send separate patch for cleanup?

Sounds good to me.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
  2015-09-08 10:18     ` Parav Pandit
@ 2015-09-08 13:50         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 13:50 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 13:18, Parav Pandit wrote:
>> >
>>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of
>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default
>>> >> + * cgroup allows fair share to kernel space ULPs as well.
>> > In what way is the highest configured limit of the hierarchy enforced? I
>> > would expect all the limits along the hierarchy to be enforced.
>> >
> In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.
> 
> Lets take example to clarify.
> Say cg_A, cg_B, cg_C
> Role              name                           limit
> Parent           cg_A                           100
> Child_level1  cg_B (child of cg_A)    20
> Child_level2: cg_C (child of cg_B)    50
> 
> If the process allocating rdma resource belongs to cg_C, limit lowest
> limit in the hierarchy is applied during charge() stage.
> If cg_A limit happens to be 10, since 10 is lowest, its limit would be
> applicable as you expected.

Looking at the code, the usage in every level is charged. This is what I
would expect. I just think the comment is a bit misleading.

>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>>> +{
>>> +     struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>>> +     int type = seq_cft(sf)->private;
>>> +     u32 usage;
>>> +
>>> +     if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>>> +             seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>> I'm not sure hiding the actual number is good, especially in the
>> show_usage case.
> 
> This is similar to following other controller same as newly added PID
> subsystem in showing max limit.

Okay.

>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +                                   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> +     struct dev_cgroup *dev_cg, *p;
>>> +     struct task_struct *ctx_task;
>>> +
>>> +     if (!num)
>>> +             return;
>>> +
>>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +      * so that when its called from any worker tasks or any
>>> +      * other tasks to which this resource doesn't belong to,
>>> +      * it can be uncharged correctly.
>>> +      */
>>> +     if (ucontext)
>>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> +     else
>>> +             ctx_task = current;
>>> +     dev_cg = task_devcgroup(ctx_task);
>>> +
>>> +     spin_lock(&ctx_task->rdma_res_counter->lock);
>> Don't you need an rcu read lock and rcu_dereference to access
>> rdma_res_counter?
> 
> I believe, its not required because when uncharge() is happening, it
> can happen only from 3 contexts.
> (a) from the caller task context, who has made allocation call, so no
> synchronizing needed.
> (b) from the dealloc resource context, again this is from the same
> task context which allocated, it so this is single threaded, no need
> to syncronize.
I don't think it is true. You can access uverbs from multiple threads.
What may help your case here I think is the fact that only when the last
ucontext is released you can change the rdma_res_counter field, and
ucontext release takes the ib_uverbs_file->mutex.

Still, I think it would be best to use rcu_dereference(), if only for
documentation and sparse.

> (c) from the fput() context when process is terminated abruptly or as
> part of differed cleanup, when this is happening there cannot be
> allocator task anyway.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-08 13:50         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 13:50 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 13:18, Parav Pandit wrote:
>> >
>>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of
>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default
>>> >> + * cgroup allows fair share to kernel space ULPs as well.
>> > In what way is the highest configured limit of the hierarchy enforced? I
>> > would expect all the limits along the hierarchy to be enforced.
>> >
> In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.
> 
> Lets take example to clarify.
> Say cg_A, cg_B, cg_C
> Role              name                           limit
> Parent           cg_A                           100
> Child_level1  cg_B (child of cg_A)    20
> Child_level2: cg_C (child of cg_B)    50
> 
> If the process allocating rdma resource belongs to cg_C, limit lowest
> limit in the hierarchy is applied during charge() stage.
> If cg_A limit happens to be 10, since 10 is lowest, its limit would be
> applicable as you expected.

Looking at the code, the usage in every level is charged. This is what I
would expect. I just think the comment is a bit misleading.

>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>>> +{
>>> +     struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>>> +     int type = seq_cft(sf)->private;
>>> +     u32 usage;
>>> +
>>> +     if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>>> +             seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>> I'm not sure hiding the actual number is good, especially in the
>> show_usage case.
> 
> This is similar to following other controller same as newly added PID
> subsystem in showing max limit.

Okay.

>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +                                   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> +     struct dev_cgroup *dev_cg, *p;
>>> +     struct task_struct *ctx_task;
>>> +
>>> +     if (!num)
>>> +             return;
>>> +
>>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +      * so that when its called from any worker tasks or any
>>> +      * other tasks to which this resource doesn't belong to,
>>> +      * it can be uncharged correctly.
>>> +      */
>>> +     if (ucontext)
>>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> +     else
>>> +             ctx_task = current;
>>> +     dev_cg = task_devcgroup(ctx_task);
>>> +
>>> +     spin_lock(&ctx_task->rdma_res_counter->lock);
>> Don't you need an rcu read lock and rcu_dereference to access
>> rdma_res_counter?
> 
> I believe, its not required because when uncharge() is happening, it
> can happen only from 3 contexts.
> (a) from the caller task context, who has made allocation call, so no
> synchronizing needed.
> (b) from the dealloc resource context, again this is from the same
> task context which allocated, it so this is single threaded, no need
> to syncronize.
I don't think it is true. You can access uverbs from multiple threads.
What may help your case here I think is the fact that only when the last
ucontext is released you can change the rdma_res_counter field, and
ucontext release takes the ib_uverbs_file->mutex.

Still, I think it would be best to use rcu_dereference(), if only for
documentation and sparse.

> (c) from the fput() context when process is terminated abruptly or as
> part of differed cleanup, when this is happening there cannot be
> allocator task anyway.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
  2015-09-08 10:50       ` Parav Pandit
@ 2015-09-08 14:10         ` Haggai Eran
  -1 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 14:10 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 13:50, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie@mellanox.com> wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +                                   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> +     struct dev_cgroup *dev_cg, *p;
>>> +     struct task_struct *ctx_task;
>>> +
>>> +     if (!num)
>>> +             return;
>>> +
>>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +      * so that when its called from any worker tasks or any
>>> +      * other tasks to which this resource doesn't belong to,
>>> +      * it can be uncharged correctly.
>>> +      */
>>> +     if (ucontext)
>>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> +     else
>>> +             ctx_task = current;
>> So what happens if a process creates a ucontext, forks, and then the
>> child creates and destroys a CQ? If I understand correctly, created
>> resources are always charged to the current process (the child), but
>> when it is destroyed the owner of the ucontext (the parent) will be
>> uncharged.
>>
>> Since ucontexts are not meant to be used by multiple processes, I think
>> it would be okay to always charge the owner process (the one that
>> created the ucontext).
> 
> I need to think about it. I would like to avoid keep per task resource
> counters for two reasons.
> For a while I thought that native fork() doesn't take care to share
> the RDMA resources and all CQ, QP dmaable memory from PID namespace
> perspective.
> 
> 1. Because, it could well happen that process and its child process is
> created in PID namespace_A, after which child is migrated to new PID
> namespace_B.
> after which parent from the namespace_A is terminated. I am not sure
> how the ucontext ownership changes from parent to child process at
> that point today.
> I prefer to keep this complexity out if at all it exists as process
> migration across namespaces is not a frequent event for which to
> optimize the code for.
> 
> 2. by having per task counter (as cost of memory some memory) allows
> to avoid using atomic during charge(), uncharge().
> 
> The intent is to have per task (process and thread) to have their
> resource counter instance, but I can see that its broken where its
> charging parent process as of now without atomics.
> As you said its ok to always charge the owner process, I have to relax
> 2nd requirement and fallback to use atomics for charge(), uncharge()
> or I have to get rid of ucontext from the uncharge() API which is
> difficult due to fput() being in worker thread context.
> 

I think the cost of atomic operations here would normally be negligible
compared to the cost of accessing the hardware to allocate or deallocate
these resources.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
@ 2015-09-08 14:10         ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-08 14:10 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On 08/09/2015 13:50, Parav Pandit wrote:
> On Tue, Sep 8, 2015 at 2:06 PM, Haggai Eran <haggaie@mellanox.com> wrote:
>> On 07/09/2015 23:38, Parav Pandit wrote:
>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>> +                                   enum devcgroup_rdma_rt type, int num)
>>> +{
>>> +     struct dev_cgroup *dev_cg, *p;
>>> +     struct task_struct *ctx_task;
>>> +
>>> +     if (!num)
>>> +             return;
>>> +
>>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>>> +      * so that when its called from any worker tasks or any
>>> +      * other tasks to which this resource doesn't belong to,
>>> +      * it can be uncharged correctly.
>>> +      */
>>> +     if (ucontext)
>>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>> +     else
>>> +             ctx_task = current;
>> So what happens if a process creates a ucontext, forks, and then the
>> child creates and destroys a CQ? If I understand correctly, created
>> resources are always charged to the current process (the child), but
>> when it is destroyed the owner of the ucontext (the parent) will be
>> uncharged.
>>
>> Since ucontexts are not meant to be used by multiple processes, I think
>> it would be okay to always charge the owner process (the one that
>> created the ucontext).
> 
> I need to think about it. I would like to avoid keep per task resource
> counters for two reasons.
> For a while I thought that native fork() doesn't take care to share
> the RDMA resources and all CQ, QP dmaable memory from PID namespace
> perspective.
> 
> 1. Because, it could well happen that process and its child process is
> created in PID namespace_A, after which child is migrated to new PID
> namespace_B.
> after which parent from the namespace_A is terminated. I am not sure
> how the ucontext ownership changes from parent to child process at
> that point today.
> I prefer to keep this complexity out if at all it exists as process
> migration across namespaces is not a frequent event for which to
> optimize the code for.
> 
> 2. by having per task counter (as cost of memory some memory) allows
> to avoid using atomic during charge(), uncharge().
> 
> The intent is to have per task (process and thread) to have their
> resource counter instance, but I can see that its broken where its
> charging parent process as of now without atomics.
> As you said its ok to always charge the owner process, I have to relax
> 2nd requirement and fallback to use atomics for charge(), uncharge()
> or I have to get rid of ucontext from the uncharge() API which is
> difficult due to fput() being in worker thread context.
> 

I think the cost of atomic operations here would normally be negligible
compared to the cost of accessing the hardware to allocate or deallocate
these resources.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 5/7] devcg: device cgroup's extension for RDMA resource.
  2015-09-08 13:50         ` Haggai Eran
  (?)
@ 2015-09-08 14:13         ` Parav Pandit
  -1 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-08 14:13 UTC (permalink / raw)
  To: Haggai Eran
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, tj, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 7:20 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 08/09/2015 13:18, Parav Pandit wrote:
>>> >
>>>> >> + * RDMA resource limits are hierarchical, so the highest configured limit of
>>>> >> + * the hierarchy is enforced. Allowing resource limit configuration to default
>>>> >> + * cgroup allows fair share to kernel space ULPs as well.
>>> > In what way is the highest configured limit of the hierarchy enforced? I
>>> > would expect all the limits along the hierarchy to be enforced.
>>> >
>> In  hierarchy, of say 3 cgroups, the smallest limit of the cgroup is applied.
>>
>> Lets take example to clarify.
>> Say cg_A, cg_B, cg_C
>> Role              name                           limit
>> Parent           cg_A                           100
>> Child_level1  cg_B (child of cg_A)    20
>> Child_level2: cg_C (child of cg_B)    50
>>
>> If the process allocating rdma resource belongs to cg_C, limit lowest
>> limit in the hierarchy is applied during charge() stage.
>> If cg_A limit happens to be 10, since 10 is lowest, its limit would be
>> applicable as you expected.
>
> Looking at the code, the usage in every level is charged. This is what I
> would expect. I just think the comment is a bit misleading.
>
>>>> +int devcgroup_rdma_get_max_resource(struct seq_file *sf, void *v)
>>>> +{
>>>> +     struct dev_cgroup *dev_cg = css_to_devcgroup(seq_css(sf));
>>>> +     int type = seq_cft(sf)->private;
>>>> +     u32 usage;
>>>> +
>>>> +     if (dev_cg->rdma.tracker[type].limit == DEVCG_RDMA_MAX_RESOURCES) {
>>>> +             seq_printf(sf, "%s\n", DEVCG_RDMA_MAX_RESOURCE_STR);
>>> I'm not sure hiding the actual number is good, especially in the
>>> show_usage case.
>>
>> This is similar to following other controller same as newly added PID
>> subsystem in showing max limit.
>
> Okay.
>
>>>> +void devcgroup_rdma_uncharge_resource(struct ib_ucontext *ucontext,
>>>> +                                   enum devcgroup_rdma_rt type, int num)
>>>> +{
>>>> +     struct dev_cgroup *dev_cg, *p;
>>>> +     struct task_struct *ctx_task;
>>>> +
>>>> +     if (!num)
>>>> +             return;
>>>> +
>>>> +     /* get cgroup of ib_ucontext it belong to, to uncharge
>>>> +      * so that when its called from any worker tasks or any
>>>> +      * other tasks to which this resource doesn't belong to,
>>>> +      * it can be uncharged correctly.
>>>> +      */
>>>> +     if (ucontext)
>>>> +             ctx_task = get_pid_task(ucontext->tgid, PIDTYPE_PID);
>>>> +     else
>>>> +             ctx_task = current;
>>>> +     dev_cg = task_devcgroup(ctx_task);
>>>> +
>>>> +     spin_lock(&ctx_task->rdma_res_counter->lock);
>>> Don't you need an rcu read lock and rcu_dereference to access
>>> rdma_res_counter?
>>
>> I believe, its not required because when uncharge() is happening, it
>> can happen only from 3 contexts.
>> (a) from the caller task context, who has made allocation call, so no
>> synchronizing needed.
>> (b) from the dealloc resource context, again this is from the same
>> task context which allocated, it so this is single threaded, no need
>> to syncronize.
> I don't think it is true. You can access uverbs from multiple threads.
Yes, thats right. Though I design counter structure allocation on per
task basis for individual thread access, I totally missed out ucontext
sharing among threads. I replied in other thread to make counters
during charge, uncharge to atomic to cover that case.
Therefore I need rcu lock and deference as well.

> What may help your case here I think is the fact that only when the last
> ucontext is released you can change the rdma_res_counter field, and
> ucontext release takes the ib_uverbs_file->mutex.
>
> Still, I think it would be best to use rcu_dereference(), if only for
> documentation and sparse.

yes.

>
>> (c) from the fput() context when process is terminated abruptly or as
>> part of differed cleanup, when this is happening there cannot be
>> allocator task anyway.
>

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-08 15:23   ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-08 15:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan, hannes,
	dledford, corbet, james.l.morris, serge, haggaie, ogerlitz,
	matanb, raindel, akpm, linux-security-module

Hello, Parav.

On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.

Is there something simple I can read up on what each resource is?
What's the usual access control mechanism?

> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.

I don't think this belongs to devcg.  If these make sense as a set of
resources to be controlled via cgroup, the right way prolly would be a
separate controller.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-08 15:23   ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-08 15:23 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	hannes-druUgvl0LCNAfugRpC6u6w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	corbet-T1hC0tSOHrs, james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, haggaie-VPRAkNaXOzVWk0Htik3J/w,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w, matanb-VPRAkNaXOzVWk0Htik3J/w,
	raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hello, Parav.

On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote:
> Currently user space applications can easily take away all the rdma
> device specific resources such as AH, CQ, QP, MR etc. Due to which other
> applications in other cgroup or kernel space ULPs may not even get chance
> to allocate any rdma resources.

Is there something simple I can read up on what each resource is?
What's the usual access control mechanism?

> This patch-set allows limiting rdma resources to set of processes.
> It extend device cgroup controller for limiting rdma device limits.

I don't think this belongs to devcg.  If these make sense as a set of
resources to be controlled via cgroup, the right way prolly would be a
separate controller.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-08 15:23   ` Tejun Heo
  (?)
@ 2015-09-09  3:57   ` Parav Pandit
  2015-09-10 16:49     ` Tejun Heo
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-09  3:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Tue, Sep 8, 2015 at 8:53 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Tue, Sep 08, 2015 at 02:08:16AM +0530, Parav Pandit wrote:
>> Currently user space applications can easily take away all the rdma
>> device specific resources such as AH, CQ, QP, MR etc. Due to which other
>> applications in other cgroup or kernel space ULPs may not even get chance
>> to allocate any rdma resources.
>
> Is there something simple I can read up on what each resource is?
> What's the usual access control mechanism?
>
Hi Tejun,
This is one old white paper, but most of the reasoning still holds true on RDMA.
http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf

More notes on RDMA resources and summary:
RDMA allows data transport from one system to other system where RDMA
device implements OSI layers 4 to 1 typically in hardware, drivers.
RDMA device provides data path semantics to perform data transfer in
zero copy manner from one to other host, very similar to local dma
controller.
It also allows data transfer operation from user space application of
one to other system.
In order to do so, all the resources are created using trusted kernel
space which also provides isolation among applications.
These resources include are-  QP (queue pair) to transfer data, CQ
(Completion queue) to indicate completion of data transfer operation,
MR (memory region) to represent user application memory as source or
destination for data transfer.
Common resources are QP, SRQ (shared received queue), CQ, MR, AH
(Address handle), FLOW, PD (protection domain), user context etc.

>> This patch-set allows limiting rdma resources to set of processes.
>> It extend device cgroup controller for limiting rdma device limits.
>
> I don't think this belongs to devcg.  If these make sense as a set of
> resources to be controlled via cgroup, the right way prolly would be a
> separate controller.
>

In past there has been similar comment to have dedicated cgroup
controller for RDMA instead of merging with device cgroup.
I am ok with both the approach, however I prefer to utilize device
controller instead of spinning of new controller for new devices
category.
I anticipate more such need would arise and for new device category,
it might not be worth to have new cgroup controller.
RapidIO though very less popular and upcoming PCIe are on horizon to
offer similar benefits as that of RDMA and in future having one
controller for each of them again would not be right approach.

I certainly seek your and others inputs in this email thread here whether
(a) to continue to extend device cgroup (which support character,
block devices white list) and now RDMA devices
or
(b) to spin of new controller, if so what are the compelling reasons
that it can provide compare to extension.

Current scope of the patch is limited to RDMA resources as first
patch, but for fact I am sure that there are more functionality in
pipe to support via this cgroup by me and others.
So keeping atleast these two aspects in mind, I need input on
direction of dedicated controller or new one.

In future, I anticipate that we might have sub directory to device
cgroup for individual device class to control.
such as,
<sys/fs/cgroup/devices/
     /char
     /block
     /rdma
     /pcie
     /child_cgroup..1..N
Each controllers cgroup access files would remain within their own
scope. We are not there yet from base infrastructure but something to
be done as it matures and users start using it.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-09  3:57   ` Parav Pandit
@ 2015-09-10 16:49     ` Tejun Heo
  2015-09-10 17:46         ` Parav Pandit
  2015-09-10 17:48       ` Hefty, Sean
  0 siblings, 2 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-10 16:49 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote:
> This is one old white paper, but most of the reasoning still holds true on RDMA.
> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf

Just read it.  Much appreciated.

...
> These resources include are-  QP (queue pair) to transfer data, CQ
> (Completion queue) to indicate completion of data transfer operation,
> MR (memory region) to represent user application memory as source or
> destination for data transfer.
> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
> (Address handle), FLOW, PD (protection domain), user context etc.

It's kinda bothering that all these are disparate resources.  I
suppose that each restriction comes from the underlying hardware and
there's no accepted higher level abstraction for these things?

> >> This patch-set allows limiting rdma resources to set of processes.
> >> It extend device cgroup controller for limiting rdma device limits.
> >
> > I don't think this belongs to devcg.  If these make sense as a set of
> > resources to be controlled via cgroup, the right way prolly would be a
> > separate controller.
> >
> 
> In past there has been similar comment to have dedicated cgroup
> controller for RDMA instead of merging with device cgroup.
> I am ok with both the approach, however I prefer to utilize device
> controller instead of spinning of new controller for new devices
> category.
> I anticipate more such need would arise and for new device category,
> it might not be worth to have new cgroup controller.
> RapidIO though very less popular and upcoming PCIe are on horizon to
> offer similar benefits as that of RDMA and in future having one
> controller for each of them again would not be right approach.
>
> I certainly seek your and others inputs in this email thread here whether
> (a) to continue to extend device cgroup (which support character,
> block devices white list) and now RDMA devices
> or
> (b) to spin of new controller, if so what are the compelling reasons
> that it can provide compare to extension.

I'm doubtful that these things are gonna be mainstream w/o building up
higher level abstractions on top and if we ever get there we won't be
talking about MR or CQ or whatever.  Also, whatever next-gen is
unlikely to have enough commonalities when the proposed resource knobs
are this low level, so let's please keep it separate, so that if/when
this goes out of fashion for one reason or another, the controller can
silently wither away too.

> Current scope of the patch is limited to RDMA resources as first
> patch, but for fact I am sure that there are more functionality in
> pipe to support via this cgroup by me and others.
> So keeping atleast these two aspects in mind, I need input on
> direction of dedicated controller or new one.
> 
> In future, I anticipate that we might have sub directory to device
> cgroup for individual device class to control.
> such as,
> <sys/fs/cgroup/devices/
>      /char
>      /block
>      /rdma
>      /pcie
>      /child_cgroup..1..N
> Each controllers cgroup access files would remain within their own
> scope. We are not there yet from base infrastructure but something to
> be done as it matures and users start using it.

I don't think that jives with the rest of cgroup and what generic
block or pcie attributes are directly exposed to applications and need
to be hierarchically controlled via cgroup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-10 17:46         ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-10 17:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Thu, Sep 10, 2015 at 10:19 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote:
>> This is one old white paper, but most of the reasoning still holds true on RDMA.
>> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf
>
> Just read it.  Much appreciated.
>
> ...
>> These resources include are-  QP (queue pair) to transfer data, CQ
>> (Completion queue) to indicate completion of data transfer operation,
>> MR (memory region) to represent user application memory as source or
>> destination for data transfer.
>> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
>> (Address handle), FLOW, PD (protection domain), user context etc.
>
> It's kinda bothering that all these are disparate resources.

Actually not. They are linked resources. Every QP needs associated one
or two CQ, one PD.
Every QP will use few MRs for data transfer.
Here is the good programming guide of the RDMA APIs exposed to the
user space application.

http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
So first version of the cgroups patch will address the control
operation for section 3.4.


> I suppose that each restriction comes from the underlying hardware and
> there's no accepted higher level abstraction for these things?
>
There is higher level abstraction which is through the verbs layer
currently which does actually expose the hardware resource but in
vendor agnostic way.
There are many vendors who support these verbs layer, some of them
which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
which support these verbs are in <drivers/infiniband/hw/> kernel tree.

There is higher level APIs above the verb layer, such as MPI,
libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
They all rely on the hardware resource. All of these higher level
abstraction is accepted and well used by certain application class. It
would be long discussion to go over them here.


>> >> This patch-set allows limiting rdma resources to set of processes.
>> >> It extend device cgroup controller for limiting rdma device limits.
>> >
>> > I don't think this belongs to devcg.  If these make sense as a set of
>> > resources to be controlled via cgroup, the right way prolly would be a
>> > separate controller.
>> >
>>
>> In past there has been similar comment to have dedicated cgroup
>> controller for RDMA instead of merging with device cgroup.
>> I am ok with both the approach, however I prefer to utilize device
>> controller instead of spinning of new controller for new devices
>> category.
>> I anticipate more such need would arise and for new device category,
>> it might not be worth to have new cgroup controller.
>> RapidIO though very less popular and upcoming PCIe are on horizon to
>> offer similar benefits as that of RDMA and in future having one
>> controller for each of them again would not be right approach.
>>
>> I certainly seek your and others inputs in this email thread here whether
>> (a) to continue to extend device cgroup (which support character,
>> block devices white list) and now RDMA devices
>> or
>> (b) to spin of new controller, if so what are the compelling reasons
>> that it can provide compare to extension.
>
> I'm doubtful that these things are gonna be mainstream w/o building up
> higher level abstractions on top and if we ever get there we won't be
> talking about MR or CQ or whatever.

Some of the higher level examples I gave above will adapt to resource
allocation failure. Some are actually adaptive to few resource
allocation failure, they do query resources. But its not completely
there yet. Once we have this notion of limited resource in place,
abstraction layer would adapt to relatively smaller value of such
resource.
These higher level abstraction is mainstream. Its shipped at least in
Redhat Enterprise Linux.

> Also, whatever next-gen is
> unlikely to have enough commonalities when the proposed resource knobs
> are this low level,

I agree that resource won't be common in next-gen other transport
whenever they arrive.
But with my existing background working on some of those transport,
they appear similar in nature and it might seek similar knobs.

> so let's please keep it separate, so that if/when
> this goes out of fashion for one reason or another, the controller can
> silently wither away too.
>
>> Current scope of the patch is limited to RDMA resources as first
>> patch, but for fact I am sure that there are more functionality in
>> pipe to support via this cgroup by me and others.
>> So keeping atleast these two aspects in mind, I need input on
>> direction of dedicated controller or new one.
>>
>> In future, I anticipate that we might have sub directory to device
>> cgroup for individual device class to control.
>> such as,
>> <sys/fs/cgroup/devices/
>>      /char
>>      /block
>>      /rdma
>>      /pcie
>>      /child_cgroup..1..N
>> Each controllers cgroup access files would remain within their own
>> scope. We are not there yet from base infrastructure but something to
>> be done as it matures and users start using it.
>
> I don't think that jives with the rest of cgroup and what generic
> block or pcie attributes are directly exposed to applications and need
> to be hierarchically controlled via cgroup?
>
I do agree that currently cgroup doesn't have notion of sub cgroup or
above hierarchy today.
so until than I was considering to implement it under devices cgroup
as generic place without the hierarchy shown above.
Therefore current interface is at device cgroup level.

If you are suggesting to have rdma cgroup as separate entity for near
future, its fine with me.
Later on when next-gen arrives we might have scope to make rdma cgroup
as more generic one. But than it might look like what I described
above.

In past I have discussions with Liran Liss from Mellanox as well on
this topic and we also agreed to have such cgroup controller.
He has recent presentation at Linux foundation event indicating to
have cgroup for RDMA.
Below is the link to it.
http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf
Slides 1 to 7 and slide 13 will give you more insight to it.
Liran and I had similar presentation to RDMA audience with less slides
in RDMA openfabrics summit in March 2015.

I am ok to create separate cgroup for rdma, if community thinks that way.
My preference would be still use device cgroup for above extensions
unless there are fundamental issues that I am missing.
I would let you make the call.
Rdma and other is just another type of device with different
characteristics than character or block, so one device cgroup with sub
functionalities can allow setting knobs.
Every device category will have their own set of knobs for resources,
ACL, limits, policy.
And I think cgroup is certainly better control point than sysfs or
spinning of new control infrastructure for this.
That said, I would like to hear your and communities view on how they
would like to see this shaping up.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-10 17:46         ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-10 17:46 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 10, 2015 at 10:19 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Wed, Sep 09, 2015 at 09:27:40AM +0530, Parav Pandit wrote:
>> This is one old white paper, but most of the reasoning still holds true on RDMA.
>> http://h10032.www1.hp.com/ctg/Manual/c00257031.pdf
>
> Just read it.  Much appreciated.
>
> ...
>> These resources include are-  QP (queue pair) to transfer data, CQ
>> (Completion queue) to indicate completion of data transfer operation,
>> MR (memory region) to represent user application memory as source or
>> destination for data transfer.
>> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
>> (Address handle), FLOW, PD (protection domain), user context etc.
>
> It's kinda bothering that all these are disparate resources.

Actually not. They are linked resources. Every QP needs associated one
or two CQ, one PD.
Every QP will use few MRs for data transfer.
Here is the good programming guide of the RDMA APIs exposed to the
user space application.

http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
So first version of the cgroups patch will address the control
operation for section 3.4.


> I suppose that each restriction comes from the underlying hardware and
> there's no accepted higher level abstraction for these things?
>
There is higher level abstraction which is through the verbs layer
currently which does actually expose the hardware resource but in
vendor agnostic way.
There are many vendors who support these verbs layer, some of them
which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
which support these verbs are in <drivers/infiniband/hw/> kernel tree.

There is higher level APIs above the verb layer, such as MPI,
libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
They all rely on the hardware resource. All of these higher level
abstraction is accepted and well used by certain application class. It
would be long discussion to go over them here.


>> >> This patch-set allows limiting rdma resources to set of processes.
>> >> It extend device cgroup controller for limiting rdma device limits.
>> >
>> > I don't think this belongs to devcg.  If these make sense as a set of
>> > resources to be controlled via cgroup, the right way prolly would be a
>> > separate controller.
>> >
>>
>> In past there has been similar comment to have dedicated cgroup
>> controller for RDMA instead of merging with device cgroup.
>> I am ok with both the approach, however I prefer to utilize device
>> controller instead of spinning of new controller for new devices
>> category.
>> I anticipate more such need would arise and for new device category,
>> it might not be worth to have new cgroup controller.
>> RapidIO though very less popular and upcoming PCIe are on horizon to
>> offer similar benefits as that of RDMA and in future having one
>> controller for each of them again would not be right approach.
>>
>> I certainly seek your and others inputs in this email thread here whether
>> (a) to continue to extend device cgroup (which support character,
>> block devices white list) and now RDMA devices
>> or
>> (b) to spin of new controller, if so what are the compelling reasons
>> that it can provide compare to extension.
>
> I'm doubtful that these things are gonna be mainstream w/o building up
> higher level abstractions on top and if we ever get there we won't be
> talking about MR or CQ or whatever.

Some of the higher level examples I gave above will adapt to resource
allocation failure. Some are actually adaptive to few resource
allocation failure, they do query resources. But its not completely
there yet. Once we have this notion of limited resource in place,
abstraction layer would adapt to relatively smaller value of such
resource.
These higher level abstraction is mainstream. Its shipped at least in
Redhat Enterprise Linux.

> Also, whatever next-gen is
> unlikely to have enough commonalities when the proposed resource knobs
> are this low level,

I agree that resource won't be common in next-gen other transport
whenever they arrive.
But with my existing background working on some of those transport,
they appear similar in nature and it might seek similar knobs.

> so let's please keep it separate, so that if/when
> this goes out of fashion for one reason or another, the controller can
> silently wither away too.
>
>> Current scope of the patch is limited to RDMA resources as first
>> patch, but for fact I am sure that there are more functionality in
>> pipe to support via this cgroup by me and others.
>> So keeping atleast these two aspects in mind, I need input on
>> direction of dedicated controller or new one.
>>
>> In future, I anticipate that we might have sub directory to device
>> cgroup for individual device class to control.
>> such as,
>> <sys/fs/cgroup/devices/
>>      /char
>>      /block
>>      /rdma
>>      /pcie
>>      /child_cgroup..1..N
>> Each controllers cgroup access files would remain within their own
>> scope. We are not there yet from base infrastructure but something to
>> be done as it matures and users start using it.
>
> I don't think that jives with the rest of cgroup and what generic
> block or pcie attributes are directly exposed to applications and need
> to be hierarchically controlled via cgroup?
>
I do agree that currently cgroup doesn't have notion of sub cgroup or
above hierarchy today.
so until than I was considering to implement it under devices cgroup
as generic place without the hierarchy shown above.
Therefore current interface is at device cgroup level.

If you are suggesting to have rdma cgroup as separate entity for near
future, its fine with me.
Later on when next-gen arrives we might have scope to make rdma cgroup
as more generic one. But than it might look like what I described
above.

In past I have discussions with Liran Liss from Mellanox as well on
this topic and we also agreed to have such cgroup controller.
He has recent presentation at Linux foundation event indicating to
have cgroup for RDMA.
Below is the link to it.
http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf
Slides 1 to 7 and slide 13 will give you more insight to it.
Liran and I had similar presentation to RDMA audience with less slides
in RDMA openfabrics summit in March 2015.

I am ok to create separate cgroup for rdma, if community thinks that way.
My preference would be still use device cgroup for above extensions
unless there are fundamental issues that I am missing.
I would let you make the call.
Rdma and other is just another type of device with different
characteristics than character or block, so one device cgroup with sub
functionalities can allow setting knobs.
Every device category will have their own set of knobs for resources,
ACL, limits, policy.
And I think cgroup is certainly better control point than sysfs or
spinning of new control infrastructure for this.
That said, I would like to hear your and communities view on how they
would like to see this shaping up.

> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-10 16:49     ` Tejun Heo
  2015-09-10 17:46         ` Parav Pandit
@ 2015-09-10 17:48       ` Hefty, Sean
  1 sibling, 0 replies; 95+ messages in thread
From: Hefty, Sean @ 2015-09-10 17:48 UTC (permalink / raw)
  To: Tejun Heo, Parav Pandit
  Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	lizefan@huawei.com, Johannes Weiner, Doug Ledford,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

> > In past there has been similar comment to have dedicated cgroup
> > controller for RDMA instead of merging with device cgroup.
> > I am ok with both the approach, however I prefer to utilize device
> > controller instead of spinning of new controller for new devices
> > category.
> > I anticipate more such need would arise and for new device category,
> > it might not be worth to have new cgroup controller.
> > RapidIO though very less popular and upcoming PCIe are on horizon to
> > offer similar benefits as that of RDMA and in future having one
> > controller for each of them again would not be right approach.
> >
> > I certainly seek your and others inputs in this email thread here
> whether
> > (a) to continue to extend device cgroup (which support character,
> > block devices white list) and now RDMA devices
> > or
> > (b) to spin of new controller, if so what are the compelling reasons
> > that it can provide compare to extension.
> 
> I'm doubtful that these things are gonna be mainstream w/o building up
> higher level abstractions on top and if we ever get there we won't be
> talking about MR or CQ or whatever.  Also, whatever next-gen is
> unlikely to have enough commonalities when the proposed resource knobs
> are this low level, so let's please keep it separate, so that if/when
> this goes out of fashion for one reason or another, the controller can
> silently wither away too.

As an attempt to abstract the hardware resources only, what these devices are exposing to apps can be viewed as command queues (RDMA QPs and SRQs), notification queues (RDMA CQs and EQs), and space in the device cache and allocated memory (RDMA MRs and AHs, maybe PDs).

If one wanted a higher level of abstraction, associations exist between these resources.  For example, command queues feed into notification queues.  Address handles are required resources to use an unconnected queue pair.

- Sean

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-10 17:46         ` Parav Pandit
  (?)
@ 2015-09-10 20:22         ` Tejun Heo
  2015-09-11  3:39           ` Parav Pandit
  -1 siblings, 1 reply; 95+ messages in thread
From: Tejun Heo @ 2015-09-10 20:22 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote:
> >> These resources include are-  QP (queue pair) to transfer data, CQ
> >> (Completion queue) to indicate completion of data transfer operation,
> >> MR (memory region) to represent user application memory as source or
> >> destination for data transfer.
> >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
> >> (Address handle), FLOW, PD (protection domain), user context etc.
> >
> > It's kinda bothering that all these are disparate resources.
> 
> Actually not. They are linked resources. Every QP needs associated one
> or two CQ, one PD.
> Every QP will use few MRs for data transfer.

So, if that's the case, let's please implement something higher level.
The goal is providing reasonable isolation or protection.  If that can
be achieved at a higher level of abstraction, please do that.

> Here is the good programming guide of the RDMA APIs exposed to the
> user space application.
> 
> http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
> So first version of the cgroups patch will address the control
> operation for section 3.4.
> 
> > I suppose that each restriction comes from the underlying hardware and
> > there's no accepted higher level abstraction for these things?
>
> There is higher level abstraction which is through the verbs layer
> currently which does actually expose the hardware resource but in
> vendor agnostic way.
> There are many vendors who support these verbs layer, some of them
> which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
> which support these verbs are in <drivers/infiniband/hw/> kernel tree.
> 
> There is higher level APIs above the verb layer, such as MPI,
> libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
> They all rely on the hardware resource. All of these higher level
> abstraction is accepted and well used by certain application class. It
> would be long discussion to go over them here.

Well, the programming interface that userland builds on top doesn't
matter too much here but if there is a common resource abstraction
which can be made in terms of constructs that consumers of the
facility would care about, that likely is a better choice than
exposing whatever hardware exposes.

> > I'm doubtful that these things are gonna be mainstream w/o building up
> > higher level abstractions on top and if we ever get there we won't be
> > talking about MR or CQ or whatever.
> 
> Some of the higher level examples I gave above will adapt to resource
> allocation failure. Some are actually adaptive to few resource
> allocation failure, they do query resources. But its not completely
> there yet. Once we have this notion of limited resource in place,
> abstraction layer would adapt to relatively smaller value of such
> resource.
>
> These higher level abstraction is mainstream. Its shipped at least in
> Redhat Enterprise Linux.

Again, I was talking more about resource abstraction - e.g. something
along the line of "I want N command buffers".

> > Also, whatever next-gen is
> > unlikely to have enough commonalities when the proposed resource knobs
> > are this low level,
> 
> I agree that resource won't be common in next-gen other transport
> whenever they arrive.
> But with my existing background working on some of those transport,
> they appear similar in nature and it might seek similar knobs.

I don't know.  What's proposed in this thread seems way too low level
to be useful anywhere else.  Also, what if there are multiple devices?
Is that a problem to worry about?

> In past I have discussions with Liran Liss from Mellanox as well on
> this topic and we also agreed to have such cgroup controller.
> He has recent presentation at Linux foundation event indicating to
> have cgroup for RDMA.
> Below is the link to it.
> http://events.linuxfoundation.org/sites/events/files/slides/containing_rdma_final.pdf
> Slides 1 to 7 and slide 13 will give you more insight to it.
> Liran and I had similar presentation to RDMA audience with less slides
> in RDMA openfabrics summit in March 2015.
>
> I am ok to create separate cgroup for rdma, if community thinks that way.
> My preference would be still use device cgroup for above extensions
> unless there are fundamental issues that I am missing.

The thing is that they aren't related at all in any way.  There's no
reason to tie them together.  In fact, the way we did devcg is
backward.  The ideal solution would have been extending the usual ACL
to understand cgroups so that it's a natural growth of the permission
system.

You're talking about actual hardware resources.  That has nothing to
do with access permissions on device nodes.

> I would let you make the call.
> Rdma and other is just another type of device with different
> characteristics than character or block, so one device cgroup with sub
> functionalities can allow setting knobs.
> Every device category will have their own set of knobs for resources,
> ACL, limits, policy.

I'm kinda doubtful we're gonna have too many of these.  Hardware
details being exposed to userland this directly isn't common.

> And I think cgroup is certainly better control point than sysfs or
> spinning of new control infrastructure for this.
> That said, I would like to hear your and communities view on how they
> would like to see this shaping up.

I'd say keep it simple and do the minimum. :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-10 20:22         ` Tejun Heo
@ 2015-09-11  3:39           ` Parav Pandit
  2015-09-11  4:04               ` Tejun Heo
  0 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-11  3:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Fri, Sep 11, 2015 at 1:52 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Thu, Sep 10, 2015 at 11:16:49PM +0530, Parav Pandit wrote:
>> >> These resources include are-  QP (queue pair) to transfer data, CQ
>> >> (Completion queue) to indicate completion of data transfer operation,
>> >> MR (memory region) to represent user application memory as source or
>> >> destination for data transfer.
>> >> Common resources are QP, SRQ (shared received queue), CQ, MR, AH
>> >> (Address handle), FLOW, PD (protection domain), user context etc.
>> >
>> > It's kinda bothering that all these are disparate resources.
>>
>> Actually not. They are linked resources. Every QP needs associated one
>> or two CQ, one PD.
>> Every QP will use few MRs for data transfer.
>
> So, if that's the case, let's please implement something higher level.
> The goal is providing reasonable isolation or protection.  If that can
> be achieved at a higher level of abstraction, please do that.
>
>> Here is the good programming guide of the RDMA APIs exposed to the
>> user space application.
>>
>> http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
>> So first version of the cgroups patch will address the control
>> operation for section 3.4.
>>
>> > I suppose that each restriction comes from the underlying hardware and
>> > there's no accepted higher level abstraction for these things?
>>
>> There is higher level abstraction which is through the verbs layer
>> currently which does actually expose the hardware resource but in
>> vendor agnostic way.
>> There are many vendors who support these verbs layer, some of them
>> which I know are Mellanox, Intel, Chelsio, Avago/Emulex whose drivers
>> which support these verbs are in <drivers/infiniband/hw/> kernel tree.
>>
>> There is higher level APIs above the verb layer, such as MPI,
>> libfabric, rsocket, rds, pgas, dapl which uses underlying verbs layer.
>> They all rely on the hardware resource. All of these higher level
>> abstraction is accepted and well used by certain application class. It
>> would be long discussion to go over them here.
>
> Well, the programming interface that userland builds on top doesn't
> matter too much here but if there is a common resource abstraction
> which can be made in terms of constructs that consumers of the
> facility would care about, that likely is a better choice than
> exposing whatever hardware exposes.
>

Tejun,
The fact is that user level application uses hardware resources.
Verbs layer is software abstraction for it. Drivers are hiding how
they implement this QP or CQ or whatever hardware resource they
project via API layer.
For all of the userland on top of verb layer I mentioned above, the
common resource abstraction is these resources AH, QP, CQ, MR etc.
Hardware (and driver) might have different view of this resource in
their real implementation.
For example, verb layer can say that it has 100 QPs, but hardware
might actually have 20 QPs that driver decide how to efficiently use
it.

>> > I'm doubtful that these things are gonna be mainstream w/o building up
>> > higher level abstractions on top and if we ever get there we won't be
>> > talking about MR or CQ or whatever.
>>
>> Some of the higher level examples I gave above will adapt to resource
>> allocation failure. Some are actually adaptive to few resource
>> allocation failure, they do query resources. But its not completely
>> there yet. Once we have this notion of limited resource in place,
>> abstraction layer would adapt to relatively smaller value of such
>> resource.
>>
>> These higher level abstraction is mainstream. Its shipped at least in
>> Redhat Enterprise Linux.
>
> Again, I was talking more about resource abstraction - e.g. something
> along the line of "I want N command buffers".
>

Yes. We are still talking of resource abstraction here.
RDMA and IBTA defines these resources. On top of these resources
various frameworks are build.
so for example,
User land is tuning environment deploying for MPI application,
it would configure:
10 processes from the PID controller,
10 CPUs in cpuset controller,
1 PD, 20 CQ, 10 QP, 100 MRs in rdma controller,

say user land is tuning environment for deploying rsocket application
for 100 connections,
it would configure, 100 PD, 100 QP, 200 MR.
When verb layer see failure with it, they will adapt to live with what
they have at lower performance.

Since every higher level which I mentioned in different in the way, it
uses RDMA resources, we cannot generalize it as "N command buffers".
That generalization in my mind is the - rdma resources - central common entity.

>> > Also, whatever next-gen is
>> > unlikely to have enough commonalities when the proposed resource knobs
>> > are this low level,
>>
>> I agree that resource won't be common in next-gen other transport
>> whenever they arrive.
>> But with my existing background working on some of those transport,
>> they appear similar in nature and it might seek similar knobs.
>
> I don't know.  What's proposed in this thread seems way too low level
> to be useful anywhere else.  Also, what if there are multiple devices?
> Is that a problem to worry about?
>
o.k. It doesn't have to be useful anywhere else. If it suffice the
need of RDMA applications, its fine for near future.
This patch allows limiting resources across multiple devices.
As we go along the path, and if requirement come up to have knob on
per device basis, thats something we can extend in future.

>
>> I would let you make the call.
>> Rdma and other is just another type of device with different
>> characteristics than character or block, so one device cgroup with sub
>> functionalities can allow setting knobs.
>> Every device category will have their own set of knobs for resources,
>> ACL, limits, policy.
>
> I'm kinda doubtful we're gonna have too many of these.  Hardware
> details being exposed to userland this directly isn't common.
>

Its common in RDMA applications. Again they may not be real hardware
resource, its just API layer which defines those RDMA constructs.

>> And I think cgroup is certainly better control point than sysfs or
>> spinning of new control infrastructure for this.
>> That said, I would like to hear your and communities view on how they
>> would like to see this shaping up.
>
> I'd say keep it simple and do the minimum. :)
>
o.k. In that case new rdma cgroup controller which does rdma resource
accounting is possibly the most simplest form?
Make sense?

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11  4:04               ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11  4:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
> The fact is that user level application uses hardware resources.
> Verbs layer is software abstraction for it. Drivers are hiding how
> they implement this QP or CQ or whatever hardware resource they
> project via API layer.
> For all of the userland on top of verb layer I mentioned above, the
> common resource abstraction is these resources AH, QP, CQ, MR etc.
> Hardware (and driver) might have different view of this resource in
> their real implementation.
> For example, verb layer can say that it has 100 QPs, but hardware
> might actually have 20 QPs that driver decide how to efficiently use
> it.

My uneducated suspicion is that the abstraction is just not developed
enough.  It should be possible to virtualize these resources through,
most likely, time-sharing to the level where userland simply says "I
want this chunk transferred there" and OS schedules the transfer
prioritizing competing requests.

It could be that given the use cases rdma might not need such level of
abstraction - e.g. most users want to be and are pretty close to bare
metal, but, if that's true, it also kinda is weird to build
hierarchical resource distribution scheme on top of such bare
abstraction.

...
> > I don't know.  What's proposed in this thread seems way too low level
> > to be useful anywhere else.  Also, what if there are multiple devices?
> > Is that a problem to worry about?
>
> o.k. It doesn't have to be useful anywhere else. If it suffice the
> need of RDMA applications, its fine for near future.
> This patch allows limiting resources across multiple devices.
> As we go along the path, and if requirement come up to have knob on
> per device basis, thats something we can extend in future.

You kinda have to decide that upfront cuz it gets baked into the
interface.

> > I'm kinda doubtful we're gonna have too many of these.  Hardware
> > details being exposed to userland this directly isn't common.
> 
> Its common in RDMA applications. Again they may not be real hardware
> resource, its just API layer which defines those RDMA constructs.

It's still a very low level of abstraction which pretty much gets
decided by what the hardware and driver decide to do.

> > I'd say keep it simple and do the minimum. :)
>
> o.k. In that case new rdma cgroup controller which does rdma resource
> accounting is possibly the most simplest form?
> Make sense?

So, this fits cgroup's purpose to certain level but it feels like
we're trying to build too much on top of something which hasn't
developed sufficiently.  I suppose it could be that this is the level
of development that rdma is gonna reach and dumb cgroup controller can
be useful for some use cases.  I don't know, so, yeah, let's keep it
simple and avoid doing crazy stuff.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11  4:04               ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11  4:04 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Doug Ledford, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hello, Parav.

On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
> The fact is that user level application uses hardware resources.
> Verbs layer is software abstraction for it. Drivers are hiding how
> they implement this QP or CQ or whatever hardware resource they
> project via API layer.
> For all of the userland on top of verb layer I mentioned above, the
> common resource abstraction is these resources AH, QP, CQ, MR etc.
> Hardware (and driver) might have different view of this resource in
> their real implementation.
> For example, verb layer can say that it has 100 QPs, but hardware
> might actually have 20 QPs that driver decide how to efficiently use
> it.

My uneducated suspicion is that the abstraction is just not developed
enough.  It should be possible to virtualize these resources through,
most likely, time-sharing to the level where userland simply says "I
want this chunk transferred there" and OS schedules the transfer
prioritizing competing requests.

It could be that given the use cases rdma might not need such level of
abstraction - e.g. most users want to be and are pretty close to bare
metal, but, if that's true, it also kinda is weird to build
hierarchical resource distribution scheme on top of such bare
abstraction.

...
> > I don't know.  What's proposed in this thread seems way too low level
> > to be useful anywhere else.  Also, what if there are multiple devices?
> > Is that a problem to worry about?
>
> o.k. It doesn't have to be useful anywhere else. If it suffice the
> need of RDMA applications, its fine for near future.
> This patch allows limiting resources across multiple devices.
> As we go along the path, and if requirement come up to have knob on
> per device basis, thats something we can extend in future.

You kinda have to decide that upfront cuz it gets baked into the
interface.

> > I'm kinda doubtful we're gonna have too many of these.  Hardware
> > details being exposed to userland this directly isn't common.
> 
> Its common in RDMA applications. Again they may not be real hardware
> resource, its just API layer which defines those RDMA constructs.

It's still a very low level of abstraction which pretty much gets
decided by what the hardware and driver decide to do.

> > I'd say keep it simple and do the minimum. :)
>
> o.k. In that case new rdma cgroup controller which does rdma resource
> accounting is possibly the most simplest form?
> Make sense?

So, this fits cgroup's purpose to certain level but it feels like
we're trying to build too much on top of something which hasn't
developed sufficiently.  I suppose it could be that this is the level
of development that rdma is gonna reach and dumb cgroup controller can
be useful for some use cases.  I don't know, so, yeah, let's keep it
simple and avoid doing crazy stuff.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11  4:24                 ` Doug Ledford
  0 siblings, 0 replies; 95+ messages in thread
From: Doug Ledford @ 2015-09-11  4:24 UTC (permalink / raw)
  To: Tejun Heo, Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

[-- Attachment #1: Type: text/plain, Size: 6374 bytes --]

On 09/11/2015 12:04 AM, Tejun Heo wrote:
> Hello, Parav.
> 
> On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
>> The fact is that user level application uses hardware resources.
>> Verbs layer is software abstraction for it. Drivers are hiding how
>> they implement this QP or CQ or whatever hardware resource they
>> project via API layer.
>> For all of the userland on top of verb layer I mentioned above, the
>> common resource abstraction is these resources AH, QP, CQ, MR etc.
>> Hardware (and driver) might have different view of this resource in
>> their real implementation.
>> For example, verb layer can say that it has 100 QPs, but hardware
>> might actually have 20 QPs that driver decide how to efficiently use
>> it.
> 
> My uneducated suspicion is that the abstraction is just not developed
> enough.

The abstraction is 10+ years old.  It has had plenty of time to ferment
and something better for the specific use case has not emerged.

>  It should be possible to virtualize these resources through,
> most likely, time-sharing to the level where userland simply says "I
> want this chunk transferred there" and OS schedules the transfer
> prioritizing competing requests.

No.  And if you think this, then you miss the *entire* point of RDMA
technologies.  An analogy that I have used many times in presentations
is that, in the networking world, the kernel is both a postman and a
copy machine.  It receives all incoming packets and must sort them to
the right recipient (the postman job) and when the user space
application is ready to use the information it must copy it into the
user's VM space because it couldn't just put the user's data buffer on
the RX buffer list since each buffer might belong to anyone (the copy
machine).  In the RDMA world, you create a new queue pair, it is often a
long lived connection (like a socket), but it belongs now to the app and
the app can directly queue both send and receive buffers to the card and
on incoming packets the card will be able to know that the packet
belongs to a specific queue pair and will immediately go to that apps
buffer.  You can *not* do this with TCP without moving to complete TCP
offload on the card, registration of specific sockets on the card, and
then allowing the application to pre-register receive buffers for a
specific socket to the card so that incoming data on the wire can go
straight to the right place.  If you ever get to the point of "OS
schedules the transfer" then you might as well throw RDMA out the window
because you have totally trashed the benefit it provides.

> It could be that given the use cases rdma might not need such level of
> abstraction - e.g. most users want to be and are pretty close to bare
> metal, but, if that's true, it also kinda is weird to build
> hierarchical resource distribution scheme on top of such bare
> abstraction.

Not really.  If you are going to have a bare abstraction, this one isn't
really a bad one.  You have devices.  On a device, you allocate
protection domains (PDs).  If you don't care about cross connection
issues, you ignore this and only use one.  If you do care, this acts
like a process's unique VM space only for RDMA buffers, it is a domain
to protect the data of one connection from another.  Then you have queue
pairs (QPs) which are roughly the equivalent of a socket.  Each QP has
at least one Completion Queue where you get the events that tell you
things have completed (although they often use two, one for send
completions and one for receive completions).  And then you use some
number of memory registrations (MRs) and address handles (AHs) depending
on your usage.  Since RDMA stands for Remote Direct Memory Access, as
you can imagine, giving a remote machine free reign to access all of the
physical memory in your machine is a security issue.  The MRs help to
control what memory the remote host on a specific QP has access to.  The
AHs control how we actually route packets from ourselves to the remote host.

Here's the deal.  You might be able to create an abstraction above this
that hides *some* of this.  But it can't hide even nearly all of it
without loosing significant functionality.  The problem here is that you
are thinking about RDMA connections like sockets.  They aren't.  Not
even close.  They are "how do I allow a remote machine to directly read
and write into my machines physical memory in an even remotely close to
secure manner?"  These resources aren't hardware resources, they are the
abstraction resources needed to answer that question.

> ...
>>> I don't know.  What's proposed in this thread seems way too low level
>>> to be useful anywhere else.  Also, what if there are multiple devices?
>>> Is that a problem to worry about?
>>
>> o.k. It doesn't have to be useful anywhere else. If it suffice the
>> need of RDMA applications, its fine for near future.
>> This patch allows limiting resources across multiple devices.
>> As we go along the path, and if requirement come up to have knob on
>> per device basis, thats something we can extend in future.
> 
> You kinda have to decide that upfront cuz it gets baked into the
> interface.
> 
>>> I'm kinda doubtful we're gonna have too many of these.  Hardware
>>> details being exposed to userland this directly isn't common.
>>
>> Its common in RDMA applications. Again they may not be real hardware
>> resource, its just API layer which defines those RDMA constructs.
> 
> It's still a very low level of abstraction which pretty much gets
> decided by what the hardware and driver decide to do.
> 
>>> I'd say keep it simple and do the minimum. :)
>>
>> o.k. In that case new rdma cgroup controller which does rdma resource
>> accounting is possibly the most simplest form?
>> Make sense?
> 
> So, this fits cgroup's purpose to certain level but it feels like
> we're trying to build too much on top of something which hasn't
> developed sufficiently.  I suppose it could be that this is the level
> of development that rdma is gonna reach and dumb cgroup controller can
> be useful for some use cases.  I don't know, so, yeah, let's keep it
> simple and avoid doing crazy stuff.
> 
> Thanks.
> 


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11  4:24                 ` Doug Ledford
  0 siblings, 0 replies; 95+ messages in thread
From: Doug Ledford @ 2015-09-11  4:24 UTC (permalink / raw)
  To: Tejun Heo, Parav Pandit
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 6403 bytes --]

On 09/11/2015 12:04 AM, Tejun Heo wrote:
> Hello, Parav.
> 
> On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
>> The fact is that user level application uses hardware resources.
>> Verbs layer is software abstraction for it. Drivers are hiding how
>> they implement this QP or CQ or whatever hardware resource they
>> project via API layer.
>> For all of the userland on top of verb layer I mentioned above, the
>> common resource abstraction is these resources AH, QP, CQ, MR etc.
>> Hardware (and driver) might have different view of this resource in
>> their real implementation.
>> For example, verb layer can say that it has 100 QPs, but hardware
>> might actually have 20 QPs that driver decide how to efficiently use
>> it.
> 
> My uneducated suspicion is that the abstraction is just not developed
> enough.

The abstraction is 10+ years old.  It has had plenty of time to ferment
and something better for the specific use case has not emerged.

>  It should be possible to virtualize these resources through,
> most likely, time-sharing to the level where userland simply says "I
> want this chunk transferred there" and OS schedules the transfer
> prioritizing competing requests.

No.  And if you think this, then you miss the *entire* point of RDMA
technologies.  An analogy that I have used many times in presentations
is that, in the networking world, the kernel is both a postman and a
copy machine.  It receives all incoming packets and must sort them to
the right recipient (the postman job) and when the user space
application is ready to use the information it must copy it into the
user's VM space because it couldn't just put the user's data buffer on
the RX buffer list since each buffer might belong to anyone (the copy
machine).  In the RDMA world, you create a new queue pair, it is often a
long lived connection (like a socket), but it belongs now to the app and
the app can directly queue both send and receive buffers to the card and
on incoming packets the card will be able to know that the packet
belongs to a specific queue pair and will immediately go to that apps
buffer.  You can *not* do this with TCP without moving to complete TCP
offload on the card, registration of specific sockets on the card, and
then allowing the application to pre-register receive buffers for a
specific socket to the card so that incoming data on the wire can go
straight to the right place.  If you ever get to the point of "OS
schedules the transfer" then you might as well throw RDMA out the window
because you have totally trashed the benefit it provides.

> It could be that given the use cases rdma might not need such level of
> abstraction - e.g. most users want to be and are pretty close to bare
> metal, but, if that's true, it also kinda is weird to build
> hierarchical resource distribution scheme on top of such bare
> abstraction.

Not really.  If you are going to have a bare abstraction, this one isn't
really a bad one.  You have devices.  On a device, you allocate
protection domains (PDs).  If you don't care about cross connection
issues, you ignore this and only use one.  If you do care, this acts
like a process's unique VM space only for RDMA buffers, it is a domain
to protect the data of one connection from another.  Then you have queue
pairs (QPs) which are roughly the equivalent of a socket.  Each QP has
at least one Completion Queue where you get the events that tell you
things have completed (although they often use two, one for send
completions and one for receive completions).  And then you use some
number of memory registrations (MRs) and address handles (AHs) depending
on your usage.  Since RDMA stands for Remote Direct Memory Access, as
you can imagine, giving a remote machine free reign to access all of the
physical memory in your machine is a security issue.  The MRs help to
control what memory the remote host on a specific QP has access to.  The
AHs control how we actually route packets from ourselves to the remote host.

Here's the deal.  You might be able to create an abstraction above this
that hides *some* of this.  But it can't hide even nearly all of it
without loosing significant functionality.  The problem here is that you
are thinking about RDMA connections like sockets.  They aren't.  Not
even close.  They are "how do I allow a remote machine to directly read
and write into my machines physical memory in an even remotely close to
secure manner?"  These resources aren't hardware resources, they are the
abstraction resources needed to answer that question.

> ...
>>> I don't know.  What's proposed in this thread seems way too low level
>>> to be useful anywhere else.  Also, what if there are multiple devices?
>>> Is that a problem to worry about?
>>
>> o.k. It doesn't have to be useful anywhere else. If it suffice the
>> need of RDMA applications, its fine for near future.
>> This patch allows limiting resources across multiple devices.
>> As we go along the path, and if requirement come up to have knob on
>> per device basis, thats something we can extend in future.
> 
> You kinda have to decide that upfront cuz it gets baked into the
> interface.
> 
>>> I'm kinda doubtful we're gonna have too many of these.  Hardware
>>> details being exposed to userland this directly isn't common.
>>
>> Its common in RDMA applications. Again they may not be real hardware
>> resource, its just API layer which defines those RDMA constructs.
> 
> It's still a very low level of abstraction which pretty much gets
> decided by what the hardware and driver decide to do.
> 
>>> I'd say keep it simple and do the minimum. :)
>>
>> o.k. In that case new rdma cgroup controller which does rdma resource
>> accounting is possibly the most simplest form?
>> Make sense?
> 
> So, this fits cgroup's purpose to certain level but it feels like
> we're trying to build too much on top of something which hasn't
> developed sufficiently.  I suppose it could be that this is the level
> of development that rdma is gonna reach and dumb cgroup controller can
> be useful for some use cases.  I don't know, so, yeah, let's keep it
> simple and avoid doing crazy stuff.
> 
> Thanks.
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11  4:04               ` Tejun Heo
  (?)
  (?)
@ 2015-09-11  4:43               ` Parav Pandit
  2015-09-11 15:03                 ` Tejun Heo
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-11  4:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Fri, Sep 11, 2015 at 9:34 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 09:09:58AM +0530, Parav Pandit wrote:
>> The fact is that user level application uses hardware resources.
>> Verbs layer is software abstraction for it. Drivers are hiding how
>> they implement this QP or CQ or whatever hardware resource they
>> project via API layer.
>> For all of the userland on top of verb layer I mentioned above, the
>> common resource abstraction is these resources AH, QP, CQ, MR etc.
>> Hardware (and driver) might have different view of this resource in
>> their real implementation.
>> For example, verb layer can say that it has 100 QPs, but hardware
>> might actually have 20 QPs that driver decide how to efficiently use
>> it.
>
> My uneducated suspicion is that the abstraction is just not developed
> enough.  It should be possible to virtualize these resources through,
> most likely, time-sharing to the level where userland simply says "I
> want this chunk transferred there" and OS schedules the transfer
> prioritizing competing requests.

Tejun,
That is such a perfect abstraction to have at OS level, but not sure
how much close it can be to bare metal RDMA it can be.
I have started discussion on that front as well as part of other
thread, but its certainly long way to go.
Most want to enjoy the performance benefit of the bare metal
interfaces it provides.

Such abstraction that you mentioned, exists, the only difference is
instead of its OS as central entity, its the higher level libraries,
drivers and hw together does it today for the applications.


>
> It could be that given the use cases rdma might not need such level of
> abstraction - e.g. most users want to be and are pretty close to bare
> metal, but, if that's true, it also kinda is weird to build
> hierarchical resource distribution scheme on top of such bare
> abstraction.
>
> ...
>> > I don't know.  What's proposed in this thread seems way too low level
>> > to be useful anywhere else.  Also, what if there are multiple devices?
>> > Is that a problem to worry about?
>>
>> o.k. It doesn't have to be useful anywhere else. If it suffice the
>> need of RDMA applications, its fine for near future.
>> This patch allows limiting resources across multiple devices.
>> As we go along the path, and if requirement come up to have knob on
>> per device basis, thats something we can extend in future.
>
> You kinda have to decide that upfront cuz it gets baked into the
> interface.

Well, all the interfaces are not yet defined. Except the test and
benchmark utilities, real world applications wouldn't really bother
much about which device are they are going through.
so I expect that per device level control would nice for very specific
applications, but I don't anticipate that in first place.
If others have different view, I would be happy to hear that.

Even if we extend per device control, I would expect per cgroup
control at top level without which its uncontrolled access.

>
>> > I'm kinda doubtful we're gonna have too many of these.  Hardware
>> > details being exposed to userland this directly isn't common.
>>
>> Its common in RDMA applications. Again they may not be real hardware
>> resource, its just API layer which defines those RDMA constructs.
>
> It's still a very low level of abstraction which pretty much gets
> decided by what the hardware and driver decide to do.
>
>> > I'd say keep it simple and do the minimum. :)
>>
>> o.k. In that case new rdma cgroup controller which does rdma resource
>> accounting is possibly the most simplest form?
>> Make sense?
>
> So, this fits cgroup's purpose to certain level but it feels like
> we're trying to build too much on top of something which hasn't
> developed sufficiently.  I suppose it could be that this is the level
> of development that rdma is gonna reach and dumb cgroup controller can
> be useful for some use cases.  I don't know, so, yeah, let's keep it
> simple and avoid doing crazy stuff.
>

o.k. thanks. I would wait for some more time to collect more feedback.
In absence of that,

I will send updated patch V1 which will include,
(a) functionality of this patch in new rdma cgroup as you recommended,
(b) fixes for comments from Haggai for this patch
(c) more fixes which I have done in mean time

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 14:52                   ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 14:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Parav Pandit, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Doug.

On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.
> 
> The abstraction is 10+ years old.  It has had plenty of time to ferment
> and something better for the specific use case has not emerged.

I think that is likely more reflective of the use cases rather than
anything inherent in the concept.

> >  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> No.  And if you think this, then you miss the *entire* point of RDMA
> technologies.  An analogy that I have used many times in presentations
> is that, in the networking world, the kernel is both a postman and a
> copy machine.  It receives all incoming packets and must sort them to
> the right recipient (the postman job) and when the user space
> application is ready to use the information it must copy it into the
> user's VM space because it couldn't just put the user's data buffer on
> the RX buffer list since each buffer might belong to anyone (the copy
> machine).  In the RDMA world, you create a new queue pair, it is often a
> long lived connection (like a socket), but it belongs now to the app and
> the app can directly queue both send and receive buffers to the card and
> on incoming packets the card will be able to know that the packet
> belongs to a specific queue pair and will immediately go to that apps
> buffer.  You can *not* do this with TCP without moving to complete TCP
> offload on the card, registration of specific sockets on the card, and
> then allowing the application to pre-register receive buffers for a
> specific socket to the card so that incoming data on the wire can go
> straight to the right place.  If you ever get to the point of "OS
> schedules the transfer" then you might as well throw RDMA out the window
> because you have totally trashed the benefit it provides.

I don't know.  This sounds like classic "this is painful so it must be
good" bare metal fantasy.  I get that rdma succeeds at bypassing a lot
of overhead.  That's great but that really isn't exclusive with having
more accessible mechanisms built on top.  The crux of cost saving is
the hardware knowing where the incoming data belongs and putting it
there directly.  Everything else is there to facilitate that and if
you're declaring that it's impossible to build accessible abstractions
for that, I can't agree with you.

Note that this is not to say that rdma should do that in the operating
system.  As you said, people have been happy with the bare abstraction
for a long time and, given relatively specialized use cases, that can
be completely fine but please do note that the lack of proper
abstraction isn't an inherent feature.  It's just easier that way and
putting in more effort hasn't been necessary.

> > It could be that given the use cases rdma might not need such level of
> > abstraction - e.g. most users want to be and are pretty close to bare
> > metal, but, if that's true, it also kinda is weird to build
> > hierarchical resource distribution scheme on top of such bare
> > abstraction.
> 
> Not really.  If you are going to have a bare abstraction, this one isn't
> really a bad one.  You have devices.  On a device, you allocate
> protection domains (PDs).  If you don't care about cross connection
> issues, you ignore this and only use one.  If you do care, this acts
> like a process's unique VM space only for RDMA buffers, it is a domain
> to protect the data of one connection from another.  Then you have queue
> pairs (QPs) which are roughly the equivalent of a socket.  Each QP has
> at least one Completion Queue where you get the events that tell you
> things have completed (although they often use two, one for send
> completions and one for receive completions).  And then you use some
> number of memory registrations (MRs) and address handles (AHs) depending
> on your usage.  Since RDMA stands for Remote Direct Memory Access, as
> you can imagine, giving a remote machine free reign to access all of the
> physical memory in your machine is a security issue.  The MRs help to
> control what memory the remote host on a specific QP has access to.  The
> AHs control how we actually route packets from ourselves to the remote host.
> 
> Here's the deal.  You might be able to create an abstraction above this
> that hides *some* of this.  But it can't hide even nearly all of it
> without loosing significant functionality.  The problem here is that you
> are thinking about RDMA connections like sockets.  They aren't.  Not
> even close.  They are "how do I allow a remote machine to directly read
> and write into my machines physical memory in an even remotely close to
> secure manner?"  These resources aren't hardware resources, they are the
> abstraction resources needed to answer that question.

So, the existence of resource limitations is fine.  That's what we
deal with all the time.  The problem usually with this sort of
interfaces which expose implementation details to users directly is
that it severely limits engineering manuevering space.  You usually
want your users to express their intentions and a mechanism to
arbitrate resources to satisfy those intentions (and in a way more
graceful than "we can't, maybe try later?"); otherwise, implementing
any sort of high level resource distribution scheme becomes painful
and usually the only thing possible is preventing runaway disasters -
you don't wanna pin unused resource permanently if there actually is
contention around it, so usually all you can do with hard limits is
overcommiting limits so that it at least prevents disasters.

cpuset is a special case but think of cpu, memory or io controllers.
Their resource distribution schemes are a lot more developed than
what's proposed in this patchset and that's a necessity because nobody
wants to cripple their machines for resource control.  This is a lot
more like the pids controller and that controller's almost sole
purpose is preventing runaway workload wrecking the whole machine.

It's getting rambly but the point is that if the resource being
controlled by this controller is actually contended for performance
reasons, this sort of hard limiting is inherently unlikely to be very
useful.  If the resource isn't and the main goal is preventing runaway
hogs, it'll be able to do that but is that the goal here?  For this to
be actually useful for performance contended cases, it'd need higher
level abstractions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 14:52                   ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 14:52 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Parav Pandit, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hello, Doug.

On Fri, Sep 11, 2015 at 12:24:33AM -0400, Doug Ledford wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.
> 
> The abstraction is 10+ years old.  It has had plenty of time to ferment
> and something better for the specific use case has not emerged.

I think that is likely more reflective of the use cases rather than
anything inherent in the concept.

> >  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> No.  And if you think this, then you miss the *entire* point of RDMA
> technologies.  An analogy that I have used many times in presentations
> is that, in the networking world, the kernel is both a postman and a
> copy machine.  It receives all incoming packets and must sort them to
> the right recipient (the postman job) and when the user space
> application is ready to use the information it must copy it into the
> user's VM space because it couldn't just put the user's data buffer on
> the RX buffer list since each buffer might belong to anyone (the copy
> machine).  In the RDMA world, you create a new queue pair, it is often a
> long lived connection (like a socket), but it belongs now to the app and
> the app can directly queue both send and receive buffers to the card and
> on incoming packets the card will be able to know that the packet
> belongs to a specific queue pair and will immediately go to that apps
> buffer.  You can *not* do this with TCP without moving to complete TCP
> offload on the card, registration of specific sockets on the card, and
> then allowing the application to pre-register receive buffers for a
> specific socket to the card so that incoming data on the wire can go
> straight to the right place.  If you ever get to the point of "OS
> schedules the transfer" then you might as well throw RDMA out the window
> because you have totally trashed the benefit it provides.

I don't know.  This sounds like classic "this is painful so it must be
good" bare metal fantasy.  I get that rdma succeeds at bypassing a lot
of overhead.  That's great but that really isn't exclusive with having
more accessible mechanisms built on top.  The crux of cost saving is
the hardware knowing where the incoming data belongs and putting it
there directly.  Everything else is there to facilitate that and if
you're declaring that it's impossible to build accessible abstractions
for that, I can't agree with you.

Note that this is not to say that rdma should do that in the operating
system.  As you said, people have been happy with the bare abstraction
for a long time and, given relatively specialized use cases, that can
be completely fine but please do note that the lack of proper
abstraction isn't an inherent feature.  It's just easier that way and
putting in more effort hasn't been necessary.

> > It could be that given the use cases rdma might not need such level of
> > abstraction - e.g. most users want to be and are pretty close to bare
> > metal, but, if that's true, it also kinda is weird to build
> > hierarchical resource distribution scheme on top of such bare
> > abstraction.
> 
> Not really.  If you are going to have a bare abstraction, this one isn't
> really a bad one.  You have devices.  On a device, you allocate
> protection domains (PDs).  If you don't care about cross connection
> issues, you ignore this and only use one.  If you do care, this acts
> like a process's unique VM space only for RDMA buffers, it is a domain
> to protect the data of one connection from another.  Then you have queue
> pairs (QPs) which are roughly the equivalent of a socket.  Each QP has
> at least one Completion Queue where you get the events that tell you
> things have completed (although they often use two, one for send
> completions and one for receive completions).  And then you use some
> number of memory registrations (MRs) and address handles (AHs) depending
> on your usage.  Since RDMA stands for Remote Direct Memory Access, as
> you can imagine, giving a remote machine free reign to access all of the
> physical memory in your machine is a security issue.  The MRs help to
> control what memory the remote host on a specific QP has access to.  The
> AHs control how we actually route packets from ourselves to the remote host.
> 
> Here's the deal.  You might be able to create an abstraction above this
> that hides *some* of this.  But it can't hide even nearly all of it
> without loosing significant functionality.  The problem here is that you
> are thinking about RDMA connections like sockets.  They aren't.  Not
> even close.  They are "how do I allow a remote machine to directly read
> and write into my machines physical memory in an even remotely close to
> secure manner?"  These resources aren't hardware resources, they are the
> abstraction resources needed to answer that question.

So, the existence of resource limitations is fine.  That's what we
deal with all the time.  The problem usually with this sort of
interfaces which expose implementation details to users directly is
that it severely limits engineering manuevering space.  You usually
want your users to express their intentions and a mechanism to
arbitrate resources to satisfy those intentions (and in a way more
graceful than "we can't, maybe try later?"); otherwise, implementing
any sort of high level resource distribution scheme becomes painful
and usually the only thing possible is preventing runaway disasters -
you don't wanna pin unused resource permanently if there actually is
contention around it, so usually all you can do with hard limits is
overcommiting limits so that it at least prevents disasters.

cpuset is a special case but think of cpu, memory or io controllers.
Their resource distribution schemes are a lot more developed than
what's proposed in this patchset and that's a necessity because nobody
wants to cripple their machines for resource control.  This is a lot
more like the pids controller and that controller's almost sole
purpose is preventing runaway workload wrecking the whole machine.

It's getting rambly but the point is that if the resource being
controlled by this controller is actually contended for performance
reasons, this sort of hard limiting is inherently unlikely to be very
useful.  If the resource isn't and the main goal is preventing runaway
hogs, it'll be able to do that but is that the goal here?  For this to
be actually useful for performance contended cases, it'd need higher
level abstractions.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11  4:43               ` Parav Pandit
@ 2015-09-11 15:03                 ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 15:03 UTC (permalink / raw)
  To: Parav Pandit
  Cc: cgroups, linux-doc, linux-kernel, linux-rdma, lizefan,
	Johannes Weiner, Doug Ledford, Jonathan Corbet, james.l.morris,
	serge, Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Fri, Sep 11, 2015 at 10:13:59AM +0530, Parav Pandit wrote:
> > My uneducated suspicion is that the abstraction is just not developed
> > enough.  It should be possible to virtualize these resources through,
> > most likely, time-sharing to the level where userland simply says "I
> > want this chunk transferred there" and OS schedules the transfer
> > prioritizing competing requests.
> 
> Tejun,
> That is such a perfect abstraction to have at OS level, but not sure
> how much close it can be to bare metal RDMA it can be.
> I have started discussion on that front as well as part of other
> thread, but its certainly long way to go.
> Most want to enjoy the performance benefit of the bare metal
> interfaces it provides.

Yeah, sure, I'm not trying to say that rdma needs or should do that.

> Such abstraction that you mentioned, exists, the only difference is
> instead of its OS as central entity, its the higher level libraries,
> drivers and hw together does it today for the applications.

But more that having resource control in the OS and actual arbitration
higher up in the stack isn't likely to lead to an effective resource
distribution scheme.

> > You kinda have to decide that upfront cuz it gets baked into the
> > interface.
> 
> Well, all the interfaces are not yet defined. Except the test and

I meant the cgroup interface.

> benchmark utilities, real world applications wouldn't really bother
> much about which device are they are going through.

Weights can work fine across multiple devices.  Hard limits don't.  It
just doesn't make any sense.  Unless you can exclude multiple device
scenarios, you'll have to implement per-device limits.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11 14:52                   ` Tejun Heo
  (?)
@ 2015-09-11 16:26                   ` Parav Pandit
  2015-09-11 16:34                       ` Tejun Heo
  -1 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-11 16:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

> If the resource isn't and the main goal is preventing runaway
> hogs, it'll be able to do that but is that the goal here?  For this to
> be actually useful for performance contended cases, it'd need higher
> level abstractions.
>

Resource run away by application can lead to (a) kernel and (b) other
applications left out with no resources situation.
Both the problems are the target of this patch set by accounting via cgroup.

Performance contention can be resolved with higher level user space,
which will tune it.
Threshold and fail counters are on the way in follow on patch.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 16:34                       ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 16:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
> Resource run away by application can lead to (a) kernel and (b) other
> applications left out with no resources situation.

Yeap, that this controller would be able to prevent to a reasonable
extent.

> Both the problems are the target of this patch set by accounting via cgroup.
> 
> Performance contention can be resolved with higher level user space,
> which will tune it.

If individual applications are gonna be allowed to do that, what's to
prevent them from jacking up their limits?  So, I assume you're
thinking of a central authority overseeing distribution and enforcing
the policy through cgroups?

> Threshold and fail counters are on the way in follow on patch.

If you're planning on following what the existing memcg did in this
area, it's unlikely to go well.  Would you mind sharing what you have
on mind in the long term?  Where do you see this going?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 16:34                       ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 16:34 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hello, Parav.

On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
> Resource run away by application can lead to (a) kernel and (b) other
> applications left out with no resources situation.

Yeap, that this controller would be able to prevent to a reasonable
extent.

> Both the problems are the target of this patch set by accounting via cgroup.
> 
> Performance contention can be resolved with higher level user space,
> which will tune it.

If individual applications are gonna be allowed to do that, what's to
prevent them from jacking up their limits?  So, I assume you're
thinking of a central authority overseeing distribution and enforcing
the policy through cgroups?

> Threshold and fail counters are on the way in follow on patch.

If you're planning on following what the existing memcg did in this
area, it's unlikely to go well.  Would you mind sharing what you have
on mind in the long term?  Where do you see this going?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 16:39                         ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-11 16:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Fri, Sep 11, 2015 at 10:04 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
>> Resource run away by application can lead to (a) kernel and (b) other
>> applications left out with no resources situation.
>
> Yeap, that this controller would be able to prevent to a reasonable
> extent.
>
>> Both the problems are the target of this patch set by accounting via cgroup.
>>
>> Performance contention can be resolved with higher level user space,
>> which will tune it.
>
> If individual applications are gonna be allowed to do that, what's to
> prevent them from jacking up their limits?
I should have been more explicit. I didnt mean the application to
control which is allocating it.
> So, I assume you're
> thinking of a central authority overseeing distribution and enforcing
> the policy through cgroups?
>
Exactly.



>> Threshold and fail counters are on the way in follow on patch.
>
> If you're planning on following what the existing memcg did in this
> area, it's unlikely to go well.  Would you mind sharing what you have
> on mind in the long term?  Where do you see this going?
>
At least current thoughts are: central entity authority monitors fail
count and new threashold count.
Fail count - as similar to other indicates how many time resource
failure occured
threshold count - indicates upto what this resource has gone upto in
usage. (application might not be able to poll on thousands of such
resources entries).
So based on fail count and threshold count, it can tune it further.




> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 16:39                         ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-11 16:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Fri, Sep 11, 2015 at 10:04 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 09:56:31PM +0530, Parav Pandit wrote:
>> Resource run away by application can lead to (a) kernel and (b) other
>> applications left out with no resources situation.
>
> Yeap, that this controller would be able to prevent to a reasonable
> extent.
>
>> Both the problems are the target of this patch set by accounting via cgroup.
>>
>> Performance contention can be resolved with higher level user space,
>> which will tune it.
>
> If individual applications are gonna be allowed to do that, what's to
> prevent them from jacking up their limits?
I should have been more explicit. I didnt mean the application to
control which is allocating it.
> So, I assume you're
> thinking of a central authority overseeing distribution and enforcing
> the policy through cgroups?
>
Exactly.



>> Threshold and fail counters are on the way in follow on patch.
>
> If you're planning on following what the existing memcg did in this
> area, it's unlikely to go well.  Would you mind sharing what you have
> on mind in the long term?  Where do you see this going?
>
At least current thoughts are: central entity authority monitors fail
count and new threashold count.
Fail count - as similar to other indicates how many time resource
failure occured
threshold count - indicates upto what this resource has gone upto in
usage. (application might not be able to poll on thousands of such
resources entries).
So based on fail count and threshold count, it can tune it further.




> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 16:47                     ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-11 16:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

> cpuset is a special case but think of cpu, memory or io controllers.
> Their resource distribution schemes are a lot more developed than
> what's proposed in this patchset and that's a necessity because nobody
> wants to cripple their machines for resource control.

IO controller and applications are mature in nature.
When IO controller throttles the IO, applications are pretty mature
where if IO takes longer to complete, there is possibly almost no way
to cancel the system call or rather application might not want to
cancel the IO at least the non asynchronous one.
So application just notice lower performance than throttled way.
Its really not possible at RDMA level with RDMA resource to hold up
resource creation call for longer time, because reusing existing
resource with failed status can likely to give better performance.
As Doug explained in his example, many RDMA resources as its been used
by applications are relatively long lived. So holding ups resource
creation while its taken by other process will certainly will look bad
on application performance front compare to returning failure and
reusing existing one once its available or once new one is available.

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 16:47                     ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-11 16:47 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

> cpuset is a special case but think of cpu, memory or io controllers.
> Their resource distribution schemes are a lot more developed than
> what's proposed in this patchset and that's a necessity because nobody
> wants to cripple their machines for resource control.

IO controller and applications are mature in nature.
When IO controller throttles the IO, applications are pretty mature
where if IO takes longer to complete, there is possibly almost no way
to cancel the system call or rather application might not want to
cancel the IO at least the non asynchronous one.
So application just notice lower performance than throttled way.
Its really not possible at RDMA level with RDMA resource to hold up
resource creation call for longer time, because reusing existing
resource with failed status can likely to give better performance.
As Doug explained in his example, many RDMA resources as its been used
by applications are relatively long lived. So holding ups resource
creation while its taken by other process will certainly will look bad
on application performance front compare to returning failure and
reusing existing one once its available or once new one is available.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 19:05                       ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 19:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Fri, Sep 11, 2015 at 10:17:42PM +0530, Parav Pandit wrote:
> IO controller and applications are mature in nature.
> When IO controller throttles the IO, applications are pretty mature
> where if IO takes longer to complete, there is possibly almost no way
> to cancel the system call or rather application might not want to
> cancel the IO at least the non asynchronous one.

I was more talking about the fact that they allow resources to be
consumed when they aren't contended.

> So application just notice lower performance than throttled way.
> Its really not possible at RDMA level with RDMA resource to hold up
> resource creation call for longer time, because reusing existing
> resource with failed status can likely to give better performance.
> As Doug explained in his example, many RDMA resources as its been used
> by applications are relatively long lived. So holding ups resource
> creation while its taken by other process will certainly will look bad
> on application performance front compare to returning failure and
> reusing existing one once its available or once new one is available.

I'm not really sold on the idea that this can be used to implement
performance based resource distribution.  I'll write more about that
on the other subthread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 19:05                       ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 19:05 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

Hello, Parav.

On Fri, Sep 11, 2015 at 10:17:42PM +0530, Parav Pandit wrote:
> IO controller and applications are mature in nature.
> When IO controller throttles the IO, applications are pretty mature
> where if IO takes longer to complete, there is possibly almost no way
> to cancel the system call or rather application might not want to
> cancel the IO at least the non asynchronous one.

I was more talking about the fact that they allow resources to be
consumed when they aren't contended.

> So application just notice lower performance than throttled way.
> Its really not possible at RDMA level with RDMA resource to hold up
> resource creation call for longer time, because reusing existing
> resource with failed status can likely to give better performance.
> As Doug explained in his example, many RDMA resources as its been used
> by applications are relatively long lived. So holding ups resource
> creation while its taken by other process will certainly will look bad
> on application performance front compare to returning failure and
> reusing existing one once its available or once new one is available.

I'm not really sold on the idea that this can be used to implement
performance based resource distribution.  I'll write more about that
on the other subthread.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11 14:52                   ` Tejun Heo
                                     ` (2 preceding siblings ...)
  (?)
@ 2015-09-11 19:22                   ` Hefty, Sean
  2015-09-11 19:43                       ` Jason Gunthorpe
  2015-09-14 10:15                     ` Parav Pandit
  -1 siblings, 2 replies; 95+ messages in thread
From: Hefty, Sean @ 2015-09-11 19:22 UTC (permalink / raw)
  To: Tejun Heo, Doug Ledford
  Cc: Parav Pandit, cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	lizefan@huawei.com, Johannes Weiner, Jonathan Corbet,
	james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran,
	Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

> So, the existence of resource limitations is fine.  That's what we
> deal with all the time.  The problem usually with this sort of
> interfaces which expose implementation details to users directly is
> that it severely limits engineering manuevering space.  You usually
> want your users to express their intentions and a mechanism to
> arbitrate resources to satisfy those intentions (and in a way more
> graceful than "we can't, maybe try later?"); otherwise, implementing
> any sort of high level resource distribution scheme becomes painful
> and usually the only thing possible is preventing runaway disasters -
> you don't wanna pin unused resource permanently if there actually is
> contention around it, so usually all you can do with hard limits is
> overcommiting limits so that it at least prevents disasters.

I agree with Tejun that this proposal is at the wrong level of abstraction.

If you look at just trying to limit QPs, it's not clear what that attempts to accomplish.  Conceptually, a QP is little more than an addressable endpoint.  It may or may not map to HW resources (for Intel NICs it does not).  Even when HW resources do back the QP, the hardware is limited by how many QPs can realistically be active at any one time, based on how much caching is available in the NIC.

Trying to limit the number of QPs that an app can allocate, therefore, just limits how much of the address space an app can use.  There's no clear link between QP limits and HW resource limits, unless you assume a very specific underlying implementation.

- Sean

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11 16:39                         ` Parav Pandit
  (?)
@ 2015-09-11 19:25                         ` Tejun Heo
  2015-09-14 10:18                             ` Parav Pandit
  -1 siblings, 1 reply; 95+ messages in thread
From: Tejun Heo @ 2015-09-11 19:25 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

Hello, Parav.

On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
> > If you're planning on following what the existing memcg did in this
> > area, it's unlikely to go well.  Would you mind sharing what you have
> > on mind in the long term?  Where do you see this going?
>
> At least current thoughts are: central entity authority monitors fail
> count and new threashold count.
> Fail count - as similar to other indicates how many time resource
> failure occured
> threshold count - indicates upto what this resource has gone upto in
> usage. (application might not be able to poll on thousands of such
> resources entries).
> So based on fail count and threshold count, it can tune it further.

So, regardless of the specific resource in question, implementing
adaptive resource distribution requires more than simple thresholds
and failcnts.  The very minimum would be a way to exert reclaim
pressure and then a way to measure how much lack of a given resource
is affecting the workload.  Maybe it can adaptively lower the limits
and then watch how often allocation fails but that's highly unlikely
to be an effective measure as it can't do anything to hoarders and the
frequency of allocation failure doesn't necessarily correlate with the
amount of impact the workload is getting (it's not a measure of
usage).

This is what I'm awry about.  The kernel-userland interface here is
cut pretty low in the stack leaving most of arbitration and management
logic in the userland, which seems to be what people wanted and that's
fine, but then you're trying to implement an intelligent resource
control layer which straddles across kernel and userland with those
low level primitives which inevitably would increase the required
interface surface as nobody has enough information.

Just to illustrate the point, please think of the alsa interface.  We
expose hardware capabilities pretty much as-is leaving management and
multiplexing to userland and there's nothing wrong with it.  It fits
better that way; however, we don't then go try to implement cgroup
controller for PCM channels.  To do any high-level resource
management, you gotta do it where the said resource is actually
managed and arbitrated.

What's the allocation frequency you're expecting?  It might be better
to just let allocations themselves go through the agent that you're
planning.  You sure can use cgroup membership to identify who's asking
tho.  Given how the whole thing is architectured, I'd suggest thinking
more about how the whole thing should turn out eventually.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 19:43                       ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-11 19:43 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Tejun Heo, Doug Ledford, Parav Pandit, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Fri, Sep 11, 2015 at 07:22:56PM +0000, Hefty, Sean wrote:
 
> Trying to limit the number of QPs that an app can allocate,
> therefore, just limits how much of the address space an app can use.
> There's no clear link between QP limits and HW resource limits,
> unless you assume a very specific underlying implementation.

Isn't that the point though? We have several vendors with hardware
that does impose hard limits on specific resources. There is no way to
avoid that, and ultimately, those exact HW resources need to be
limited.

If we want to talk about abstraction, then I'd suggest something very
general and simple - two limits:
 '% of the RDMA hardware resource pool' (per device or per ep?)
 'bytes of kernel memory for RDMA structures' (all devices)

That comfortably covers all the various kinds of hardware we support
in a reasonable fashion.

Unless there really is a reason why we need to constrain exactly
and precisely PD/QP/MR/AH (I can't think of one off hand)

The 'RDMA hardware resource pool' is a vendor-driver-device specific
thing, with no generic definition beyond something that doesn't fit in
the other limit.

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-11 19:43                       ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-11 19:43 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Tejun Heo, Doug Ledford, Parav Pandit,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran,
	Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Fri, Sep 11, 2015 at 07:22:56PM +0000, Hefty, Sean wrote:
 
> Trying to limit the number of QPs that an app can allocate,
> therefore, just limits how much of the address space an app can use.
> There's no clear link between QP limits and HW resource limits,
> unless you assume a very specific underlying implementation.

Isn't that the point though? We have several vendors with hardware
that does impose hard limits on specific resources. There is no way to
avoid that, and ultimately, those exact HW resources need to be
limited.

If we want to talk about abstraction, then I'd suggest something very
general and simple - two limits:
 '% of the RDMA hardware resource pool' (per device or per ep?)
 'bytes of kernel memory for RDMA structures' (all devices)

That comfortably covers all the various kinds of hardware we support
in a reasonable fashion.

Unless there really is a reason why we need to constrain exactly
and precisely PD/QP/MR/AH (I can't think of one off hand)

The 'RDMA hardware resource pool' is a vendor-driver-device specific
thing, with no generic definition beyond something that doesn't fit in
the other limit.

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* RE: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11 19:43                       ` Jason Gunthorpe
  (?)
@ 2015-09-11 20:06                       ` Hefty, Sean
  2015-09-14 11:09                         ` Parav Pandit
  -1 siblings, 1 reply; 95+ messages in thread
From: Hefty, Sean @ 2015-09-11 20:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tejun Heo, Doug Ledford, Parav Pandit, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

> > Trying to limit the number of QPs that an app can allocate,
> > therefore, just limits how much of the address space an app can use.
> > There's no clear link between QP limits and HW resource limits,
> > unless you assume a very specific underlying implementation.
> 
> Isn't that the point though? We have several vendors with hardware
> that does impose hard limits on specific resources. There is no way to
> avoid that, and ultimately, those exact HW resources need to be
> limited.

My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries?  Who knows?

> If we want to talk about abstraction, then I'd suggest something very
> general and simple - two limits:
>  '% of the RDMA hardware resource pool' (per device or per ep?)
>  'bytes of kernel memory for RDMA structures' (all devices)

Yes - this makes more sense to me.


^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11 19:22                   ` Hefty, Sean
  2015-09-11 19:43                       ` Jason Gunthorpe
@ 2015-09-14 10:15                     ` Parav Pandit
  1 sibling, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 10:15 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Sat, Sep 12, 2015 at 12:52 AM, Hefty, Sean <sean.hefty@intel.com> wrote:
>> So, the existence of resource limitations is fine.  That's what we
>> deal with all the time.  The problem usually with this sort of
>> interfaces which expose implementation details to users directly is
>> that it severely limits engineering manuevering space.  You usually
>> want your users to express their intentions and a mechanism to
>> arbitrate resources to satisfy those intentions (and in a way more
>> graceful than "we can't, maybe try later?"); otherwise, implementing
>> any sort of high level resource distribution scheme becomes painful
>> and usually the only thing possible is preventing runaway disasters -
>> you don't wanna pin unused resource permanently if there actually is
>> contention around it, so usually all you can do with hard limits is
>> overcommiting limits so that it at least prevents disasters.
>
> I agree with Tejun that this proposal is at the wrong level of abstraction.
>
> If you look at just trying to limit QPs, it's not clear what that attempts to accomplish.  Conceptually, a QP is little more than an addressable endpoint.  It may or may not map to HW resources (for Intel NICs it does not).  Even when HW resources do back the QP, the hardware is limited by how many QPs can realistically be active at any one time, based on how much caching is available in the NIC.
>

cgroups as it stands today provides resource controls in effective
manner of existing defined resource, such as cpu cycles, memory in
user and kernel space, tcp bytes, IOPS etc.
Similarly RDMA programming model defines its own set of resources
which is used by applications which accesses those resources directly.

What we are debating here is that, RDMA exposing hardware resources is
not correct, and therefore whether a cgroup controller is needed or
not.
There are two points here.
1. Whether RDMA programming model is correct or not which works on
defined resources of IB spec.
2. Assuming that programming model is fine, (because we have actively
maintained IB stack in kernel and adoption of user space components in
OS),
whether we need to control those resources or not via cgroup.

Tejun trying to say that because point_1 is doesn't seem to be right
way to solve problem, point_2 should not be done or done at different
level of abstraction.
More questions/comments in Jason and Sean thread.

Sean,
Even though there is no one to one map of verb-QP to hw-QP, in order
for driver or lower layer to effectively map the right verb-QP to
hw-QP, such vendor specific layer needs to know how is it going to be
used. Otherwise two contending applications for a QP may not get the
right number of hw-QPs to use.

> Trying to limit the number of QPs that an app can allocate, therefore, just limits how much of the address space an app can use.  There's no clear link between QP limits and HW resource limits, unless you assume a very specific underlying implementation.
>
> - Sean

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 10:18                             ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 10:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups, linux-doc, linux-kernel, linux-rdma,
	lizefan, Johannes Weiner, Jonathan Corbet, james.l.morris, serge,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel, akpm,
	linux-security-module

On Sat, Sep 12, 2015 at 12:55 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
>> > If you're planning on following what the existing memcg did in this
>> > area, it's unlikely to go well.  Would you mind sharing what you have
>> > on mind in the long term?  Where do you see this going?
>>
>> At least current thoughts are: central entity authority monitors fail
>> count and new threashold count.
>> Fail count - as similar to other indicates how many time resource
>> failure occured
>> threshold count - indicates upto what this resource has gone upto in
>> usage. (application might not be able to poll on thousands of such
>> resources entries).
>> So based on fail count and threshold count, it can tune it further.
>
> So, regardless of the specific resource in question, implementing
> adaptive resource distribution requires more than simple thresholds
> and failcnts.

May be yes. Buts in difficult to go through the whole design to shape
up right now.
This is the infrastructure getting build with few capabilities.
I see this as starting point instead of end point.

> The very minimum would be a way to exert reclaim
> pressure and then a way to measure how much lack of a given resource
> is affecting the workload.  Maybe it can adaptively lower the limits
> and then watch how often allocation fails but that's highly unlikely
> to be an effective measure as it can't do anything to hoarders and the
> frequency of allocation failure doesn't necessarily correlate with the
> amount of impact the workload is getting (it's not a measure of
> usage).

It can always kill the hoarding process(es), which is holding up the
resources without using it.
Such processes will eventually will get restarted but will not be able
to hoard so much because its been on the radar for hoarding and its
limits have been reduced.

>
> This is what I'm awry about.  The kernel-userland interface here is
> cut pretty low in the stack leaving most of arbitration and management
> logic in the userland, which seems to be what people wanted and that's
> fine, but then you're trying to implement an intelligent resource
> control layer which straddles across kernel and userland with those
> low level primitives which inevitably would increase the required
> interface surface as nobody has enough information.
>
We might be able to get the information as we go along.
Such arbitration and management layer outside (instead of inside) has
more visibility into multiple systems which are part of single cluster
and processes are spreaded across cgroup in each such system.
While a logic inside can manage just a manage a process of single node
which are using multiple cgroups.

> Just to illustrate the point, please think of the alsa interface.  We
> expose hardware capabilities pretty much as-is leaving management and
> multiplexing to userland and there's nothing wrong with it.  It fits
> better that way; however, we don't then go try to implement cgroup
> controller for PCM channels.  To do any high-level resource
> management, you gotta do it where the said resource is actually
> managed and arbitrated.
>
> What's the allocation frequency you're expecting?  It might be better
> to just let allocations themselves go through the agent that you're
> planning.
In that case we might need to build FUSE style infrastructure.
Frequency for RDMA resource allocation is certainly less than read/write calls.

> You sure can use cgroup membership to identify who's asking
> tho.  Given how the whole thing is architectured, I'd suggest thinking
> more about how the whole thing should turn out eventually.
>
Yes, I agree.
At this point, its software solution to provide resource isolation in
simple manner which has scope to become adaptive in future.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 10:18                             ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 10:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Doug Ledford, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-doc-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA,
	Johannes Weiner, Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA,
	serge-A9i7LUbDfNHQT0dZR+AlfA, Haggai Eran, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA

On Sat, Sep 12, 2015 at 12:55 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
>> > If you're planning on following what the existing memcg did in this
>> > area, it's unlikely to go well.  Would you mind sharing what you have
>> > on mind in the long term?  Where do you see this going?
>>
>> At least current thoughts are: central entity authority monitors fail
>> count and new threashold count.
>> Fail count - as similar to other indicates how many time resource
>> failure occured
>> threshold count - indicates upto what this resource has gone upto in
>> usage. (application might not be able to poll on thousands of such
>> resources entries).
>> So based on fail count and threshold count, it can tune it further.
>
> So, regardless of the specific resource in question, implementing
> adaptive resource distribution requires more than simple thresholds
> and failcnts.

May be yes. Buts in difficult to go through the whole design to shape
up right now.
This is the infrastructure getting build with few capabilities.
I see this as starting point instead of end point.

> The very minimum would be a way to exert reclaim
> pressure and then a way to measure how much lack of a given resource
> is affecting the workload.  Maybe it can adaptively lower the limits
> and then watch how often allocation fails but that's highly unlikely
> to be an effective measure as it can't do anything to hoarders and the
> frequency of allocation failure doesn't necessarily correlate with the
> amount of impact the workload is getting (it's not a measure of
> usage).

It can always kill the hoarding process(es), which is holding up the
resources without using it.
Such processes will eventually will get restarted but will not be able
to hoard so much because its been on the radar for hoarding and its
limits have been reduced.

>
> This is what I'm awry about.  The kernel-userland interface here is
> cut pretty low in the stack leaving most of arbitration and management
> logic in the userland, which seems to be what people wanted and that's
> fine, but then you're trying to implement an intelligent resource
> control layer which straddles across kernel and userland with those
> low level primitives which inevitably would increase the required
> interface surface as nobody has enough information.
>
We might be able to get the information as we go along.
Such arbitration and management layer outside (instead of inside) has
more visibility into multiple systems which are part of single cluster
and processes are spreaded across cgroup in each such system.
While a logic inside can manage just a manage a process of single node
which are using multiple cgroups.

> Just to illustrate the point, please think of the alsa interface.  We
> expose hardware capabilities pretty much as-is leaving management and
> multiplexing to userland and there's nothing wrong with it.  It fits
> better that way; however, we don't then go try to implement cgroup
> controller for PCM channels.  To do any high-level resource
> management, you gotta do it where the said resource is actually
> managed and arbitrated.
>
> What's the allocation frequency you're expecting?  It might be better
> to just let allocations themselves go through the agent that you're
> planning.
In that case we might need to build FUSE style infrastructure.
Frequency for RDMA resource allocation is certainly less than read/write calls.

> You sure can use cgroup membership to identify who's asking
> tho.  Given how the whole thing is architectured, I'd suggest thinking
> more about how the whole thing should turn out eventually.
>
Yes, I agree.
At this point, its software solution to provide resource isolation in
simple manner which has scope to become adaptive in future.

> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-11 20:06                       ` Hefty, Sean
@ 2015-09-14 11:09                         ` Parav Pandit
  2015-09-14 14:04                           ` Parav Pandit
  2015-09-14 17:28                             ` Jason Gunthorpe
  0 siblings, 2 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 11:09 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jason Gunthorpe, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.hefty@intel.com> wrote:
>> > Trying to limit the number of QPs that an app can allocate,
>> > therefore, just limits how much of the address space an app can use.
>> > There's no clear link between QP limits and HW resource limits,
>> > unless you assume a very specific underlying implementation.
>>
>> Isn't that the point though? We have several vendors with hardware
>> that does impose hard limits on specific resources. There is no way to
>> avoid that, and ultimately, those exact HW resources need to be
>> limited.
>
> My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries?  Who knows?

I think it means if its RDMA RC QP, than whether you can talk to 1000
nodes or 1 node in network.
When we deploy MPI application, it know the rank of the application,
we know the cluster size of the deployment and based on that resource
allocation can be done.
If you meant to say from performance point of view, than resource
count is possibly not the right measure.

Just because we have not defined those interface for performance today
in this patch set, doesn't mean that we won't do it.
I could easily see a number_of_messages/sec as one interface to be
added in future.
But that won't stop process hoarders to stop taking away all the QPs,
just the way we needed PID controller.

Now when it comes to Intel implementation, if it driver layer knows
(in future we new APIs) that whether 10 or 100 user QPs should map to
few hw-QPs or more hw-QPs (uSNIC).
so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to
other cgroup.
If hw- implementation doesn't require isolation, it could just
continue from single pool, its left to the vendor implementation on
how to use this information (this API is not present in the patch).

So cgroup can also provides a control point for vendor layer to tune
internal resource allocation based on provided matrix, which cannot be
done by just providing "memory usage by RDMA structures".

If I have to compare it with other cgroup knobs, low level individual
knobs by itself, doesn't serve any meaningful purpose either.
Just by defined how much CPU to use or how much memory to use, it
cannot define the application performance either.
I am not sure, whether iocontroller can achieve 10 million IOPs by
defining single CPU and 64KB of memory.
all the knobs needs to be set in right way to reach desired number.

In similar line RDMA resource knobs as individual knobs are not
definition of performance, its just another knob.

>
>> If we want to talk about abstraction, then I'd suggest something very
>> general and simple - two limits:
>>  '% of the RDMA hardware resource pool' (per device or per ep?)
>>  'bytes of kernel memory for RDMA structures' (all devices)
>
> Yes - this makes more sense to me.
>

Sean, Jason,
Help me to understand this scheme.

1. How does the % of resource, is different than absolute number? With
rest of the cgroups systems we define absolute number at most places
to my knowledge.
Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc.
20% of QP = 20 QPs when 100 QPs are with hw.
I prefer to keep the resource scheme consistent with other resource
control points - i.e. absolute number.

2. bytes of  kernel memory for RDMA structures
One QP of one vendor might consume X bytes and other Y bytes. How does
the application knows how much memory to give.
application can allocate 100 QP of each 1 entry deep or 1 QP of 100
entries deep as in Sean's example.
Both might consume almost same memory.
Application doing 100 QP allocation, still within limit of memory of
cgroup leaves other applications without any QP.
I don't see a point of memory footprint based scheme, as memory limits
are well addressed by more smarter memory controller anyway.

I do agree with Tejun, Sean on the point that abstraction level has to
be different for using RDMA and thats why libfabrics and other
interfaces are emerging which will take its own time to get stabilize,
integrated.

Until pure IB style RDMA programming model exist - based on RDMA
resource based scheme, I think control point also has to be on
resources.
Once a stable abstraction level is on table (possibly across fabric
not just RDMA), than a right resource controller can be implemented.
Even when RDMA abstraction layer arrives, as Jason mentioned, at the
end it would consume some hw resource anyway, that needs to be
controlled too.

Jason,
If the hardware vendor defines the resource pool without saying its
resource QP or MR, how would actually management/control point can
decide what should be controlled to what limit?
We will need additional user space library component to decode than,
after that it needs to be abstracted out as QP or MR so that it can be
deal in vendor agnostic way as application layer.
and than it would look similar to what is being proposed here?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-14 11:09                         ` Parav Pandit
@ 2015-09-14 14:04                           ` Parav Pandit
  2015-09-14 15:21                               ` Tejun Heo
  2015-09-14 17:28                             ` Jason Gunthorpe
  1 sibling, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 14:04 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Jason Gunthorpe, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

Hi Tejun,

I missed to acknowledge your point that we need both - hard limit and
soft limit/weight. Current patchset is only based on hard limit.
I see that weight would be another helfpul layer in chain that we can
implement after this as incremental that makes review, debugging
manageable?

Parav



On Mon, Sep 14, 2015 at 4:39 PM, Parav Pandit <pandit.parav@gmail.com> wrote:
> On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.hefty@intel.com> wrote:
>>> > Trying to limit the number of QPs that an app can allocate,
>>> > therefore, just limits how much of the address space an app can use.
>>> > There's no clear link between QP limits and HW resource limits,
>>> > unless you assume a very specific underlying implementation.
>>>
>>> Isn't that the point though? We have several vendors with hardware
>>> that does impose hard limits on specific resources. There is no way to
>>> avoid that, and ultimately, those exact HW resources need to be
>>> limited.
>>
>> My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything.  Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries?  Who knows?
>
> I think it means if its RDMA RC QP, than whether you can talk to 1000
> nodes or 1 node in network.
> When we deploy MPI application, it know the rank of the application,
> we know the cluster size of the deployment and based on that resource
> allocation can be done.
> If you meant to say from performance point of view, than resource
> count is possibly not the right measure.
>
> Just because we have not defined those interface for performance today
> in this patch set, doesn't mean that we won't do it.
> I could easily see a number_of_messages/sec as one interface to be
> added in future.
> But that won't stop process hoarders to stop taking away all the QPs,
> just the way we needed PID controller.
>
> Now when it comes to Intel implementation, if it driver layer knows
> (in future we new APIs) that whether 10 or 100 user QPs should map to
> few hw-QPs or more hw-QPs (uSNIC).
> so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to
> other cgroup.
> If hw- implementation doesn't require isolation, it could just
> continue from single pool, its left to the vendor implementation on
> how to use this information (this API is not present in the patch).
>
> So cgroup can also provides a control point for vendor layer to tune
> internal resource allocation based on provided matrix, which cannot be
> done by just providing "memory usage by RDMA structures".
>
> If I have to compare it with other cgroup knobs, low level individual
> knobs by itself, doesn't serve any meaningful purpose either.
> Just by defined how much CPU to use or how much memory to use, it
> cannot define the application performance either.
> I am not sure, whether iocontroller can achieve 10 million IOPs by
> defining single CPU and 64KB of memory.
> all the knobs needs to be set in right way to reach desired number.
>
> In similar line RDMA resource knobs as individual knobs are not
> definition of performance, its just another knob.
>
>>
>>> If we want to talk about abstraction, then I'd suggest something very
>>> general and simple - two limits:
>>>  '% of the RDMA hardware resource pool' (per device or per ep?)
>>>  'bytes of kernel memory for RDMA structures' (all devices)
>>
>> Yes - this makes more sense to me.
>>
>
> Sean, Jason,
> Help me to understand this scheme.
>
> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.
> Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc.
> 20% of QP = 20 QPs when 100 QPs are with hw.
> I prefer to keep the resource scheme consistent with other resource
> control points - i.e. absolute number.
>
> 2. bytes of  kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.
> application can allocate 100 QP of each 1 entry deep or 1 QP of 100
> entries deep as in Sean's example.
> Both might consume almost same memory.
> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.
> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.
>
> I do agree with Tejun, Sean on the point that abstraction level has to
> be different for using RDMA and thats why libfabrics and other
> interfaces are emerging which will take its own time to get stabilize,
> integrated.
>
> Until pure IB style RDMA programming model exist - based on RDMA
> resource based scheme, I think control point also has to be on
> resources.
> Once a stable abstraction level is on table (possibly across fabric
> not just RDMA), than a right resource controller can be implemented.
> Even when RDMA abstraction layer arrives, as Jason mentioned, at the
> end it would consume some hw resource anyway, that needs to be
> controlled too.
>
> Jason,
> If the hardware vendor defines the resource pool without saying its
> resource QP or MR, how would actually management/control point can
> decide what should be controlled to what limit?
> We will need additional user space library component to decode than,
> after that it needs to be abstracted out as QP or MR so that it can be
> deal in vendor agnostic way as application layer.
> and than it would look similar to what is being proposed here?

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 15:21                               ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-14 15:21 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Jason Gunthorpe, Doug Ledford,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	lizefan@huawei.com, Johannes Weiner, Jonathan Corbet,
	james.l.morris@oracle.com, serge@hallyn.com, Haggai Eran,
	Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

Hello, Parav.

On Mon, Sep 14, 2015 at 07:34:09PM +0530, Parav Pandit wrote:
> I missed to acknowledge your point that we need both - hard limit and
> soft limit/weight. Current patchset is only based on hard limit.
> I see that weight would be another helfpul layer in chain that we can
> implement after this as incremental that makes review, debugging
> manageable?

At this point, I'm very unsure that doing this as a cgroup controller
is a good direction.  From userland interface standpoint, publishing a
cgroup controller is a big commitment.  It is true that we haven't
been doing a good job of gatekeeping or polishing controller
interfaces but we're trying hard to change that and what's being
proposed in this thread doesn't really seem to be mature enough.  It's
not even clear what's being identified as resources here are things
that the users would actually care about or if it's even possible to
implement sensible resource control in the kernel via the proposed
resource restrictions.

So, I'd suggest going back to the board and figuring out what the
actual resources are, their distribution strategies should be and at
which layer such strategies can be implemented best.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 15:21                               ` Tejun Heo
  0 siblings, 0 replies; 95+ messages in thread
From: Tejun Heo @ 2015-09-14 15:21 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Jason Gunthorpe, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran,
	Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hello, Parav.

On Mon, Sep 14, 2015 at 07:34:09PM +0530, Parav Pandit wrote:
> I missed to acknowledge your point that we need both - hard limit and
> soft limit/weight. Current patchset is only based on hard limit.
> I see that weight would be another helfpul layer in chain that we can
> implement after this as incremental that makes review, debugging
> manageable?

At this point, I'm very unsure that doing this as a cgroup controller
is a good direction.  From userland interface standpoint, publishing a
cgroup controller is a big commitment.  It is true that we haven't
been doing a good job of gatekeeping or polishing controller
interfaces but we're trying hard to change that and what's being
proposed in this thread doesn't really seem to be mature enough.  It's
not even clear what's being identified as resources here are things
that the users would actually care about or if it's even possible to
implement sensible resource control in the kernel via the proposed
resource restrictions.

So, I'd suggest going back to the board and figuring out what the
actual resources are, their distribution strategies should be and at
which layer such strategies can be implemented best.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 17:28                             ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-14 17:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:

> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.

There isn't really much choice if the abstraction is a bundle of all
resources. You can't use an absolute number unless every possible
hardware limited resource is defined, which doesn't seem smart to me
either. It is not abstract enough, and doesn't match our universe of
hardware very well.

> 2. bytes of  kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.

I don't see this distinction being useful at such a fine granularity
where the control side needs to distinguish between 1 and 2 QPs.

The majority use for control groups has been along with containers to
prevent a container for exhausting resources in a way that impacts
another.

In that use model limiting each container to N MB of kernel memory
makes it straightforward to reason about resource exhaustion in a
multi-tennant environment. We have other controllers that do this,
just more indirectly (ie limiting the number of inotifies, or the
number of fds indirectly cap kernel memory consumption)

ie Presumably some fairly small limitation like 10MB is enough for
most non-MPI jobs.

> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.

No, if the HW has a fixed QP pool then it would hit #1 above. Both are
active at once. For example you'd say a container cannot use more than
10% of the device's hardware resources, or more than 10MB of kernel
memory.

If on an mlx card, you probably hit the 10% of QP resources first. If
on an qib card there is no HW QP pool (well, almost, QPNs are always
limited), so you'd hit the memory limit instead.

In either case, we don't want to see a container able to exhaust
either all of kernel memory or all of the HW resources to deny other
containers.

If you have a non-container use case in mind I'd be curious to hear
it..

> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.

I don't thing #1 is controlled but another controller. This is long
lived kernel-side memory allocations to support RDMA resource
allocation - we certainly have nothing in the rdma layer that is
tracking this stuff.

> If the hardware vendor defines the resource pool without saying its
> resource QP or MR, how would actually management/control point can
> decide what should be controlled to what limit?

In the kernel each HW driver has to be involved to declare what it's
hardware resource limits are.

In user space, it is just a simple limiter knob to prevent resource
exhaustion.

UAPI wise, nobdy has to care if the limit is actually # of QPs or
something else.

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 17:28                             ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-14 17:28 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran,
	Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:

> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.

There isn't really much choice if the abstraction is a bundle of all
resources. You can't use an absolute number unless every possible
hardware limited resource is defined, which doesn't seem smart to me
either. It is not abstract enough, and doesn't match our universe of
hardware very well.

> 2. bytes of  kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.

I don't see this distinction being useful at such a fine granularity
where the control side needs to distinguish between 1 and 2 QPs.

The majority use for control groups has been along with containers to
prevent a container for exhausting resources in a way that impacts
another.

In that use model limiting each container to N MB of kernel memory
makes it straightforward to reason about resource exhaustion in a
multi-tennant environment. We have other controllers that do this,
just more indirectly (ie limiting the number of inotifies, or the
number of fds indirectly cap kernel memory consumption)

ie Presumably some fairly small limitation like 10MB is enough for
most non-MPI jobs.

> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.

No, if the HW has a fixed QP pool then it would hit #1 above. Both are
active at once. For example you'd say a container cannot use more than
10% of the device's hardware resources, or more than 10MB of kernel
memory.

If on an mlx card, you probably hit the 10% of QP resources first. If
on an qib card there is no HW QP pool (well, almost, QPNs are always
limited), so you'd hit the memory limit instead.

In either case, we don't want to see a container able to exhaust
either all of kernel memory or all of the HW resources to deny other
containers.

If you have a non-container use case in mind I'd be curious to hear
it..

> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.

I don't thing #1 is controlled but another controller. This is long
lived kernel-side memory allocations to support RDMA resource
allocation - we certainly have nothing in the rdma layer that is
tracking this stuff.

> If the hardware vendor defines the resource pool without saying its
> resource QP or MR, how would actually management/control point can
> decide what should be controlled to what limit?

In the kernel each HW driver has to be involved to declare what it's
hardware resource limits are.

In user space, it is just a simple limiter knob to prevent resource
exhaustion.

UAPI wise, nobdy has to care if the limit is actually # of QPs or
something else.

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 18:54                               ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 18:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:
>
>> 1. How does the % of resource, is different than absolute number? With
>> rest of the cgroups systems we define absolute number at most places
>> to my knowledge.
>
> There isn't really much choice if the abstraction is a bundle of all
> resources. You can't use an absolute number unless every possible
> hardware limited resource is defined, which doesn't seem smart to me
> either.

Absolute number of percentage is representation for a given property.
That property needs definition. Isn't it?
How do we say that "Some undefined" resource you give certain amount,
which user doesn't know about what to administer, or configure.
It has to be quantifiable entity.

It is not abstract enough, and doesn't match our universe of
> hardware very well.
>
Why does the user need to know the actual hardware resource limits or
define hardware based resource.

RDMA verbs is the abstraction point.
We could well define
(a) how many number of RDMA connections are allowed instead of QP, or CQ or AH.
(b) how many data transfer buffers to use.

The fact is we have so many mid layers, which uses these resources
differently, above abstraction does not fit the bill.
But we know the mid layers how they operate, and how they use the RDMA
resource keeping.
So if we deploy MPI application for given cluster of container, we can
accurately configure the RDMA resource, isn't it?

Another example would be, if we don't want only 50% resources to be
given to all containers and rest 50% to kernel consumers such as NFS,
all containers can reside in single rdma cgroup limited to given
limits.


>> 2. bytes of  kernel memory for RDMA structures
>> One QP of one vendor might consume X bytes and other Y bytes. How does
>> the application knows how much memory to give.
>
> I don't see this distinction being useful at such a fine granularity
> where the control side needs to distinguish between 1 and 2 QPs.
>
> The majority use for control groups has been along with containers to
> prevent a container for exhausting resources in a way that impacts
> another.
>
Right. Thats the intention.

> In that use model limiting each container to N MB of kernel memory
> makes it straightforward to reason about resource exhaustion in a
> multi-tennant environment. We have other controllers that do this,
> just more indirectly (ie limiting the number of inotifies, or the
> number of fds indirectly cap kernel memory consumption)
>
> ie Presumably some fairly small limitation like 10MB is enough for
> most non-MPI jobs.

Container application always write a simple for loop code to take away
majority of QP with 10MB limit.
>
>> Application doing 100 QP allocation, still within limit of memory of
>> cgroup leaves other applications without any QP.
>
> No, if the HW has a fixed QP pool then it would hit #1 above. Both are
> active at once. For example you'd say a container cannot use more than
> 10% of the device's hardware resources, or more than 10MB of kernel
> memory.
>
Right. we need to define this resource pool, right?
Why it cannot be verbs abstraction?
How many resources are really used to implement verb layer in reality
is left to hardware vendor
Abstract pool just added confusion instead of clarity.

Imagine instead of tcp_bytes or kmem bytes, its "some memory
resource", how would someone debug/tune a system with abstract knobs?

> If on an mlx card, you probably hit the 10% of QP resources first. If
> on an qib card there is no HW QP pool (well, almost, QPNs are always
> limited), so you'd hit the memory limit instead.
>
> In either case, we don't want to see a container able to exhaust
> either all of kernel memory or all of the HW resources to deny other
> containers.
>
> If you have a non-container use case in mind I'd be curious to hear
> it..

Container is the prime case. Additionally equally prime case of non
container use case.
Today, application can take up all the resource being first class
citizan, and NFS mount will fail.
So without container also we should be able to restrict resources to
user mode app.


>
>> I don't see a point of memory footprint based scheme, as memory limits
>> are well addressed by more smarter memory controller anyway.
>
> I don't thing #1 is controlled but another controller. This is long
> lived kernel-side memory allocations to support RDMA resource
> allocation - we certainly have nothing in the rdma layer that is
> tracking this stuff.
>
Some drivers performs mmap() of kernel memory to user space, some
drivers does user space page allocation and maps to device.
Putting or tracking all those is just so intrusive changes spreading
down the vendor drivers or ib layer which may not be right way to
track.
Memory allocation tracking I believe should be left to memcg.


>> If the hardware vendor defines the resource pool without saying its
>> resource QP or MR, how would actually management/control point can
>> decide what should be controlled to what limit?
>
> In the kernel each HW driver has to be involved to declare what it's
> hardware resource limits are.
>
> In user space, it is just a simple limiter knob to prevent resource
> exhaustion.
>
> UAPI wise, nobdy has to care if the limit is actually # of QPs or
> something else.
>

If we dont care about resource, we cannot tune or limit it. number of
MRs used by MPI vs rsocket vs accelio is way different.



> Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-14 18:54                               ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-14 18:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran,
	Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:
>
>> 1. How does the % of resource, is different than absolute number? With
>> rest of the cgroups systems we define absolute number at most places
>> to my knowledge.
>
> There isn't really much choice if the abstraction is a bundle of all
> resources. You can't use an absolute number unless every possible
> hardware limited resource is defined, which doesn't seem smart to me
> either.

Absolute number of percentage is representation for a given property.
That property needs definition. Isn't it?
How do we say that "Some undefined" resource you give certain amount,
which user doesn't know about what to administer, or configure.
It has to be quantifiable entity.

It is not abstract enough, and doesn't match our universe of
> hardware very well.
>
Why does the user need to know the actual hardware resource limits or
define hardware based resource.

RDMA verbs is the abstraction point.
We could well define
(a) how many number of RDMA connections are allowed instead of QP, or CQ or AH.
(b) how many data transfer buffers to use.

The fact is we have so many mid layers, which uses these resources
differently, above abstraction does not fit the bill.
But we know the mid layers how they operate, and how they use the RDMA
resource keeping.
So if we deploy MPI application for given cluster of container, we can
accurately configure the RDMA resource, isn't it?

Another example would be, if we don't want only 50% resources to be
given to all containers and rest 50% to kernel consumers such as NFS,
all containers can reside in single rdma cgroup limited to given
limits.


>> 2. bytes of  kernel memory for RDMA structures
>> One QP of one vendor might consume X bytes and other Y bytes. How does
>> the application knows how much memory to give.
>
> I don't see this distinction being useful at such a fine granularity
> where the control side needs to distinguish between 1 and 2 QPs.
>
> The majority use for control groups has been along with containers to
> prevent a container for exhausting resources in a way that impacts
> another.
>
Right. Thats the intention.

> In that use model limiting each container to N MB of kernel memory
> makes it straightforward to reason about resource exhaustion in a
> multi-tennant environment. We have other controllers that do this,
> just more indirectly (ie limiting the number of inotifies, or the
> number of fds indirectly cap kernel memory consumption)
>
> ie Presumably some fairly small limitation like 10MB is enough for
> most non-MPI jobs.

Container application always write a simple for loop code to take away
majority of QP with 10MB limit.
>
>> Application doing 100 QP allocation, still within limit of memory of
>> cgroup leaves other applications without any QP.
>
> No, if the HW has a fixed QP pool then it would hit #1 above. Both are
> active at once. For example you'd say a container cannot use more than
> 10% of the device's hardware resources, or more than 10MB of kernel
> memory.
>
Right. we need to define this resource pool, right?
Why it cannot be verbs abstraction?
How many resources are really used to implement verb layer in reality
is left to hardware vendor
Abstract pool just added confusion instead of clarity.

Imagine instead of tcp_bytes or kmem bytes, its "some memory
resource", how would someone debug/tune a system with abstract knobs?

> If on an mlx card, you probably hit the 10% of QP resources first. If
> on an qib card there is no HW QP pool (well, almost, QPNs are always
> limited), so you'd hit the memory limit instead.
>
> In either case, we don't want to see a container able to exhaust
> either all of kernel memory or all of the HW resources to deny other
> containers.
>
> If you have a non-container use case in mind I'd be curious to hear
> it..

Container is the prime case. Additionally equally prime case of non
container use case.
Today, application can take up all the resource being first class
citizan, and NFS mount will fail.
So without container also we should be able to restrict resources to
user mode app.


>
>> I don't see a point of memory footprint based scheme, as memory limits
>> are well addressed by more smarter memory controller anyway.
>
> I don't thing #1 is controlled but another controller. This is long
> lived kernel-side memory allocations to support RDMA resource
> allocation - we certainly have nothing in the rdma layer that is
> tracking this stuff.
>
Some drivers performs mmap() of kernel memory to user space, some
drivers does user space page allocation and maps to device.
Putting or tracking all those is just so intrusive changes spreading
down the vendor drivers or ib layer which may not be right way to
track.
Memory allocation tracking I believe should be left to memcg.


>> If the hardware vendor defines the resource pool without saying its
>> resource QP or MR, how would actually management/control point can
>> decide what should be controlled to what limit?
>
> In the kernel each HW driver has to be involved to declare what it's
> hardware resource limits are.
>
> In user space, it is just a simple limiter knob to prevent resource
> exhaustion.
>
> UAPI wise, nobdy has to care if the limit is actually # of QPs or
> something else.
>

If we dont care about resource, we cannot tune or limit it. number of
MRs used by MPI vs rsocket vs accelio is way different.



> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-14 18:54                               ` Parav Pandit
  (?)
@ 2015-09-14 20:18                               ` Jason Gunthorpe
  2015-09-15  3:08                                 ` Parav Pandit
  -1 siblings, 1 reply; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-14 20:18 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Tue, Sep 15, 2015 at 12:24:41AM +0530, Parav Pandit wrote:
> On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe
> <jgunthorpe@obsidianresearch.com> wrote:
> > On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:
> >
> >> 1. How does the % of resource, is different than absolute number? With
> >> rest of the cgroups systems we define absolute number at most places
> >> to my knowledge.
> >
> > There isn't really much choice if the abstraction is a bundle of all
> > resources. You can't use an absolute number unless every possible
> > hardware limited resource is defined, which doesn't seem smart to me
> > either.
> 
> Absolute number of percentage is representation for a given property.
> That property needs definition. Isn't it?
> How do we say that "Some undefined" resource you give certain amount,
> which user doesn't know about what to administer, or configure.
> It has to be quantifiable entity.

Each vendor can quantify exactly what HW resources their
implementation has and how the above limit impacts their card. There
will be many variations, and IIRC, some vendors have resource pools
not directly related to the standard PD/QP/MR/CQ/AH verbs resources.

> > It is not abstract enough, and doesn't match our universe of
> > hardware very well.

> Why does the user need to know the actual hardware resource limits or
> define hardware based resource.

Because actual hardware resources *ARE* the limit. We cannot abstract
it away. The hardware/driver has real, fixed, immutable limits. No API
abstraction can possibly change that.

The limits are such there *IS NO* API boundary that can bundle them
into something simpler. There will always be apps that require wildly
different ratios of the basic verbs resources (PD/QP/CQ/AH/MR)

Either we control each and every vendor's limited resource directly
(which is where you started), or we just roll them up into a 'all
resource' bundle and control them indirectly. There just isn't a
mythical third 'better API' choice with the hardware we have today.

> (a) how many number of RDMA connections are allowed instead of QP, or CQ or AH.
> (b) how many data transfer buffers to use.

None of that accurately reflects what the real HW limits actually are.

> > ie Presumably some fairly small limitation like 10MB is enough for
> > most non-MPI jobs.
> 
> Container application always write a simple for loop code to take away
> majority of QP with 10MB limit.

No, the HW and kmem limits must work together, the HW limit would
prevent exhaustion outside the container.

> Imagine instead of tcp_bytes or kmem bytes, its "some memory
> resource", how would someone debug/tune a system with abstract knobs?

Well, we have the memcg controller that does track kmem. The subsystem
specific kmem limit is to force fair sharing of the limited kmem
resource within the overall memcg limit.

They are complementary.

A fictional rdma_kmem and tcp_kmem would serve very similar purposes.

> > UAPI wise, nobdy has to care if the limit is actually # of QPs or
> > something else.

> If we dont care about resource, we cannot tune or limit it. number of
> MRs used by MPI vs rsocket vs accelio is way different.

So? I don't think it is really important to have an exact, precise,
limit. The HW pools are pretty big, unless you plan to run tens of
thousands of containers eacg with tiny RDMA limits, it is fine to talk
in broader terms (ie 10% of all HW limited resource) which is totally
adaquate to hard-prevent run away or exhaustion scenarios.

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
  2015-09-14 20:18                               ` Jason Gunthorpe
@ 2015-09-15  3:08                                 ` Parav Pandit
  2015-09-15  3:45                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 95+ messages in thread
From: Parav Pandit @ 2015-09-15  3:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

> Because actual hardware resources *ARE* the limit. We cannot abstract
> it away. The hardware/driver has real, fixed, immutable limits. No API
> abstraction can possibly change that.
>
> The limits are such there *IS NO* API boundary that can bundle them
> into something simpler. There will always be apps that require wildly
> different ratios of the basic verbs resources (PD/QP/CQ/AH/MR)
>
> Either we control each and every vendor's limited resource directly
> (which is where you started), or we just roll them up into a 'all
> resource' bundle and control them indirectly. There just isn't a
> mythical third 'better API' choice with the hardware we have today.
>

As you precisely described, about wild ratio,
we are asking vendor driver (bottom most layer) to statically define
what the resource pool is, without telling him which application are
we going to run to use those pool.
Therefore vendor layer cannot ever define "right" resource pool.

If we try to fix defining "right" resource pool, we will have to come
up with API to modify/tune individual element of the pool.
Once we bring that complexity, it becomes what is proposed in this pachset.

Instead of bringing such complex solution, that affecting all the
layers which solves the same problem as this patch,
its better to keep definition of "bundle" in the user
library/application deployment engine.
where bundle is set of those resources.

May be instead of having invidividual files for each resource, at user
interface level, we can have rdma.bundle file.
this bundle cgroup file defines these resources such as
"ah 100
mr 100
qp 10"

> So? I don't think it is really important to have an exact, precise,
> limit. The HW pools are pretty big, unless you plan to run tens of
> thousands of containers eacg with tiny RDMA limits, it is fine to talk
> in broader terms (ie 10% of all HW limited resource) which is totally
> adaquate to hard-prevent run away or exhaustion scenarios.
>

rdma cgroup will allow us to run post 512 or 1024 containers without
using PCIe SR-IOV, without creating any vendor specific resource
pools.


> Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-15  3:45                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-15  3:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote:

> As you precisely described, about wild ratio,
> we are asking vendor driver (bottom most layer) to statically define
> what the resource pool is, without telling him which application are
> we going to run to use those pool.
> Therefore vendor layer cannot ever define "right" resource pool.

No, I'm saying the resource pool is *well defined* and *fixed* by each
hardware.

The only question is how do we expose the N resource limits, the list
of which is totally vendor specific.

Yes, using a % scheme fixes the ratios, 1% is going to be a certain
number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
configuration. That is the trade off for API simplicity.

Yes, this results in some resources being over provisioned.

I have no idea if that is usable for the workloads people want to run..

But *there is no middle option*. Either each and every single hardware
limited resources has a dedicated per-container limit, or they are
*somehow* bundled and the ratios become fixed.

If Tejun says we can't have something so emphemeral as a vendor
specific list of hardware resource pools - then what choice is
left?

> Instead of bringing such complex solution, that affecting all the
> layers which solves the same problem as this patch,
> its better to keep definition of "bundle" in the user
> library/application deployment engine.
> where bundle is set of those resources.

The kernel has to do the restriction, so at some point you are telling
the kernel to limit each and every unique resource the HW has, which
is back to the original patch set, munging how the data is passed
makes no difference to the basic objection, IMHO.

> rdma cgroup will allow us to run post 512 or 1024 containers without
> using PCIe SR-IOV, without creating any vendor specific resource
> pools.

If you ignore any vendor specific resource limits then you've just
left open a hole, a wayward container can exhaust all others - so what
was the point of doing all this work?

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-15  3:45                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 95+ messages in thread
From: Jason Gunthorpe @ 2015-09-15  3:45 UTC (permalink / raw)
  To: Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran,
	Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote:

> As you precisely described, about wild ratio,
> we are asking vendor driver (bottom most layer) to statically define
> what the resource pool is, without telling him which application are
> we going to run to use those pool.
> Therefore vendor layer cannot ever define "right" resource pool.

No, I'm saying the resource pool is *well defined* and *fixed* by each
hardware.

The only question is how do we expose the N resource limits, the list
of which is totally vendor specific.

Yes, using a % scheme fixes the ratios, 1% is going to be a certain
number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
configuration. That is the trade off for API simplicity.

Yes, this results in some resources being over provisioned.

I have no idea if that is usable for the workloads people want to run..

But *there is no middle option*. Either each and every single hardware
limited resources has a dedicated per-container limit, or they are
*somehow* bundled and the ratios become fixed.

If Tejun says we can't have something so emphemeral as a vendor
specific list of hardware resource pools - then what choice is
left?

> Instead of bringing such complex solution, that affecting all the
> layers which solves the same problem as this patch,
> its better to keep definition of "bundle" in the user
> library/application deployment engine.
> where bundle is set of those resources.

The kernel has to do the restriction, so at some point you are telling
the kernel to limit each and every unique resource the HW has, which
is back to the original patch set, munging how the data is passed
makes no difference to the basic objection, IMHO.

> rdma cgroup will allow us to run post 512 or 1024 containers without
> using PCIe SR-IOV, without creating any vendor specific resource
> pools.

If you ignore any vendor specific resource limits then you've just
left open a hole, a wayward container can exhaust all others - so what
was the point of doing all this work?

Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-16  4:41                                       ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-16  4:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran, Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

Hi Jason, Sean, Tejun,

I am in process of defining new approach, design based on the feedback
given here for new RDMA cgroup from all of you.
I have also collected feedback from Liran yesterday and ORNL folks too.

Soon I will post the new approach, high level APIs and functionality
for review before submitting actual implementation.

Regards,
Parav Pandit

On Tue, Sep 15, 2015 at 9:15 AM, Jason Gunthorpe
<jgunthorpe@obsidianresearch.com> wrote:
> On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote:
>
>> As you precisely described, about wild ratio,
>> we are asking vendor driver (bottom most layer) to statically define
>> what the resource pool is, without telling him which application are
>> we going to run to use those pool.
>> Therefore vendor layer cannot ever define "right" resource pool.
>
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
>
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.
>


>> rdma cgroup will allow us to run post 512 or 1024 containers without
>> using PCIe SR-IOV, without creating any vendor specific resource
>> pools.
>
> If you ignore any vendor specific resource limits then you've just
> left open a hole, a wayward container can exhaust all others - so what
> was the point of doing all this work?
>
> Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-16  4:41                                       ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-09-16  4:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Haggai Eran,
	Or Gerlitz, Matan Barak,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi Jason, Sean, Tejun,

I am in process of defining new approach, design based on the feedback
given here for new RDMA cgroup from all of you.
I have also collected feedback from Liran yesterday and ORNL folks too.

Soon I will post the new approach, high level APIs and functionality
for review before submitting actual implementation.

Regards,
Parav Pandit

On Tue, Sep 15, 2015 at 9:15 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Tue, Sep 15, 2015 at 08:38:54AM +0530, Parav Pandit wrote:
>
>> As you precisely described, about wild ratio,
>> we are asking vendor driver (bottom most layer) to statically define
>> what the resource pool is, without telling him which application are
>> we going to run to use those pool.
>> Therefore vendor layer cannot ever define "right" resource pool.
>
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
>
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.
>


>> rdma cgroup will allow us to run post 512 or 1024 containers without
>> using PCIe SR-IOV, without creating any vendor specific resource
>> pools.
>
> If you ignore any vendor specific resource limits then you've just
> left open a hole, a wayward container can exhaust all others - so what
> was the point of doing all this work?
>
> Jason

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-20 10:35                                       ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-20 10:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-rdma@vger.kernel.org, lizefan@huawei.com, Johannes Weiner,
	Jonathan Corbet, james.l.morris@oracle.com, serge@hallyn.com,
	Or Gerlitz, Matan Barak, raindel@mellanox.com,
	akpm@linux-foundation.org, linux-security-module@vger.kernel.org

On 15/09/2015 06:45, Jason Gunthorpe wrote:
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
> 
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.

I don't see why you say the limits are vendor specific. It is true that
different RDMA devices have different implementations and capabilities,
but they all use the expose the same set of RDMA objects with their
limitations. Whether those limitations come from hardware limitations,
from the driver, or just because the address space is limited, they can
still be exhausted.

> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
> configuration. That is the trade off for API simplicity.
>
> 
> Yes, this results in some resources being over provisioned.

I agree that such a scheme will be easy to configure, but I don't think
it can work well in all situations. Imagine you want to let one
container use almost all RC QPs as you want it to connect to the entire
cluster through RC. Other containers can still use a single datagram QP
to connect to the entire cluster, but they would require many address
handles. If you force a fixed ratio of resources given to each container
it would be hard to describe such a partitioning.

I think it would be better to expose different controls for the
different RDMA resources.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-09-20 10:35                                       ` Haggai Eran
  0 siblings, 0 replies; 95+ messages in thread
From: Haggai Eran @ 2015-09-20 10:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Parav Pandit
  Cc: Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On 15/09/2015 06:45, Jason Gunthorpe wrote:
> No, I'm saying the resource pool is *well defined* and *fixed* by each
> hardware.
> 
> The only question is how do we expose the N resource limits, the list
> of which is totally vendor specific.

I don't see why you say the limits are vendor specific. It is true that
different RDMA devices have different implementations and capabilities,
but they all use the expose the same set of RDMA objects with their
limitations. Whether those limitations come from hardware limitations,
from the driver, or just because the address space is limited, they can
still be exhausted.

> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
> configuration. That is the trade off for API simplicity.
>
> 
> Yes, this results in some resources being over provisioned.

I agree that such a scheme will be easy to configure, but I don't think
it can work well in all situations. Imagine you want to let one
container use almost all RC QPs as you want it to connect to the entire
cluster through RC. Other containers can still use a single datagram QP
to connect to the entire cluster, but they would require many address
handles. If you force a fixed ratio of resources given to each container
it would be hard to describe such a partitioning.

I think it would be better to expose different controls for the
different RDMA resources.

Regards,
Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-10-28  8:14                                         ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-10-28  8:14 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jason Gunthorpe, Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	lizefan@huawei.com, Johannes Weiner, Jonathan Corbet,
	james.l.morris@oracle.com, serge@hallyn.com, Or Gerlitz,
	Matan Barak, raindel@mellanox.com, akpm@linux-foundation.org,
	linux-security-module@vger.kernel.org

Hi,

I finally got some chance and progress on redesigning rdma cgroup
controller for the most use cases that we discussed in this email
chain.
I am posting RFC and soon code in new email.

Parav


On Sun, Sep 20, 2015 at 4:05 PM, Haggai Eran <haggaie@mellanox.com> wrote:
> On 15/09/2015 06:45, Jason Gunthorpe wrote:
>> No, I'm saying the resource pool is *well defined* and *fixed* by each
>> hardware.
>>
>> The only question is how do we expose the N resource limits, the list
>> of which is totally vendor specific.
>
> I don't see why you say the limits are vendor specific. It is true that
> different RDMA devices have different implementations and capabilities,
> but they all use the expose the same set of RDMA objects with their
> limitations. Whether those limitations come from hardware limitations,
> from the driver, or just because the address space is limited, they can
> still be exhausted.
>
>> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
>> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
>> configuration. That is the trade off for API simplicity.
>>
>>
>> Yes, this results in some resources being over provisioned.
>
> I agree that such a scheme will be easy to configure, but I don't think
> it can work well in all situations. Imagine you want to let one
> container use almost all RC QPs as you want it to connect to the entire
> cluster through RC. Other containers can still use a single datagram QP
> to connect to the entire cluster, but they would require many address
> handles. If you force a fixed ratio of resources given to each container
> it would be hard to describe such a partitioning.
>
> I think it would be better to expose different controls for the
> different RDMA resources.
>
> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

* Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
@ 2015-10-28  8:14                                         ` Parav Pandit
  0 siblings, 0 replies; 95+ messages in thread
From: Parav Pandit @ 2015-10-28  8:14 UTC (permalink / raw)
  To: Haggai Eran
  Cc: Jason Gunthorpe, Hefty, Sean, Tejun Heo, Doug Ledford,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org, Johannes Weiner,
	Jonathan Corbet,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org, Or Gerlitz,
	Matan Barak, raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Hi,

I finally got some chance and progress on redesigning rdma cgroup
controller for the most use cases that we discussed in this email
chain.
I am posting RFC and soon code in new email.

Parav


On Sun, Sep 20, 2015 at 4:05 PM, Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 15/09/2015 06:45, Jason Gunthorpe wrote:
>> No, I'm saying the resource pool is *well defined* and *fixed* by each
>> hardware.
>>
>> The only question is how do we expose the N resource limits, the list
>> of which is totally vendor specific.
>
> I don't see why you say the limits are vendor specific. It is true that
> different RDMA devices have different implementations and capabilities,
> but they all use the expose the same set of RDMA objects with their
> limitations. Whether those limitations come from hardware limitations,
> from the driver, or just because the address space is limited, they can
> still be exhausted.
>
>> Yes, using a % scheme fixes the ratios, 1% is going to be a certain
>> number of PD's, QP's, MRs, CQ's, etc at a ratio fixed by the driver
>> configuration. That is the trade off for API simplicity.
>>
>>
>> Yes, this results in some resources being over provisioned.
>
> I agree that such a scheme will be easy to configure, but I don't think
> it can work well in all situations. Imagine you want to let one
> container use almost all RC QPs as you want it to connect to the entire
> cluster through RC. Other containers can still use a single datagram QP
> to connect to the entire cluster, but they would require many address
> handles. If you force a fixed ratio of resources given to each container
> it would be hard to describe such a partitioning.
>
> I think it would be better to expose different controls for the
> different RDMA resources.
>
> Regards,
> Haggai

^ permalink raw reply	[flat|nested] 95+ messages in thread

end of thread, other threads:[~2015-10-28  8:20 UTC | newest]

Thread overview: 95+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
2015-09-07 20:38 ` Parav Pandit
2015-09-07 20:38 ` [PATCH 1/7] devcg: Added user option to rdma resource tracking Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:38 ` [PATCH 2/7] devcg: Added rdma resource tracking module Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:38 ` [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup Parav Pandit
2015-09-08  5:31   ` Haggai Eran
2015-09-08  5:31     ` Haggai Eran
2015-09-08  7:02     ` Parav Pandit
2015-09-08  7:02       ` Parav Pandit
2015-09-07 20:38 ` [PATCH 4/7] devcg: Added rdma resource tracker object per task Parav Pandit
2015-09-08  5:48   ` Haggai Eran
2015-09-08  5:48     ` Haggai Eran
2015-09-08  7:04     ` Parav Pandit
2015-09-08  8:24       ` Haggai Eran
2015-09-08  8:24         ` Haggai Eran
2015-09-08  8:26         ` Parav Pandit
2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-08  8:22   ` Haggai Eran
2015-09-08  8:22     ` Haggai Eran
2015-09-08 10:18     ` Parav Pandit
2015-09-08 13:50       ` Haggai Eran
2015-09-08 13:50         ` Haggai Eran
2015-09-08 14:13         ` Parav Pandit
2015-09-08  8:36   ` Haggai Eran
2015-09-08  8:36     ` Haggai Eran
2015-09-08 10:50     ` Parav Pandit
2015-09-08 10:50       ` Parav Pandit
2015-09-08 14:10       ` Haggai Eran
2015-09-08 14:10         ` Haggai Eran
2015-09-07 20:38 ` [PATCH 6/7] devcg: Added support to use RDMA device cgroup Parav Pandit
2015-09-08  8:40   ` Haggai Eran
2015-09-08  8:40     ` Haggai Eran
2015-09-08 10:22     ` Parav Pandit
2015-09-08 13:40       ` Haggai Eran
2015-09-08 13:40         ` Haggai Eran
2015-09-07 20:38 ` [PATCH 7/7] devcg: Added Documentation of " Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:55 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
2015-09-08 12:45 ` Haggai Eran
2015-09-08 12:45   ` Haggai Eran
2015-09-08 15:23 ` Tejun Heo
2015-09-08 15:23   ` Tejun Heo
2015-09-09  3:57   ` Parav Pandit
2015-09-10 16:49     ` Tejun Heo
2015-09-10 17:46       ` Parav Pandit
2015-09-10 17:46         ` Parav Pandit
2015-09-10 20:22         ` Tejun Heo
2015-09-11  3:39           ` Parav Pandit
2015-09-11  4:04             ` Tejun Heo
2015-09-11  4:04               ` Tejun Heo
2015-09-11  4:24               ` Doug Ledford
2015-09-11  4:24                 ` Doug Ledford
2015-09-11 14:52                 ` Tejun Heo
2015-09-11 14:52                   ` Tejun Heo
2015-09-11 16:26                   ` Parav Pandit
2015-09-11 16:34                     ` Tejun Heo
2015-09-11 16:34                       ` Tejun Heo
2015-09-11 16:39                       ` Parav Pandit
2015-09-11 16:39                         ` Parav Pandit
2015-09-11 19:25                         ` Tejun Heo
2015-09-14 10:18                           ` Parav Pandit
2015-09-14 10:18                             ` Parav Pandit
2015-09-11 16:47                   ` Parav Pandit
2015-09-11 16:47                     ` Parav Pandit
2015-09-11 19:05                     ` Tejun Heo
2015-09-11 19:05                       ` Tejun Heo
2015-09-11 19:22                   ` Hefty, Sean
2015-09-11 19:43                     ` Jason Gunthorpe
2015-09-11 19:43                       ` Jason Gunthorpe
2015-09-11 20:06                       ` Hefty, Sean
2015-09-14 11:09                         ` Parav Pandit
2015-09-14 14:04                           ` Parav Pandit
2015-09-14 15:21                             ` Tejun Heo
2015-09-14 15:21                               ` Tejun Heo
2015-09-14 17:28                           ` Jason Gunthorpe
2015-09-14 17:28                             ` Jason Gunthorpe
2015-09-14 18:54                             ` Parav Pandit
2015-09-14 18:54                               ` Parav Pandit
2015-09-14 20:18                               ` Jason Gunthorpe
2015-09-15  3:08                                 ` Parav Pandit
2015-09-15  3:45                                   ` Jason Gunthorpe
2015-09-15  3:45                                     ` Jason Gunthorpe
2015-09-16  4:41                                     ` Parav Pandit
2015-09-16  4:41                                       ` Parav Pandit
2015-09-20 10:35                                     ` Haggai Eran
2015-09-20 10:35                                       ` Haggai Eran
2015-10-28  8:14                                       ` Parav Pandit
2015-10-28  8:14                                         ` Parav Pandit
2015-09-14 10:15                     ` Parav Pandit
2015-09-11  4:43               ` Parav Pandit
2015-09-11 15:03                 ` Tejun Heo
2015-09-10 17:48       ` Hefty, Sean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.