All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
To: Parav Pandit <pandit.parav@gmail.com>
Cc: "Hefty, Sean" <sean.hefty@intel.com>, Tejun Heo <tj@kernel.org>,
	Doug Ledford <dledford@redhat.com>,
	"cgroups@vger.kernel.org" <cgroups@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"lizefan@huawei.com" <lizefan@huawei.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	"james.l.morris@oracle.com" <james.l.morris@oracle.com>,
	"serge@hallyn.com" <serge@hallyn.com>,
	Haggai Eran <haggaie@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Matan Barak <matanb@mellanox.com>,
	"raindel@mellanox.com" <raindel@mellanox.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-security-module@vger.kernel.org" 
	<linux-security-module@vger.kernel.org>
Subject: Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Date: Mon, 14 Sep 2015 11:28:32 -0600	[thread overview]
Message-ID: <20150914172832.GA21652@obsidianresearch.com> (raw)
In-Reply-To: <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg@mail.gmail.com>

On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:

> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.

There isn't really much choice if the abstraction is a bundle of all
resources. You can't use an absolute number unless every possible
hardware limited resource is defined, which doesn't seem smart to me
either. It is not abstract enough, and doesn't match our universe of
hardware very well.

> 2. bytes of  kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.

I don't see this distinction being useful at such a fine granularity
where the control side needs to distinguish between 1 and 2 QPs.

The majority use for control groups has been along with containers to
prevent a container for exhausting resources in a way that impacts
another.

In that use model limiting each container to N MB of kernel memory
makes it straightforward to reason about resource exhaustion in a
multi-tennant environment. We have other controllers that do this,
just more indirectly (ie limiting the number of inotifies, or the
number of fds indirectly cap kernel memory consumption)

ie Presumably some fairly small limitation like 10MB is enough for
most non-MPI jobs.

> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.

No, if the HW has a fixed QP pool then it would hit #1 above. Both are
active at once. For example you'd say a container cannot use more than
10% of the device's hardware resources, or more than 10MB of kernel
memory.

If on an mlx card, you probably hit the 10% of QP resources first. If
on an qib card there is no HW QP pool (well, almost, QPNs are always
limited), so you'd hit the memory limit instead.

In either case, we don't want to see a container able to exhaust
either all of kernel memory or all of the HW resources to deny other
containers.

If you have a non-container use case in mind I'd be curious to hear
it..

> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.

I don't thing #1 is controlled but another controller. This is long
lived kernel-side memory allocations to support RDMA resource
allocation - we certainly have nothing in the rdma layer that is
tracking this stuff.

> If the hardware vendor defines the resource pool without saying its
> resource QP or MR, how would actually management/control point can
> decide what should be controlled to what limit?

In the kernel each HW driver has to be involved to declare what it's
hardware resource limits are.

In user space, it is just a simple limiter knob to prevent resource
exhaustion.

UAPI wise, nobdy has to care if the limit is actually # of QPs or
something else.

Jason

WARNING: multiple messages have this Message-ID (diff)
From: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
To: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: "Hefty,
	Sean" <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	"cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	"lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org"
	<lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>,
	"james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org"
	<james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	"serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org"
	<serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>,
	Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	"raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org"
	<raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	"akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org"
	<akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>,
	"linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Date: Mon, 14 Sep 2015 11:28:32 -0600	[thread overview]
Message-ID: <20150914172832.GA21652@obsidianresearch.com> (raw)
In-Reply-To: <CAG53R5XsMwnLK7L4q1mQx3_wEJNv1qthOr5TsX0o43kRWaiWrg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:

> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.

There isn't really much choice if the abstraction is a bundle of all
resources. You can't use an absolute number unless every possible
hardware limited resource is defined, which doesn't seem smart to me
either. It is not abstract enough, and doesn't match our universe of
hardware very well.

> 2. bytes of  kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.

I don't see this distinction being useful at such a fine granularity
where the control side needs to distinguish between 1 and 2 QPs.

The majority use for control groups has been along with containers to
prevent a container for exhausting resources in a way that impacts
another.

In that use model limiting each container to N MB of kernel memory
makes it straightforward to reason about resource exhaustion in a
multi-tennant environment. We have other controllers that do this,
just more indirectly (ie limiting the number of inotifies, or the
number of fds indirectly cap kernel memory consumption)

ie Presumably some fairly small limitation like 10MB is enough for
most non-MPI jobs.

> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.

No, if the HW has a fixed QP pool then it would hit #1 above. Both are
active at once. For example you'd say a container cannot use more than
10% of the device's hardware resources, or more than 10MB of kernel
memory.

If on an mlx card, you probably hit the 10% of QP resources first. If
on an qib card there is no HW QP pool (well, almost, QPNs are always
limited), so you'd hit the memory limit instead.

In either case, we don't want to see a container able to exhaust
either all of kernel memory or all of the HW resources to deny other
containers.

If you have a non-container use case in mind I'd be curious to hear
it..

> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.

I don't thing #1 is controlled but another controller. This is long
lived kernel-side memory allocations to support RDMA resource
allocation - we certainly have nothing in the rdma layer that is
tracking this stuff.

> If the hardware vendor defines the resource pool without saying its
> resource QP or MR, how would actually management/control point can
> decide what should be controlled to what limit?

In the kernel each HW driver has to be involved to declare what it's
hardware resource limits are.

In user space, it is just a simple limiter knob to prevent resource
exhaustion.

UAPI wise, nobdy has to care if the limit is actually # of QPs or
something else.

Jason

  parent reply	other threads:[~2015-09-14 17:29 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
2015-09-07 20:38 ` Parav Pandit
2015-09-07 20:38 ` [PATCH 1/7] devcg: Added user option to rdma resource tracking Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:38 ` [PATCH 2/7] devcg: Added rdma resource tracking module Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:38 ` [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup Parav Pandit
2015-09-08  5:31   ` Haggai Eran
2015-09-08  5:31     ` Haggai Eran
2015-09-08  7:02     ` Parav Pandit
2015-09-08  7:02       ` Parav Pandit
2015-09-07 20:38 ` [PATCH 4/7] devcg: Added rdma resource tracker object per task Parav Pandit
2015-09-08  5:48   ` Haggai Eran
2015-09-08  5:48     ` Haggai Eran
2015-09-08  7:04     ` Parav Pandit
2015-09-08  8:24       ` Haggai Eran
2015-09-08  8:24         ` Haggai Eran
2015-09-08  8:26         ` Parav Pandit
2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-08  8:22   ` Haggai Eran
2015-09-08  8:22     ` Haggai Eran
2015-09-08 10:18     ` Parav Pandit
2015-09-08 13:50       ` Haggai Eran
2015-09-08 13:50         ` Haggai Eran
2015-09-08 14:13         ` Parav Pandit
2015-09-08  8:36   ` Haggai Eran
2015-09-08  8:36     ` Haggai Eran
2015-09-08 10:50     ` Parav Pandit
2015-09-08 10:50       ` Parav Pandit
2015-09-08 14:10       ` Haggai Eran
2015-09-08 14:10         ` Haggai Eran
2015-09-07 20:38 ` [PATCH 6/7] devcg: Added support to use RDMA device cgroup Parav Pandit
2015-09-08  8:40   ` Haggai Eran
2015-09-08  8:40     ` Haggai Eran
2015-09-08 10:22     ` Parav Pandit
2015-09-08 13:40       ` Haggai Eran
2015-09-08 13:40         ` Haggai Eran
2015-09-07 20:38 ` [PATCH 7/7] devcg: Added Documentation of " Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:55 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
2015-09-08 12:45 ` Haggai Eran
2015-09-08 12:45   ` Haggai Eran
2015-09-08 15:23 ` Tejun Heo
2015-09-08 15:23   ` Tejun Heo
2015-09-09  3:57   ` Parav Pandit
2015-09-10 16:49     ` Tejun Heo
2015-09-10 17:46       ` Parav Pandit
2015-09-10 17:46         ` Parav Pandit
2015-09-10 20:22         ` Tejun Heo
2015-09-11  3:39           ` Parav Pandit
2015-09-11  4:04             ` Tejun Heo
2015-09-11  4:04               ` Tejun Heo
2015-09-11  4:24               ` Doug Ledford
2015-09-11  4:24                 ` Doug Ledford
2015-09-11 14:52                 ` Tejun Heo
2015-09-11 14:52                   ` Tejun Heo
2015-09-11 16:26                   ` Parav Pandit
2015-09-11 16:34                     ` Tejun Heo
2015-09-11 16:34                       ` Tejun Heo
2015-09-11 16:39                       ` Parav Pandit
2015-09-11 16:39                         ` Parav Pandit
2015-09-11 19:25                         ` Tejun Heo
2015-09-14 10:18                           ` Parav Pandit
2015-09-14 10:18                             ` Parav Pandit
2015-09-11 16:47                   ` Parav Pandit
2015-09-11 16:47                     ` Parav Pandit
2015-09-11 19:05                     ` Tejun Heo
2015-09-11 19:05                       ` Tejun Heo
2015-09-11 19:22                   ` Hefty, Sean
2015-09-11 19:43                     ` Jason Gunthorpe
2015-09-11 19:43                       ` Jason Gunthorpe
2015-09-11 20:06                       ` Hefty, Sean
2015-09-14 11:09                         ` Parav Pandit
2015-09-14 14:04                           ` Parav Pandit
2015-09-14 15:21                             ` Tejun Heo
2015-09-14 15:21                               ` Tejun Heo
2015-09-14 17:28                           ` Jason Gunthorpe [this message]
2015-09-14 17:28                             ` Jason Gunthorpe
2015-09-14 18:54                             ` Parav Pandit
2015-09-14 18:54                               ` Parav Pandit
2015-09-14 20:18                               ` Jason Gunthorpe
2015-09-15  3:08                                 ` Parav Pandit
2015-09-15  3:45                                   ` Jason Gunthorpe
2015-09-15  3:45                                     ` Jason Gunthorpe
2015-09-16  4:41                                     ` Parav Pandit
2015-09-16  4:41                                       ` Parav Pandit
2015-09-20 10:35                                     ` Haggai Eran
2015-09-20 10:35                                       ` Haggai Eran
2015-10-28  8:14                                       ` Parav Pandit
2015-10-28  8:14                                         ` Parav Pandit
2015-09-14 10:15                     ` Parav Pandit
2015-09-11  4:43               ` Parav Pandit
2015-09-11 15:03                 ` Tejun Heo
2015-09-10 17:48       ` Hefty, Sean

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150914172832.GA21652@obsidianresearch.com \
    --to=jgunthorpe@obsidianresearch.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dledford@redhat.com \
    --cc=haggaie@mellanox.com \
    --cc=hannes@cmpxchg.org \
    --cc=james.l.morris@oracle.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=matanb@mellanox.com \
    --cc=ogerlitz@mellanox.com \
    --cc=pandit.parav@gmail.com \
    --cc=raindel@mellanox.com \
    --cc=sean.hefty@intel.com \
    --cc=serge@hallyn.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.