All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Parav Pandit <pandit.parav@gmail.com>
To: Tejun Heo <tj@kernel.org>
Cc: Doug Ledford <dledford@redhat.com>,
	cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org,
	lizefan@huawei.com, Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	james.l.morris@oracle.com, serge@hallyn.com,
	Haggai Eran <haggaie@mellanox.com>,
	Or Gerlitz <ogerlitz@mellanox.com>,
	Matan Barak <matanb@mellanox.com>,
	raindel@mellanox.com, akpm@linux-foundation.org,
	linux-security-module@vger.kernel.org
Subject: Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Date: Mon, 14 Sep 2015 15:48:51 +0530	[thread overview]
Message-ID: <CAG53R5VuBHLud0bXQmhHsDZ9oPdaEgK4P9MCKV9ARA0vrqOhsA@mail.gmail.com> (raw)
In-Reply-To: <20150911192517.GU8114@mtj.duckdns.org>

On Sat, Sep 12, 2015 at 12:55 AM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
>> > If you're planning on following what the existing memcg did in this
>> > area, it's unlikely to go well.  Would you mind sharing what you have
>> > on mind in the long term?  Where do you see this going?
>>
>> At least current thoughts are: central entity authority monitors fail
>> count and new threashold count.
>> Fail count - as similar to other indicates how many time resource
>> failure occured
>> threshold count - indicates upto what this resource has gone upto in
>> usage. (application might not be able to poll on thousands of such
>> resources entries).
>> So based on fail count and threshold count, it can tune it further.
>
> So, regardless of the specific resource in question, implementing
> adaptive resource distribution requires more than simple thresholds
> and failcnts.

May be yes. Buts in difficult to go through the whole design to shape
up right now.
This is the infrastructure getting build with few capabilities.
I see this as starting point instead of end point.

> The very minimum would be a way to exert reclaim
> pressure and then a way to measure how much lack of a given resource
> is affecting the workload.  Maybe it can adaptively lower the limits
> and then watch how often allocation fails but that's highly unlikely
> to be an effective measure as it can't do anything to hoarders and the
> frequency of allocation failure doesn't necessarily correlate with the
> amount of impact the workload is getting (it's not a measure of
> usage).

It can always kill the hoarding process(es), which is holding up the
resources without using it.
Such processes will eventually will get restarted but will not be able
to hoard so much because its been on the radar for hoarding and its
limits have been reduced.

>
> This is what I'm awry about.  The kernel-userland interface here is
> cut pretty low in the stack leaving most of arbitration and management
> logic in the userland, which seems to be what people wanted and that's
> fine, but then you're trying to implement an intelligent resource
> control layer which straddles across kernel and userland with those
> low level primitives which inevitably would increase the required
> interface surface as nobody has enough information.
>
We might be able to get the information as we go along.
Such arbitration and management layer outside (instead of inside) has
more visibility into multiple systems which are part of single cluster
and processes are spreaded across cgroup in each such system.
While a logic inside can manage just a manage a process of single node
which are using multiple cgroups.

> Just to illustrate the point, please think of the alsa interface.  We
> expose hardware capabilities pretty much as-is leaving management and
> multiplexing to userland and there's nothing wrong with it.  It fits
> better that way; however, we don't then go try to implement cgroup
> controller for PCM channels.  To do any high-level resource
> management, you gotta do it where the said resource is actually
> managed and arbitrated.
>
> What's the allocation frequency you're expecting?  It might be better
> to just let allocations themselves go through the agent that you're
> planning.
In that case we might need to build FUSE style infrastructure.
Frequency for RDMA resource allocation is certainly less than read/write calls.

> You sure can use cgroup membership to identify who's asking
> tho.  Given how the whole thing is architectured, I'd suggest thinking
> more about how the whole thing should turn out eventually.
>
Yes, I agree.
At this point, its software solution to provide resource isolation in
simple manner which has scope to become adaptive in future.

> Thanks.
>
> --
> tejun

WARNING: multiple messages have this Message-ID (diff)
From: Parav Pandit <pandit.parav-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
To: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-doc-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org,
	Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>,
	Jonathan Corbet <corbet-T1hC0tSOHrs@public.gmane.org>,
	james.l.morris-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org,
	Haggai Eran <haggaie-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	raindel-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org,
	linux-security-module-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource
Date: Mon, 14 Sep 2015 15:48:51 +0530	[thread overview]
Message-ID: <CAG53R5VuBHLud0bXQmhHsDZ9oPdaEgK4P9MCKV9ARA0vrqOhsA@mail.gmail.com> (raw)
In-Reply-To: <20150911192517.GU8114-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>

On Sat, Sep 12, 2015 at 12:55 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello, Parav.
>
> On Fri, Sep 11, 2015 at 10:09:48PM +0530, Parav Pandit wrote:
>> > If you're planning on following what the existing memcg did in this
>> > area, it's unlikely to go well.  Would you mind sharing what you have
>> > on mind in the long term?  Where do you see this going?
>>
>> At least current thoughts are: central entity authority monitors fail
>> count and new threashold count.
>> Fail count - as similar to other indicates how many time resource
>> failure occured
>> threshold count - indicates upto what this resource has gone upto in
>> usage. (application might not be able to poll on thousands of such
>> resources entries).
>> So based on fail count and threshold count, it can tune it further.
>
> So, regardless of the specific resource in question, implementing
> adaptive resource distribution requires more than simple thresholds
> and failcnts.

May be yes. Buts in difficult to go through the whole design to shape
up right now.
This is the infrastructure getting build with few capabilities.
I see this as starting point instead of end point.

> The very minimum would be a way to exert reclaim
> pressure and then a way to measure how much lack of a given resource
> is affecting the workload.  Maybe it can adaptively lower the limits
> and then watch how often allocation fails but that's highly unlikely
> to be an effective measure as it can't do anything to hoarders and the
> frequency of allocation failure doesn't necessarily correlate with the
> amount of impact the workload is getting (it's not a measure of
> usage).

It can always kill the hoarding process(es), which is holding up the
resources without using it.
Such processes will eventually will get restarted but will not be able
to hoard so much because its been on the radar for hoarding and its
limits have been reduced.

>
> This is what I'm awry about.  The kernel-userland interface here is
> cut pretty low in the stack leaving most of arbitration and management
> logic in the userland, which seems to be what people wanted and that's
> fine, but then you're trying to implement an intelligent resource
> control layer which straddles across kernel and userland with those
> low level primitives which inevitably would increase the required
> interface surface as nobody has enough information.
>
We might be able to get the information as we go along.
Such arbitration and management layer outside (instead of inside) has
more visibility into multiple systems which are part of single cluster
and processes are spreaded across cgroup in each such system.
While a logic inside can manage just a manage a process of single node
which are using multiple cgroups.

> Just to illustrate the point, please think of the alsa interface.  We
> expose hardware capabilities pretty much as-is leaving management and
> multiplexing to userland and there's nothing wrong with it.  It fits
> better that way; however, we don't then go try to implement cgroup
> controller for PCM channels.  To do any high-level resource
> management, you gotta do it where the said resource is actually
> managed and arbitrated.
>
> What's the allocation frequency you're expecting?  It might be better
> to just let allocations themselves go through the agent that you're
> planning.
In that case we might need to build FUSE style infrastructure.
Frequency for RDMA resource allocation is certainly less than read/write calls.

> You sure can use cgroup membership to identify who's asking
> tho.  Given how the whole thing is architectured, I'd suggest thinking
> more about how the whole thing should turn out eventually.
>
Yes, I agree.
At this point, its software solution to provide resource isolation in
simple manner which has scope to become adaptive in future.

> Thanks.
>
> --
> tejun

  reply	other threads:[~2015-09-14 10:18 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-07 20:38 [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
2015-09-07 20:38 ` Parav Pandit
2015-09-07 20:38 ` [PATCH 1/7] devcg: Added user option to rdma resource tracking Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:38 ` [PATCH 2/7] devcg: Added rdma resource tracking module Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:38 ` [PATCH 3/7] devcg: Added infrastructure for rdma device cgroup Parav Pandit
2015-09-08  5:31   ` Haggai Eran
2015-09-08  5:31     ` Haggai Eran
2015-09-08  7:02     ` Parav Pandit
2015-09-08  7:02       ` Parav Pandit
2015-09-07 20:38 ` [PATCH 4/7] devcg: Added rdma resource tracker object per task Parav Pandit
2015-09-08  5:48   ` Haggai Eran
2015-09-08  5:48     ` Haggai Eran
2015-09-08  7:04     ` Parav Pandit
2015-09-08  8:24       ` Haggai Eran
2015-09-08  8:24         ` Haggai Eran
2015-09-08  8:26         ` Parav Pandit
2015-09-07 20:38 ` [PATCH 5/7] devcg: device cgroup's extension for RDMA resource Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-08  8:22   ` Haggai Eran
2015-09-08  8:22     ` Haggai Eran
2015-09-08 10:18     ` Parav Pandit
2015-09-08 13:50       ` Haggai Eran
2015-09-08 13:50         ` Haggai Eran
2015-09-08 14:13         ` Parav Pandit
2015-09-08  8:36   ` Haggai Eran
2015-09-08  8:36     ` Haggai Eran
2015-09-08 10:50     ` Parav Pandit
2015-09-08 10:50       ` Parav Pandit
2015-09-08 14:10       ` Haggai Eran
2015-09-08 14:10         ` Haggai Eran
2015-09-07 20:38 ` [PATCH 6/7] devcg: Added support to use RDMA device cgroup Parav Pandit
2015-09-08  8:40   ` Haggai Eran
2015-09-08  8:40     ` Haggai Eran
2015-09-08 10:22     ` Parav Pandit
2015-09-08 13:40       ` Haggai Eran
2015-09-08 13:40         ` Haggai Eran
2015-09-07 20:38 ` [PATCH 7/7] devcg: Added Documentation of " Parav Pandit
2015-09-07 20:38   ` Parav Pandit
2015-09-07 20:55 ` [PATCH 0/7] devcg: device cgroup extension for rdma resource Parav Pandit
2015-09-08 12:45 ` Haggai Eran
2015-09-08 12:45   ` Haggai Eran
2015-09-08 15:23 ` Tejun Heo
2015-09-08 15:23   ` Tejun Heo
2015-09-09  3:57   ` Parav Pandit
2015-09-10 16:49     ` Tejun Heo
2015-09-10 17:46       ` Parav Pandit
2015-09-10 17:46         ` Parav Pandit
2015-09-10 20:22         ` Tejun Heo
2015-09-11  3:39           ` Parav Pandit
2015-09-11  4:04             ` Tejun Heo
2015-09-11  4:04               ` Tejun Heo
2015-09-11  4:24               ` Doug Ledford
2015-09-11  4:24                 ` Doug Ledford
2015-09-11 14:52                 ` Tejun Heo
2015-09-11 14:52                   ` Tejun Heo
2015-09-11 16:26                   ` Parav Pandit
2015-09-11 16:34                     ` Tejun Heo
2015-09-11 16:34                       ` Tejun Heo
2015-09-11 16:39                       ` Parav Pandit
2015-09-11 16:39                         ` Parav Pandit
2015-09-11 19:25                         ` Tejun Heo
2015-09-14 10:18                           ` Parav Pandit [this message]
2015-09-14 10:18                             ` Parav Pandit
2015-09-11 16:47                   ` Parav Pandit
2015-09-11 16:47                     ` Parav Pandit
2015-09-11 19:05                     ` Tejun Heo
2015-09-11 19:05                       ` Tejun Heo
2015-09-11 19:22                   ` Hefty, Sean
2015-09-11 19:43                     ` Jason Gunthorpe
2015-09-11 19:43                       ` Jason Gunthorpe
2015-09-11 20:06                       ` Hefty, Sean
2015-09-14 11:09                         ` Parav Pandit
2015-09-14 14:04                           ` Parav Pandit
2015-09-14 15:21                             ` Tejun Heo
2015-09-14 15:21                               ` Tejun Heo
2015-09-14 17:28                           ` Jason Gunthorpe
2015-09-14 17:28                             ` Jason Gunthorpe
2015-09-14 18:54                             ` Parav Pandit
2015-09-14 18:54                               ` Parav Pandit
2015-09-14 20:18                               ` Jason Gunthorpe
2015-09-15  3:08                                 ` Parav Pandit
2015-09-15  3:45                                   ` Jason Gunthorpe
2015-09-15  3:45                                     ` Jason Gunthorpe
2015-09-16  4:41                                     ` Parav Pandit
2015-09-16  4:41                                       ` Parav Pandit
2015-09-20 10:35                                     ` Haggai Eran
2015-09-20 10:35                                       ` Haggai Eran
2015-10-28  8:14                                       ` Parav Pandit
2015-10-28  8:14                                         ` Parav Pandit
2015-09-14 10:15                     ` Parav Pandit
2015-09-11  4:43               ` Parav Pandit
2015-09-11 15:03                 ` Tejun Heo
2015-09-10 17:48       ` Hefty, Sean

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG53R5VuBHLud0bXQmhHsDZ9oPdaEgK4P9MCKV9ARA0vrqOhsA@mail.gmail.com \
    --to=pandit.parav@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dledford@redhat.com \
    --cc=haggaie@mellanox.com \
    --cc=hannes@cmpxchg.org \
    --cc=james.l.morris@oracle.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=matanb@mellanox.com \
    --cc=ogerlitz@mellanox.com \
    --cc=raindel@mellanox.com \
    --cc=serge@hallyn.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.