From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751737AbbINUTH (ORCPT ); Mon, 14 Sep 2015 16:19:07 -0400 Received: from quartz.orcorp.ca ([184.70.90.242]:46643 "EHLO quartz.orcorp.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751061AbbINUTD (ORCPT ); Mon, 14 Sep 2015 16:19:03 -0400 Date: Mon, 14 Sep 2015 14:18:40 -0600 From: Jason Gunthorpe To: Parav Pandit Cc: "Hefty, Sean" , Tejun Heo , Doug Ledford , "cgroups@vger.kernel.org" , "linux-doc@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-rdma@vger.kernel.org" , "lizefan@huawei.com" , Johannes Weiner , Jonathan Corbet , "james.l.morris@oracle.com" , "serge@hallyn.com" , Haggai Eran , Or Gerlitz , Matan Barak , "raindel@mellanox.com" , "akpm@linux-foundation.org" , "linux-security-module@vger.kernel.org" Subject: Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource Message-ID: <20150914201840.GA8764@obsidianresearch.com> References: <20150911040413.GA18850@htj.duckdns.org> <55F25781.20308@redhat.com> <20150911145213.GQ8114@mtj.duckdns.org> <1828884A29C6694DAF28B7E6B8A82373A903A586@ORSMSX109.amr.corp.intel.com> <20150911194311.GA18755@obsidianresearch.com> <1828884A29C6694DAF28B7E6B8A82373A903A5DB@ORSMSX109.amr.corp.intel.com> <20150914172832.GA21652@obsidianresearch.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Broken-Reverse-DNS: no host name found for IP address 10.0.0.160 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 15, 2015 at 12:24:41AM +0530, Parav Pandit wrote: > On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe > wrote: > > On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote: > > > >> 1. How does the % of resource, is different than absolute number? With > >> rest of the cgroups systems we define absolute number at most places > >> to my knowledge. > > > > There isn't really much choice if the abstraction is a bundle of all > > resources. You can't use an absolute number unless every possible > > hardware limited resource is defined, which doesn't seem smart to me > > either. > > Absolute number of percentage is representation for a given property. > That property needs definition. Isn't it? > How do we say that "Some undefined" resource you give certain amount, > which user doesn't know about what to administer, or configure. > It has to be quantifiable entity. Each vendor can quantify exactly what HW resources their implementation has and how the above limit impacts their card. There will be many variations, and IIRC, some vendors have resource pools not directly related to the standard PD/QP/MR/CQ/AH verbs resources. > > It is not abstract enough, and doesn't match our universe of > > hardware very well. > Why does the user need to know the actual hardware resource limits or > define hardware based resource. Because actual hardware resources *ARE* the limit. We cannot abstract it away. The hardware/driver has real, fixed, immutable limits. No API abstraction can possibly change that. The limits are such there *IS NO* API boundary that can bundle them into something simpler. There will always be apps that require wildly different ratios of the basic verbs resources (PD/QP/CQ/AH/MR) Either we control each and every vendor's limited resource directly (which is where you started), or we just roll them up into a 'all resource' bundle and control them indirectly. There just isn't a mythical third 'better API' choice with the hardware we have today. > (a) how many number of RDMA connections are allowed instead of QP, or CQ or AH. > (b) how many data transfer buffers to use. None of that accurately reflects what the real HW limits actually are. > > ie Presumably some fairly small limitation like 10MB is enough for > > most non-MPI jobs. > > Container application always write a simple for loop code to take away > majority of QP with 10MB limit. No, the HW and kmem limits must work together, the HW limit would prevent exhaustion outside the container. > Imagine instead of tcp_bytes or kmem bytes, its "some memory > resource", how would someone debug/tune a system with abstract knobs? Well, we have the memcg controller that does track kmem. The subsystem specific kmem limit is to force fair sharing of the limited kmem resource within the overall memcg limit. They are complementary. A fictional rdma_kmem and tcp_kmem would serve very similar purposes. > > UAPI wise, nobdy has to care if the limit is actually # of QPs or > > something else. > If we dont care about resource, we cannot tune or limit it. number of > MRs used by MPI vs rsocket vs accelio is way different. So? I don't think it is really important to have an exact, precise, limit. The HW pools are pretty big, unless you plan to run tens of thousands of containers eacg with tiny RDMA limits, it is fine to talk in broader terms (ie 10% of all HW limited resource) which is totally adaquate to hard-prevent run away or exhaustion scenarios. Jason