[PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass()
@ 2015-09-13 20:14 Tejun Heo
  2015-09-13 20:14   ` Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-13 20:14 UTC (permalink / raw)
  To: akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the
same series of conditions to decide whether to bypass kmem accounting.
Collect the tests into __memcg_kmem_bypass().

This is pure refactoring.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
Hello,

These three patches are on top of mmotm as of Sep 13th and the two
patches from the following thread.

 http://lkml.kernel.org/g/20150913185940.GA25369@htj.duckdns.org

Thanks.

 include/linux/memcontrol.h |   46 +++++++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 24 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -776,20 +776,7 @@ int memcg_charge_kmem(struct mem_cgroup
 		      unsigned long nr_pages);
 void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages);
 
-/**
- * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
- * @gfp: the gfp allocation flags.
- * @memcg: a pointer to the memcg this was charged against.
- * @order: allocation order.
- *
- * returns true if the memcg where the current task belongs can hold this
- * allocation.
- *
- * We return true automatically if this allocation is not to be accounted to
- * any memcg.
- */
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+static inline bool __memcg_kmem_bypass(gfp_t gfp)
 {
 	if (!memcg_kmem_enabled())
 		return true;
@@ -811,6 +798,26 @@ memcg_kmem_newpage_charge(gfp_t gfp, str
 	if (unlikely(fatal_signal_pending(current)))
 		return true;
 
+	return false;
+}
+
+/**
+ * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
+ * @gfp: the gfp allocation flags.
+ * @memcg: a pointer to the memcg this was charged against.
+ * @order: allocation order.
+ *
+ * returns true if the memcg where the current task belongs can hold this
+ * allocation.
+ *
+ * We return true automatically if this allocation is not to be accounted to
+ * any memcg.
+ */
+static inline bool
+memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
+{
+	if (__memcg_kmem_bypass(gfp))
+		return true;
 	return __memcg_kmem_newpage_charge(gfp, memcg, order);
 }
 
@@ -853,17 +860,8 @@ memcg_kmem_commit_charge(struct page *pa
 static __always_inline struct kmem_cache *
 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
-	if (!memcg_kmem_enabled())
-		return cachep;
-	if (gfp & __GFP_NOACCOUNT)
-		return cachep;
-	if (gfp & __GFP_NOFAIL)
+	if (__memcg_kmem_bypass(gfp))
 		return cachep;
-	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-		return cachep;
-	if (unlikely(fatal_signal_pending(current)))
-		return cachep;
-
 	return __memcg_kmem_get_cache(cachep);
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-13 20:14   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-13 20:14 UTC (permalink / raw)
  To: akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

try_charge() is the main charging logic of memcg.  When it hits the
limit but either can't fail the allocation due to __GFP_NOFAIL or the
task is likely to free memory very soon, being OOM killed, has SIGKILL
pending or exiting, it "bypasses" the charge to the root memcg and
returns -EINTR.  While this is one approach which can be taken for
these situations, it has several issues.

* It unnecessarily lies about the reality.  The number itself doesn't
  go over the limit but the actual usage does.  memcg is either forced
  to or actively chooses to go over the limit because that is the
  right behavior under the circumstances, which is completely fine,
  but, if at all avoidable, it shouldn't be misrepresenting what's
  happening by sneaking the charges into the root memcg.

* Despite trying, we already do over-charge.  kmemcg can't deal with
  switching over to the root memcg by the point try_charge() returns
  -EINTR, so it open-codes over-charing.

* It complicates the callers.  Each try_charge() user has to handle
  the weird -EINTR exception.  memcg_charge_kmem() does the manual
  over-charging.  mem_cgroup_do_precharge() performs unnecessary
  uncharging of root memcg, which BTW is inconsistent with what
  memcg_charge_kmem() does.  mem_cgroup_try_charge() needs to switch
  the returned cgroup to the root one.

The reality is that in memcg there are cases where we are forced
and/or willing to go over the limit.  Each such case needs to be
scrutinized and justified but there definitely are situations where
that is the right thing to do.  We alredy do this but with a
superficial and inconsistent disguise which leads to unnecessary
complications.

This patch updates try_charge() so that it over-charges and returns 0
when deemed necessary.  -EINTR return is removed along with all
special case handling in the callers.

While at it, remove the local variable @ret, which was initialized to
zero and never changed, along with done: label which just returned the
always zero @ret.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 mm/memcontrol.c |   69 ++++++++++++++++----------------------------------------
 1 file changed, 20 insertions(+), 49 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1999,13 +1999,12 @@ static int try_charge(struct mem_cgroup
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
-	int ret = 0;
 
 	if (mem_cgroup_is_root(memcg))
-		goto done;
+		return 0;
 retry:
 	if (consume_stock(memcg, nr_pages))
-		goto done;
+		return 0;
 
 	if (!do_swap_account ||
 	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
@@ -2033,7 +2032,7 @@ retry:
 	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
-		goto bypass;
+		goto force;
 
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
@@ -2079,10 +2078,10 @@ retry:
 		goto retry;
 
 	if (gfp_mask & __GFP_NOFAIL)
-		goto bypass;
+		goto force;
 
 	if (fatal_signal_pending(current))
-		goto bypass;
+		goto force;
 
 	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
 
@@ -2090,8 +2089,18 @@ retry:
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-bypass:
-	return -EINTR;
+force:
+	/*
+	 * The allocation either can't fail or will lead to more memory
+	 * being freed very soon.  Allow memory usage go over the limit
+	 * temporarily by force charging it.
+	 */
+	page_counter_charge(&memcg->memory, nr_pages);
+	if (do_swap_account)
+		page_counter_charge(&memcg->memsw, nr_pages);
+	css_get_many(&memcg->css, nr_pages);
+
+	return 0;
 
 done_restock:
 	css_get_many(&memcg->css, batch);
@@ -2114,8 +2123,8 @@ done_restock:
 			break;
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
-done:
-	return ret;
+
+	return 0;
 }
 
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2207,28 +2216,7 @@ int memcg_charge_kmem(struct mem_cgroup
 		return ret;
 
 	ret = try_charge(memcg, gfp, nr_pages);
-	if (ret == -EINTR)  {
-		/*
-		 * try_charge() chose to bypass to root due to OOM kill or
-		 * fatal signal.  Since our only options are to either fail
-		 * the allocation or charge it to this cgroup, do it as a
-		 * temporary condition. But we can't fail. From a kmem/slab
-		 * perspective, the cache has already been selected, by
-		 * mem_cgroup_kmem_get_cache(), so it is too late to change
-		 * our minds.
-		 *
-		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed
-		 * during try_charge() above. Tasks that were already dying
-		 * when the allocation triggers should have been already
-		 * directed to the root cgroup in memcontrol.h
-		 */
-		page_counter_charge(&memcg->memory, nr_pages);
-		if (do_swap_account)
-			page_counter_charge(&memcg->memsw, nr_pages);
-		css_get_many(&memcg->css, nr_pages);
-		ret = 0;
-	} else if (ret)
+	if (ret)
 		page_counter_uncharge(&memcg->kmem, nr_pages);
 
 	return ret;
@@ -4433,22 +4421,10 @@ static int mem_cgroup_do_precharge(unsig
 		mc.precharge += count;
 		return ret;
 	}
-	if (ret == -EINTR) {
-		cancel_charge(root_mem_cgroup, count);
-		return ret;
-	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
 		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
-		/*
-		 * In case of failure, any residual charges against
-		 * mc.to will be dropped by mem_cgroup_clear_mc()
-		 * later on.  However, cancel any charges that are
-		 * bypassed to root right away or they'll be lost.
-		 */
-		if (ret == -EINTR)
-			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -5353,11 +5329,6 @@ int mem_cgroup_try_charge(struct page *p
 	ret = try_charge(memcg, gfp_mask, nr_pages);
 
 	css_put(&memcg->css);
-
-	if (ret == -EINTR) {
-		memcg = root_mem_cgroup;
-		ret = 0;
-	}
 out:
 	*memcgp = memcg;
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-13 20:14   ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-13 20:14 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

try_charge() is the main charging logic of memcg.  When it hits the
limit but either can't fail the allocation due to __GFP_NOFAIL or the
task is likely to free memory very soon, being OOM killed, has SIGKILL
pending or exiting, it "bypasses" the charge to the root memcg and
returns -EINTR.  While this is one approach which can be taken for
these situations, it has several issues.

* It unnecessarily lies about the reality.  The number itself doesn't
  go over the limit but the actual usage does.  memcg is either forced
  to or actively chooses to go over the limit because that is the
  right behavior under the circumstances, which is completely fine,
  but, if at all avoidable, it shouldn't be misrepresenting what's
  happening by sneaking the charges into the root memcg.

* Despite trying, we already do over-charge.  kmemcg can't deal with
  switching over to the root memcg by the point try_charge() returns
  -EINTR, so it open-codes over-charing.

* It complicates the callers.  Each try_charge() user has to handle
  the weird -EINTR exception.  memcg_charge_kmem() does the manual
  over-charging.  mem_cgroup_do_precharge() performs unnecessary
  uncharging of root memcg, which BTW is inconsistent with what
  memcg_charge_kmem() does.  mem_cgroup_try_charge() needs to switch
  the returned cgroup to the root one.

The reality is that in memcg there are cases where we are forced
and/or willing to go over the limit.  Each such case needs to be
scrutinized and justified but there definitely are situations where
that is the right thing to do.  We alredy do this but with a
superficial and inconsistent disguise which leads to unnecessary
complications.

This patch updates try_charge() so that it over-charges and returns 0
when deemed necessary.  -EINTR return is removed along with all
special case handling in the callers.

While at it, remove the local variable @ret, which was initialized to
zero and never changed, along with done: label which just returned the
always zero @ret.

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 mm/memcontrol.c |   69 ++++++++++++++++----------------------------------------
 1 file changed, 20 insertions(+), 49 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1999,13 +1999,12 @@ static int try_charge(struct mem_cgroup
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
-	int ret = 0;
 
 	if (mem_cgroup_is_root(memcg))
-		goto done;
+		return 0;
 retry:
 	if (consume_stock(memcg, nr_pages))
-		goto done;
+		return 0;
 
 	if (!do_swap_account ||
 	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
@@ -2033,7 +2032,7 @@ retry:
 	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
-		goto bypass;
+		goto force;
 
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
@@ -2079,10 +2078,10 @@ retry:
 		goto retry;
 
 	if (gfp_mask & __GFP_NOFAIL)
-		goto bypass;
+		goto force;
 
 	if (fatal_signal_pending(current))
-		goto bypass;
+		goto force;
 
 	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
 
@@ -2090,8 +2089,18 @@ retry:
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-bypass:
-	return -EINTR;
+force:
+	/*
+	 * The allocation either can't fail or will lead to more memory
+	 * being freed very soon.  Allow memory usage go over the limit
+	 * temporarily by force charging it.
+	 */
+	page_counter_charge(&memcg->memory, nr_pages);
+	if (do_swap_account)
+		page_counter_charge(&memcg->memsw, nr_pages);
+	css_get_many(&memcg->css, nr_pages);
+
+	return 0;
 
 done_restock:
 	css_get_many(&memcg->css, batch);
@@ -2114,8 +2123,8 @@ done_restock:
 			break;
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
-done:
-	return ret;
+
+	return 0;
 }
 
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2207,28 +2216,7 @@ int memcg_charge_kmem(struct mem_cgroup
 		return ret;
 
 	ret = try_charge(memcg, gfp, nr_pages);
-	if (ret == -EINTR)  {
-		/*
-		 * try_charge() chose to bypass to root due to OOM kill or
-		 * fatal signal.  Since our only options are to either fail
-		 * the allocation or charge it to this cgroup, do it as a
-		 * temporary condition. But we can't fail. From a kmem/slab
-		 * perspective, the cache has already been selected, by
-		 * mem_cgroup_kmem_get_cache(), so it is too late to change
-		 * our minds.
-		 *
-		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed
-		 * during try_charge() above. Tasks that were already dying
-		 * when the allocation triggers should have been already
-		 * directed to the root cgroup in memcontrol.h
-		 */
-		page_counter_charge(&memcg->memory, nr_pages);
-		if (do_swap_account)
-			page_counter_charge(&memcg->memsw, nr_pages);
-		css_get_many(&memcg->css, nr_pages);
-		ret = 0;
-	} else if (ret)
+	if (ret)
 		page_counter_uncharge(&memcg->kmem, nr_pages);
 
 	return ret;
@@ -4433,22 +4421,10 @@ static int mem_cgroup_do_precharge(unsig
 		mc.precharge += count;
 		return ret;
 	}
-	if (ret == -EINTR) {
-		cancel_charge(root_mem_cgroup, count);
-		return ret;
-	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
 		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
-		/*
-		 * In case of failure, any residual charges against
-		 * mc.to will be dropped by mem_cgroup_clear_mc()
-		 * later on.  However, cancel any charges that are
-		 * bypassed to root right away or they'll be lost.
-		 */
-		if (ret == -EINTR)
-			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -5353,11 +5329,6 @@ int mem_cgroup_try_charge(struct page *p
 	ret = try_charge(memcg, gfp_mask, nr_pages);
 
 	css_put(&memcg->css);
-
-	if (ret == -EINTR) {
-		memcg = root_mem_cgroup;
-		ret = 0;
-	}
 out:
 	*memcgp = memcg;
 	return ret;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
@ 2015-09-13 20:15     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-13 20:15 UTC (permalink / raw)
  To: akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

__memcg_kmem_bypass() decides whether a kmem allocation should be
bypassed to the root memcg.  Some conditions that it tests are valid
criteria regarding who should be held accountable; however, there are
a couple unnecessary tests for cold paths - __GFP_FAIL and
fatal_signal_pending().

The previous patch updated try_charge() to handle both __GFP_FAIL and
dying tasks correctly and the only thing these two tests are doing is
making accounting less accurate and sprinkling tests for cold path
conditions in the hot paths.  There's nothing meaningful gained by
these extra tests.

This patch removes the two unnecessary tests from
__memcg_kmem_bypass().

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/memcontrol.h |   14 --------------
 1 file changed, 14 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -780,24 +780,10 @@ static inline bool __memcg_kmem_bypass(g
 {
 	if (!memcg_kmem_enabled())
 		return true;
-
 	if (gfp & __GFP_NOACCOUNT)
 		return true;
-	/*
-	 * __GFP_NOFAIL allocations will move on even if charging is not
-	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it forcibly, but we hope
-	 * those allocations are rare, and won't be worth the trouble.
-	 */
-	if (gfp & __GFP_NOFAIL)
-		return true;
 	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
 		return true;
-
-	/* If the test is dying, just let it go. */
-	if (unlikely(fatal_signal_pending(current)))
-		return true;
-
 	return false;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
@ 2015-09-13 20:15     ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-13 20:15 UTC (permalink / raw)
  To: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, kernel-team-b10kYP2dOMg

__memcg_kmem_bypass() decides whether a kmem allocation should be
bypassed to the root memcg.  Some conditions that it tests are valid
criteria regarding who should be held accountable; however, there are
a couple unnecessary tests for cold paths - __GFP_FAIL and
fatal_signal_pending().

The previous patch updated try_charge() to handle both __GFP_FAIL and
dying tasks correctly and the only thing these two tests are doing is
making accounting less accurate and sprinkling tests for cold path
conditions in the hot paths.  There's nothing meaningful gained by
these extra tests.

This patch removes the two unnecessary tests from
__memcg_kmem_bypass().

Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
---
 include/linux/memcontrol.h |   14 --------------
 1 file changed, 14 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -780,24 +780,10 @@ static inline bool __memcg_kmem_bypass(g
 {
 	if (!memcg_kmem_enabled())
 		return true;
-
 	if (gfp & __GFP_NOACCOUNT)
 		return true;
-	/*
-	 * __GFP_NOFAIL allocations will move on even if charging is not
-	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it forcibly, but we hope
-	 * those allocations are rare, and won't be worth the trouble.
-	 */
-	if (gfp & __GFP_NOFAIL)
-		return true;
 	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
 		return true;
-
-	/* If the test is dying, just let it go. */
-	if (unlikely(fatal_signal_pending(current)))
-		return true;
-
 	return false;
 }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass()
@ 2015-09-14  9:03   ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-09-14  9:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, mhocko, cgroups, linux-mm, kernel-team

On Sun, Sep 13, 2015 at 04:14:16PM -0400, Tejun Heo wrote:
> memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the
> same series of conditions to decide whether to bypass kmem accounting.
> Collect the tests into __memcg_kmem_bypass().
> 
> This is pure refactoring.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass()
@ 2015-09-14  9:03   ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-09-14  9:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Sun, Sep 13, 2015 at 04:14:16PM -0400, Tejun Heo wrote:
> memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the
> same series of conditions to decide whether to bypass kmem accounting.
> Collect the tests into __memcg_kmem_bypass().
> 
> This is pure refactoring.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-14 12:44     ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-09-14 12:44 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, mhocko, cgroups, linux-mm, kernel-team

On Sun, Sep 13, 2015 at 04:14:42PM -0400, Tejun Heo wrote:
> try_charge() is the main charging logic of memcg.  When it hits the
> limit but either can't fail the allocation due to __GFP_NOFAIL or the
> task is likely to free memory very soon, being OOM killed, has SIGKILL
> pending or exiting, it "bypasses" the charge to the root memcg and
> returns -EINTR.  While this is one approach which can be taken for
> these situations, it has several issues.
> 
> * It unnecessarily lies about the reality.  The number itself doesn't
>   go over the limit but the actual usage does.  memcg is either forced
>   to or actively chooses to go over the limit because that is the
>   right behavior under the circumstances, which is completely fine,
>   but, if at all avoidable, it shouldn't be misrepresenting what's
>   happening by sneaking the charges into the root memcg.
> 
> * Despite trying, we already do over-charge.  kmemcg can't deal with
>   switching over to the root memcg by the point try_charge() returns
>   -EINTR, so it open-codes over-charing.
> 
> * It complicates the callers.  Each try_charge() user has to handle
>   the weird -EINTR exception.  memcg_charge_kmem() does the manual
>   over-charging.  mem_cgroup_do_precharge() performs unnecessary
>   uncharging of root memcg, which BTW is inconsistent with what

Hmm, cancel_charge(root_mem_cgroup) is a no-op. Looks like this is a
leftover from the times when we did charge root_mem_cgroup.

Anyway, the rationale makes sense to me, and the patch looks good.

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

>   memcg_charge_kmem() does.  mem_cgroup_try_charge() needs to switch
>   the returned cgroup to the root one.
> 
> The reality is that in memcg there are cases where we are forced
> and/or willing to go over the limit.  Each such case needs to be
> scrutinized and justified but there definitely are situations where
> that is the right thing to do.  We alredy do this but with a
> superficial and inconsistent disguise which leads to unnecessary
> complications.
> 
> This patch updates try_charge() so that it over-charges and returns 0
> when deemed necessary.  -EINTR return is removed along with all
> special case handling in the callers.
> 
> While at it, remove the local variable @ret, which was initialized to
> zero and never changed, along with done: label which just returned the
> always zero @ret.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-14 12:44     ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-09-14 12:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Sun, Sep 13, 2015 at 04:14:42PM -0400, Tejun Heo wrote:
> try_charge() is the main charging logic of memcg.  When it hits the
> limit but either can't fail the allocation due to __GFP_NOFAIL or the
> task is likely to free memory very soon, being OOM killed, has SIGKILL
> pending or exiting, it "bypasses" the charge to the root memcg and
> returns -EINTR.  While this is one approach which can be taken for
> these situations, it has several issues.
> 
> * It unnecessarily lies about the reality.  The number itself doesn't
>   go over the limit but the actual usage does.  memcg is either forced
>   to or actively chooses to go over the limit because that is the
>   right behavior under the circumstances, which is completely fine,
>   but, if at all avoidable, it shouldn't be misrepresenting what's
>   happening by sneaking the charges into the root memcg.
> 
> * Despite trying, we already do over-charge.  kmemcg can't deal with
>   switching over to the root memcg by the point try_charge() returns
>   -EINTR, so it open-codes over-charing.
> 
> * It complicates the callers.  Each try_charge() user has to handle
>   the weird -EINTR exception.  memcg_charge_kmem() does the manual
>   over-charging.  mem_cgroup_do_precharge() performs unnecessary
>   uncharging of root memcg, which BTW is inconsistent with what

Hmm, cancel_charge(root_mem_cgroup) is a no-op. Looks like this is a
leftover from the times when we did charge root_mem_cgroup.

Anyway, the rationale makes sense to me, and the patch looks good.

Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

>   memcg_charge_kmem() does.  mem_cgroup_try_charge() needs to switch
>   the returned cgroup to the root one.
> 
> The reality is that in memcg there are cases where we are forced
> and/or willing to go over the limit.  Each such case needs to be
> scrutinized and justified but there definitely are situations where
> that is the right thing to do.  We alredy do this but with a
> superficial and inconsistent disguise which leads to unnecessary
> complications.
> 
> This patch updates try_charge() so that it over-charges and returns 0
> when deemed necessary.  -EINTR return is removed along with all
> special case handling in the callers.
> 
> While at it, remove the local variable @ret, which was initialized to
> zero and never changed, along with done: label which just returned the
> always zero @ret.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
@ 2015-09-14 12:51       ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-09-14 12:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, mhocko, cgroups, linux-mm, kernel-team

On Sun, Sep 13, 2015 at 04:15:09PM -0400, Tejun Heo wrote:
> __memcg_kmem_bypass() decides whether a kmem allocation should be
> bypassed to the root memcg.  Some conditions that it tests are valid
> criteria regarding who should be held accountable; however, there are
> a couple unnecessary tests for cold paths - __GFP_FAIL and
> fatal_signal_pending().
> 
> The previous patch updated try_charge() to handle both __GFP_FAIL and
> dying tasks correctly and the only thing these two tests are doing is
> making accounting less accurate and sprinkling tests for cold path
> conditions in the hot paths.  There's nothing meaningful gained by
> these extra tests.
> 
> This patch removes the two unnecessary tests from
> __memcg_kmem_bypass().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
@ 2015-09-14 12:51       ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-09-14 12:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

On Sun, Sep 13, 2015 at 04:15:09PM -0400, Tejun Heo wrote:
> __memcg_kmem_bypass() decides whether a kmem allocation should be
> bypassed to the root memcg.  Some conditions that it tests are valid
> criteria regarding who should be held accountable; however, there are
> a couple unnecessary tests for cold paths - __GFP_FAIL and
> fatal_signal_pending().
> 
> The previous patch updated try_charge() to handle both __GFP_FAIL and
> dying tasks correctly and the only thing these two tests are doing is
> making accounting less accurate and sprinkling tests for cold path
> conditions in the hot paths.  There's nothing meaningful gained by
> these extra tests.
> 
> This patch removes the two unnecessary tests from
> __memcg_kmem_bypass().
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass()
  2015-09-13 20:14 [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass() Tejun Heo
  2015-09-13 20:14   ` Tejun Heo
  2015-09-14  9:03   ` Vladimir Davydov
@ 2015-09-14 15:21 ` Michal Hocko
  2 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2015-09-14 15:21 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, cgroups, linux-mm, vdavydov, kernel-team

On Sun 13-09-15 16:14:16, Tejun Heo wrote:
> memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the
> same series of conditions to decide whether to bypass kmem accounting.
> Collect the tests into __memcg_kmem_bypass().
> 
> This is pure refactoring.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Hello,
> 
> These three patches are on top of mmotm as of Sep 13th and the two
> patches from the following thread.
> 
>  http://lkml.kernel.org/g/20150913185940.GA25369@htj.duckdns.org
> 
> Thanks.
> 
>  include/linux/memcontrol.h |   46 +++++++++++++++++++++------------------------
>  1 file changed, 22 insertions(+), 24 deletions(-)
> 
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -776,20 +776,7 @@ int memcg_charge_kmem(struct mem_cgroup
>  		      unsigned long nr_pages);
>  void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages);
>  
> -/**
> - * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
> - * @gfp: the gfp allocation flags.
> - * @memcg: a pointer to the memcg this was charged against.
> - * @order: allocation order.
> - *
> - * returns true if the memcg where the current task belongs can hold this
> - * allocation.
> - *
> - * We return true automatically if this allocation is not to be accounted to
> - * any memcg.
> - */
> -static inline bool
> -memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> +static inline bool __memcg_kmem_bypass(gfp_t gfp)
>  {
>  	if (!memcg_kmem_enabled())
>  		return true;
> @@ -811,6 +798,26 @@ memcg_kmem_newpage_charge(gfp_t gfp, str
>  	if (unlikely(fatal_signal_pending(current)))
>  		return true;
>  
> +	return false;
> +}
> +
> +/**
> + * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
> + * @gfp: the gfp allocation flags.
> + * @memcg: a pointer to the memcg this was charged against.
> + * @order: allocation order.
> + *
> + * returns true if the memcg where the current task belongs can hold this
> + * allocation.
> + *
> + * We return true automatically if this allocation is not to be accounted to
> + * any memcg.
> + */
> +static inline bool
> +memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
> +{
> +	if (__memcg_kmem_bypass(gfp))
> +		return true;
>  	return __memcg_kmem_newpage_charge(gfp, memcg, order);
>  }
>  
> @@ -853,17 +860,8 @@ memcg_kmem_commit_charge(struct page *pa
>  static __always_inline struct kmem_cache *
>  memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
>  {
> -	if (!memcg_kmem_enabled())
> -		return cachep;
> -	if (gfp & __GFP_NOACCOUNT)
> -		return cachep;
> -	if (gfp & __GFP_NOFAIL)
> +	if (__memcg_kmem_bypass(gfp))
>  		return cachep;
> -	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
> -		return cachep;
> -	if (unlikely(fatal_signal_pending(current)))
> -		return cachep;
> -
>  	return __memcg_kmem_get_cache(cachep);
>  }
>  

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
  2015-09-14 12:44     ` Vladimir Davydov
@ 2015-09-14 15:51       ` Tejun Heo
  -1 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-14 15:51 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, cgroups, linux-mm, kernel-team

Hello, Vladmir.

On Mon, Sep 14, 2015 at 03:44:20PM +0300, Vladimir Davydov wrote:
> Hmm, cancel_charge(root_mem_cgroup) is a no-op. Looks like this is a
> leftover from the times when we did charge root_mem_cgroup.

Yeap, it's inconsistent but not broken.  Will not that in the
description.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-14 15:51       ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-14 15:51 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, mhocko-DgEjT+Ai2ygdnm+yROfE0A,
	cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	kernel-team-b10kYP2dOMg

Hello, Vladmir.

On Mon, Sep 14, 2015 at 03:44:20PM +0300, Vladimir Davydov wrote:
> Hmm, cancel_charge(root_mem_cgroup) is a no-op. Looks like this is a
> leftover from the times when we did charge root_mem_cgroup.

Yeap, it's inconsistent but not broken.  Will not that in the
description.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
  2015-09-13 20:14   ` Tejun Heo
                     ` (2 preceding siblings ...)
  (?)
@ 2015-09-14 19:32   ` Michal Hocko
  2015-09-14 19:56       ` Tejun Heo
  -1 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-09-14 19:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, cgroups, linux-mm, vdavydov, kernel-team

On Sun 13-09-15 16:14:42, Tejun Heo wrote:
> try_charge() is the main charging logic of memcg.  When it hits the
> limit but either can't fail the allocation due to __GFP_NOFAIL or the
> task is likely to free memory very soon, being OOM killed, has SIGKILL
> pending or exiting, it "bypasses" the charge to the root memcg and
> returns -EINTR.  While this is one approach which can be taken for
> these situations, it has several issues.
> 
> * It unnecessarily lies about the reality.  The number itself doesn't
>   go over the limit but the actual usage does.  memcg is either forced
>   to or actively chooses to go over the limit because that is the
>   right behavior under the circumstances, which is completely fine,
>   but, if at all avoidable, it shouldn't be misrepresenting what's
>   happening by sneaking the charges into the root memcg.
> 
> * Despite trying, we already do over-charge.  kmemcg can't deal with
>   switching over to the root memcg by the point try_charge() returns
>   -EINTR, so it open-codes over-charing.
> 
> * It complicates the callers.  Each try_charge() user has to handle
>   the weird -EINTR exception.  memcg_charge_kmem() does the manual
>   over-charging.  mem_cgroup_do_precharge() performs unnecessary
>   uncharging of root memcg, which BTW is inconsistent with what
>   memcg_charge_kmem() does. 

This is a left over from ce00a967377b ("mm: memcontrol: revert use of
root_mem_cgroup res_counter")

>   mem_cgroup_try_charge() needs to switch
>   the returned cgroup to the root one.
> 
> The reality is that in memcg there are cases where we are forced
> and/or willing to go over the limit.  Each such case needs to be
> scrutinized and justified but there definitely are situations where
> that is the right thing to do.  We alredy do this but with a
> superficial and inconsistent disguise which leads to unnecessary
> complications.
>
> This patch updates try_charge() so that it over-charges and returns 0
> when deemed necessary.  -EINTR return is removed along with all
> special case handling in the callers.

OK the code is easier in the end, although I would argue that try_charge
could return ENOMEM for GFP_NOWAIT instead of overcharging (this would
e.g. allow precharge to bail out earlier). Something for a separate patch I
guess.

Anyway I still do not like usage > max/hard limit presented to userspace
because it looks like a clear breaking of max/hard limit semantic. I
realize that we cannot solve the underlying problem easily or it might
be unfeasible but we should consider how to present this state to the
userspace.
We have basically 2 options AFAICS. We can either document that a
_temporal_ breach of the max/hard limit is allowed or we can hide this
fact and always present max(current,max).
The first one might be better for an easier debugging and it is also
more honest about the current state but the definition of the hard limit
is a bit weird. It also exposes implementation details to the userspace.
The other choice is clearly lying but users shouldn't care about the
implementation details and if the state is really temporal then the
userspace shouldn't even notice. There is also a risk that somebody is
already depending on current < max which happened to work without kmem
until now.

This is something to be solved in a separate patch I guess but we
should think about that. I am not entirely clear on that myself but I am
more inclined to the first option and simply document the potential
corner case and temporal breach.

> While at it, remove the local variable @ret, which was initialized to
> zero and never changed, along with done: label which just returned the
> always zero @ret.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c |   69 ++++++++++++++++----------------------------------------
>  1 file changed, 20 insertions(+), 49 deletions(-)
> 
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1999,13 +1999,12 @@ static int try_charge(struct mem_cgroup
>  	unsigned long nr_reclaimed;
>  	bool may_swap = true;
>  	bool drained = false;
> -	int ret = 0;
>  
>  	if (mem_cgroup_is_root(memcg))
> -		goto done;
> +		return 0;
>  retry:
>  	if (consume_stock(memcg, nr_pages))
> -		goto done;
> +		return 0;
>  
>  	if (!do_swap_account ||
>  	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
> @@ -2033,7 +2032,7 @@ retry:
>  	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
>  		     fatal_signal_pending(current) ||
>  		     current->flags & PF_EXITING))
> -		goto bypass;
> +		goto force;
>  
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
> @@ -2079,10 +2078,10 @@ retry:
>  		goto retry;
>  
>  	if (gfp_mask & __GFP_NOFAIL)
> -		goto bypass;
> +		goto force;
>  
>  	if (fatal_signal_pending(current))
> -		goto bypass;
> +		goto force;
>  
>  	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
>  
> @@ -2090,8 +2089,18 @@ retry:
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
> -bypass:
> -	return -EINTR;
> +force:
> +	/*
> +	 * The allocation either can't fail or will lead to more memory
> +	 * being freed very soon.  Allow memory usage go over the limit
> +	 * temporarily by force charging it.
> +	 */
> +	page_counter_charge(&memcg->memory, nr_pages);
> +	if (do_swap_account)
> +		page_counter_charge(&memcg->memsw, nr_pages);
> +	css_get_many(&memcg->css, nr_pages);
> +
> +	return 0;
>  
>  done_restock:
>  	css_get_many(&memcg->css, batch);
> @@ -2114,8 +2123,8 @@ done_restock:
>  			break;
>  		}
>  	} while ((memcg = parent_mem_cgroup(memcg)));
> -done:
> -	return ret;
> +
> +	return 0;
>  }
>  
>  static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
> @@ -2207,28 +2216,7 @@ int memcg_charge_kmem(struct mem_cgroup
>  		return ret;
>  
>  	ret = try_charge(memcg, gfp, nr_pages);
> -	if (ret == -EINTR)  {
> -		/*
> -		 * try_charge() chose to bypass to root due to OOM kill or
> -		 * fatal signal.  Since our only options are to either fail
> -		 * the allocation or charge it to this cgroup, do it as a
> -		 * temporary condition. But we can't fail. From a kmem/slab
> -		 * perspective, the cache has already been selected, by
> -		 * mem_cgroup_kmem_get_cache(), so it is too late to change
> -		 * our minds.
> -		 *
> -		 * This condition will only trigger if the task entered
> -		 * memcg_charge_kmem in a sane state, but was OOM-killed
> -		 * during try_charge() above. Tasks that were already dying
> -		 * when the allocation triggers should have been already
> -		 * directed to the root cgroup in memcontrol.h
> -		 */
> -		page_counter_charge(&memcg->memory, nr_pages);
> -		if (do_swap_account)
> -			page_counter_charge(&memcg->memsw, nr_pages);
> -		css_get_many(&memcg->css, nr_pages);
> -		ret = 0;
> -	} else if (ret)
> +	if (ret)
>  		page_counter_uncharge(&memcg->kmem, nr_pages);
>  
>  	return ret;
> @@ -4433,22 +4421,10 @@ static int mem_cgroup_do_precharge(unsig
>  		mc.precharge += count;
>  		return ret;
>  	}
> -	if (ret == -EINTR) {
> -		cancel_charge(root_mem_cgroup, count);
> -		return ret;
> -	}
>  
>  	/* Try charges one by one with reclaim */
>  	while (count--) {
>  		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
> -		/*
> -		 * In case of failure, any residual charges against
> -		 * mc.to will be dropped by mem_cgroup_clear_mc()
> -		 * later on.  However, cancel any charges that are
> -		 * bypassed to root right away or they'll be lost.
> -		 */
> -		if (ret == -EINTR)
> -			cancel_charge(root_mem_cgroup, 1);
>  		if (ret)
>  			return ret;
>  		mc.precharge++;
> @@ -5353,11 +5329,6 @@ int mem_cgroup_try_charge(struct page *p
>  	ret = try_charge(memcg, gfp_mask, nr_pages);
>  
>  	css_put(&memcg->css);
> -
> -	if (ret == -EINTR) {
> -		memcg = root_mem_cgroup;
> -		ret = 0;
> -	}
>  out:
>  	*memcgp = memcg;
>  	return ret;

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
@ 2015-09-14 19:40       ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2015-09-14 19:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, cgroups, linux-mm, vdavydov, kernel-team

On Sun 13-09-15 16:15:09, Tejun Heo wrote:
> __memcg_kmem_bypass() decides whether a kmem allocation should be
> bypassed to the root memcg.  Some conditions that it tests are valid
> criteria regarding who should be held accountable; however, there are
> a couple unnecessary tests for cold paths - __GFP_FAIL and
> fatal_signal_pending().
> 
> The previous patch updated try_charge() to handle both __GFP_FAIL and
> dying tasks correctly and the only thing these two tests are doing is
> making accounting less accurate and sprinkling tests for cold path
> conditions in the hot paths.  There's nothing meaningful gained by
> these extra tests.
> 
> This patch removes the two unnecessary tests from
> __memcg_kmem_bypass().
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |   14 --------------
>  1 file changed, 14 deletions(-)
> 
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -780,24 +780,10 @@ static inline bool __memcg_kmem_bypass(g
>  {
>  	if (!memcg_kmem_enabled())
>  		return true;
> -
>  	if (gfp & __GFP_NOACCOUNT)
>  		return true;
> -	/*
> -	 * __GFP_NOFAIL allocations will move on even if charging is not
> -	 * possible. Therefore we don't even try, and have this allocation
> -	 * unaccounted. We could in theory charge it forcibly, but we hope
> -	 * those allocations are rare, and won't be worth the trouble.
> -	 */
> -	if (gfp & __GFP_NOFAIL)
> -		return true;
>  	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>  		return true;
> -
> -	/* If the test is dying, just let it go. */
> -	if (unlikely(fatal_signal_pending(current)))
> -		return true;
> -
>  	return false;
>  }
>  

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass()
@ 2015-09-14 19:40       ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2015-09-14 19:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Sun 13-09-15 16:15:09, Tejun Heo wrote:
> __memcg_kmem_bypass() decides whether a kmem allocation should be
> bypassed to the root memcg.  Some conditions that it tests are valid
> criteria regarding who should be held accountable; however, there are
> a couple unnecessary tests for cold paths - __GFP_FAIL and
> fatal_signal_pending().
> 
> The previous patch updated try_charge() to handle both __GFP_FAIL and
> dying tasks correctly and the only thing these two tests are doing is
> making accounting less accurate and sprinkling tests for cold path
> conditions in the hot paths.  There's nothing meaningful gained by
> these extra tests.
> 
> This patch removes the two unnecessary tests from
> __memcg_kmem_bypass().
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

> ---
>  include/linux/memcontrol.h |   14 --------------
>  1 file changed, 14 deletions(-)
> 
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -780,24 +780,10 @@ static inline bool __memcg_kmem_bypass(g
>  {
>  	if (!memcg_kmem_enabled())
>  		return true;
> -
>  	if (gfp & __GFP_NOACCOUNT)
>  		return true;
> -	/*
> -	 * __GFP_NOFAIL allocations will move on even if charging is not
> -	 * possible. Therefore we don't even try, and have this allocation
> -	 * unaccounted. We could in theory charge it forcibly, but we hope
> -	 * those allocations are rare, and won't be worth the trouble.
> -	 */
> -	if (gfp & __GFP_NOFAIL)
> -		return true;
>  	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
>  		return true;
> -
> -	/* If the test is dying, just let it go. */
> -	if (unlikely(fatal_signal_pending(current)))
> -		return true;
> -
>  	return false;
>  }
>  

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-14 19:56       ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-14 19:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, cgroups, linux-mm, vdavydov, kernel-team

Hello, Michal.

On Mon, Sep 14, 2015 at 09:32:25PM +0200, Michal Hocko wrote:
> >   mem_cgroup_try_charge() needs to switch
> >   the returned cgroup to the root one.
> > 
> > The reality is that in memcg there are cases where we are forced
> > and/or willing to go over the limit.  Each such case needs to be
> > scrutinized and justified but there definitely are situations where
> > that is the right thing to do.  We alredy do this but with a
> > superficial and inconsistent disguise which leads to unnecessary
> > complications.
> >
> > This patch updates try_charge() so that it over-charges and returns 0
> > when deemed necessary.  -EINTR return is removed along with all
> > special case handling in the callers.
> 
> OK the code is easier in the end, although I would argue that try_charge
> could return ENOMEM for GFP_NOWAIT instead of overcharging (this would
> e.g. allow precharge to bail out earlier). Something for a separate patch I
> guess.

Hmm... GFP_NOWAIT is failed unless it also has __GFP_NOFAIL.

> Anyway I still do not like usage > max/hard limit presented to userspace
> because it looks like a clear breaking of max/hard limit semantic. I
> realize that we cannot solve the underlying problem easily or it might
> be unfeasible but we should consider how to present this state to the
> userspace.
> We have basically 2 options AFAICS. We can either document that a
> _temporal_ breach of the max/hard limit is allowed or we can hide this
> fact and always present max(current,max).
> The first one might be better for an easier debugging and it is also
> more honest about the current state but the definition of the hard limit
> is a bit weird. It also exposes implementation details to the userspace.
> The other choice is clearly lying but users shouldn't care about the
> implementation details and if the state is really temporal then the
> userspace shouldn't even notice. There is also a risk that somebody is
> already depending on current < max which happened to work without kmem
> until now.
> This is something to be solved in a separate patch I guess but we
> should think about that. I am not entirely clear on that myself but I am
> more inclined to the first option and simply document the potential
> corner case and temporal breach.

I'm pretty sure we don't wanna lie.  Just document that temporal
small-scale breaches may happen.  I don't even think this is an
implementation detail.  The fact that we have separate high and max
limits is already admitting that this is inherently different from
global case and that memcg is consciously and actively making
trade-offs regarding handling of global and local memory pressure and
I think that's the right thing to do and something inherent to what
memcg is doing here.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-14 19:56       ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-14 19:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	hannes-druUgvl0LCNAfugRpC6u6w, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

Hello, Michal.

On Mon, Sep 14, 2015 at 09:32:25PM +0200, Michal Hocko wrote:
> >   mem_cgroup_try_charge() needs to switch
> >   the returned cgroup to the root one.
> > 
> > The reality is that in memcg there are cases where we are forced
> > and/or willing to go over the limit.  Each such case needs to be
> > scrutinized and justified but there definitely are situations where
> > that is the right thing to do.  We alredy do this but with a
> > superficial and inconsistent disguise which leads to unnecessary
> > complications.
> >
> > This patch updates try_charge() so that it over-charges and returns 0
> > when deemed necessary.  -EINTR return is removed along with all
> > special case handling in the callers.
> 
> OK the code is easier in the end, although I would argue that try_charge
> could return ENOMEM for GFP_NOWAIT instead of overcharging (this would
> e.g. allow precharge to bail out earlier). Something for a separate patch I
> guess.

Hmm... GFP_NOWAIT is failed unless it also has __GFP_NOFAIL.

> Anyway I still do not like usage > max/hard limit presented to userspace
> because it looks like a clear breaking of max/hard limit semantic. I
> realize that we cannot solve the underlying problem easily or it might
> be unfeasible but we should consider how to present this state to the
> userspace.
> We have basically 2 options AFAICS. We can either document that a
> _temporal_ breach of the max/hard limit is allowed or we can hide this
> fact and always present max(current,max).
> The first one might be better for an easier debugging and it is also
> more honest about the current state but the definition of the hard limit
> is a bit weird. It also exposes implementation details to the userspace.
> The other choice is clearly lying but users shouldn't care about the
> implementation details and if the state is really temporal then the
> userspace shouldn't even notice. There is also a risk that somebody is
> already depending on current < max which happened to work without kmem
> until now.
> This is something to be solved in a separate patch I guess but we
> should think about that. I am not entirely clear on that myself but I am
> more inclined to the first option and simply document the potential
> corner case and temporal breach.

I'm pretty sure we don't wanna lie.  Just document that temporal
small-scale breaches may happen.  I don't even think this is an
implementation detail.  The fact that we have separate high and max
limits is already admitting that this is inherently different from
global case and that memcg is consciously and actively making
trade-offs regarding handling of global and local memory pressure and
I think that's the right thing to do and something inherent to what
memcg is doing here.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/3] memcg: ratify and consolidate over-charge handling
  2015-09-13 20:14   ` Tejun Heo
                     ` (3 preceding siblings ...)
  (?)
@ 2015-09-14 20:07   ` Tejun Heo
  2015-09-15 16:18       ` Johannes Weiner
  -1 siblings, 1 reply; 24+ messages in thread
From: Tejun Heo @ 2015-09-14 20:07 UTC (permalink / raw)
  To: akpm, hannes, mhocko; +Cc: cgroups, linux-mm, vdavydov, kernel-team

try_charge() is the main charging logic of memcg.  When it hits the
limit but either can't fail the allocation due to __GFP_NOFAIL or the
task is likely to free memory very soon, being OOM killed, has SIGKILL
pending or exiting, it "bypasses" the charge to the root memcg and
returns -EINTR.  While this is one approach which can be taken for
these situations, it has several issues.

* It unnecessarily lies about the reality.  The number itself doesn't
  go over the limit but the actual usage does.  memcg is either forced
  to or actively chooses to go over the limit because that is the
  right behavior under the circumstances, which is completely fine,
  but, if at all avoidable, it shouldn't be misrepresenting what's
  happening by sneaking the charges into the root memcg.

* Despite trying, we already do over-charge.  kmemcg can't deal with
  switching over to the root memcg by the point try_charge() returns
  -EINTR, so it open-codes over-charing.

* It complicates the callers.  Each try_charge() user has to handle
  the weird -EINTR exception.  memcg_charge_kmem() does the manual
  over-charging.  mem_cgroup_do_precharge() performs unnecessary
  uncharging of root memcg, which BTW is inconsistent with what
  memcg_charge_kmem() does but not broken as [un]charging are noops on
  root memcg.  mem_cgroup_try_charge() needs to switch the returned
  cgroup to the root one.

The reality is that in memcg there are cases where we are forced
and/or willing to go over the limit.  Each such case needs to be
scrutinized and justified but there definitely are situations where
that is the right thing to do.  We alredy do this but with a
superficial and inconsistent disguise which leads to unnecessary
complications.

This patch updates try_charge() so that it over-charges and returns 0
when deemed necessary.  -EINTR return is removed along with all
special case handling in the callers.

While at it, remove the local variable @ret, which was initialized to
zero and never changed, along with done: label which just returned the
always zero @ret.

v2: Minor update to patch description as per Vladimir.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/memcontrol.c |   69 ++++++++++++++++----------------------------------------
 1 file changed, 20 insertions(+), 49 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1999,13 +1999,12 @@ static int try_charge(struct mem_cgroup
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
-	int ret = 0;
 
 	if (mem_cgroup_is_root(memcg))
-		goto done;
+		return 0;
 retry:
 	if (consume_stock(memcg, nr_pages))
-		goto done;
+		return 0;
 
 	if (!do_swap_account ||
 	    !page_counter_try_charge(&memcg->memsw, batch, &counter)) {
@@ -2033,7 +2032,7 @@ retry:
 	if (unlikely(test_thread_flag(TIF_MEMDIE) ||
 		     fatal_signal_pending(current) ||
 		     current->flags & PF_EXITING))
-		goto bypass;
+		goto force;
 
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
@@ -2079,10 +2078,10 @@ retry:
 		goto retry;
 
 	if (gfp_mask & __GFP_NOFAIL)
-		goto bypass;
+		goto force;
 
 	if (fatal_signal_pending(current))
-		goto bypass;
+		goto force;
 
 	mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
 
@@ -2090,8 +2089,18 @@ retry:
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;
-bypass:
-	return -EINTR;
+force:
+	/*
+	 * The allocation either can't fail or will lead to more memory
+	 * being freed very soon.  Allow memory usage go over the limit
+	 * temporarily by force charging it.
+	 */
+	page_counter_charge(&memcg->memory, nr_pages);
+	if (do_swap_account)
+		page_counter_charge(&memcg->memsw, nr_pages);
+	css_get_many(&memcg->css, nr_pages);
+
+	return 0;
 
 done_restock:
 	css_get_many(&memcg->css, batch);
@@ -2114,8 +2123,8 @@ done_restock:
 			break;
 		}
 	} while ((memcg = parent_mem_cgroup(memcg)));
-done:
-	return ret;
+
+	return 0;
 }
 
 static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
@@ -2207,28 +2216,7 @@ int memcg_charge_kmem(struct mem_cgroup
 		return ret;
 
 	ret = try_charge(memcg, gfp, nr_pages);
-	if (ret == -EINTR)  {
-		/*
-		 * try_charge() chose to bypass to root due to OOM kill or
-		 * fatal signal.  Since our only options are to either fail
-		 * the allocation or charge it to this cgroup, do it as a
-		 * temporary condition. But we can't fail. From a kmem/slab
-		 * perspective, the cache has already been selected, by
-		 * mem_cgroup_kmem_get_cache(), so it is too late to change
-		 * our minds.
-		 *
-		 * This condition will only trigger if the task entered
-		 * memcg_charge_kmem in a sane state, but was OOM-killed
-		 * during try_charge() above. Tasks that were already dying
-		 * when the allocation triggers should have been already
-		 * directed to the root cgroup in memcontrol.h
-		 */
-		page_counter_charge(&memcg->memory, nr_pages);
-		if (do_swap_account)
-			page_counter_charge(&memcg->memsw, nr_pages);
-		css_get_many(&memcg->css, nr_pages);
-		ret = 0;
-	} else if (ret)
+	if (ret)
 		page_counter_uncharge(&memcg->kmem, nr_pages);
 
 	return ret;
@@ -4433,22 +4421,10 @@ static int mem_cgroup_do_precharge(unsig
 		mc.precharge += count;
 		return ret;
 	}
-	if (ret == -EINTR) {
-		cancel_charge(root_mem_cgroup, count);
-		return ret;
-	}
 
 	/* Try charges one by one with reclaim */
 	while (count--) {
 		ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_NORETRY, 1);
-		/*
-		 * In case of failure, any residual charges against
-		 * mc.to will be dropped by mem_cgroup_clear_mc()
-		 * later on.  However, cancel any charges that are
-		 * bypassed to root right away or they'll be lost.
-		 */
-		if (ret == -EINTR)
-			cancel_charge(root_mem_cgroup, 1);
 		if (ret)
 			return ret;
 		mc.precharge++;
@@ -5353,11 +5329,6 @@ int mem_cgroup_try_charge(struct page *p
 	ret = try_charge(memcg, gfp_mask, nr_pages);
 
 	css_put(&memcg->css);
-
-	if (ret == -EINTR) {
-		memcg = root_mem_cgroup;
-		ret = 0;
-	}
 out:
 	*memcgp = memcg;
 	return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
  2015-09-14 19:56       ` Tejun Heo
  (?)
@ 2015-09-15  8:01       ` Michal Hocko
  2015-09-15 15:50         ` Tejun Heo
  -1 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2015-09-15  8:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, hannes, cgroups, linux-mm, vdavydov, kernel-team

On Mon 14-09-15 15:56:08, Tejun Heo wrote:
> Hello, Michal.
> 
> On Mon, Sep 14, 2015 at 09:32:25PM +0200, Michal Hocko wrote:
> > >   mem_cgroup_try_charge() needs to switch
> > >   the returned cgroup to the root one.
> > > 
> > > The reality is that in memcg there are cases where we are forced
> > > and/or willing to go over the limit.  Each such case needs to be
> > > scrutinized and justified but there definitely are situations where
> > > that is the right thing to do.  We alredy do this but with a
> > > superficial and inconsistent disguise which leads to unnecessary
> > > complications.
> > >
> > > This patch updates try_charge() so that it over-charges and returns 0
> > > when deemed necessary.  -EINTR return is removed along with all
> > > special case handling in the callers.
> > 
> > OK the code is easier in the end, although I would argue that try_charge
> > could return ENOMEM for GFP_NOWAIT instead of overcharging (this would
> > e.g. allow precharge to bail out earlier). Something for a separate patch I
> > guess.
> 
> Hmm... GFP_NOWAIT is failed unless it also has __GFP_NOFAIL.

Yes I wasn't clear, sorry, it fails but TIF_MEMDIE or killed/exiting
context would still overcharge GFP_NOWAIT requests rather than failing
them. Something for a separate patch though.

> > Anyway I still do not like usage > max/hard limit presented to userspace
> > because it looks like a clear breaking of max/hard limit semantic. I
> > realize that we cannot solve the underlying problem easily or it might
> > be unfeasible but we should consider how to present this state to the
> > userspace.
> > We have basically 2 options AFAICS. We can either document that a
> > _temporal_ breach of the max/hard limit is allowed or we can hide this
> > fact and always present max(current,max).
> > The first one might be better for an easier debugging and it is also
> > more honest about the current state but the definition of the hard limit
> > is a bit weird. It also exposes implementation details to the userspace.
> > The other choice is clearly lying but users shouldn't care about the
> > implementation details and if the state is really temporal then the
> > userspace shouldn't even notice. There is also a risk that somebody is
> > already depending on current < max which happened to work without kmem
> > until now.
> > This is something to be solved in a separate patch I guess but we
> > should think about that. I am not entirely clear on that myself but I am
> > more inclined to the first option and simply document the potential
> > corner case and temporal breach.
> 
> I'm pretty sure we don't wanna lie.  Just document that temporal
> small-scale breaches may happen. 

It goes against hard limit semantic but I am more and more convinced this
is the right way to go because I think we want to mimic the global case
which allows to access accounted reserves in some cases (e.g. reclaimers
should be able to get access to more memory to free memory). I will cook
up a documentation patch sometimes this week but we have an internal
conference so I might be too busy to do it right away.

> I don't even think this is an implementation detail.

I really think this is an implementation detail because we can force
the implementation to never overcharge. Just retry indefinitely for
__GFP_NOFAIL, fail the charge for others and be done with that. Of
course it is not that easy. Retrying indefinitely is deadlock prone.
Relaxing conditions for exiting tasks is merely an optimization.  The
slab code could be reorganized to cope with the ENOMEM as well. But I
guess this is not worth the effort and a small and ephemeral overcharge
is justified.

> The fact that we have separate high and max
> limits is already admitting that this is inherently different from
> global case and that memcg is consciously and actively making
> trade-offs regarding handling of global and local memory pressure and

I am not sure I understand here. High and max are basically resembling
watermarks for the global case. Sure max/high can be set independently
which is not the case for the global case which calculates them from
min_free_kbytes but why would that matter and make them different?

As mentioned above the resemblance with the global case should make it
more understandable for users.

> I think that's the right thing to do and something inherent to what
> memcg is doing here.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/3] memcg: ratify and consolidate over-charge handling
  2015-09-15  8:01       ` Michal Hocko
@ 2015-09-15 15:50         ` Tejun Heo
  0 siblings, 0 replies; 24+ messages in thread
From: Tejun Heo @ 2015-09-15 15:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, cgroups, linux-mm, vdavydov, kernel-team

Hello, Michal.

On Tue, Sep 15, 2015 at 10:01:10AM +0200, Michal Hocko wrote:
> > > OK the code is easier in the end, although I would argue that try_charge
> > > could return ENOMEM for GFP_NOWAIT instead of overcharging (this would
> > > e.g. allow precharge to bail out earlier). Something for a separate patch I
> > > guess.
> > 
> > Hmm... GFP_NOWAIT is failed unless it also has __GFP_NOFAIL.
> 
> Yes I wasn't clear, sorry, it fails but TIF_MEMDIE or killed/exiting
> context would still overcharge GFP_NOWAIT requests rather than failing
> them. Something for a separate patch though.

Ah, I see.  I'm not sure that'd matter one way or the other tho.

> > I don't even think this is an implementation detail.
> 
> I really think this is an implementation detail because we can force
> the implementation to never overcharge. Just retry indefinitely for

But we actively choose not to and I think that's an architectural
decision.

> I am not sure I understand here. High and max are basically resembling
> watermarks for the global case. Sure max/high can be set independently
> which is not the case for the global case which calculates them from
> min_free_kbytes but why would that matter and make them different?

Yes, there are similarities but if we really wanted to emulate global
case, we'd just have the hardlimit and then have async reclaim to keep
certain level of reserves and so on.

I don't think it'd be meaningful to try to follow the limit strictly
under all circumstances.  It's not like we can track all memory
consumption to begin with.  There always is common consumption which
can't clearly be distributed to specific cgroups which makes literal
strictness rather pointless.  Also, I'm pretty sure there will always
be enough cases where it is saner to temporarily breach the limit in
small scale just because exhaustion of memory inside a cgroup doesn't
mean global exhaustion and that's an inherent characteristic of what
memcg does.

Anyways, I don't think this difference in viewpoints matters that
much.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-15 16:18       ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2015-09-15 16:18 UTC (permalink / raw)
  To: Tejun Heo; +Cc: akpm, mhocko, cgroups, linux-mm, vdavydov, kernel-team

On Mon, Sep 14, 2015 at 04:07:32PM -0400, Tejun Heo wrote:
> try_charge() is the main charging logic of memcg.  When it hits the
> limit but either can't fail the allocation due to __GFP_NOFAIL or the
> task is likely to free memory very soon, being OOM killed, has SIGKILL
> pending or exiting, it "bypasses" the charge to the root memcg and
> returns -EINTR.  While this is one approach which can be taken for
> these situations, it has several issues.
> 
> * It unnecessarily lies about the reality.  The number itself doesn't
>   go over the limit but the actual usage does.  memcg is either forced
>   to or actively chooses to go over the limit because that is the
>   right behavior under the circumstances, which is completely fine,
>   but, if at all avoidable, it shouldn't be misrepresenting what's
>   happening by sneaking the charges into the root memcg.
> 
> * Despite trying, we already do over-charge.  kmemcg can't deal with
>   switching over to the root memcg by the point try_charge() returns
>   -EINTR, so it open-codes over-charing.
> 
> * It complicates the callers.  Each try_charge() user has to handle
>   the weird -EINTR exception.  memcg_charge_kmem() does the manual
>   over-charging.  mem_cgroup_do_precharge() performs unnecessary
>   uncharging of root memcg, which BTW is inconsistent with what
>   memcg_charge_kmem() does but not broken as [un]charging are noops on
>   root memcg.  mem_cgroup_try_charge() needs to switch the returned
>   cgroup to the root one.
> 
> The reality is that in memcg there are cases where we are forced
> and/or willing to go over the limit.  Each such case needs to be
> scrutinized and justified but there definitely are situations where
> that is the right thing to do.  We alredy do this but with a
> superficial and inconsistent disguise which leads to unnecessary
> complications.
> 
> This patch updates try_charge() so that it over-charges and returns 0
> when deemed necessary.  -EINTR return is removed along with all
> special case handling in the callers.
> 
> While at it, remove the local variable @ret, which was initialized to
> zero and never changed, along with done: label which just returned the
> always zero @ret.
> 
> v2: Minor update to patch description as per Vladimir.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/3] memcg: ratify and consolidate over-charge handling
@ 2015-09-15 16:18       ` Johannes Weiner
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Weiner @ 2015-09-15 16:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mhocko-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, vdavydov-bzQdu9zFT3WakBO8gow8eQ,
	kernel-team-b10kYP2dOMg

On Mon, Sep 14, 2015 at 04:07:32PM -0400, Tejun Heo wrote:
> try_charge() is the main charging logic of memcg.  When it hits the
> limit but either can't fail the allocation due to __GFP_NOFAIL or the
> task is likely to free memory very soon, being OOM killed, has SIGKILL
> pending or exiting, it "bypasses" the charge to the root memcg and
> returns -EINTR.  While this is one approach which can be taken for
> these situations, it has several issues.
> 
> * It unnecessarily lies about the reality.  The number itself doesn't
>   go over the limit but the actual usage does.  memcg is either forced
>   to or actively chooses to go over the limit because that is the
>   right behavior under the circumstances, which is completely fine,
>   but, if at all avoidable, it shouldn't be misrepresenting what's
>   happening by sneaking the charges into the root memcg.
> 
> * Despite trying, we already do over-charge.  kmemcg can't deal with
>   switching over to the root memcg by the point try_charge() returns
>   -EINTR, so it open-codes over-charing.
> 
> * It complicates the callers.  Each try_charge() user has to handle
>   the weird -EINTR exception.  memcg_charge_kmem() does the manual
>   over-charging.  mem_cgroup_do_precharge() performs unnecessary
>   uncharging of root memcg, which BTW is inconsistent with what
>   memcg_charge_kmem() does but not broken as [un]charging are noops on
>   root memcg.  mem_cgroup_try_charge() needs to switch the returned
>   cgroup to the root one.
> 
> The reality is that in memcg there are cases where we are forced
> and/or willing to go over the limit.  Each such case needs to be
> scrutinized and justified but there definitely are situations where
> that is the right thing to do.  We alredy do this but with a
> superficial and inconsistent disguise which leads to unnecessary
> complications.
> 
> This patch updates try_charge() so that it over-charges and returns 0
> when deemed necessary.  -EINTR return is removed along with all
> special case handling in the callers.
> 
> While at it, remove the local variable @ret, which was initialized to
> zero and never changed, along with done: label which just returned the
> always zero @ret.
> 
> v2: Minor update to patch description as per Vladimir.
> 
> Signed-off-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> Reviewed-by: Vladimir Davydov <vdavydov-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> Acked-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>

Acked-by: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-09-15 16:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-13 20:14 [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass() Tejun Heo
2015-09-13 20:14 ` [PATCH 2/3] memcg: ratify and consolidate over-charge handling Tejun Heo
2015-09-13 20:14   ` Tejun Heo
2015-09-13 20:15   ` [PATCH 3/3] memcg: drop unnecessary cold-path tests from __memcg_kmem_bypass() Tejun Heo
2015-09-13 20:15     ` Tejun Heo
2015-09-14 12:51     ` Vladimir Davydov
2015-09-14 12:51       ` Vladimir Davydov
2015-09-14 19:40     ` Michal Hocko
2015-09-14 19:40       ` Michal Hocko
2015-09-14 12:44   ` [PATCH 2/3] memcg: ratify and consolidate over-charge handling Vladimir Davydov
2015-09-14 12:44     ` Vladimir Davydov
2015-09-14 15:51     ` Tejun Heo
2015-09-14 15:51       ` Tejun Heo
2015-09-14 19:32   ` Michal Hocko
2015-09-14 19:56     ` Tejun Heo
2015-09-14 19:56       ` Tejun Heo
2015-09-15  8:01       ` Michal Hocko
2015-09-15 15:50         ` Tejun Heo
2015-09-14 20:07   ` [PATCH v2 " Tejun Heo
2015-09-15 16:18     ` Johannes Weiner
2015-09-15 16:18       ` Johannes Weiner
2015-09-14  9:03 ` [PATCH 1/3] memcg: collect kmem bypass conditions into __memcg_kmem_bypass() Vladimir Davydov
2015-09-14  9:03   ` Vladimir Davydov
2015-09-14 15:21 ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.