[PATCH 0/5] oom reaper v5

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] oom reaper v5
@ 2016-02-03 13:13 Michal Hocko
  2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
                   ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi,
I am reposting the whole patchset on top of mmotm with the previous
version of the patchset reverted for an easier review. The series
applies cleanly on top of the current Linus tree as well.

The previous version was posted http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
I have tried to address most of the feedback. There was some push
for extending the current implementation further but I do not feel
comfortable to do that right now. I believe that we should start
as easy as possible and add extensions on top. That shouldn't be
hard with the current architecture.

Wrt. the previous version, I have added patches 4 and 5. Patch4 reports
success/failures to reap a task which is useful to see how the reaper
operates. Patch 5 is implementing a more robust API between the oom
killer and the oom reaper. We allow more tasks to be queued for the
reaper at the same time rather than the original signle task mode.

Patch 1 also dropped oom_reaper thread priority handling as per
David.  I ended up keeping vma filtering code inside __oom_reap_task.
I still believe this is a better fit because the rules are a single
fit for the reaper. They cannot be shared with a larger code base.

In the meantime I have prepared down_write_killable rw_semaphore
variant http://lkml.kernel.org/r/1454444369-2146-1-git-send-email-mhocko@kernel.org
and also have a tentative patch to convert some users of mmap_sem for
write to use the killable version. This needs more checking though but
I guess I will have something ready in 2 weeks or so (I will be on
vacation next week).

For the general description of the oom_reaper functionality, please
refer to Patch1.

I would be really greatful if we could postpone any
functional/semantical enhancements for later discussion and focus on the
correctness of these particular patches as there were no fundamental
objectios to the current approach.

Thanks!

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
@ 2016-02-03 13:13 ` Michal Hocko
  2016-02-03 23:48   ` David Rientjes
  2016-02-06 13:22   ` Tetsuo Handa
  2016-02-03 13:13 ` [PATCH 2/5] oom reaper: handle mlocked pages Michal Hocko
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory.  Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Chnages since v4
- drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
  might interfere with the rest of the system
Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
  to prevent from missing unmaps and follow up BUG_ON during truncate
  resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
  could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
  by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
  and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
  for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mm.h |   2 +
 mm/internal.h      |   5 ++
 mm/memory.c        |  17 +++---
 mm/oom_kill.c      | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 4 files changed, 162 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05b9fbbceb01..8a67ea2a6323 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1111,6 +1111,8 @@ struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
 	pgoff_t	first_index;			/* Lowest page->index to unmap */
 	pgoff_t last_index;			/* Highest page->index to unmap */
+	bool ignore_dirty;			/* Ignore dirty pages */
+	bool check_swap_entries;		/* Check also swap entries */
 };
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index ed90298c12db..cac6eb458727 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -42,6 +42,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
+void unmap_page_range(struct mmu_gather *tlb,
+			     struct vm_area_struct *vma,
+			     unsigned long addr, unsigned long end,
+			     struct zap_details *details);
+
 static inline void set_page_count(struct page *page, int v)
 {
 	atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index 42d5bec9bb91..c158dc53ca3d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1105,6 +1105,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			if (!PageAnon(page)) {
 				if (pte_dirty(ptent)) {
+					/*
+					 * oom_reaper cannot tear down dirty
+					 * pages
+					 */
+					if (unlikely(details && details->ignore_dirty))
+						continue;
 					force_flush = 1;
 					set_page_dirty(page);
 				}
@@ -1123,8 +1129,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			}
 			continue;
 		}
-		/* If details->check_mapping, we leave swap entries. */
-		if (unlikely(details))
+		/* only check swap_entries if explicitly asked for in details */
+		if (unlikely(details && !details->check_swap_entries))
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
@@ -1229,7 +1235,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	return addr;
 }
 
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
 			     struct vm_area_struct *vma,
 			     unsigned long addr, unsigned long end,
 			     struct zap_details *details)
@@ -1237,9 +1243,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
 	pgd_t *pgd;
 	unsigned long next;
 
-	if (details && !details->check_mapping)
-		details = NULL;
-
 	BUG_ON(addr >= end);
 	tlb_start_vma(tlb, vma);
 	pgd = pgd_offset(vma->vm_mm, addr);
@@ -2419,7 +2422,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows)
 {
-	struct zap_details details;
+	struct zap_details details = { };
 	pgoff_t hba = holebegin >> PAGE_SHIFT;
 	pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e3ab892903ee..9a0e4e5f50b4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
 #include <linux/freezer.h>
 #include <linux/ftrace.h>
 #include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/oom.h>
@@ -406,6 +411,133 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
 
 bool oom_killer_disabled __read_mostly;
 
+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+	struct mmu_gather tlb;
+	struct vm_area_struct *vma;
+	struct zap_details details = {.check_swap_entries = true,
+				      .ignore_dirty = true};
+	bool ret = true;
+
+	/* We might have raced with exit path */
+	if (!atomic_inc_not_zero(&mm->mm_users))
+		return true;
+
+	if (!down_read_trylock(&mm->mmap_sem)) {
+		ret = false;
+		goto out;
+	}
+
+	tlb_gather_mmu(&tlb, mm, 0, -1);
+	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+		if (is_vm_hugetlb_page(vma))
+			continue;
+
+		/*
+		 * mlocked VMAs require explicit munlocking before unmap.
+		 * Let's keep it simple here and skip such VMAs.
+		 */
+		if (vma->vm_flags & VM_LOCKED)
+			continue;
+
+		/*
+		 * Only anonymous pages have a good chance to be dropped
+		 * without additional steps which we cannot afford as we
+		 * are OOM already.
+		 *
+		 * We do not even care about fs backed pages because all
+		 * which are reclaimable have already been reclaimed and
+		 * we do not want to block exit_mmap by keeping mm ref
+		 * count elevated without a good reason.
+		 */
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+					 &details);
+	}
+	tlb_finish_mmu(&tlb, 0, -1);
+	up_read(&mm->mmap_sem);
+out:
+	mmput(mm);
+	return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+	int attempts = 0;
+
+	/* Retry the down_read_trylock(mmap_sem) a few times */
+	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+		schedule_timeout_idle(HZ/10);
+
+	/* Drop a reference taken by wake_oom_reaper */
+	mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+	while (true) {
+		struct mm_struct *mm;
+
+		wait_event_freezable(oom_reaper_wait,
+				     (mm = READ_ONCE(mm_to_reap)));
+		oom_reap_vmas(mm);
+		WRITE_ONCE(mm_to_reap, NULL);
+	}
+
+	return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+	struct mm_struct *old_mm;
+
+	if (!oom_reaper_th)
+		return;
+
+	/*
+	 * Pin the given mm. Use mm_count instead of mm_users because
+	 * we do not want to delay the address space tear down.
+	 */
+	atomic_inc(&mm->mm_count);
+
+	/*
+	 * Make sure that only a single mm is ever queued for the reaper
+	 * because multiple are not necessary and the operation might be
+	 * disruptive so better reduce it to the bare minimum.
+	 */
+	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+	if (!old_mm)
+		wake_up(&oom_reaper_wait);
+	else
+		mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+	oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+	if (IS_ERR(oom_reaper_th)) {
+		pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+				PTR_ERR(oom_reaper_th));
+		oom_reaper_th = NULL;
+	}
+	return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
 /**
  * mark_oom_victim - mark the given task as OOM victim
  * @tsk: task to mark
@@ -511,6 +643,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
+	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -601,17 +734,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD))
-			continue;
-		if (is_global_init(p))
-			continue;
-		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+			/*
+			 * We cannot use oom_reaper for the mm shared by this
+			 * process because it wouldn't get killed and so the
+			 * memory might be still used.
+			 */
+			can_oom_reap = false;
 			continue;
-
+		}
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
+	if (can_oom_reap)
+		wake_oom_reaper(mm);
+
 	mmdrop(mm);
 	put_task_struct(victim);
 }
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
  2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
@ 2016-02-03 13:13 ` Michal Hocko
  2016-02-03 23:57   ` David Rientjes
  2016-02-23  1:36   ` David Rientjes
  2016-02-03 13:13 ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped. This is primarily done for
simplicity. There is no reason to skip over them and reduce the amount
of reclaimed memory. This is safe from the semantic point of view
because try_to_unmap_one during rmap walk would keep tell the reclaim
to cull the page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9a0e4e5f50b4..840e03986497 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 			continue;
 
 		/*
-		 * mlocked VMAs require explicit munlocking before unmap.
-		 * Let's keep it simple here and skip such VMAs.
-		 */
-		if (vma->vm_flags & VM_LOCKED)
-			continue;
-
-		/*
 		 * Only anonymous pages have a good chance to be dropped
 		 * without additional steps which we cannot afford as we
 		 * are OOM already.
@@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 		 * we do not want to block exit_mmap by keeping mm ref
 		 * count elevated without a good reason.
 		 */
-		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+			if (vma->vm_flags & VM_LOCKED)
+				munlock_vma_pages_all(vma);
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
+		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
  2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
  2016-02-03 13:13 ` [PATCH 2/5] oom reaper: handle mlocked pages Michal Hocko
@ 2016-02-03 13:13 ` Michal Hocko
  2016-02-04 14:22   ` Tetsuo Handa
  2016-02-03 13:13 ` [PATCH 4/5] mm, oom_reaper: report success/failure Michal Hocko
  2016-02-03 13:14 ` [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing Michal Hocko
  4 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/oom.h |  2 +-
 kernel/exit.c       |  2 +-
 mm/oom_kill.c       | 73 +++++++++++++++++++++++++++++++++++------------------
 3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 07110c6020a0..4b1c6e2658bd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim();
+		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 840e03986497..8e345126d73e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,20 +417,36 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	struct task_struct *p;
 	struct zap_details details = {.check_swap_entries = true,
 				      .ignore_dirty = true};
 	bool ret = true;
 
-	/* We might have raced with exit path */
-	if (!atomic_inc_not_zero(&mm->mm_users))
+	/*
+	 * Make sure we find the associated mm_struct even when the particular
+	 * thread has already terminated and cleared its mm.
+	 * We might have race with exit path so consider our work done if there
+	 * is no mm.
+	 */
+	p = find_lock_task_mm(tsk);
+	if (!p)
+		return true;
+
+	mm = p->mm;
+	if (!atomic_inc_not_zero(&mm->mm_users)) {
+		task_unlock(p);
 		return true;
+	}
+
+	task_unlock(p);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
@@ -461,60 +477,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
 	up_read(&mm->mmap_sem);
+
+	/*
+	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+	 * reasonably reclaimable memory anymore. OOM killer can continue
+	 * by selecting other victim if unmapping hasn't led to any
+	 * improvements. This also means that selecting this task doesn't
+	 * make any sense.
+	 */
+	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+	exit_oom_victim(tsk);
 out:
 	mmput(mm);
 	return ret;
 }
 
-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_vmas(mm))
+	while (attempts++ < 10 && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
 	/* Drop a reference taken by wake_oom_reaper */
-	mmdrop(mm);
+	put_task_struct(tsk);
 }
 
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct mm_struct *mm;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (mm = READ_ONCE(mm_to_reap)));
-		oom_reap_vmas(mm);
-		WRITE_ONCE(mm_to_reap, NULL);
+				     (tsk = READ_ONCE(task_to_reap)));
+		oom_reap_task(tsk);
+		WRITE_ONCE(task_to_reap, NULL);
 	}
 
 	return 0;
 }
 
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct mm_struct *old_mm;
+	struct task_struct *old_tsk;
 
 	if (!oom_reaper_th)
 		return;
 
-	/*
-	 * Pin the given mm. Use mm_count instead of mm_users because
-	 * we do not want to delay the address space tear down.
-	 */
-	atomic_inc(&mm->mm_count);
+	get_task_struct(tsk);
 
 	/*
 	 * Make sure that only a single mm is ever queued for the reaper
 	 * because multiple are not necessary and the operation might be
 	 * disruptive so better reduce it to the bare minimum.
 	 */
-	old_mm = cmpxchg(&mm_to_reap, NULL, mm);
-	if (!old_mm)
+	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+	if (!old_tsk)
 		wake_up(&oom_reaper_wait);
 	else
-		mmdrop(mm);
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
@@ -529,7 +551,7 @@ static int __init oom_init(void)
 }
 subsys_initcall(oom_init)
 #else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *mm)
 {
 }
 #endif
@@ -560,9 +582,10 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
 {
-	clear_thread_flag(TIF_MEMDIE);
+	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+		return;
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
@@ -745,7 +768,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	rcu_read_unlock();
 
 	if (can_oom_reap)
-		wake_oom_reaper(mm);
+		wake_oom_reaper(victim);
 
 	mmdrop(mm);
 	put_task_struct(victim);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
                   ` (2 preceding siblings ...)
  2016-02-03 13:13 ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
@ 2016-02-03 13:13 ` Michal Hocko
  2016-02-03 23:10   ` David Rientjes
  2016-02-03 13:14 ` [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing Michal Hocko
  4 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-03 13:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8e345126d73e..b87acdca2a41 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
 static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 
+#define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
@@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
+	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
+			task_pid_nr(tsk), tsk->comm,
+			K(get_mm_counter(mm, MM_ANONPAGES)),
+			K(get_mm_counter(mm, MM_FILEPAGES)),
+			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
 
 	/*
@@ -492,14 +498,21 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	return ret;
 }
 
+#define MAX_OOM_REAP_RETRIES 10
 static void oom_reap_task(struct task_struct *tsk)
 {
 	int attempts = 0;
 
 	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < 10 && !__oom_reap_task(tsk))
+	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
 		schedule_timeout_idle(HZ/10);
 
+	if (attempts > MAX_OOM_REAP_RETRIES) {
+		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+		debug_show_all_locks();
+	}
+
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);
 }
@@ -646,7 +659,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 	return false;
 }
 
-#define K(x) ((x) << (PAGE_SHIFT-10))
 /*
  * Must be called while holding a reference to p, which will be released upon
  * returning.
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
                   ` (3 preceding siblings ...)
  2016-02-03 13:13 ` [PATCH 4/5] mm, oom_reaper: report success/failure Michal Hocko
@ 2016-02-03 13:14 ` Michal Hocko
  2016-02-04 10:49   ` Tetsuo Handa
  2016-02-17  9:48   ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for Michal Hocko
  4 siblings, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-03 13:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require
some way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach
to help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
  wake_oom_reaper
    cmpxchg // OK
    			oom_reaper
			  oom_reap_task
			    __oom_reap_task
oom_victim terminates
			      atomic_inc_not_zero // fail
out_of_memory
  wake_oom_reaper
    cmpxchg // fails
			  task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h |  3 +++
 mm/oom_kill.c         | 36 +++++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a9cdd032b988..c25996c336de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1814,6 +1814,9 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
+#ifdef CONFIG_MMU
+	struct list_head oom_reaper_list;
+#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b87acdca2a41..87d644c97ac9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,8 +417,10 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+static LIST_HEAD(oom_reaper_list);
+static DEFINE_SPINLOCK(oom_reaper_lock);
+
 
 #define K(x) ((x) << (PAGE_SHIFT-10))
 static bool __oom_reap_task(struct task_struct *tsk)
@@ -520,12 +522,20 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk;
+		struct task_struct *tsk = NULL;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (tsk = READ_ONCE(task_to_reap)));
-		oom_reap_task(tsk);
-		WRITE_ONCE(task_to_reap, NULL);
+				     (!list_empty(&oom_reaper_list)));
+		spin_lock(&oom_reaper_lock);
+		if (!list_empty(&oom_reaper_list)) {
+			tsk = list_first_entry(&oom_reaper_list,
+					struct task_struct, oom_reaper_list);
+			list_del(&tsk->oom_reaper_list);
+		}
+		spin_unlock(&oom_reaper_lock);
+
+		if (tsk)
+			oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -533,23 +543,15 @@ static int oom_reaper(void *unused)
 
 static void wake_oom_reaper(struct task_struct *tsk)
 {
-	struct task_struct *old_tsk;
-
 	if (!oom_reaper_th)
 		return;
 
 	get_task_struct(tsk);
 
-	/*
-	 * Make sure that only a single mm is ever queued for the reaper
-	 * because multiple are not necessary and the operation might be
-	 * disruptive so better reduce it to the bare minimum.
-	 */
-	old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
-	if (!old_tsk)
-		wake_up(&oom_reaper_wait);
-	else
-		put_task_struct(tsk);
+	spin_lock(&oom_reaper_lock);
+	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+	spin_unlock(&oom_reaper_lock);
+	wake_up(&oom_reaper_wait);
 }
 
 static int __init oom_init(void)
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-03 13:13 ` [PATCH 4/5] mm, oom_reaper: report success/failure Michal Hocko
@ 2016-02-03 23:10   ` David Rientjes
  2016-02-04  6:46     ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: David Rientjes @ 2016-02-03 23:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 8e345126d73e..b87acdca2a41 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
>  static struct task_struct *task_to_reap;
>  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  
> +#define K(x) ((x) << (PAGE_SHIFT-10))
>  static bool __oom_reap_task(struct task_struct *tsk)
>  {
>  	struct mmu_gather tlb;
> @@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> +	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> +			task_pid_nr(tsk), tsk->comm,
> +			K(get_mm_counter(mm, MM_ANONPAGES)),
> +			K(get_mm_counter(mm, MM_FILEPAGES)),
> +			K(get_mm_counter(mm, MM_SHMEMPAGES)));
>  	up_read(&mm->mmap_sem);
>  
>  	/*

This is a bit misleading, it would appear that the rss values are what was 
reaped when in fact they represent just the values of the mm being reaped.  
We have already printed these values as an artifact in the kernel log.

I think it would be helpful to show anon-rss after reaping, however, so we 
can compare to the previous anon-rss that was reported.  And, I agree that 
leaving behind a message in the kernel log that reaping has been 
successful is worthwhile.  So this line should just show what anon-rss is 
after reaping and make it clear that this is not the memory reaped.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
@ 2016-02-03 23:48   ` David Rientjes
  2016-02-04  6:41     ` Michal Hocko
  2016-02-06 13:22   ` Tetsuo Handa
  1 sibling, 1 reply; 48+ messages in thread
From: David Rientjes @ 2016-02-03 23:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
> 
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory.  Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
> 
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
> 
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
> 
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
> 
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
> 
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
> 
> Chnages since v4
> - drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
>   might interfere with the rest of the system
> Changes since v3
> - many style/compile fixups by Andrew
> - unmap_mapping_range_tree needs full initialization of zap_details
>   to prevent from missing unmaps and follow up BUG_ON during truncate
>   resp. misaccounting - Kirill/Andrew
> - exclude mlocked pages because they need an explicit munlock by Kirill
> - use subsys_initcall instead of module_init - Paul Gortmaker
> - do not tear down mm if it is shared with the global init because this
>   could lead to SEGV and panic - Tetsuo
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
>   by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
>   and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
>   for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> 
> Suggested-by: Oleg Nesterov <oleg@redhat.com>
> Suggested-by: Mel Gorman <mgorman@suse.de>
> Acked-by: Mel Gorman <mgorman@suse.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

I think all the patches could really have been squashed together because 
subsequent patches just overwrite already added code.  I was going to 
suggest not doing atomic_inc(&mm->mm_count) in wake_oom_reaper() and 
change oom_kill_process() to do

	if (can_oom_reap)
		wake_oom_reaper(mm);
	else
		mmdrop(mm);

but I see that we don't even touch mm->mm_count after the third patch.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-03 13:13 ` [PATCH 2/5] oom reaper: handle mlocked pages Michal Hocko
@ 2016-02-03 23:57   ` David Rientjes
  2016-02-23  1:36   ` David Rientjes
  1 sibling, 0 replies; 48+ messages in thread
From: David Rientjes @ 2016-02-03 23:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> __oom_reap_vmas current skips over all mlocked vmas because they need a
> special treatment before they are unmapped. This is primarily done for
> simplicity. There is no reason to skip over them and reduce the amount
> of reclaimed memory. This is safe from the semantic point of view
> because try_to_unmap_one during rmap walk would keep tell the reclaim
> to cull the page back and mlock it again.
> 
> munlock_vma_pages_all is also safe to be called from the oom reaper
> context because it doesn't sit on any locks but mmap_sem (for read).
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 23:48   ` David Rientjes
@ 2016-02-04  6:41     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-04  6:41 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 03-02-16 15:48:18, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
> > 
> > The OOM killer currently allows to kill only a single task in a good
> > hope that the task will terminate in a reasonable time and frees up its
> > memory.  Such a task (oom victim) will get an access to memory reserves
> > via mark_oom_victim to allow a forward progress should there be a need
> > for additional memory during exit path.
> > 
> > It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> > construct workloads which break the core assumption mentioned above and
> > the OOM victim might take unbounded amount of time to exit because it
> > might be blocked in the uninterruptible state waiting for an event
> > (e.g. lock) which is blocked by another task looping in the page
> > allocator.
> > 
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper) which tries to reclaim additional
> > memory by preemptively reaping the anonymous or swapped out memory
> > owned by the oom victim under an assumption that such a memory won't
> > be needed when its owner is killed and kicked from the userspace anyway.
> > There is one notable exception to this, though, if the OOM victim was
> > in the process of coredumping the result would be incomplete. This is
> > considered a reasonable constrain because the overall system health is
> > more important than debugability of a particular application.
> > 
> > A kernel thread has been chosen because we need a reliable way of
> > invocation so workqueue context is not appropriate because all the
> > workers might be busy (e.g. allocating memory). Kswapd which sounds
> > like another good fit is not appropriate as well because it might get
> > blocked on locks during reclaim as well.
> > 
> > oom_reaper has to take mmap_sem on the target task for reading so the
> > solution is not 100% because the semaphore might be held or blocked for
> > write but the probability is reduced considerably wrt. basically any
> > lock blocking forward progress as described above. In order to prevent
> > from blocking on the lock without any forward progress we are using only
> > a trylock and retry 10 times with a short sleep in between.
> > Users of mmap_sem which need it for write should be carefully reviewed
> > to use _killable waiting as much as possible and reduce allocations
> > requests done with the lock held to absolute minimum to reduce the risk
> > even further.
> > 
> > The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> > updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> > and oom_reaper clear this atomically once it is done with the work. This
> > means that only a single mm_struct can be reaped at the time. As the
> > operation is potentially disruptive we are trying to limit it to the
> > ncessary minimum and the reaper blocks any updates while it operates on
> > an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> > race is detected by atomic_inc_not_zero(mm_users).
> > 
> > Chnages since v4
> > - drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
> >   might interfere with the rest of the system
> > Changes since v3
> > - many style/compile fixups by Andrew
> > - unmap_mapping_range_tree needs full initialization of zap_details
> >   to prevent from missing unmaps and follow up BUG_ON during truncate
> >   resp. misaccounting - Kirill/Andrew
> > - exclude mlocked pages because they need an explicit munlock by Kirill
> > - use subsys_initcall instead of module_init - Paul Gortmaker
> > - do not tear down mm if it is shared with the global init because this
> >   could lead to SEGV and panic - Tetsuo
> > Changes since v2
> > - fix mm_count refernce leak reported by Tetsuo
> > - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> > - use wait_event_freezable rather than open coded wait loop - suggested
> >   by Tetsuo
> > Changes since v1
> > - fix the screwed up detail->check_swap_entries - Johannes
> > - do not use kthread_should_stop because that would need a cleanup
> >   and we do not have anybody to stop us - Tetsuo
> > - move wake_oom_reaper to oom_kill_process because we have to wait
> >   for all tasks sharing the same mm to get killed - Tetsuo
> > - do not reap mm structs which are shared with unkillable tasks - Tetsuo
> > 
> > Suggested-by: Oleg Nesterov <oleg@redhat.com>
> > Suggested-by: Mel Gorman <mgorman@suse.de>
> > Acked-by: Mel Gorman <mgorman@suse.de>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>

Thanks!
 
> I think all the patches could really have been squashed together because 
> subsequent patches just overwrite already added code. 

The primary reason is a better bisectability and incremental nature of
changes.

> I was going to 
> suggest not doing atomic_inc(&mm->mm_count) in wake_oom_reaper() and 
> change oom_kill_process() to do

I found it easier to follow the reference counting that way (pin the mm
at the place when I hand over it to the async thread).

> 
> 	if (can_oom_reap)
> 		wake_oom_reaper(mm);
> 	else
> 		mmdrop(mm);
> 
> but I see that we don't even touch mm->mm_count after the third patch.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-03 23:10   ` David Rientjes
@ 2016-02-04  6:46     ` Michal Hocko
  2016-02-04 22:31       ` David Rientjes
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-04  6:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 03-02-16 15:10:57, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 8e345126d73e..b87acdca2a41 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -420,6 +420,7 @@ static struct task_struct *oom_reaper_th;
> >  static struct task_struct *task_to_reap;
> >  static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
> >  
> > +#define K(x) ((x) << (PAGE_SHIFT-10))
> >  static bool __oom_reap_task(struct task_struct *tsk)
> >  {
> >  	struct mmu_gather tlb;
> > @@ -476,6 +477,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
> >  		}
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> > +	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> > +			task_pid_nr(tsk), tsk->comm,
> > +			K(get_mm_counter(mm, MM_ANONPAGES)),
> > +			K(get_mm_counter(mm, MM_FILEPAGES)),
> > +			K(get_mm_counter(mm, MM_SHMEMPAGES)));
> >  	up_read(&mm->mmap_sem);
> >  
> >  	/*
> 
> This is a bit misleading, it would appear that the rss values are what was 
> reaped when in fact they represent just the values of the mm being reaped.  
> We have already printed these values as an artifact in the kernel log.

Yes and the idea was to provide the after state to compare before and
after. That's why I have kept the similar format. Just dropped the
virtual memory size because that doesn't make any sense in this context
now.

> I think it would be helpful to show anon-rss after reaping, however, so we 
> can compare to the previous anon-rss that was reported.  And, I agree that 
> leaving behind a message in the kernel log that reaping has been 
> successful is worthwhile.  So this line should just show what anon-rss is 
> after reaping and make it clear that this is not the memory reaped.

Does
"oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "

sound any better?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-03 13:14 ` [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing Michal Hocko
@ 2016-02-04 10:49   ` Tetsuo Handa
  2016-02-04 14:53     ` Michal Hocko
  2016-02-17  9:48   ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-04 10:49 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> wake_oom_reaper has allowed only 1 oom victim to be queued. The main
> reason for that was the simplicity as other solutions would require
> some way of queuing. The current approach is racy and that was deemed
> sufficient as the oom_reaper is considered a best effort approach
> to help with oom handling when the OOM victim cannot terminate in a
> reasonable time. The race could lead to missing an oom victim which can
> get stuck
> 
> out_of_memory
>   wake_oom_reaper
>     cmpxchg // OK
>     			oom_reaper
> 			  oom_reap_task
> 			    __oom_reap_task
> oom_victim terminates
> 			      atomic_inc_not_zero // fail
> out_of_memory
>   wake_oom_reaper
>     cmpxchg // fails
> 			  task_to_reap = NULL
> 
> This race requires 2 OOM invocations in a short time period which is not
> very likely but certainly not impossible. E.g. the original victim might
> have not released a lot of memory for some reason.
> 
> The situation would improve considerably if wake_oom_reaper used a more
> robust queuing. This is what this patch implements. This means adding
> oom_reaper_list list_head into task_struct (eat a hole before embeded
> thread_struct for that purpose) and a oom_reaper_lock spinlock for
> queuing synchronization. wake_oom_reaper will then add the task on the
> queue and oom_reaper will dequeue it.
> 

I think we want to rewrite this patch's description from a different point
of view.

As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to
manage OOM livelock caused by system-wide OOM events using the OOM reaper.
Therefore, the OOM reaper had high scheduling priority and we considered side
effect of the OOM reaper as a reasonable constraint.

But as the discussion went by, we started to try to manage OOM livelock
caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper.
Therefore, the OOM reaper now has normal scheduling priority. For non
system-wide OOM events, side effect of the OOM reaper might not be a
reasonable constraint. Some administrator might expect that the OOM reaper
does not break coredumping unless the system is under system-wide OOM events.

The race described in this patch's description sounds as if 2 OOM invocations
are by system-wide OOM events. If we consider only system-wide OOM events,
there is no need to keep task_to_reap != NULL after the OOM reaper found
a task to reap (shown below) because existing victim will prevent the OOM
killer from calling wake_oom_reaper().

----------------------------------------
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 012dd6f..c919ddb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1835,9 +1835,6 @@ struct task_struct {
 	unsigned long	task_state_change;
 #endif
 	int pagefault_disabled;
-#ifdef CONFIG_MMU
-	struct list_head oom_reaper_list;
-#endif
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..fa6a302 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -422,10 +422,8 @@ bool oom_killer_disabled __read_mostly;
  * victim (if that is possible) to help the OOM killer to move on.
  */
 static struct task_struct *oom_reaper_th;
+static struct task_struct *task_to_reap;
 static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static LIST_HEAD(oom_reaper_list);
-static DEFINE_SPINLOCK(oom_reaper_lock);
-
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -526,20 +524,11 @@ static void oom_reap_task(struct task_struct *tsk)
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
+		struct task_struct *tsk;
 
 		wait_event_freezable(oom_reaper_wait,
-				     (!list_empty(&oom_reaper_list)));
-		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
-		}
-		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+				     (tsk = xchg(&task_to_reap, NULL)));
+		oom_reap_task(tsk);
 	}
 
 	return 0;
@@ -551,11 +540,11 @@ static void wake_oom_reaper(struct task_struct *tsk)
 		return;
 
 	get_task_struct(tsk);
-
-	spin_lock(&oom_reaper_lock);
-	list_add(&tsk->oom_reaper_list, &oom_reaper_list);
-	spin_unlock(&oom_reaper_lock);
-	wake_up(&oom_reaper_wait);
+	tsk = xchg(&task_to_reap, tsk);
+	if (!tsk)
+		wake_up(&oom_reaper_wait);
+	else
+		put_task_struct(tsk);
 }
 
 static int __init oom_init(void)
----------------------------------------

But if we consider non system-wide OOM events, it is not very unlikely to hit
this race. This queue is useful for situations where memcg1 and memcg2 hit
memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.

I expect parallel reaping (shown below) because there is no need to serialize
victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to
1 second to complete before start reaping victim2 in memcg2) if we implement
this queue.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..c2d6472 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -427,7 +427,7 @@ static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
 
-static bool __oom_reap_task(struct task_struct *tsk)
+static bool oom_reap_task(struct task_struct *tsk)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
@@ -504,42 +504,42 @@ out:
 	return ret;
 }
 
-#define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
-{
-	int attempts = 0;
-
-	/* Retry the down_read_trylock(mmap_sem) a few times */
-	while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
-		schedule_timeout_idle(HZ/10);
-
-	if (attempts > MAX_OOM_REAP_RETRIES) {
-		pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
-				task_pid_nr(tsk), tsk->comm);
-		debug_show_all_locks();
-	}
-
-	/* Drop a reference taken by wake_oom_reaper */
-	put_task_struct(tsk);
-}
-
 static int oom_reaper(void *unused)
 {
 	while (true) {
-		struct task_struct *tsk = NULL;
-
+		struct task_struct *tsk;
+		struct task_struct *t;
+		LIST_HEAD(list);
+		int i;
+
 		wait_event_freezable(oom_reaper_wait,
 				     (!list_empty(&oom_reaper_list)));
 		spin_lock(&oom_reaper_lock);
-		if (!list_empty(&oom_reaper_list)) {
-			tsk = list_first_entry(&oom_reaper_list,
-					struct task_struct, oom_reaper_list);
-			list_del(&tsk->oom_reaper_list);
-		}
+		list_splice(&oom_reaper_list, &list);
+		INIT_LIST_HEAD(&oom_reaper_list);
 		spin_unlock(&oom_reaper_lock);
-
-		if (tsk)
-			oom_reap_task(tsk);
+		/* Retry the down_read_trylock(mmap_sem) a few times */
+		for (i = 0; i < 10; i++) {
+			list_for_each_entry_safe(tsk, t, &list,
+						 oom_reaper_list) {
+				if (!oom_reap_task(tsk))
+					continue;
+				list_del(&tsk->oom_reaper_list);
+				/* Drop a reference taken by wake_oom_reaper */
+				put_task_struct(tsk);
+			}
+			if (list_empty(&list))
+				break;
+			schedule_timeout_idle(HZ/10);
+		}
+		if (list_empty(&list))
+			continue;
+		list_for_each_entry(tsk, &list, oom_reaper_list) {
+			pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+				task_pid_nr(tsk), tsk->comm);
+			put_task_struct(tsk);
+		}
+		debug_show_all_locks();
 	}
 
 	return 0;
----------------------------------------

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-03 13:13 ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
@ 2016-02-04 14:22   ` Tetsuo Handa
  2016-02-04 14:43     ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-04 14:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> When oom_reaper manages to unmap all the eligible vmas there shouldn't
> be much of the freable memory held by the oom victim left anymore so it
> makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> OOM killer to select another task.

Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
reaper confuses

    wait_event(oom_victims_wait, !atomic_read(&oom_victims));

in oom_killer_disable(), I'm worrying that the freezing operation continues
before the OOM victim which escaped the __refrigerator() actually releases
memory. Does this cause consistency problem?

> +	/*
> +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> +	 * reasonably reclaimable memory anymore. OOM killer can continue
> +	 * by selecting other victim if unmapping hasn't led to any
> +	 * improvements. This also means that selecting this task doesn't
> +	 * make any sense.
> +	 */
> +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> +	exit_oom_victim(tsk);

I noticed that updating only one thread group's oom_score_adj disables
further wake_oom_reaper() calls due to rough-grained can_oom_reap check at

  p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN

in oom_kill_process(). I think we need to either update all thread groups'
oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
dying or exiting.

----------
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int writer(void *unused)
{
	static char buffer[4096];
	int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

int main(int argc, char *argv[])
{
	unsigned long size;
	char *buf = NULL;
	unsigned long i;
	if (fork() == 0) {
		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
		write(fd, "1000", 4);
		close(fd);
		for (i = 0; i < 2; i++) {
			char *stack = malloc(4096);
			if (stack)
				clone(writer, stack + 4096, CLONE_VM, NULL);
		}
		writer(NULL);
		while (1)
			pause();
	}
	sleep(1);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(5);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
----------

----------
[  177.722853] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780     784       6        0             0 a.out
[  177.876274] [ 3963]  1000  3963     1078       21       7       3        0          1000 a.out
[  177.878261] [ 3964]  1000  3964     1078       21       7       3        0          1000 a.out
[  177.880194] [ 3965]  1000  3965     1078       21       7       3        0          1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308     785       6        0             0 a.out
[  179.840328] [ 3963]  1000  3963     1078        0       7       3        0         -1000 a.out
[  179.842443] [ 3965]  1000  3965     1078        0       7       3        0          1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
----------

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 14:22   ` Tetsuo Handa
@ 2016-02-04 14:43     ` Michal Hocko
  2016-02-04 15:08       ` Tetsuo Handa
  2016-02-06  6:45       ` Michal Hocko
  0 siblings, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-04 14:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > be much of the freable memory held by the oom victim left anymore so it
> > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > OOM killer to select another task.
> 
> Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> reaper confuses
> 
>     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> 
> in oom_killer_disable(), I'm worrying that the freezing operation continues
> before the OOM victim which escaped the __refrigerator() actually releases
> memory. Does this cause consistency problem?

This is a good question! At first sight it seems this is not safe and we
might need to make the oom_reaper freezable so that it doesn't wake up
during suspend and interfere. Let me think about that.

> > +	/*
> > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > +	 * by selecting other victim if unmapping hasn't led to any
> > +	 * improvements. This also means that selecting this task doesn't
> > +	 * make any sense.
> > +	 */
> > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > +	exit_oom_victim(tsk);
> 
> I noticed that updating only one thread group's oom_score_adj disables
> further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> 
>   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> 
> in oom_kill_process(). I think we need to either update all thread groups'
> oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> dying or exiting.

I do not understand. Why would you want to reap the mm again when
this has been done already? The mm is shared, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-04 10:49   ` Tetsuo Handa
@ 2016-02-04 14:53     ` Michal Hocko
  2016-02-06  5:54       ` Tetsuo Handa
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-04 14:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 19:49:29, Tetsuo Handa wrote:
[...]
> I think we want to rewrite this patch's description from a different point
> of view.
> 
> As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to
> manage OOM livelock caused by system-wide OOM events using the OOM reaper.
> Therefore, the OOM reaper had high scheduling priority and we considered side
> effect of the OOM reaper as a reasonable constraint.
> 
> But as the discussion went by, we started to try to manage OOM livelock
> caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper.
> Therefore, the OOM reaper now has normal scheduling priority. For non
> system-wide OOM events, side effect of the OOM reaper might not be a
> reasonable constraint. Some administrator might expect that the OOM reaper
> does not break coredumping unless the system is under system-wide OOM events.

I am willing to discuss this as an option after we actually hear about a
_real_ usecase.

[...]

> But if we consider non system-wide OOM events, it is not very unlikely to hit
> this race. This queue is useful for situations where memcg1 and memcg2 hit
> memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.

This can happen of course but the likelihood is _much_ smaller without
the global OOM because the memcg OOM killer is invoked from a lockless
context so the oom context cannot block the victim to proceed.

> I expect parallel reaping (shown below) because there is no need to serialize
> victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to
> 1 second to complete before start reaping victim2 in memcg2) if we implement
> this queue.

I would really prefer to go a simpler way first and extend the code when
we see the current approach insufficient for real life loads. Please do
not get me wrong, of course the code can be enhanced in many different
ways and optimize for lots of pathological cases but I really believe
that we should start with correctness first and only later care about
optimizing corner cases. Realistically, who really cares about
oom_reaper acting on an Nth completely stuck tasks after N seconds?
For now, my target is to guarantee that oom_reaper will _eventually_
process a queued task and reap its memory if the target hasn't exited
yet. I do not think this is an unreasonable goal...

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 14:43     ` Michal Hocko
@ 2016-02-04 15:08       ` Tetsuo Handa
  2016-02-04 16:31         ` Michal Hocko
  2016-02-06  6:45       ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-04 15:08 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > > +	/*
> > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > +	 * by selecting other victim if unmapping hasn't led to any
> > > +	 * improvements. This also means that selecting this task doesn't
> > > +	 * make any sense.
> > > +	 */
> > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > +	exit_oom_victim(tsk);
> > 
> > I noticed that updating only one thread group's oom_score_adj disables
> > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > 
> >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > 
> > in oom_kill_process(). I think we need to either update all thread groups'
> > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > dying or exiting.
> 
> I do not understand. Why would you want to reap the mm again when
> this has been done already? The mm is shared, right?

The mm is shared between previous victim and next victim, but these victims
are in different thread groups. The OOM killer selects next victim whose mm
was already reaped due to sharing previous victim's memory. We don't want
the OOM killer to select such next victim. Maybe set MMF_OOM_REAP_DONE on
the previous victim's mm and check it instead of TIF_MEMDIE when selecting
a victim? That will also avoid problems caused by clearing TIF_MEMDIE?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 15:08       ` Tetsuo Handa
@ 2016-02-04 16:31         ` Michal Hocko
  2016-02-05 11:14           ` Tetsuo Handa
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-04 16:31 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > +	/*
> > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > +	 * improvements. This also means that selecting this task doesn't
> > > > +	 * make any sense.
> > > > +	 */
> > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > +	exit_oom_victim(tsk);
> > > 
> > > I noticed that updating only one thread group's oom_score_adj disables
> > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > 
> > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > 
> > > in oom_kill_process(). I think we need to either update all thread groups'
> > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > dying or exiting.
> > 
> > I do not understand. Why would you want to reap the mm again when
> > this has been done already? The mm is shared, right?
> 
> The mm is shared between previous victim and next victim, but these victims
> are in different thread groups. The OOM killer selects next victim whose mm
> was already reaped due to sharing previous victim's memory.

OK, now I got your point. From your previous email it sounded like you
were talking about oom_reaper and its invocation which is was confusing.

> We don't want the OOM killer to select such next victim.

Yes, selecting such a task doesn't make much sense. It has been killed
so it has fatal_signal_pending. If it wanted to allocate it would get
TIF_MEMDIE already and it's address space has been reaped so there is
nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
crazy combo, it is just causing troubles all over and I am not convinced
it is actually that helpful </rant>.


> Maybe set MMF_OOM_REAP_DONE on
> the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> a victim? That will also avoid problems caused by clearing TIF_MEMDIE?

Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
so using the flag sounds like the easiest way to go. Then we even do not
have to play with OOM_SCORE_ADJ_MIN which might be updated from the
userspace after the oom reaper has done that. Care to send a patch?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-04  6:46     ` Michal Hocko
@ 2016-02-04 22:31       ` David Rientjes
  2016-02-05  9:26         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: David Rientjes @ 2016-02-04 22:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu, 4 Feb 2016, Michal Hocko wrote:

> > I think it would be helpful to show anon-rss after reaping, however, so we 
> > can compare to the previous anon-rss that was reported.  And, I agree that 
> > leaving behind a message in the kernel log that reaping has been 
> > successful is worthwhile.  So this line should just show what anon-rss is 
> > after reaping and make it clear that this is not the memory reaped.
> 
> Does
> "oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "
> 
> sound any better?

oom_reaper: reaped process %d (%s), now anon-rss:%lukB

would probably be better until additional support is added to do other 
kinds of reaping other than just primarily heap.  This should help to 
quantify the exact amount of memory that could be reaped (or otherwise 
unmapped) iff oom_reaper has to get involved rather than fluctations that 
have nothing to do with it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-04 22:31       ` David Rientjes
@ 2016-02-05  9:26         ` Michal Hocko
  2016-02-06  6:34           ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-05  9:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Thu 04-02-16 14:31:26, David Rientjes wrote:
> On Thu, 4 Feb 2016, Michal Hocko wrote:
> 
> > > I think it would be helpful to show anon-rss after reaping, however, so we 
> > > can compare to the previous anon-rss that was reported.  And, I agree that 
> > > leaving behind a message in the kernel log that reaping has been 
> > > successful is worthwhile.  So this line should just show what anon-rss is 
> > > after reaping and make it clear that this is not the memory reaped.
> > 
> > Does
> > "oom_reaper: reaped process %d (%s) current memory anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB "
> > 
> > sound any better?
> 
> oom_reaper: reaped process %d (%s), now anon-rss:%lukB
> 
> would probably be better until additional support is added to do other 
> kinds of reaping other than just primarily heap.  This should help to 
> quantify the exact amount of memory that could be reaped (or otherwise 
> unmapped) iff oom_reaper has to get involved rather than fluctations that 
> have nothing to do with it.

---
>From 402090df64de7f80d7d045b0b17e860220837fa6 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Fri, 5 Feb 2016 10:24:23 +0100
Subject: [PATCH] mm-oom_reaper-report-success-failure-fix

update the log message to be more specific

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 87d644c97ac9..ca61e6cfae52 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,7 +479,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		}
 	}
 	tlb_finish_mmu(&tlb, 0, -1);
-	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
+	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
 			task_pid_nr(tsk), tsk->comm,
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
-- 
2.7.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 16:31         ` Michal Hocko
@ 2016-02-05 11:14           ` Tetsuo Handa
  2016-02-06  8:30             ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-05 11:14 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > > +	/*
> > > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > > +	 * improvements. This also means that selecting this task doesn't
> > > > > +	 * make any sense.
> > > > > +	 */
> > > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > > +	exit_oom_victim(tsk);
> > > >
> > > > I noticed that updating only one thread group's oom_score_adj disables
> > > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > >
> > > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > >
> > > > in oom_kill_process(). I think we need to either update all thread groups'
> > > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > > dying or exiting.
> > >
> > > I do not understand. Why would you want to reap the mm again when
> > > this has been done already? The mm is shared, right?
> >
> > The mm is shared between previous victim and next victim, but these victims
> > are in different thread groups. The OOM killer selects next victim whose mm
> > was already reaped due to sharing previous victim's memory.
>
> OK, now I got your point. From your previous email it sounded like you
> were talking about oom_reaper and its invocation which is was confusing.
>
> > We don't want the OOM killer to select such next victim.
>
> Yes, selecting such a task doesn't make much sense. It has been killed
> so it has fatal_signal_pending. If it wanted to allocate it would get
> TIF_MEMDIE already and it's address space has been reaped so there is
> nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
> crazy combo, it is just causing troubles all over and I am not convinced
> it is actually that helpful </rant>.
>

I think moving "whether a mm is reapable or not" check to the OOM reaper
is preferable (shown below). In most cases, mm_is_reapable() will return
true.

----------------------------------------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..fc114b3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -426,6 +426,39 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static LIST_HEAD(oom_reaper_list);
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
+static bool mm_is_reapable(struct mm_struct *mm)
+{
+	struct task_struct *g;
+	struct task_struct *p;
+
+	/*
+	 * Since it is possible that p voluntarily called do_exit() or
+	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
+	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
+	 * has pending SIGKILL or already reached do_exit().
+	 *
+	 * On the other hand, it is possible that mark_oom_victim(p) is called
+	 * without sending SIGKILL to all tasks using this mm. In this case,
+	 * the OOM reaper cannot reap this mm unless p is the only task using
+	 * this mm.
+	 *
+	 * Therefore, determine whether this mm is reapable by testing whether
+	 * all tasks using this mm are dying or already exiting rather than
+	 * depending on p->signal->oom_score_adj value which is updated by the
+	 * OOM reaper.
+	 */
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (mm != READ_ONCE(p->mm) ||
+		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
+			continue;
+		mm = NULL;
+		goto out;
+	}
+ out:
+	rcu_read_unlock();
+	return mm != NULL;
+}
 
 static bool __oom_reap_task(struct task_struct *tsk)
 {
@@ -455,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
 
 	task_unlock(p);
 
-	if (!down_read_trylock(&mm->mmap_sem)) {
+	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
 		ret = false;
 		goto out;
 	}
@@ -596,6 +629,7 @@ void mark_oom_victim(struct task_struct *tsk)
 	 */
 	__thaw_task(tsk);
 	atomic_inc(&oom_victims);
+	wake_oom_reaper(tsk);
 }
 
 /**
@@ -680,7 +714,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -771,23 +804,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 			continue;
 		if (same_thread_group(p, victim))
 			continue;
-		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
-			/*
-			 * We cannot use oom_reaper for the mm shared by this
-			 * process because it wouldn't get killed and so the
-			 * memory might be still used.
-			 */
-			can_oom_reap = false;
+		if (unlikely(p->flags & PF_KTHREAD))
 			continue;
-		}
+		if (is_global_init(p))
+			continue;
+		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			continue;
+
 		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
 	}
 	rcu_read_unlock();
 
-	if (can_oom_reap)
-		wake_oom_reaper(victim);
-
 	mmdrop(mm);
 	put_task_struct(victim);
 }
----------------------------------------

Then, I think we need to kill two lies in allocation retry loop.

The first lie is that we pretend as if we are making forward progress
without hitting the OOM killer. This lie is preventing any OOM victim
(with SIGKILL and without TIF_MEMDIE) doing !__GFP_FS allocations from
hitting the OOM killer, which can prevent the OOM victim tasks from
reaching do_exit(), and which is conflicting with assumption

  * That thread will now get access to memory reserves since it has a
  * pending fatal signal.

in oom_kill_process() and similar assertions like this patch's description.

  The lack of TIF_MEMDIE also means that the victim cannot access memory
  reserves anymore but that shouldn't be a problem because it would get
  the access again if it needs to allocate and hits the OOM killer again
  due to the fatal_signal_pending resp. PF_EXITING check.

If the callers of !__GFP_FS allocation do not need to loop until somebody
else reclaims memory on behalf of them, they can add __GFP_NORETRY. Otherwise,
the callers of !__GFP_FS allocation can and should call out_of_memory().
That's all we need for killing this lie.

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1668159..67591a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2772,20 +2772,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 		/* The OOM killer does not needlessly kill tasks for lowmem */
 		if (ac->high_zoneidx < ZONE_NORMAL)
 			goto out;
-		/* The OOM killer does not compensate for IO-less reclaim */
-		if (!(gfp_mask & __GFP_FS)) {
-			/*
-			 * XXX: Page reclaim didn't yield anything,
-			 * and the OOM killer can't be invoked, but
-			 * keep looping as per tradition.
-			 *
-			 * But do not keep looping if oom_killer_disable()
-			 * was already called, for the system is trying to
-			 * enter a quiescent state during suspend.
-			 */
-			*did_some_progress = !oom_killer_disabled;
-			goto out;
-		}
 		if (pm_suspended_storage())
 			goto out;
 		/* The OOM killer may not free memory on a specific node */
----------------------------------------

The second lie is that we pretend as if we are making forward progress
without taking any action when the OOM killer found a TIF_MEMDIE task.
This lie is preventing any task (without SIGKILL and without TIF_MEMDIE)
which is blocking the OOM victim (which might be looping without getting
TIF_MEMDIE due to doing !__GFP_FS allocation) from making forward progress
if the OOM reaper does not clear TIF_MEMDIE.

----------------------------------------
	/* Retry as long as the OOM killer is making progress */
	if (did_some_progress) {
		no_progress_loops = 0;
		goto retry;
	}
----------------------------------------
        v.s.
----------------------------------------
	/*
	 * This task already has access to memory reserves and is being killed.
	 * Don't allow any other task to have access to the reserves.
	 */
	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
		if (!is_sysrq_oom(oc))
			return OOM_SCAN_ABORT;
	}
----------------------------------------

By moving "whether a mm is reapable or not" check to the OOM reaper, we can
delegate the duty of clearing TIF_MEMDIE to the OOM reaper because the OOM
reaper is tracking all TIF_MEMDIE tasks. Since mm_is_reapable() can return
true for most situations, it becomes an unlikely corner case that we need to
clear TIF_MEMDIE and prevent the OOM killer from setting TIF_MEMDIE on the
same task again when the OOM reaper gave up. Like you commented in [PATCH 5/5],
falling back to simple timer would be sufficient for handling such corner cases.

| I would really prefer to go a simpler way first and extend the code when
| we see the current approach insufficient for real life loads. Please do
| not get me wrong, of course the code can be enhanced in many different
| ways and optimize for lots of pathological cases but I really believe
| that we should start with correctness first and only later care about
| optimizing corner cases.

>
> > Maybe set MMF_OOM_REAP_DONE on
> > the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> > a victim? That will also avoid problems caused by clearing TIF_MEMDIE?
>
> Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
> so using the flag sounds like the easiest way to go. Then we even do not
> have to play with OOM_SCORE_ADJ_MIN which might be updated from the
> userspace after the oom reaper has done that. Care to send a patch?

Not only we don't need to worry about ->oom_score_adj being modified from
outside the SIGKILL pending tasks, I think we also don't need to clear remote
TIF_MEMDIE if we use MMF_OOM_REAP_DONE. Something like below untested patch?

----------------------------------------
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 45993b8..03e6257 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 
 extern bool out_of_memory(struct oom_control *oc);
 
-extern void exit_oom_victim(struct task_struct *tsk);
+extern void exit_oom_victim(void);
 
 extern int register_oom_notifier(struct notifier_block *nb);
 extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 012dd6f..442ba46 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -515,6 +515,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
+#define MMF_OOM_REAP_DONE       21      /* set when OOM reap completed */
+
 struct sighand_struct {
 	atomic_t		count;
 	struct k_sigaction	action[_NSIG];
diff --git a/kernel/exit.c b/kernel/exit.c
index ba3bd29..10e0882 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
 	mm_update_next_owner(mm);
 	mmput(mm);
 	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
+		exit_oom_victim();
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b42c6bc..b67d8bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -271,19 +271,24 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc,
 enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,
 			struct task_struct *task, unsigned long totalpages)
 {
+	struct mm_struct *mm;
+
 	if (oom_unkillable_task(task, NULL, oc->nodemask))
 		return OOM_SCAN_CONTINUE;
 
+	mm = task->mm;
+	if (!mm)
+		return OOM_SCAN_CONTINUE;
 	/*
 	 * This task already has access to memory reserves and is being killed.
-	 * Don't allow any other task to have access to the reserves.
+	 * Don't allow any other task to have access to the reserves unless
+	 * this task's memory was OOM reaped.
 	 */
 	if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
-		if (!is_sysrq_oom(oc))
+		if (!is_sysrq_oom(oc) &&
+		    !test_bit(MMF_OOM_REAP_DONE, &mm->flags))
 			return OOM_SCAN_ABORT;
 	}
-	if (!task->mm)
-		return OOM_SCAN_CONTINUE;
 
 	/*
 	 * If task is allocating a lot of memory and has been marked to be
@@ -448,7 +453,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		return true;
 
 	mm = p->mm;
-	if (!atomic_inc_not_zero(&mm->mm_users)) {
+	if (test_bit(MMF_OOM_REAP_DONE, &mm->flags) ||
+	    !atomic_inc_not_zero(&mm->mm_users)) {
 		task_unlock(p);
 		return true;
 	}
@@ -491,14 +497,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
 	up_read(&mm->mmap_sem);
 
 	/*
-	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
-	 * reasonably reclaimable memory anymore. OOM killer can continue
-	 * by selecting other victim if unmapping hasn't led to any
-	 * improvements. This also means that selecting this task doesn't
-	 * make any sense.
+	 * Set MMF_OOM_REAP_DONE on this mm so that OOM killer can continue
+	 * by selecting other victim which does not use this mm if unmapping
+	 * this mm hasn't led to any improvements.
 	 */
-	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
-	exit_oom_victim(tsk);
+	set_bit(MMF_OOM_REAP_DONE, &mm->flags);
 out:
 	mmput(mm);
 	return ret;
@@ -601,10 +604,9 @@ void mark_oom_victim(struct task_struct *tsk)
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
-void exit_oom_victim(struct task_struct *tsk)
+void exit_oom_victim(void)
 {
-	if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
-		return;
+	clear_thread_flag(TIF_MEMDIE);
 
 	if (!atomic_dec_return(&oom_victims))
 		wake_up_all(&oom_victims_wait);
----------------------------------------

I suggested many changes in this post because [PATCH 3/5] and [PATCH 5/5]
made it possible for us to simplify [PATCH 1/5] like old versions. I think
you want to rebuild this series with these changes merged as appropriate.

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-04 14:53     ` Michal Hocko
@ 2016-02-06  5:54       ` Tetsuo Handa
  2016-02-06  8:37         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-06  5:54 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> 
> This can happen of course but the likelihood is _much_ smaller without
> the global OOM because the memcg OOM killer is invoked from a lockless
> context so the oom context cannot block the victim to proceed.

Suppose mem_cgroup_out_of_memory() is called from a lockless context via
mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
"lockless" is talking about only current thread, doesn't it?

Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
victim process, it is possible that non-first mm!=NULL thread triggers
pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
isn't it?

Then, where is the guarantee that victim1 (first mm!=NULL thread in memcg1
which got TIF_MEMDIE) is not waiting at down_read(&victim2->mm->mmap_sem)
when victim2 (first mm!=NULL thread in memcg2 which got TIF_MEMDIE) is
waiting at down_write(&victim2->mm->mmap_sem) or both victim1 and victim2
are waiting on a lock somewhere in memory reclaim path (e.g.
mutex_lock(&inode->i_mutex))?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 4/5] mm, oom_reaper: report success/failure
  2016-02-05  9:26         ` Michal Hocko
@ 2016-02-06  6:34           ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-06  6:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Fri 05-02-16 10:26:40, Michal Hocko wrote:
[...]
> From 402090df64de7f80d7d045b0b17e860220837fa6 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Fri, 5 Feb 2016 10:24:23 +0100
> Subject: [PATCH] mm-oom_reaper-report-success-failure-fix
> 
> update the log message to be more specific
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/oom_kill.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 87d644c97ac9..ca61e6cfae52 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,7 +479,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
> -	pr_info("oom_reaper: reaped process :%d (%s) anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",
> +	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lulB\n",

Dohh, s@lulB@ulkB@

>  			task_pid_nr(tsk), tsk->comm,
>  			K(get_mm_counter(mm, MM_ANONPAGES)),
>  			K(get_mm_counter(mm, MM_FILEPAGES)),
> -- 
> 2.7.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-04 14:43     ` Michal Hocko
  2016-02-04 15:08       ` Tetsuo Handa
@ 2016-02-06  6:45       ` Michal Hocko
  2016-02-06 14:33         ` Tetsuo Handa
  1 sibling, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-06  6:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Thu 04-02-16 15:43:19, Michal Hocko wrote:
> On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > > be much of the freable memory held by the oom victim left anymore so it
> > > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > > OOM killer to select another task.
> > 
> > Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> > with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> > reaper confuses
> > 
> >     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > 
> > in oom_killer_disable(), I'm worrying that the freezing operation continues
> > before the OOM victim which escaped the __refrigerator() actually releases
> > memory. Does this cause consistency problem?
> 
> This is a good question! At first sight it seems this is not safe and we
> might need to make the oom_reaper freezable so that it doesn't wake up
> during suspend and interfere. Let me think about that.

OK, I was thinking about it some more and it seems you are right here.
oom_reaper as a kernel thread is not freezable automatically and so it
might interfere after all the processes/kernel threads are considered
frozen. Then it really might shut down TIF_MEMDIE too early and wake out
oom_killer_disable. wait_event_freezable is not sufficient because the
oom_reaper might running while the PM freezer is freezing tasks and it
will miss it because it doesn't see it.

So I think we might need this. I am heading to vacation today and will
be offline for the next week so I will prepare the full patch with the
proper changelog after I get back:

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca61e6cfae52..7e9953a64489 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-05 11:14           ` Tetsuo Handa
@ 2016-02-06  8:30             ` Michal Hocko
  2016-02-06 11:23               ` Tetsuo Handa
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-06  8:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Fri 05-02-16 20:14:40, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 05-02-16 00:08:25, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > > +	/*
> > > > > > +	 * Clear TIF_MEMDIE because the task shouldn't be sitting on a
> > > > > > +	 * reasonably reclaimable memory anymore. OOM killer can continue
> > > > > > +	 * by selecting other victim if unmapping hasn't led to any
> > > > > > +	 * improvements. This also means that selecting this task doesn't
> > > > > > +	 * make any sense.
> > > > > > +	 */
> > > > > > +	tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
> > > > > > +	exit_oom_victim(tsk);
> > > > >
> > > > > I noticed that updating only one thread group's oom_score_adj disables
> > > > > further wake_oom_reaper() calls due to rough-grained can_oom_reap check at
> > > > >
> > > > >   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN
> > > > >
> > > > > in oom_kill_process(). I think we need to either update all thread groups'
> > > > > oom_score_adj using the reaped mm equally or use more fine-grained can_oom_reap
> > > > > check which ignores OOM_SCORE_ADJ_MIN if all threads in that thread group are
> > > > > dying or exiting.
> > > >
> > > > I do not understand. Why would you want to reap the mm again when
> > > > this has been done already? The mm is shared, right?
> > >
> > > The mm is shared between previous victim and next victim, but these victims
> > > are in different thread groups. The OOM killer selects next victim whose mm
> > > was already reaped due to sharing previous victim's memory.
> >
> > OK, now I got your point. From your previous email it sounded like you
> > were talking about oom_reaper and its invocation which is was confusing.
> >
> > > We don't want the OOM killer to select such next victim.
> >
> > Yes, selecting such a task doesn't make much sense. It has been killed
> > so it has fatal_signal_pending. If it wanted to allocate it would get
> > TIF_MEMDIE already and it's address space has been reaped so there is
> > nothing to free left. These CLONE_VM without CLONE_SIGHAND is really
> > crazy combo, it is just causing troubles all over and I am not convinced
> > it is actually that helpful </rant>.
> >
> 
> I think moving "whether a mm is reapable or not" check to the OOM reaper
> is preferable (shown below). In most cases, mm_is_reapable() will return
> true.

Why should we select such a task in the first place when it's been
already oom reaped? I think it belongs to oom_scan_process_thread.
 
> ----------------------------------------
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b42c6bc..fc114b3 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -426,6 +426,39 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static LIST_HEAD(oom_reaper_list);
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> +static bool mm_is_reapable(struct mm_struct *mm)
> +{
> +	struct task_struct *g;
> +	struct task_struct *p;
> +
> +	/*
> +	 * Since it is possible that p voluntarily called do_exit() or
> +	 * somebody other than the OOM killer sent SIGKILL on p, this mm used
> +	 * by p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN is reapable if p
> +	 * has pending SIGKILL or already reached do_exit().
> +	 *
> +	 * On the other hand, it is possible that mark_oom_victim(p) is called
> +	 * without sending SIGKILL to all tasks using this mm. In this case,
> +	 * the OOM reaper cannot reap this mm unless p is the only task using
> +	 * this mm.
> +	 *
> +	 * Therefore, determine whether this mm is reapable by testing whether
> +	 * all tasks using this mm are dying or already exiting rather than
> +	 * depending on p->signal->oom_score_adj value which is updated by the
> +	 * OOM reaper.
> +	 */
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		if (mm != READ_ONCE(p->mm) ||
> +		    fatal_signal_pending(p) || (p->flags & PF_EXITING))
> +			continue;
> +		mm = NULL;
> +		goto out;
> +	}
> + out:
> +	rcu_read_unlock();
> +	return mm != NULL;
> +}
>  
>  static bool __oom_reap_task(struct task_struct *tsk)
>  {
> @@ -455,7 +488,7 @@ static bool __oom_reap_task(struct task_struct *tsk)
>  
>  	task_unlock(p);
>  
> -	if (!down_read_trylock(&mm->mmap_sem)) {
> +	if (!mm_is_reapable(mm) || !down_read_trylock(&mm->mmap_sem)) {
>  		ret = false;
>  		goto out;
>  	}
> @@ -596,6 +629,7 @@ void mark_oom_victim(struct task_struct *tsk)
>  	 */
>  	__thaw_task(tsk);
>  	atomic_inc(&oom_victims);
> +	wake_oom_reaper(tsk);
>  }
>  
>  /**
> @@ -680,7 +714,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	unsigned int victim_points = 0;
>  	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					      DEFAULT_RATELIMIT_BURST);
> -	bool can_oom_reap = true;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -771,23 +804,17 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  			continue;
>  		if (same_thread_group(p, victim))
>  			continue;
> -		if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
> -		    p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
> -			/*
> -			 * We cannot use oom_reaper for the mm shared by this
> -			 * process because it wouldn't get killed and so the
> -			 * memory might be still used.
> -			 */
> -			can_oom_reap = false;
> +		if (unlikely(p->flags & PF_KTHREAD))
>  			continue;
> -		}
> +		if (is_global_init(p))
> +			continue;
> +		if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> +			continue;
> +
>  		do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
>  	}
>  	rcu_read_unlock();
>  
> -	if (can_oom_reap)
> -		wake_oom_reaper(victim);
> -
>  	mmdrop(mm);
>  	put_task_struct(victim);
>  }
> ----------------------------------------

this is unnecessarily too complex IMO

> Then, I think we need to kill two lies in allocation retry loop.

This is completely unrelated to the topic discussed here.
[...]

> By moving "whether a mm is reapable or not" check to the OOM reaper, we can
> delegate the duty of clearing TIF_MEMDIE to the OOM reaper because the OOM
> reaper is tracking all TIF_MEMDIE tasks. Since mm_is_reapable() can return
> true for most situations, it becomes an unlikely corner case that we need to
> clear TIF_MEMDIE and prevent the OOM killer from setting TIF_MEMDIE on the
> same task again when the OOM reaper gave up. Like you commented in [PATCH 5/5],
> falling back to simple timer would be sufficient for handling such corner cases.

I am not really sure I understand what you are trying to tell here to be honest
but no I am not going to add any timers at this stage.

> | I would really prefer to go a simpler way first and extend the code when
> | we see the current approach insufficient for real life loads. Please do
> | not get me wrong, of course the code can be enhanced in many different
> | ways and optimize for lots of pathological cases but I really believe
> | that we should start with correctness first and only later care about
> | optimizing corner cases.
> 
> >
> > > Maybe set MMF_OOM_REAP_DONE on
> > > the previous victim's mm and check it instead of TIF_MEMDIE when selecting
> > > a victim? That will also avoid problems caused by clearing TIF_MEMDIE?
> >
> > Hmm, it doesn't seem we are under MMF_ availabel bits pressure right now
> > so using the flag sounds like the easiest way to go. Then we even do not
> > have to play with OOM_SCORE_ADJ_MIN which might be updated from the
> > userspace after the oom reaper has done that. Care to send a patch?
> 
> Not only we don't need to worry about ->oom_score_adj being modified from
> outside the SIGKILL pending tasks, I think we also don't need to clear remote
> TIF_MEMDIE if we use MMF_OOM_REAP_DONE. Something like below untested patch?

Dropping TIF_MEMDIE will help to unlock OOM killer as soon as we know
the current victim is no longer interesting for the OOM killer to allow
further victims selection. If we add MMF_OOM_REAP_DONE after reaping and
oom_scan_process_thread is taught to ignore those you will get all cases
of shared memory handles properly AFAICS. Such a patch should be really
trivial enhancement on top of the current code.

[...]

> I suggested many changes in this post because [PATCH 3/5] and [PATCH 5/5]
> made it possible for us to simplify [PATCH 1/5] like old versions. I think
> you want to rebuild this series with these changes merged as appropriate.

I would really _appreciate_ incremental changes as mentioned several
times already. And it would be really helpful if you could stick to
that. If you want to propose additional enhancements on top of the
current code you are free to do so but I am really reluctant to respin
everything to add more stuff into each patch and risk introduction of new
bugs. As things stand now patches are aiming to be as simple as possible
they do not introduce new bugs (except for the PM freezer which should
be fixable by a trivial patch which I will post after I get back from
vacation) and they improve things considerably in its current form
already.

I would like to target the next merge window rather than have this out
of tree for another release cycle which means that we should really
focus on the current functionality and make sure we haven't missed
anything. As there is no fundamental disagreement to the approach all
the rest are just technicalities.

Thanks

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-06  5:54       ` Tetsuo Handa
@ 2016-02-06  8:37         ` Michal Hocko
  2016-02-06 15:33           ` Tetsuo Handa
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-06  8:37 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > 
> > This can happen of course but the likelihood is _much_ smaller without
> > the global OOM because the memcg OOM killer is invoked from a lockless
> > context so the oom context cannot block the victim to proceed.
> 
> Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> "lockless" is talking about only current thread, doesn't it?

Yes and you need the OOM context to sit on the same lock as the victim
to form a deadlock. So while the victim might be blocked somewhere it is
much less likely it would be deadlocked.

> Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> victim process, it is possible that non-first mm!=NULL thread triggers
> pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> isn't it?

I got lost here completely. Maybe it is your usage of thread terminology
again.
 
> Then, where is the guarantee that victim1 (first mm!=NULL thread in memcg1
> which got TIF_MEMDIE) is not waiting at down_read(&victim2->mm->mmap_sem)
> when victim2 (first mm!=NULL thread in memcg2 which got TIF_MEMDIE) is
> waiting at down_write(&victim2->mm->mmap_sem)

All threads/processes sharing the same mm are in fact in the same memory
cgroup. That is the reason we have owner in the task_struct

> or both victim1 and victim2
> are waiting on a lock somewhere in memory reclaim path (e.g.
> mutex_lock(&inode->i_mutex))?

Such waiting has to make a forward progress at some point in time
because the lock itself cannot be deadlocked by the memcg OOM context.


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-06  8:30             ` Michal Hocko
@ 2016-02-06 11:23               ` Tetsuo Handa
  2016-02-15 20:47                 ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-06 11:23 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> I am not really sure I understand what you are trying to tell here to be honest
> but no I am not going to add any timers at this stage.

> Dropping TIF_MEMDIE will help to unlock OOM killer as soon as we know
> the current victim is no longer interesting for the OOM killer to allow
> further victims selection. If we add MMF_OOM_REAP_DONE after reaping and
> oom_scan_process_thread is taught to ignore those you will get all cases
> of shared memory handles properly AFAICS. Such a patch should be really
> trivial enhancement on top of the current code.

What I'm trying to tell is that we should prepare for corner cases where
dropping TIF_MEMDIE (or reaping the victim's memory) is not taken place.

What happens if TIF_MEMDIE was set at

	if (current->mm &&
	    (fatal_signal_pending(current) || task_will_free_mem(current))) {
		mark_oom_victim(current);
		return true;
	}

in out_of_memory() because current thread received SIGKILL while doing
a GFP_KERNEL allocation of

  Do a GFP_KERNEL allocation.
  Bail out if GFP_KERNEL allocation failed.
  Hold a lock.
  Do a GFP_NOFS allocation.
  Release a lock.

sequence? If current thread is blocked at waiting for that lock held by
somebody else doing memory allocation, nothing will unlock the OOM killer
(because the OOM reaper is not woken up and a timer for unlocking the OOM
killer does not exist).

What happens if TIF_MEMDIE was set at

	task_lock(p);
	if (p->mm && task_will_free_mem(p)) {
		mark_oom_victim(p);
		task_unlock(p);
		put_task_struct(p);
		return;
	}
	task_unlock(p);

in oom_kill_process() when p is waiting for a lock held by somebody else
doing memory allocation? Since the OOM reaper will not be woken up,
nothing will unlock the OOM killer.

If TIF_MEMDIE was set at

	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
	mark_oom_victim(victim);
	pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",

in oom_kill_process() but the OOM reaper was not woken up because of
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN case, nothing will unlock
the OOM killer.

By always waking the OOM reaper up, we can delegate the duty of unlocking
the OOM killer (by clearing TIF_MEMDIE or some other means) to the OOM
reaper because the OOM reaper is tracking all TIF_MEMDIE tasks.

Of course, it is possible that we handle such corner cases by adding
a timer for unlocking the OOM killer (regardless of availability of the
OOM reaper). But if we do "whether a mm is reapable or not" check at the
OOM reaper side by waking the OOM reaper up whenever TIF_MEMDIE is set
on some task, we can increase likeliness of dropping TIF_MEMDIE (after
reaping the victim's memory) being taken place and reduce frequency of
killing more OOM victims by using a timer for unlocking the OOM killer.

> I would like to target the next merge window rather than have this out
> of tree for another release cycle which means that we should really
> focus on the current functionality and make sure we haven't missed
> anything. As there is no fundamental disagreement to the approach all
> the rest are just technicalities.

Of course, we can target the OOM reaper for the next merge window. I'm
suggesting you that my changes would help handling corner cases (bugs)
you are not paying attention to.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
  2016-02-03 23:48   ` David Rientjes
@ 2016-02-06 13:22   ` Tetsuo Handa
  2016-02-15 20:50     ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-06 13:22 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel, mhocko

Michal Hocko wrote:
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.

Is it possible to clarify what "the result would be incomplete" mean?

  (1) The size of coredump file becomes smaller than it should be, and
      data in reaped pages is not included into the file.

  (2) The size of coredump file does not change, and data in reaped pages
      is included into the file as NUL byte.

  (3) The size of coredump file does not change, and data in reaped pages
      is included into the file as-is (i.e. information leak security risk).

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-06  6:45       ` Michal Hocko
@ 2016-02-06 14:33         ` Tetsuo Handa
  2016-02-15 20:40           ` [PATCH 3.1/5] oom: make oom_reaper freezable Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-06 14:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Thu 04-02-16 15:43:19, Michal Hocko wrote:
> > On Thu 04-02-16 23:22:18, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > When oom_reaper manages to unmap all the eligible vmas there shouldn't
> > > > be much of the freable memory held by the oom victim left anymore so it
> > > > makes sense to clear the TIF_MEMDIE flag for the victim and allow the
> > > > OOM killer to select another task.
> > > 
> > > Just a confirmation. Is it safe to clear TIF_MEMDIE without reaching do_exit()
> > > with regard to freezing_slow_path()? Since clearing TIF_MEMDIE from the OOM
> > > reaper confuses
> > > 
> > >     wait_event(oom_victims_wait, !atomic_read(&oom_victims));
> > > 
> > > in oom_killer_disable(), I'm worrying that the freezing operation continues
> > > before the OOM victim which escaped the __refrigerator() actually releases
> > > memory. Does this cause consistency problem?
> > 
> > This is a good question! At first sight it seems this is not safe and we
> > might need to make the oom_reaper freezable so that it doesn't wake up
> > during suspend and interfere. Let me think about that.
> 
> OK, I was thinking about it some more and it seems you are right here.
> oom_reaper as a kernel thread is not freezable automatically and so it
> might interfere after all the processes/kernel threads are considered
> frozen. Then it really might shut down TIF_MEMDIE too early and wake out
> oom_killer_disable. wait_event_freezable is not sufficient because the
> oom_reaper might running while the PM freezer is freezing tasks and it
> will miss it because it doesn't see it.

I'm not using PM freezer, but your answer is opposite to my guess.
I thought try_to_freeze_tasks(false) is called by freeze_kernel_threads()
after oom_killer_disable() succeeded, and try_to_freeze_tasks(false) will
freeze both userspace tasks (including OOM victims which got TIF_MEMDIE
cleared by the OOM reaper) and kernel threads (including the OOM reaper).
Thus, I was guessing that clearing TIF_MEMDIE without reaching do_exit() is
safe.

> 
> So I think we might need this. I am heading to vacation today and will
> be offline for the next week so I will prepare the full patch with the
> proper changelog after I get back:
> 
I can't judge whether we need this set_freezable().

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ca61e6cfae52..7e9953a64489 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
>  
>  static int oom_reaper(void *unused)
>  {
> +	set_freezable();
> +
>  	while (true) {
>  		struct task_struct *tsk = NULL;
>  
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-06  8:37         ` Michal Hocko
@ 2016-02-06 15:33           ` Tetsuo Handa
  2016-02-15 20:15             ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-06 15:33 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > 
> > > This can happen of course but the likelihood is _much_ smaller without
> > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > context so the oom context cannot block the victim to proceed.
> > 
> > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > "lockless" is talking about only current thread, doesn't it?
> 
> Yes and you need the OOM context to sit on the same lock as the victim
> to form a deadlock. So while the victim might be blocked somewhere it is
> much less likely it would be deadlocked.
> 
> > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > victim process, it is possible that non-first mm!=NULL thread triggers
> > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > isn't it?
> 
> I got lost here completely. Maybe it is your usage of thread terminology
> again.

I'm using "process" == "thread group" which contains at least one "thread",
and "thread" == "struct task_struct".
My assumption is

   (1) app1 process has two threads named app1t1 and app1t2
   (2) app2 process has two threads named app2t1 and app2t2
   (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
   (4) app1 is in memcg1 and app2 is in memcg2

and sequence is

   (1) app1t2 triggers pagefault_out_of_memory()
   (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
   (3) oom_scan_process_thread() selects app1 as an OOM victim process
   (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
   (5) app1t1 gets TIF_MEMDIE
   (6) app2t2 triggers pagefault_out_of_memory()
   (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
   (8) oom_scan_process_thread() selects app2 as an OOM victim process
   (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
   (10) app2t1 gets TIF_MEMDIE

.

I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
up_read(&app1t1->mm->mmap_sem). In this case, this [PATCH 5/5] helps the OOM reaper to
reap app2t1->mm after giving up waiting for down_read(&app1t1->mm->mmap_sem) to succeed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-06 15:33           ` Tetsuo Handa
@ 2016-02-15 20:15             ` Michal Hocko
  2016-02-16 11:11               ` Tetsuo Handa
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-15 20:15 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sun 07-02-16 00:33:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > > 
> > > > This can happen of course but the likelihood is _much_ smaller without
> > > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > > context so the oom context cannot block the victim to proceed.
> > > 
> > > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > > "lockless" is talking about only current thread, doesn't it?
> > 
> > Yes and you need the OOM context to sit on the same lock as the victim
> > to form a deadlock. So while the victim might be blocked somewhere it is
> > much less likely it would be deadlocked.
> > 
> > > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > > victim process, it is possible that non-first mm!=NULL thread triggers
> > > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > > isn't it?
> > 
> > I got lost here completely. Maybe it is your usage of thread terminology
> > again.
> 
> I'm using "process" == "thread group" which contains at least one "thread",
> and "thread" == "struct task_struct".
> My assumption is
> 
>    (1) app1 process has two threads named app1t1 and app1t2
>    (2) app2 process has two threads named app2t1 and app2t2
>    (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
>    (4) app1 is in memcg1 and app2 is in memcg2
> 
> and sequence is
> 
>    (1) app1t2 triggers pagefault_out_of_memory()
>    (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
>    (3) oom_scan_process_thread() selects app1 as an OOM victim process
>    (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
>    (5) app1t1 gets TIF_MEMDIE

OK so we have a victim in memcg1 and app1t2 will get to do_exit right away
because we are in the page fault path...

>    (6) app2t2 triggers pagefault_out_of_memory()
>    (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
>    (8) oom_scan_process_thread() selects app2 as an OOM victim process
>    (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
>    (10) app2t1 gets TIF_MEMDIE
> 
> .
> 
> I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
> because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
> doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
> up_read(&app1t1->mm->mmap_sem).

Unless we are under global OOM then this doesn't matter much because the
allocation request should succeed at some point in time and memcg
charges are bypassed for tasks with pending fatal signals. So we can
make a forward progress.

> In this case, this [PATCH 5/5] helps the OOM reaper to reap app2t1->mm
> after giving up waiting for down_read(&app1t1->mm->mmap_sem) to
> succeed.

Why does that matter at all?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 3.1/5] oom: make oom_reaper freezable
  2016-02-06 14:33         ` Tetsuo Handa
@ 2016-02-15 20:40           ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-15 20:40 UTC (permalink / raw)
  To: akpm, Tetsuo Handa
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Andrew,
this can be either folded into 3/5 patch or go as a standalone one. I
would be inclined to have it standalone for the record (the description
should be pretty clear about the intention) and because the issue is
highly unlikely. OOM during the PM freezer doesn't happen in 99% cases.

On Sat 06-02-16 23:33:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > OK, I was thinking about it some more and it seems you are right here.
> > oom_reaper as a kernel thread is not freezable automatically and so it
> > might interfere after all the processes/kernel threads are considered
> > frozen. Then it really might shut down TIF_MEMDIE too early and wake out
> > oom_killer_disable. wait_event_freezable is not sufficient because the
> > oom_reaper might running while the PM freezer is freezing tasks and it
> > will miss it because it doesn't see it.
> 
> I'm not using PM freezer, but your answer is opposite to my guess.
> I thought try_to_freeze_tasks(false) is called by freeze_kernel_threads()
> after oom_killer_disable() succeeded, and try_to_freeze_tasks(false) will
> freeze both userspace tasks (including OOM victims which got TIF_MEMDIE
> cleared by the OOM reaper) and kernel threads (including the OOM reaper).

kernel threads which are not freezable are ignored by the freezer.

> Thus, I was guessing that clearing TIF_MEMDIE without reaching do_exit() is
> safe.

Does the following explains it better?
---
>From d7f57b1ac07532657312c91f3bba67cf0332b32f Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 15 Feb 2016 10:09:39 +0100
Subject: [PATCH] oom: make oom_reaper freezable

After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target
task after it is done. This might however race with the PM freezer:

CPU0				CPU1				CPU2
freeze_processes
  try_to_freeze_tasks
  				# Allocation request
				out_of_memory
  oom_killer_disable
				  wake_oom_reaper(P1)
				  				__oom_reap_task
								  exit_oom_victim(P1)
    wait_event(oom_victims==0)
[...]
    				do_exit(P1)
				  perform IO/interfere with the freezer

which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.

In order to close this race, make the oom_reaper thread freezable. This
will work because
	a) already running oom_reaper will block freezer to enter the
	   quiescent state
	b) wake_oom_reaper will not wake up the reaper after it has been
	   frozen
	c) the only way to call exit_oom_victim after try_to_freeze_tasks
	   is from the oom victim's context when we know the further
	   interference shouldn't be possible

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ca61e6cfae52..7e9953a64489 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -521,6 +521,8 @@ static void oom_reap_task(struct task_struct *tsk)
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
  2016-02-06 11:23               ` Tetsuo Handa
@ 2016-02-15 20:47                 ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-15 20:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 20:23:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> By always waking the OOM reaper up, we can delegate the duty of unlocking
> the OOM killer (by clearing TIF_MEMDIE or some other means) to the OOM
> reaper because the OOM reaper is tracking all TIF_MEMDIE tasks.

And again, I didn't say this would be incorrect. I am just saying that
this will get more complex if we want to handle all the cases properly.

> > I would like to target the next merge window rather than have this out
> > of tree for another release cycle which means that we should really
> > focus on the current functionality and make sure we haven't missed
> > anything. As there is no fundamental disagreement to the approach all
> > the rest are just technicalities.
> 
> Of course, we can target the OOM reaper for the next merge window. I'm
> suggesting you that my changes would help handling corner cases (bugs)
> you are not paying attention to.

I am paying attention to them. I just think that incremental changes are
preferable and we should start with simpler cases before we go further
steps. There is no reason to rush this.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 1/5] mm, oom: introduce oom reaper
  2016-02-06 13:22   ` Tetsuo Handa
@ 2016-02-15 20:50     ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-15 20:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 06-02-16 22:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > There is one notable exception to this, though, if the OOM victim was
> > in the process of coredumping the result would be incomplete. This is
> > considered a reasonable constrain because the overall system health is
> > more important than debugability of a particular application.
> 
> Is it possible to clarify what "the result would be incomplete" mean?
> 
>   (1) The size of coredump file becomes smaller than it should be, and
>       data in reaped pages is not included into the file.
> 
>   (2) The size of coredump file does not change, and data in reaped pages
>       is included into the file as NUL byte.

AFAIU this will be the case. We are not destroying VMAs we are just
unmapping the page ranges. So what would happen is that the core dump
will contain zero pages for anonymous mappings. This might change in
future though because the oom repear might be extended to do more work
(e.g. drop associated page tables when I would expect the core dumping
could SEGV).

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-15 20:15             ` Michal Hocko
@ 2016-02-16 11:11               ` Tetsuo Handa
  2016-02-16 15:53                 ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-16 11:11 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

Michal Hocko wrote:
> Unless we are under global OOM then this doesn't matter much because the
> allocation request should succeed at some point in time and memcg
> charges are bypassed for tasks with pending fatal signals. So we can
> make a forward progress.

Hmm, then I wonder how memcg OOM livelock occurs. Anyway, OK for now.

But current OOM reaper forgot a protection for list item "double add" bug.
Precisely speaking, this is not a OOM reaper's bug.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

static int file_writer(void)
{
	static char buffer[4096] = { }; 
	const int fd = open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
	while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
	return 0;
}

static int memory_consumer(void)
{
	const int fd = open("/dev/zero", O_RDONLY);
	unsigned long size;
	char *buf = NULL;
	sleep(1);
	unlink("/tmp/file");
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
	return 0;
}

int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 1024; i++)
		if (fork() == 0) {
			file_writer();
			_exit(0);
		}
	memory_consumer();
	while (1)
		pause();
}
----------

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160216.txt.xz .
----------
[  140.758667] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  140.760706] a.out cpuset=/ mems_allowed=0
(...snipped...)
[  140.860676] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
[  140.864883] Killed process 10596 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[  140.868483] oom_reaper: reaped process 10596 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
(...snipped...)
** 3 printk messages dropped ** [  206.416481] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 2 printk messages dropped ** [  206.418908] Killed process 10600 (a.out) total-vm:4176kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
** 2 printk messages dropped ** [  206.421956] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 2 printk messages dropped ** [  206.424293] INFO: rcu_sched self-detected stall on CPU
** 3 printk messages dropped ** [  206.424300] oom_reaper      R  running task        0    33      2 0x00000008
[  206.424302]  ffff88007cd35900 000000000b04b66b ffff88007fcc3dd0 ffffffff8109db68
** 1 printk messages dropped ** [  206.424304]  ffffffff810a0bc4 0000000000000003 ffff88007fcc3e18 ffffffff8113e092
** 4 printk messages dropped ** [  206.424316]  [<ffffffff8113e092>] rcu_dump_cpu_stacks+0x73/0x94
** 3 printk messages dropped ** [  206.424322]  [<ffffffff810e2d44>] update_process_times+0x34/0x60
** 2 printk messages dropped ** [  206.424327]  [<ffffffff810f2a50>] ? tick_sched_handle.isra.20+0x40/0x40
** 4 printk messages dropped ** [  206.424334]  [<ffffffff81049598>] smp_apic_timer_interrupt+0x38/0x50
** 3 printk messages dropped ** [  206.424342]  [<ffffffff810d0f9a>] vprintk_default+0x1a/0x20
** 2 printk messages dropped ** [  206.424346]  [<ffffffff81708291>] ? _raw_spin_unlock_irqrestore+0x31/0x60
** 5 printk messages dropped ** [  206.424355]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
** 2 printk messages dropped ** [  206.431196] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 3 printk messages dropped ** [  206.434491] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 1 printk messages dropped ** [  206.436152] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
** 3 printk messages dropped ** [  206.439359] Out of memory (oom_kill_allocating_task): Kill process 10595 (a.out) score 0 or sacrifice child
(...snipped...)
[  312.387913] general protection fault: 0000 [#1] SMP 
[  312.387943] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_raw iptable_filter coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper cryptd ppdev vmw_balloon pcspkr sg parport_pc shpchp parport i2c_piix4 vmw_vmci ip_tables sd_mod ata_generic pata_acpi crc32c_intel serio_raw mptspi scsi_transport_spi mptscsih vmwgfx mptbase drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci ttm libahci drm ata_piix e1000 libata i2c_core
[  312.387945] CPU: 0 PID: 33 Comm: oom_reaper Not tainted 4.5.0-rc4-next-20160216 #305
[  312.387946] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  312.387947] task: ffff88007cd35900 ti: ffff88007cdf0000 task.ti: ffff88007cdf0000
[  312.387953] RIP: 0010:[<ffffffff8114425c>]  [<ffffffff8114425c>] oom_reaper+0x9c/0x1e0
[  312.387954] RSP: 0018:ffff88007cdf3e00  EFLAGS: 00010287
[  312.387954] RAX: dead000000000200 RBX: ffff8800621bac80 RCX: 0000000000000001
[  312.387955] RDX: dead000000000100 RSI: 0000000000000000 RDI: ffffffff81c4cac0
[  312.387955] RBP: ffff88007cdf3e60 R08: 0000000000000001 R09: 0000000000000000
[  312.387956] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88007cd35900
[  312.387956] R13: 0000000000000001 R14: ffff88007cdf3e20 R15: ffff8800621bbe50
[  312.387957] FS:  0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[  312.387958] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  312.387958] CR2: 00007f6d8eaa5000 CR3: 000000006361e000 CR4: 00000000001406f0
[  312.387988] Stack:
[  312.387990]  ffff88007cd35900 ffffffff00000000 ffff88007cd35900 ffffffff810b6100
[  312.387991]  ffff88007cdf3e20 ffff88007cdf3e20 000000000b04b66b ffff88007cd9f340
[  312.387992]  0000000000000000 ffffffff811441c0 ffffffff81c4e300 0000000000000000
[  312.387992] Call Trace:
[  312.387998]  [<ffffffff810b6100>] ? wait_woken+0x90/0x90
[  312.388003]  [<ffffffff811441c0>] ? __oom_reap_task+0x220/0x220
[  312.388005]  [<ffffffff81090a49>] kthread+0xf9/0x110
[  312.388011]  [<ffffffff81708c32>] ret_from_fork+0x22/0x50
[  312.388012]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
[  312.388025] Code: cb c4 81 0f 84 a0 00 00 00 4c 8b 3d bf 88 b0 00 48 c7 c7 c0 ca c4 81 41 bd 01 00 00 00 49 8b 47 08 49 8b 17 49 8d 9f 30 ee ff ff <48> 89 42 08 48 89 10 48 b8 00 01 00 00 00 00 ad de 49 89 07 66 
[  312.388027] RIP  [<ffffffff8114425c>] oom_reaper+0x9c/0x1e0
[  312.388028]  RSP <ffff88007cdf3e00>
[  312.388029] ---[ end trace ea62d9784868759a ]---
[  312.388031] BUG: sleeping function called from invalid context at include/linux/sched.h:2819
[  312.388032] in_atomic(): 1, irqs_disabled(): 0, pid: 33, name: oom_reaper
[  312.388033] INFO: lockdep is turned off.
[  312.388034] CPU: 0 PID: 33 Comm: oom_reaper Tainted: G      D         4.5.0-rc4-next-20160216 #305
[  312.388035] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  312.388036]  0000000000000286 000000000b04b66b ffff88007cdf3be0 ffffffff8139e83d
[  312.388037]  ffff88007cd35900 ffffffff819b931a ffff88007cdf3c08 ffffffff810961eb
[  312.388038]  ffffffff819b931a 0000000000000b03 0000000000000000 ffff88007cdf3c30
[  312.388038] Call Trace:
[  312.388043]  [<ffffffff8139e83d>] dump_stack+0x85/0xc8
[  312.388046]  [<ffffffff810961eb>] ___might_sleep+0x14b/0x240
[  312.388047]  [<ffffffff81096324>] __might_sleep+0x44/0x80
[  312.388050]  [<ffffffff8107f7ee>] exit_signals+0x2e/0x150
[  312.388051]  [<ffffffff81091fa1>] ? blocking_notifier_call_chain+0x11/0x20
[  312.388054]  [<ffffffff81072e32>] do_exit+0xc2/0xb50
[  312.388057]  [<ffffffff8101893c>] oops_end+0x9c/0xd0
[  312.388058]  [<ffffffff81018ba6>] die+0x46/0x60
[  312.388059]  [<ffffffff8101602b>] do_general_protection+0xdb/0x1b0
[  312.388061]  [<ffffffff8170a4f8>] general_protection+0x28/0x30
[  312.388064]  [<ffffffff8114425c>] ? oom_reaper+0x9c/0x1e0
[  312.388066]  [<ffffffff810b6100>] ? wait_woken+0x90/0x90
[  312.388067]  [<ffffffff811441c0>] ? __oom_reap_task+0x220/0x220
[  312.388068]  [<ffffffff81090a49>] kthread+0xf9/0x110
[  312.388071]  [<ffffffff81708c32>] ret_from_fork+0x22/0x50
[  312.388072]  [<ffffffff81090950>] ? kthread_create_on_node+0x230/0x230
[  312.388075] note: oom_reaper[33] exited with preempt_count 1
----------

For oom_kill_allocating_task = 1 case (despite the name, it still tries to kill
children first), the OOM killer does not wait for OOM victim to clear TIF_MEMDIE
because select_bad_process() is not called. Therefore, if an OOM victim fails to
terminate because the OOM reaper failed to reap enough memory, the kernel is
flooded with OOM killer messages trying to kill that stuck victim (with OOM
reaper lockup due to list corruption).

Adding an OOM victim which was already added to oom_reaper_list is wrong.
What should we do here?

(Choice 1) Make sure that TIF_MEMDIE thread is not chosen as an OOM victim.
           This will avoid list corruption, but choosing other !TIF_MEMDIE
           threads sharing the TIF_MEMDIE thread's mm and adding them to
           oom_reaper_list does not make sense.

(Choice 2) Make sure that any process which includes a TIF_MEMDIE thread is
           not chosen as an OOM victim. But choosing other processes without
           TIF_MEMDIE thread sharing the TIF_MEMDIE thread's mm and adding
           them to oom_reaper_list does not make sense. A mm should be added
           to oom_reaper_list up to only once.

(Choice 3) Make sure that any process which uses a mm which was added to
           oom_reaper_list is not chosen as an OOM victim. This would mean
           replacing test_tsk_thread_flag(task, TIF_MEMDIE) with
           (mm = task->mm, mm && test_bit(MMF_OOM_VICTIM, &mm->flags))
           in OOM killer and replacing test_thread_flag(TIF_MEMDIE) with
           (current->mm && test_bit(MMF_OOM_VICTIM, &current->mm->flags) &&
            (fatal_signal_pending(current) || (current->flags & PF_EXITING)))
           in ALLOC_NO_WATERMARKS check.

(Choice 4) Call select_bad_process() for oom_kill_allocating_task = 1 case.

----------
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7653055..5e3e2f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -880,15 +880,6 @@ bool out_of_memory(struct oom_control *oc)
 		oc->nodemask = NULL;
 	check_panic_on_oom(oc, constraint, NULL);
 
-	if (sysctl_oom_kill_allocating_task && current->mm &&
-	    !oom_unkillable_task(current, NULL, oc->nodemask) &&
-	    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
-		get_task_struct(current);
-		oom_kill_process(oc, current, 0, totalpages, NULL,
-				 "Out of memory (oom_kill_allocating_task)");
-		return true;
-	}
-
 	p = select_bad_process(oc, &points, totalpages);
 	/* Found nothing?!?! Either we hang forever, or we panic. */
 	if (!p && !is_sysrq_oom(oc)) {
@@ -896,8 +887,16 @@ bool out_of_memory(struct oom_control *oc)
 		panic("Out of memory and no killable processes...\n");
 	}
 	if (p && p != (void *)-1UL) {
-		oom_kill_process(oc, p, points, totalpages, NULL,
-				 "Out of memory");
+		if (sysctl_oom_kill_allocating_task && current->mm &&
+		    !oom_unkillable_task(current, NULL, oc->nodemask) &&
+		    current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
+			put_task_struct(p);
+			get_task_struct(current);
+			oom_kill_process(oc, current, 0, totalpages, NULL,
+					 "Out of memory (oom_kill_allocating_task)");
+		} else
+			oom_kill_process(oc, p, points, totalpages, NULL,
+					 "Out of memory");
 		/*
 		 * Give the killed process a good chance to exit before trying
 		 * to allocate memory again.
----------

(Choice 5) Any other ideas?

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing
  2016-02-16 11:11               ` Tetsuo Handa
@ 2016-02-16 15:53                 ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-16 15:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Tue 16-02-16 20:11:24, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Unless we are under global OOM then this doesn't matter much because the
> > allocation request should succeed at some point in time and memcg
> > charges are bypassed for tasks with pending fatal signals. So we can
> > make a forward progress.
> 
> Hmm, then I wonder how memcg OOM livelock occurs. Anyway, OK for now.
> 
> But current OOM reaper forgot a protection for list item "double add" bug.
> Precisely speaking, this is not a OOM reaper's bug.
[...]
> For oom_kill_allocating_task = 1 case (despite the name, it still tries to kill
> children first),

Yes this is a long term behavior and I cannot say I would be happy about
that because it clearly breaks the defined semantic.

> the OOM killer does not wait for OOM victim to clear TIF_MEMDIE
> because select_bad_process() is not called. Therefore, if an OOM victim fails to
> terminate because the OOM reaper failed to reap enough memory, the kernel is
> flooded with OOM killer messages trying to kill that stuck victim (with OOM
> reaper lockup due to list corruption).

Hmmm, I didn't consider this possibility. For now I would simply disable
oom_reaper for sysctl_oom_kill_allocating_task. oom_kill_allocating_task
needs some more changes IMO. a) we shouldn't kill children as a
heuristic b) we should go and panic if the current task is TIF_MEMDIE
already because that means that we cannot do anything about OOM. But I
think this should be handled separately.

Whould be the following acceptable for now?
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..357cee067950 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
+	
+	/*
+	 * oom_kill_allocating_task doesn't follow normal OOM exclusion
+	 * and so the same task might enter oom_kill_process which oom_reaper
+	 * cannot handle currently.
+	 */
+	can_oom_reap = !sysctl_oom_kill_allocating_task;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-03 13:14 ` [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing Michal Hocko
  2016-02-04 10:49   ` Tetsuo Handa
@ 2016-02-17  9:48   ` Michal Hocko
  2016-02-17 10:41     ` Tetsuo Handa
  2016-02-19 18:34     ` Michal Hocko
  1 sibling, 2 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-17  9:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

Hi Andrew,
although this can be folded into patch 5
(mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
better to have it separate and revert after we sort out the proper
oom_kill_allocating_task behavior or handle exclusion at oom_reaper
level.

Thanks!
---
>From 7d8c953994f97fb38a8d71b53c06ecf8418616e9 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 17 Feb 2016 10:40:41 +0100
Subject: [PATCH] oom, oom_reaper: disable oom_reaper for
 oom_kill_allocating_task

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times. Let's workaround this issue for now until we decide
how to handle oom_kill_allocating_task properly (should it sacrifice
children at all?) or come up with some other protection.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..078e07ec0906 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,14 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
+
+	/*
+	 * XXX: oom_kill_allocating_task doesn't follow normal OOM exclusion
+	 * and so the same task might enter oom_kill_process which oom_reaper
+	 * cannot handle currently.
+	 */
+	can_oom_reap = !sysctl_oom_kill_allocating_task;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
-- 
2.7.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-17  9:48   ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for Michal Hocko
@ 2016-02-17 10:41     ` Tetsuo Handa
  2016-02-17 11:33       ` Michal Hocko
  2016-02-19 18:34     ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-17 10:41 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> Hi Andrew,
> although this can be folded into patch 5
> (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> better to have it separate and revert after we sort out the proper
> oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> level.

What a rough workaround. sysctl_oom_kill_allocating_task == 1 does not
always mean we must skip OOM reaper, for OOM-unkillable callers take
sysctl_oom_kill_allocating_task == 0 path.

I've just posted a patchset which allows you to merge the OOM reaper
without correcting problems found in "[PATCH 3/5] oom: clear TIF_MEMDIE
after oom_reaper managed to unmap the address space" and "[PATCH 5/5]
mm, oom_reaper: implement OOM victims queuing".

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-17 10:41     ` Tetsuo Handa
@ 2016-02-17 11:33       ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-17 11:33 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Wed 17-02-16 19:41:53, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Hi Andrew,
> > although this can be folded into patch 5
> > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > better to have it separate and revert after we sort out the proper
> > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > level.
> 
> What a rough workaround. sysctl_oom_kill_allocating_task == 1 does not
> always mean we must skip OOM reaper, for OOM-unkillable callers take
> sysctl_oom_kill_allocating_task == 0 path.

Yes it is indeed rough but also shouldn't add new issues. I consider
oom_kill_allocating_task as a borderline which can be sorted out later.
So while I do not like workarounds like this in general I would rather
go with obvious code first before going for more complex solutions.

> I've just posted a patchset which allows you to merge the OOM reaper
> without correcting problems found in "[PATCH 3/5] oom: clear TIF_MEMDIE
> after oom_reaper managed to unmap the address space" and "[PATCH 5/5]
> mm, oom_reaper: implement OOM victims queuing".

I will try to look at your patches but the series seems unnecessarily
heavy to be a pre-requisite for the oom_reaper.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for
  2016-02-17  9:48   ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for Michal Hocko
  2016-02-17 10:41     ` Tetsuo Handa
@ 2016-02-19 18:34     ` Michal Hocko
  2016-02-20  2:32       ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo Handa
  1 sibling, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-19 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> Hi Andrew,
> although this can be folded into patch 5
> (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> better to have it separate and revert after we sort out the proper
> oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> level.

An alternative would be something like the following. It is definitely
less hackish but it steals one bit in mm->flags. We do not seem to be
in shortage there now but who knows. Does this sound better? Later
changes might even consider the flag for the victim selection and ignore
those which already have the flag set. But I didn't think about it more
to form a patch yet.
---
>From 8b17e66a70edac65ecd6df411a675cf3d840a9fe Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 17 Feb 2016 10:40:41 +0100
Subject: [PATCH] oom, oom_reaper: disable oom_reaper for
 oom_kill_allocating_task

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 2 ++
 mm/oom_kill.c         | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c25996c336de..0552cd5696c2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -509,6 +509,8 @@ static inline int get_dumpable(struct mm_struct *mm)
 #define MMF_HAS_UPROBES		19	/* has uprobes */
 #define MMF_RECALC_UPROBES	20	/* MMF_HAS_UPROBES can be wrong */
 
+#define MMF_OOM_KILLED		21	/* OOM killer has chosen this mm */
+
 #define MMF_INIT_MASK		(MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)
 
 struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 7e9953a64489..32ce05b1aa10 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,7 +678,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	unsigned int victim_points = 0;
 	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
 					      DEFAULT_RATELIMIT_BURST);
-	bool can_oom_reap = true;
+	bool can_oom_reap;
 
 	/*
 	 * If the task is already exiting, don't alarm the sysadmin or kill
@@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
 	/* Get a reference to safely compare mm after task_unlock(victim) */
 	mm = victim->mm;
 	atomic_inc(&mm->mm_count);
+
+	/* Make sure we do not try to oom reap the mm multiple times */
+	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
+
 	/*
 	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
 	 * the OOM victim from depleting the memory reserves from the user
-- 
2.7.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-02-19 18:34     ` Michal Hocko
@ 2016-02-20  2:32       ` Tetsuo Handa
  2016-02-22  9:41         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Tetsuo Handa @ 2016-02-20  2:32 UTC (permalink / raw)
  To: mhocko, akpm
  Cc: rientjes, mgorman, oleg, torvalds, hughd, andrea, riel, linux-mm,
	linux-kernel

Michal Hocko wrote:
> On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> > Hi Andrew,
> > although this can be folded into patch 5
> > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > better to have it separate and revert after we sort out the proper
> > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > level.
> 
> An alternative would be something like the following. It is definitely
> less hackish but it steals one bit in mm->flags. We do not seem to be
> in shortage there now but who knows. Does this sound better? Later
> changes might even consider the flag for the victim selection and ignore
> those which already have the flag set. But I didn't think about it more
> to form a patch yet.

This sounds better than "can_oom_reap = !sysctl_oom_kill_allocating_task;".

> @@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
>  	/* Get a reference to safely compare mm after task_unlock(victim) */
>  	mm = victim->mm;
>  	atomic_inc(&mm->mm_count);
> +
> +	/* Make sure we do not try to oom reap the mm multiple times */
> +	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> +
>  	/*
>  	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
>  	 * the OOM victim from depleting the memory reserves from the user

But as of this line we don't know whether this mm is reapable.

Shouldn't this be done like

  static void wake_oom_reaper(struct task_struct *tsk)
  {
          /* Make sure we do not try to oom reap the mm multiple times */
          if (!oom_reaper_th || !test_and_set_bit(MMF_OOM_KILLED, &mm->flags))
                  return;

          get_task_struct(tsk);

          spin_lock(&oom_reaper_lock);
          list_add(&tsk->oom_reaper_list, &oom_reaper_list);
          spin_unlock(&oom_reaper_lock);
          wake_up(&oom_reaper_wait);
  }

?

Moreover, why don't you do like

  struct mm_struct {
  	(...snipped...)
  	struct list_head oom_reaper_list;
  	(...snipped...)
  }

than

  struct task_struct {
  	(...snipped...)
  	struct list_head oom_reaper_list;
  	(...snipped...)
  }

so that we can update all ->oom_score_adj using this mm_struct for handling
crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
  2016-02-20  2:32       ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo Handa
@ 2016-02-22  9:41         ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-02-22  9:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: akpm, rientjes, mgorman, oleg, torvalds, hughd, andrea, riel,
	linux-mm, linux-kernel

On Sat 20-02-16 11:32:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 17-02-16 10:48:55, Michal Hocko wrote:
> > > Hi Andrew,
> > > although this can be folded into patch 5
> > > (mm-oom_reaper-implement-oom-victims-queuing.patch) I think it would be
> > > better to have it separate and revert after we sort out the proper
> > > oom_kill_allocating_task behavior or handle exclusion at oom_reaper
> > > level.
> > 
> > An alternative would be something like the following. It is definitely
> > less hackish but it steals one bit in mm->flags. We do not seem to be
> > in shortage there now but who knows. Does this sound better? Later
> > changes might even consider the flag for the victim selection and ignore
> > those which already have the flag set. But I didn't think about it more
> > to form a patch yet.
> 
> This sounds better than "can_oom_reap = !sysctl_oom_kill_allocating_task;".
> 
> > @@ -740,6 +740,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
> >  	/* Get a reference to safely compare mm after task_unlock(victim) */
> >  	mm = victim->mm;
> >  	atomic_inc(&mm->mm_count);
> > +
> > +	/* Make sure we do not try to oom reap the mm multiple times */
> > +	can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
> > +
> >  	/*
> >  	 * We should send SIGKILL before setting TIF_MEMDIE in order to prevent
> >  	 * the OOM victim from depleting the memory reserves from the user
> 
> But as of this line we don't know whether this mm is reapable.

Which is not really important. We know that it is eligible only if the
mm wasn't a part of the OOM kill before. Later checks are, of course,
allowed to veto the default and disable the oom reaper.

> Shouldn't this be done like
> 
>   static void wake_oom_reaper(struct task_struct *tsk)
>   {
>           /* Make sure we do not try to oom reap the mm multiple times */
>           if (!oom_reaper_th || !test_and_set_bit(MMF_OOM_KILLED, &mm->flags))
>                   return;

We do not have the mm here. We have a task and would need the task_lock.
I find it much easier to evaluate mm while we still have it and we know
the task holding this mm will receive SIGKILL and TIF_MEMDIE.
 
>           get_task_struct(tsk);
> 
>           spin_lock(&oom_reaper_lock);
>           list_add(&tsk->oom_reaper_list, &oom_reaper_list);
>           spin_unlock(&oom_reaper_lock);
>           wake_up(&oom_reaper_wait);
>   }
> 
> ?
> 
> Moreover, why don't you do like
> 
>   struct mm_struct {
>   	(...snipped...)
>   	struct list_head oom_reaper_list;
>   	(...snipped...)
>   }

Because we would need to search all tasks sharing the same mm in order
to exit_oom_victim.

> than
> 
>   struct task_struct {
>   	(...snipped...)
>   	struct list_head oom_reaper_list;
>   	(...snipped...)
>   }
> 
> so that we can update all ->oom_score_adj using this mm_struct for handling
> crazy combo ( http://lkml.kernel.org/r/20160204163113.GF14425@dhcp22.suse.cz ) ?

I find it much easier to to simply skip over tasks with MMF_OOM_KILLED
when already selecting a victim. We won't need oom_score_adj games at
all. This needs a deeper evaluation though. I didn't get to it yet,
but the point of having MMF flag which is not oom_reaper specific
was to have it reusable in other contexts as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-03 13:13 ` [PATCH 2/5] oom reaper: handle mlocked pages Michal Hocko
  2016-02-03 23:57   ` David Rientjes
@ 2016-02-23  1:36   ` David Rientjes
  2016-02-23 13:21     ` Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: David Rientjes @ 2016-02-23  1:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML, Michal Hocko

On Wed, 3 Feb 2016, Michal Hocko wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 9a0e4e5f50b4..840e03986497 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  			continue;
>  
>  		/*
> -		 * mlocked VMAs require explicit munlocking before unmap.
> -		 * Let's keep it simple here and skip such VMAs.
> -		 */
> -		if (vma->vm_flags & VM_LOCKED)
> -			continue;
> -
> -		/*
>  		 * Only anonymous pages have a good chance to be dropped
>  		 * without additional steps which we cannot afford as we
>  		 * are OOM already.
> @@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
>  		 * we do not want to block exit_mmap by keeping mm ref
>  		 * count elevated without a good reason.
>  		 */
> -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> +			if (vma->vm_flags & VM_LOCKED)
> +				munlock_vma_pages_all(vma);
>  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
>  					 &details);
> +		}
>  	}
>  	tlb_finish_mmu(&tlb, 0, -1);
>  	up_read(&mm->mmap_sem);

Are we concerned about munlock_vma_pages_all() taking lock_page() and 
perhaps stalling forever, the same way it would stall in exit_mmap() for 
VM_LOCKED vmas, if another thread has locked the same page and is doing an 
allocation?  I'm wondering if in that case it would be better to do a 
best-effort munlock_vma_pages_all() with trylock_page() and just give up 
on releasing memory from that particular vma.  In that case, there may be 
other memory that can be freed with unmap_page_range() that would handle 
this livelock.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-23  1:36   ` David Rientjes
@ 2016-02-23 13:21     ` Michal Hocko
  2016-02-29  3:19       ` Hugh Dickins
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-23 13:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Tetsuo Handa, Oleg Nesterov,
	Linus Torvalds, Hugh Dickins, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 22-02-16 17:36:07, David Rientjes wrote:
> On Wed, 3 Feb 2016, Michal Hocko wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 9a0e4e5f50b4..840e03986497 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -443,13 +443,6 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  			continue;
> >  
> >  		/*
> > -		 * mlocked VMAs require explicit munlocking before unmap.
> > -		 * Let's keep it simple here and skip such VMAs.
> > -		 */
> > -		if (vma->vm_flags & VM_LOCKED)
> > -			continue;
> > -
> > -		/*
> >  		 * Only anonymous pages have a good chance to be dropped
> >  		 * without additional steps which we cannot afford as we
> >  		 * are OOM already.
> > @@ -459,9 +452,12 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
> >  		 * we do not want to block exit_mmap by keeping mm ref
> >  		 * count elevated without a good reason.
> >  		 */
> > -		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> > +		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> > +			if (vma->vm_flags & VM_LOCKED)
> > +				munlock_vma_pages_all(vma);
> >  			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> >  					 &details);
> > +		}
> >  	}
> >  	tlb_finish_mmu(&tlb, 0, -1);
> >  	up_read(&mm->mmap_sem);
> 
> Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> perhaps stalling forever, the same way it would stall in exit_mmap() for 
> VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> allocation?

This is a good question. I have checked for that particular case
previously and managed to convinced myself that this is OK(ish).
munlock_vma_pages_range locks only THP pages to prevent from the
parallel split-up AFAICS. And split_huge_page_to_list doesn't seem
to depend on an allocation. It can block on anon_vma lock but I didn't
see any allocation requests from there either. I might be missing
something of course. Do you have any specific path in mind?

> I'm wondering if in that case it would be better to do a 
> best-effort munlock_vma_pages_all() with trylock_page() and just give up 
> on releasing memory from that particular vma.  In that case, there may be 
> other memory that can be freed with unmap_page_range() that would handle 
> this livelock.

I have tried to code it up but I am not really sure the whole churn is
really worth it - unless I am missing something that would really make
the THP case likely to hit in the real life.

Just for the reference this is what I came up with (just compile tested).
---
diff --git a/mm/internal.h b/mm/internal.h
index cac6eb458727..63dcdd60aca8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -249,11 +249,13 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 #ifdef CONFIG_MMU
 extern long populate_vma_page_range(struct vm_area_struct *vma,
 		unsigned long start, unsigned long end, int *nonblocking);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
-			unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
+
+/* Can fail only if enforce == false */
+extern int munlock_vma_pages_range(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end, bool enforce);
+static inline int munlock_vma_pages_all(struct vm_area_struct *vma, bool enforce)
 {
-	munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
+	return munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end, enforce);
 }
 
 /*
diff --git a/mm/mlock.c b/mm/mlock.c
index 96f001041928..934c0f8f8ebc 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -431,8 +431,9 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
  * and re-mlocked by try_to_{munlock|unmap} before we unmap and
  * free them.  This will result in freeing mlocked pages.
  */
-void munlock_vma_pages_range(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+int munlock_vma_pages_range(struct vm_area_struct *vma,
+			     unsigned long start, unsigned long end,
+			     bool enforce)
 {
 	vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
 
@@ -460,7 +461,13 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 				VM_BUG_ON_PAGE(PageMlocked(page), page);
 				put_page(page); /* follow_page_mask() */
 			} else if (PageTransHuge(page)) {
-				lock_page(page);
+				if (enforce) {
+					lock_page(page);
+				} else if (!trylock_page(page)) {
+					put_page(page);
+					return -EAGAIN;
+				}
+
 				/*
 				 * Any THP page found by follow_page_mask() may
 				 * have gotten split before reaching
@@ -497,6 +504,8 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 next:
 		cond_resched();
 	}
+
+	return 0;
 }
 
 /*
@@ -561,7 +570,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (lock)
 		vma->vm_flags = newflags;
 	else
-		munlock_vma_pages_range(vma, start, end);
+		munlock_vma_pages_range(vma, start, end, true);
 
 out:
 	*prev = vma;
diff --git a/mm/mmap.c b/mm/mmap.c
index cfc0cdca421e..7c2ed6e7b415 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2592,7 +2592,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len)
 		while (tmp && tmp->vm_start < end) {
 			if (tmp->vm_flags & VM_LOCKED) {
 				mm->locked_vm -= vma_pages(tmp);
-				munlock_vma_pages_all(tmp);
+				munlock_vma_pages_all(tmp, true);
 			}
 			tmp = tmp->vm_next;
 		}
@@ -2683,7 +2683,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 	if (vma->vm_flags & VM_LOCKED) {
 		flags |= MAP_LOCKED;
 		/* drop PG_Mlocked flag for over-mapped range */
-		munlock_vma_pages_range(vma, start, start + size);
+		munlock_vma_pages_range(vma, start, start + size, true);
 	}
 
 	file = get_file(vma->vm_file);
@@ -2825,7 +2825,7 @@ void exit_mmap(struct mm_struct *mm)
 		vma = mm->mmap;
 		while (vma) {
 			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
+				munlock_vma_pages_all(vma, true);
 			vma = vma->vm_next;
 		}
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 32ce05b1aa10..09e6f3211f1c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -473,7 +473,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 		 */
 		if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
 			if (vma->vm_flags & VM_LOCKED)
-				munlock_vma_pages_all(vma);
+				if (munlock_vma_pages_all(vma, false))
+					continue;
 			unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
 					 &details);
 		}

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-23 13:21     ` Michal Hocko
@ 2016-02-29  3:19       ` Hugh Dickins
  2016-02-29 13:41         ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Hugh Dickins @ 2016-02-29  3:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Hugh Dickins, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Tue, 23 Feb 2016, Michal Hocko wrote:
> On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > 
> > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > allocation?
> 
> This is a good question. I have checked for that particular case
> previously and managed to convinced myself that this is OK(ish).
> munlock_vma_pages_range locks only THP pages to prevent from the
> parallel split-up AFAICS.

I think you're mistaken on that: there is also the lock_page()
on every page in Phase 2 of __munlock_pagevec().

> And split_huge_page_to_list doesn't seem
> to depend on an allocation. It can block on anon_vma lock but I didn't
> see any allocation requests from there either. I might be missing
> something of course. Do you have any specific path in mind?
> 
> > I'm wondering if in that case it would be better to do a 
> > best-effort munlock_vma_pages_all() with trylock_page() and just give up 
> > on releasing memory from that particular vma.  In that case, there may be 
> > other memory that can be freed with unmap_page_range() that would handle 
> > this livelock.

I agree with David, that we ought to trylock_page() throughout munlock:
just so long as it gets to do the TestClearPageMlocked without demanding
page lock, the rest is the usual sugarcoating for accurate Mlocked stats,
and leave the rest for reclaim to fix up.

> 
> I have tried to code it up but I am not really sure the whole churn is
> really worth it - unless I am missing something that would really make
> the THP case likely to hit in the real life.

Though I must have known about it forever, it was a shock to see all
those page locks demanded in exit, brought home to us a week or so ago.

The proximate cause in this case was my own change, to defer pte_alloc
to suit huge tmpfs: it had not previously occurred to me that I was
now doing the pte_alloc while __do_fault holds page lock.  Bad Hugh.
But change not yet upstream, so not so urgent for you.

>From time immemorial, free_swap_and_cache() and free_swap_cache() only
ever trylock a page, precisely so that they never hold up munmap or exit
(well, if I looked harder, I might find lock ordering reasons too).

> 
> Just for the reference this is what I came up with (just compile tested).

I tried something similar internally (on an earlier kernel).  Like
you I've set that work aside for now, there were quicker ways to fix
the issue at hand.  But it does continue to offend me that munlock
demands all those page locks: so if you don't get back to it before me,
I shall eventually.

I didn't understand why you complicated yours with the "enforce"
arg to munlock_vma_pages_range(): why not just trylock in all cases?

Hugh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-29  3:19       ` Hugh Dickins
@ 2016-02-29 13:41         ` Michal Hocko
  2016-03-08 13:40           ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-02-29 13:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> On Tue, 23 Feb 2016, Michal Hocko wrote:
> > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > 
> > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > allocation?
> > 
> > This is a good question. I have checked for that particular case
> > previously and managed to convinced myself that this is OK(ish).
> > munlock_vma_pages_range locks only THP pages to prevent from the
> > parallel split-up AFAICS.
> 
> I think you're mistaken on that: there is also the lock_page()
> on every page in Phase 2 of __munlock_pagevec().

Ohh, I have missed that one. Thanks for pointing it out!

[...]
> > Just for the reference this is what I came up with (just compile tested).
> 
> I tried something similar internally (on an earlier kernel).  Like
> you I've set that work aside for now, there were quicker ways to fix
> the issue at hand.  But it does continue to offend me that munlock
> demands all those page locks: so if you don't get back to it before me,
> I shall eventually.
> 
> I didn't understand why you complicated yours with the "enforce"
> arg to munlock_vma_pages_range(): why not just trylock in all cases?

Well, I have to confess that I am not really sure I understand all the
consequences of the locking here. It has always been subtle and weird
issues popping up from time to time. So I only wanted to have that
change limitted to the oom_reaper. So I would really appreciate if
somebody more knowledgeable had a look. We can drop the mlock patch for
now.

Thanks for looking into this, Hugh!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-02-29 13:41         ` Michal Hocko
@ 2016-03-08 13:40           ` Michal Hocko
  2016-03-08 20:07             ` Hugh Dickins
  0 siblings, 1 reply; 48+ messages in thread
From: Michal Hocko @ 2016-03-08 13:40 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > 
> > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > allocation?
> > > 
> > > This is a good question. I have checked for that particular case
> > > previously and managed to convinced myself that this is OK(ish).
> > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > parallel split-up AFAICS.
> > 
> > I think you're mistaken on that: there is also the lock_page()
> > on every page in Phase 2 of __munlock_pagevec().
> 
> Ohh, I have missed that one. Thanks for pointing it out!
> 
> [...]
> > > Just for the reference this is what I came up with (just compile tested).
> > 
> > I tried something similar internally (on an earlier kernel).  Like
> > you I've set that work aside for now, there were quicker ways to fix
> > the issue at hand.  But it does continue to offend me that munlock
> > demands all those page locks: so if you don't get back to it before me,
> > I shall eventually.
> > 
> > I didn't understand why you complicated yours with the "enforce"
> > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> 
> Well, I have to confess that I am not really sure I understand all the
> consequences of the locking here. It has always been subtle and weird
> issues popping up from time to time. So I only wanted to have that
> change limitted to the oom_reaper. So I would really appreciate if
> somebody more knowledgeable had a look. We can drop the mlock patch for
> now.

According to the rc7 announcement it seems we are approaching the merge
window. Should we drop the patch for now or the risk of the lockup is
too low to care about and keep it in for now as it might be already
useful and change the munlock path to not depend on page locks later on?

I am OK with both ways.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-03-08 13:40           ` Michal Hocko
@ 2016-03-08 20:07             ` Hugh Dickins
  2016-03-09  8:26               ` Michal Hocko
  0 siblings, 1 reply; 48+ messages in thread
From: Hugh Dickins @ 2016-03-08 20:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hugh Dickins, David Rientjes, Andrew Morton, Mel Gorman,
	Tetsuo Handa, Oleg Nesterov, Linus Torvalds, Andrea Argangeli,
	Rik van Riel, linux-mm, LKML

On Tue, 8 Mar 2016, Michal Hocko wrote:
> On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> > On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > > 
> > > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > > allocation?
> > > > 
> > > > This is a good question. I have checked for that particular case
> > > > previously and managed to convinced myself that this is OK(ish).
> > > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > > parallel split-up AFAICS.
> > > 
> > > I think you're mistaken on that: there is also the lock_page()
> > > on every page in Phase 2 of __munlock_pagevec().
> > 
> > Ohh, I have missed that one. Thanks for pointing it out!
> > 
> > [...]
> > > > Just for the reference this is what I came up with (just compile tested).
> > > 
> > > I tried something similar internally (on an earlier kernel).  Like
> > > you I've set that work aside for now, there were quicker ways to fix
> > > the issue at hand.  But it does continue to offend me that munlock
> > > demands all those page locks: so if you don't get back to it before me,
> > > I shall eventually.
> > > 
> > > I didn't understand why you complicated yours with the "enforce"
> > > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> > 
> > Well, I have to confess that I am not really sure I understand all the
> > consequences of the locking here. It has always been subtle and weird
> > issues popping up from time to time. So I only wanted to have that
> > change limitted to the oom_reaper. So I would really appreciate if
> > somebody more knowledgeable had a look. We can drop the mlock patch for
> > now.
> 
> According to the rc7 announcement it seems we are approaching the merge
> window. Should we drop the patch for now or the risk of the lockup is
> too low to care about and keep it in for now as it might be already
> useful and change the munlock path to not depend on page locks later on?
> 
> I am OK with both ways.

You're asking about the Subject patch, "oom reaper: handle mlocked pages",
I presume.  Your Work-In-Progress mods to munlock_vma_pages_range() should
certainly be dropped for now, and revisited by one of us another time.

I vote for dropping "oom reaper: handle mlocked pages" for now too.
If I understand correctly, the purpose of the oom reaper is to free up
as much memory from the targeted task as possible, while avoiding getting
stuck on locks; in advance of the task actually exiting and doing the
freeing itself, but perhaps getting stuck on locks as it does so.

If that's a fair description, then it's inappropriate for the oom reaper
to call munlock_vma_pages_all(), with the risk of getting stuck on many
page locks; best leave that risk to the task when it exits as at present.
Of course we should come back to this later, fix munlock_vma_pages_range()
with trylocks (on the pages only? rmap mutexes also?), and then integrate
"oom reaper: handle mlocked pages".

(Or if we had the old mechanism for scanning unevictable lrus on demand,
perhaps simply not avoid the VM_LOCKED vmas in __oom_reap_vmas(), let
the clear_page_mlock() in page_remove_*rmap() handle all the singly
mapped and mlocked pages, and un-mlock the rest by scanning unevictables.)

Hugh

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 2/5] oom reaper: handle mlocked pages
  2016-03-08 20:07             ` Hugh Dickins
@ 2016-03-09  8:26               ` Michal Hocko
  0 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2016-03-09  8:26 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Rientjes, Andrew Morton, Mel Gorman, Tetsuo Handa,
	Oleg Nesterov, Linus Torvalds, Andrea Argangeli, Rik van Riel,
	linux-mm, LKML

On Tue 08-03-16 12:07:24, Hugh Dickins wrote:
> On Tue, 8 Mar 2016, Michal Hocko wrote:
> > On Mon 29-02-16 14:41:39, Michal Hocko wrote:
> > > On Sun 28-02-16 19:19:11, Hugh Dickins wrote:
> > > > On Tue, 23 Feb 2016, Michal Hocko wrote:
> > > > > On Mon 22-02-16 17:36:07, David Rientjes wrote:
> > > > > > 
> > > > > > Are we concerned about munlock_vma_pages_all() taking lock_page() and 
> > > > > > perhaps stalling forever, the same way it would stall in exit_mmap() for 
> > > > > > VM_LOCKED vmas, if another thread has locked the same page and is doing an 
> > > > > > allocation?
> > > > > 
> > > > > This is a good question. I have checked for that particular case
> > > > > previously and managed to convinced myself that this is OK(ish).
> > > > > munlock_vma_pages_range locks only THP pages to prevent from the
> > > > > parallel split-up AFAICS.
> > > > 
> > > > I think you're mistaken on that: there is also the lock_page()
> > > > on every page in Phase 2 of __munlock_pagevec().
> > > 
> > > Ohh, I have missed that one. Thanks for pointing it out!
> > > 
> > > [...]
> > > > > Just for the reference this is what I came up with (just compile tested).
> > > > 
> > > > I tried something similar internally (on an earlier kernel).  Like
> > > > you I've set that work aside for now, there were quicker ways to fix
> > > > the issue at hand.  But it does continue to offend me that munlock
> > > > demands all those page locks: so if you don't get back to it before me,
> > > > I shall eventually.
> > > > 
> > > > I didn't understand why you complicated yours with the "enforce"
> > > > arg to munlock_vma_pages_range(): why not just trylock in all cases?
> > > 
> > > Well, I have to confess that I am not really sure I understand all the
> > > consequences of the locking here. It has always been subtle and weird
> > > issues popping up from time to time. So I only wanted to have that
> > > change limitted to the oom_reaper. So I would really appreciate if
> > > somebody more knowledgeable had a look. We can drop the mlock patch for
> > > now.
> > 
> > According to the rc7 announcement it seems we are approaching the merge
> > window. Should we drop the patch for now or the risk of the lockup is
> > too low to care about and keep it in for now as it might be already
> > useful and change the munlock path to not depend on page locks later on?
> > 
> > I am OK with both ways.
> 
> You're asking about the Subject patch, "oom reaper: handle mlocked pages",
> I presume.  Your Work-In-Progress mods to munlock_vma_pages_range() should
> certainly be dropped for now, and revisited by one of us another time.

I believe it hasn't landed in the mmotm yet.

> I vote for dropping "oom reaper: handle mlocked pages" for now too.

OK, Andrew, could you drop oom-reaper-handle-mlocked-pages.patch for
now. We will revisit it later on after we make the munlock path page
lock free.

> If I understand correctly, the purpose of the oom reaper is to free up
> as much memory from the targeted task as possible, while avoiding getting
> stuck on locks; in advance of the task actually exiting and doing the
> freeing itself, but perhaps getting stuck on locks as it does so.
> 
> If that's a fair description, then it's inappropriate for the oom reaper
> to call munlock_vma_pages_all(), with the risk of getting stuck on many
> page locks; best leave that risk to the task when it exits as at present.
> Of course we should come back to this later, fix munlock_vma_pages_range()
> with trylocks (on the pages only? rmap mutexes also?), and then integrate
> "oom reaper: handle mlocked pages".

Fair enough.

> (Or if we had the old mechanism for scanning unevictable lrus on demand,
> perhaps simply not avoid the VM_LOCKED vmas in __oom_reap_vmas(), let
> the clear_page_mlock() in page_remove_*rmap() handle all the singly
> mapped and mlocked pages, and un-mlock the rest by scanning unevictables.)

I will have a look at this possibility as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2016-03-09  8:26 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-03 13:13 [PATCH 0/5] oom reaper v5 Michal Hocko
2016-02-03 13:13 ` [PATCH 1/5] mm, oom: introduce oom reaper Michal Hocko
2016-02-03 23:48   ` David Rientjes
2016-02-04  6:41     ` Michal Hocko
2016-02-06 13:22   ` Tetsuo Handa
2016-02-15 20:50     ` Michal Hocko
2016-02-03 13:13 ` [PATCH 2/5] oom reaper: handle mlocked pages Michal Hocko
2016-02-03 23:57   ` David Rientjes
2016-02-23  1:36   ` David Rientjes
2016-02-23 13:21     ` Michal Hocko
2016-02-29  3:19       ` Hugh Dickins
2016-02-29 13:41         ` Michal Hocko
2016-03-08 13:40           ` Michal Hocko
2016-03-08 20:07             ` Hugh Dickins
2016-03-09  8:26               ` Michal Hocko
2016-02-03 13:13 ` [PATCH 3/5] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space Michal Hocko
2016-02-04 14:22   ` Tetsuo Handa
2016-02-04 14:43     ` Michal Hocko
2016-02-04 15:08       ` Tetsuo Handa
2016-02-04 16:31         ` Michal Hocko
2016-02-05 11:14           ` Tetsuo Handa
2016-02-06  8:30             ` Michal Hocko
2016-02-06 11:23               ` Tetsuo Handa
2016-02-15 20:47                 ` Michal Hocko
2016-02-06  6:45       ` Michal Hocko
2016-02-06 14:33         ` Tetsuo Handa
2016-02-15 20:40           ` [PATCH 3.1/5] oom: make oom_reaper freezable Michal Hocko
2016-02-03 13:13 ` [PATCH 4/5] mm, oom_reaper: report success/failure Michal Hocko
2016-02-03 23:10   ` David Rientjes
2016-02-04  6:46     ` Michal Hocko
2016-02-04 22:31       ` David Rientjes
2016-02-05  9:26         ` Michal Hocko
2016-02-06  6:34           ` Michal Hocko
2016-02-03 13:14 ` [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing Michal Hocko
2016-02-04 10:49   ` Tetsuo Handa
2016-02-04 14:53     ` Michal Hocko
2016-02-06  5:54       ` Tetsuo Handa
2016-02-06  8:37         ` Michal Hocko
2016-02-06 15:33           ` Tetsuo Handa
2016-02-15 20:15             ` Michal Hocko
2016-02-16 11:11               ` Tetsuo Handa
2016-02-16 15:53                 ` Michal Hocko
2016-02-17  9:48   ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for Michal Hocko
2016-02-17 10:41     ` Tetsuo Handa
2016-02-17 11:33       ` Michal Hocko
2016-02-19 18:34     ` Michal Hocko
2016-02-20  2:32       ` [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task Tetsuo Handa
2016-02-22  9:41         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).