From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755891AbcBDKtk (ORCPT ); Thu, 4 Feb 2016 05:49:40 -0500 Received: from www262.sakura.ne.jp ([202.181.97.72]:46468 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757283AbcBDKtb (ORCPT ); Thu, 4 Feb 2016 05:49:31 -0500 To: mhocko@kernel.org, akpm@linux-foundation.org Cc: rientjes@google.com, mgorman@suse.de, oleg@redhat.com, torvalds@linux-foundation.org, hughd@google.com, andrea@kernel.org, riel@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, mhocko@suse.com Subject: Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing From: Tetsuo Handa References: <1454505240-23446-1-git-send-email-mhocko@kernel.org> <1454505240-23446-6-git-send-email-mhocko@kernel.org> In-Reply-To: <1454505240-23446-6-git-send-email-mhocko@kernel.org> Message-Id: <201602041949.BIG30715.QVFLFOOOHMtSFJ@I-love.SAKURA.ne.jp> X-Mailer: Winbiff [Version 2.51 PL2] X-Accept-Language: ja,en,zh Date: Thu, 4 Feb 2016 19:49:29 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Michal Hocko wrote: > From: Michal Hocko > > wake_oom_reaper has allowed only 1 oom victim to be queued. The main > reason for that was the simplicity as other solutions would require > some way of queuing. The current approach is racy and that was deemed > sufficient as the oom_reaper is considered a best effort approach > to help with oom handling when the OOM victim cannot terminate in a > reasonable time. The race could lead to missing an oom victim which can > get stuck > > out_of_memory > wake_oom_reaper > cmpxchg // OK > oom_reaper > oom_reap_task > __oom_reap_task > oom_victim terminates > atomic_inc_not_zero // fail > out_of_memory > wake_oom_reaper > cmpxchg // fails > task_to_reap = NULL > > This race requires 2 OOM invocations in a short time period which is not > very likely but certainly not impossible. E.g. the original victim might > have not released a lot of memory for some reason. > > The situation would improve considerably if wake_oom_reaper used a more > robust queuing. This is what this patch implements. This means adding > oom_reaper_list list_head into task_struct (eat a hole before embeded > thread_struct for that purpose) and a oom_reaper_lock spinlock for > queuing synchronization. wake_oom_reaper will then add the task on the > queue and oom_reaper will dequeue it. > I think we want to rewrite this patch's description from a different point of view. As of "[PATCH 1/5] mm, oom: introduce oom reaper", we assumed that we try to manage OOM livelock caused by system-wide OOM events using the OOM reaper. Therefore, the OOM reaper had high scheduling priority and we considered side effect of the OOM reaper as a reasonable constraint. But as the discussion went by, we started to try to manage OOM livelock caused by non system-wide OOM events (e.g. memcg OOM) using the OOM reaper. Therefore, the OOM reaper now has normal scheduling priority. For non system-wide OOM events, side effect of the OOM reaper might not be a reasonable constraint. Some administrator might expect that the OOM reaper does not break coredumping unless the system is under system-wide OOM events. The race described in this patch's description sounds as if 2 OOM invocations are by system-wide OOM events. If we consider only system-wide OOM events, there is no need to keep task_to_reap != NULL after the OOM reaper found a task to reap (shown below) because existing victim will prevent the OOM killer from calling wake_oom_reaper(). ---------------------------------------- diff --git a/include/linux/sched.h b/include/linux/sched.h index 012dd6f..c919ddb 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1835,9 +1835,6 @@ struct task_struct { unsigned long task_state_change; #endif int pagefault_disabled; -#ifdef CONFIG_MMU - struct list_head oom_reaper_list; -#endif /* CPU-specific state of this task */ struct thread_struct thread; /* diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b42c6bc..fa6a302 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -422,10 +422,8 @@ bool oom_killer_disabled __read_mostly; * victim (if that is possible) to help the OOM killer to move on. */ static struct task_struct *oom_reaper_th; +static struct task_struct *task_to_reap; static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static LIST_HEAD(oom_reaper_list); -static DEFINE_SPINLOCK(oom_reaper_lock); - static bool __oom_reap_task(struct task_struct *tsk) { @@ -526,20 +524,11 @@ static void oom_reap_task(struct task_struct *tsk) static int oom_reaper(void *unused) { while (true) { - struct task_struct *tsk = NULL; + struct task_struct *tsk; wait_event_freezable(oom_reaper_wait, - (!list_empty(&oom_reaper_list))); - spin_lock(&oom_reaper_lock); - if (!list_empty(&oom_reaper_list)) { - tsk = list_first_entry(&oom_reaper_list, - struct task_struct, oom_reaper_list); - list_del(&tsk->oom_reaper_list); - } - spin_unlock(&oom_reaper_lock); - - if (tsk) - oom_reap_task(tsk); + (tsk = xchg(&task_to_reap, NULL))); + oom_reap_task(tsk); } return 0; @@ -551,11 +540,11 @@ static void wake_oom_reaper(struct task_struct *tsk) return; get_task_struct(tsk); - - spin_lock(&oom_reaper_lock); - list_add(&tsk->oom_reaper_list, &oom_reaper_list); - spin_unlock(&oom_reaper_lock); - wake_up(&oom_reaper_wait); + tsk = xchg(&task_to_reap, tsk); + if (!tsk) + wake_up(&oom_reaper_wait); + else + put_task_struct(tsk); } static int __init oom_init(void) ---------------------------------------- But if we consider non system-wide OOM events, it is not very unlikely to hit this race. This queue is useful for situations where memcg1 and memcg2 hit memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately. I expect parallel reaping (shown below) because there is no need to serialize victim tasks (e.g. wait for reaping victim1 in memcg1 which can take up to 1 second to complete before start reaping victim2 in memcg2) if we implement this queue. ---------------------------------------- diff --git a/mm/oom_kill.c b/mm/oom_kill.c index b42c6bc..c2d6472 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -427,7 +427,7 @@ static LIST_HEAD(oom_reaper_list); static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task(struct task_struct *tsk) +static bool oom_reap_task(struct task_struct *tsk) { struct mmu_gather tlb; struct vm_area_struct *vma; @@ -504,42 +504,42 @@ out: return ret; } -#define MAX_OOM_REAP_RETRIES 10 -static void oom_reap_task(struct task_struct *tsk) -{ - int attempts = 0; - - /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk)) - schedule_timeout_idle(HZ/10); - - if (attempts > MAX_OOM_REAP_RETRIES) { - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - debug_show_all_locks(); - } - - /* Drop a reference taken by wake_oom_reaper */ - put_task_struct(tsk); -} - static int oom_reaper(void *unused) { while (true) { - struct task_struct *tsk = NULL; - + struct task_struct *tsk; + struct task_struct *t; + LIST_HEAD(list); + int i; + wait_event_freezable(oom_reaper_wait, (!list_empty(&oom_reaper_list))); spin_lock(&oom_reaper_lock); - if (!list_empty(&oom_reaper_list)) { - tsk = list_first_entry(&oom_reaper_list, - struct task_struct, oom_reaper_list); - list_del(&tsk->oom_reaper_list); - } + list_splice(&oom_reaper_list, &list); + INIT_LIST_HEAD(&oom_reaper_list); spin_unlock(&oom_reaper_lock); - - if (tsk) - oom_reap_task(tsk); + /* Retry the down_read_trylock(mmap_sem) a few times */ + for (i = 0; i < 10; i++) { + list_for_each_entry_safe(tsk, t, &list, + oom_reaper_list) { + if (!oom_reap_task(tsk)) + continue; + list_del(&tsk->oom_reaper_list); + /* Drop a reference taken by wake_oom_reaper */ + put_task_struct(tsk); + } + if (list_empty(&list)) + break; + schedule_timeout_idle(HZ/10); + } + if (list_empty(&list)) + continue; + list_for_each_entry(tsk, &list, oom_reaper_list) { + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + put_task_struct(tsk); + } + debug_show_all_locks(); } return 0; ----------------------------------------