[PATCH] sched,numa: limit amount of virtual memory scanned in task_numa

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work
@ 2015-09-11 13:00 Rik van Riel
  2015-09-11 15:05 ` Mel Gorman
  2015-09-18  8:48 ` [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa_work() tip-bot for Rik van Riel
  0 siblings, 2 replies; 5+ messages in thread
From: Rik van Riel @ 2015-09-11 13:00 UTC (permalink / raw
  To: linux-kernel; +Cc: peterz, mingo, mgorman, Andrea Arcangeli, Jan Stancek

Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
---
 kernel/sched/fair.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e2e3483b1ec..ff51b559ccaf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
-	long pages;
+	long pages, virtpages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	virtpages = pages * 8;	   /* Scan up to this much virtual space */
 	if (!pages)
 		return;
 
+
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (!vma) {
@@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			nr_pte_updates += change_prot_numa(vma, start, end);
+			nr_pte_updates = change_prot_numa(vma, start, end);
 
 			/*
-			 * Scan sysctl_numa_balancing_scan_size but ensure that
-			 * at least one PTE is updated so that unused virtual
-			 * address space is quickly skipped.
+			 * Try to scan sysctl_numa_balancing_size worth of
+			 * hpages that have at least one present PTE that
+			 * is not already pte-numa. If the VMA contains
+			 * areas that are unused or already full of prot_numa
+			 * PTEs, scan up to virtpages, to skip through those
+			 * areas faster.
 			 */
 			if (nr_pte_updates)
 				pages -= (end - start) >> PAGE_SHIFT;
+			virtpages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
-			if (pages <= 0)
+			if (pages <= 0 || virtpages <= 0)
 				goto out;
 
 			cond_resched();

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work
  2015-09-11 13:00 [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work Rik van Riel
@ 2015-09-11 15:05 ` Mel Gorman
  2015-09-11 15:57   ` Rik van Riel
  2015-09-18  8:48 ` [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa_work() tip-bot for Rik van Riel
  1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2015-09-11 15:05 UTC (permalink / raw
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, Andrea Arcangeli, Jan Stancek

On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote:
> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
> of memory per invocation, but only counts memory areas that have at
> least one PTE that is still present and not marked for numa hint faulting.
> 
> It will skip over arbitarily large amounts of memory that are either
> unused, full of swap ptes, or full of PTEs that were already marked
> for NUMA hint faults but have not been faulted on yet.
> 

This was deliberate and intended to cover a case whereby a process sparsely
using the address space would quickly skip over the sparse portions and
reach the active portions. Obviously you've found that this is not always
a great idea.

> This can cause excessive amounts of CPU use, due to there being
> essentially no upper limit on the scan rate of very large processes
> that are not yet in a phase where they are actively accessing old
> memory pages (eg. they are still initializing their data).
> 
> Avoid that problem by placing an upper limit on the amount of virtual
> memory that task_numa_work scans in each invocation. This can be a
> higher limit than "pages", to ensure the task still skips over unused
> areas fairly quickly.
> 
> While we are here, also fix the "nr_pte_updates" logic, so it only
> counts page ranges with ptes in them.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Jan Stancek <jstancek@redhat.com>
> ---
>  kernel/sched/fair.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e2e3483b1ec..ff51b559ccaf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
>  	struct vm_area_struct *vma;
>  	unsigned long start, end;
>  	unsigned long nr_pte_updates = 0;
> -	long pages;
> +	long pages, virtpages;
>  
>  	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
>  
> @@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
>  	start = mm->numa_scan_offset;
>  	pages = sysctl_numa_balancing_scan_size;
>  	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
> +	virtpages = pages * 8;	   /* Scan up to this much virtual space */
>  	if (!pages)
>  		return;
>  
> +
>  	down_read(&mm->mmap_sem);
>  	vma = find_vma(mm, start);
>  	if (!vma) {


> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
>  			start = max(start, vma->vm_start);
>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>  			end = min(end, vma->vm_end);
> -			nr_pte_updates += change_prot_numa(vma, start, end);
> +			nr_pte_updates = change_prot_numa(vma, start, end);
>  

Are you *sure* about this particular change?

The intent is that sparse space be skipped until the first updated PTE
is found and then scan sysctl_numa_balancing_scan_size pages after that.
With this change, if we find a single PTE in the middle of a sparse space
than we stop updating pages in the nr_pte_updates check below. You get
protected from a lot of scanning by the virtpages check but it does not
seem this fix is necessary.  It has an odd side-effect whereby we possible
scan more with this patch in some cases.

>  			/*
> -			 * Scan sysctl_numa_balancing_scan_size but ensure that
> -			 * at least one PTE is updated so that unused virtual
> -			 * address space is quickly skipped.
> +			 * Try to scan sysctl_numa_balancing_size worth of
> +			 * hpages that have at least one present PTE that
> +			 * is not already pte-numa. If the VMA contains
> +			 * areas that are unused or already full of prot_numa
> +			 * PTEs, scan up to virtpages, to skip through those
> +			 * areas faster.
>  			 */
>  			if (nr_pte_updates)
>  				pages -= (end - start) >> PAGE_SHIFT;
> +			virtpages -= (end - start) >> PAGE_SHIFT;
>  

It's a pity there will potentially be a lot of useless dead scanning on
those processes but caching start addresses is both outside the scope of
this patch and has its own problems.

>  			start = end;
> -			if (pages <= 0)
> +			if (pages <= 0 || virtpages <= 0)
>  				goto out;
>  
>  			cond_resched();

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work
  2015-09-11 15:05 ` Mel Gorman
@ 2015-09-11 15:57   ` Rik van Riel
  2015-09-11 16:16     ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 2015-09-11 15:57 UTC (permalink / raw
  To: Mel Gorman; +Cc: linux-kernel, peterz, mingo, Andrea Arcangeli, Jan Stancek

On 09/11/2015 11:05 AM, Mel Gorman wrote:
> On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote:
>> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
>> of memory per invocation, but only counts memory areas that have at
>> least one PTE that is still present and not marked for numa hint faulting.
>>
>> It will skip over arbitarily large amounts of memory that are either
>> unused, full of swap ptes, or full of PTEs that were already marked
>> for NUMA hint faults but have not been faulted on yet.
>>
> 
> This was deliberate and intended to cover a case whereby a process sparsely
> using the address space would quickly skip over the sparse portions and
> reach the active portions. Obviously you've found that this is not always
> a great idea.

Skipping over non-present pages is fine, since the scan
rate is keyed off the RSS.

However, skipping over pages that are already marked
PROT_NONE / PTE_NUMA results in unmapping pages at a much
accelerated rate (sometimes using >90% of the CPU of the
task), because the pages that are already PROT_NONE / NUMA
_are_ counted as part of the RSS.

>> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
>>  			start = max(start, vma->vm_start);
>>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
>>  			end = min(end, vma->vm_end);
>> -			nr_pte_updates += change_prot_numa(vma, start, end);
>> +			nr_pte_updates = change_prot_numa(vma, start, end);
>>  
> 
> Are you *sure* about this particular change?
> 
> The intent is that sparse space be skipped until the first updated PTE
> is found and then scan sysctl_numa_balancing_scan_size pages after that.
> With this change, if we find a single PTE in the middle of a sparse space
> than we stop updating pages in the nr_pte_updates check below. You get
> protected from a lot of scanning by the virtpages check but it does not
> seem this fix is necessary.  It has an odd side-effect whereby we possible
> scan more with this patch in some cases.

True, it is possible that this patch would lead to more scanning
than before, if a process has present PTEs interleaved with areas
that are either sparsely populated, or already marked PROT_NONE.

However, was your intention to not quickly skip over empty areas
that come right after one single present PTE, but only over empty
areas at the beginning of a scan area?

If so, I don't understand the logic behind that, and would like
to know more :)

>>  			/*
>> -			 * Scan sysctl_numa_balancing_scan_size but ensure that
>> -			 * at least one PTE is updated so that unused virtual
>> -			 * address space is quickly skipped.
>> +			 * Try to scan sysctl_numa_balancing_size worth of
>> +			 * hpages that have at least one present PTE that
>> +			 * is not already pte-numa. If the VMA contains
>> +			 * areas that are unused or already full of prot_numa
>> +			 * PTEs, scan up to virtpages, to skip through those
>> +			 * areas faster.
>>  			 */
>>  			if (nr_pte_updates)
>>  				pages -= (end - start) >> PAGE_SHIFT;
>> +			virtpages -= (end - start) >> PAGE_SHIFT;
>>  
> 
> It's a pity there will potentially be a lot of useless dead scanning on
> those processes but caching start addresses is both outside the scope of
> this patch and has its own problems.

The problem has been observed when processes already have a lot of
pages marked PROT_NONE by change_prot_numa(), and change_prot_numa()
returning zero because no PTEs were hanged.

In that case, the amount of useless dead scanning should be a whole
lot less with this patch, than without.

I do not quite understand how this patch makes it worse, though.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work
  2015-09-11 15:57   ` Rik van Riel
@ 2015-09-11 16:16     ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2015-09-11 16:16 UTC (permalink / raw
  To: Rik van Riel; +Cc: linux-kernel, peterz, mingo, Andrea Arcangeli, Jan Stancek

On Fri, Sep 11, 2015 at 11:57:31AM -0400, Rik van Riel wrote:
> On 09/11/2015 11:05 AM, Mel Gorman wrote:
> > On Fri, Sep 11, 2015 at 09:00:27AM -0400, Rik van Riel wrote:
> >> Currently task_numa_work scans up to numa_balancing_scan_size_mb worth
> >> of memory per invocation, but only counts memory areas that have at
> >> least one PTE that is still present and not marked for numa hint faulting.
> >>
> >> It will skip over arbitarily large amounts of memory that are either
> >> unused, full of swap ptes, or full of PTEs that were already marked
> >> for NUMA hint faults but have not been faulted on yet.
> >>
> > 
> > This was deliberate and intended to cover a case whereby a process sparsely
> > using the address space would quickly skip over the sparse portions and
> > reach the active portions. Obviously you've found that this is not always
> > a great idea.
> 
> Skipping over non-present pages is fine, since the scan
> rate is keyed off the RSS.
> 
> However, skipping over pages that are already marked
> PROT_NONE / PTE_NUMA results in unmapping pages at a much
> accelerated rate (sometimes using >90% of the CPU of the
> task), because the pages that are already PROT_NONE / NUMA
> _are_ counted as part of the RSS.
> 

True.

> >> @@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
> >>  			start = max(start, vma->vm_start);
> >>  			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
> >>  			end = min(end, vma->vm_end);
> >> -			nr_pte_updates += change_prot_numa(vma, start, end);
> >> +			nr_pte_updates = change_prot_numa(vma, start, end);
> >>  
> > 
> > Are you *sure* about this particular change?
> > 
> > The intent is that sparse space be skipped until the first updated PTE
> > is found and then scan sysctl_numa_balancing_scan_size pages after that.
> > With this change, if we find a single PTE in the middle of a sparse space
> > than we stop updating pages in the nr_pte_updates check below. You get
> > protected from a lot of scanning by the virtpages check but it does not
> > seem this fix is necessary.  It has an odd side-effect whereby we possible
> > scan more with this patch in some cases.
> 
> True, it is possible that this patch would lead to more scanning
> than before, if a process has present PTEs interleaved with areas
> that are either sparsely populated, or already marked PROT_NONE.
> 
> However, was your intention to not quickly skip over empty areas
> that come right after one single present PTE, but only over empty
> areas at the beginning of a scan area?
> 

The intent was to skip over inactive areas which potentially are marked
PROT_NONE but not being addressed.

Just because it was the intent does not mean it was the best idea
though. I can easily see how the accelerated scan rate would occur and
why it needs to be mitigated. I just wanted to be 100% sure I understand
what you were thinking and what problem you encountered.

Acked-by: Mel Gorman <mgorman@suse.de>

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa_work()
  2015-09-11 13:00 [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work Rik van Riel
  2015-09-11 15:05 ` Mel Gorman
@ 2015-09-18  8:48 ` tip-bot for Rik van Riel
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot for Rik van Riel @ 2015-09-18  8:48 UTC (permalink / raw
  To: linux-tip-commits
  Cc: hpa, tglx, peterz, mingo, linux-kernel, aarcange, jstancek,
	torvalds, mgorman, riel

Commit-ID:  4620f8c1fda2af4ccbd11e194e2dd785f7d7f279
Gitweb:     http://git.kernel.org/tip/4620f8c1fda2af4ccbd11e194e2dd785f7d7f279
Author:     Rik van Riel <riel@redhat.com>
AuthorDate: Fri, 11 Sep 2015 09:00:27 -0400
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 18 Sep 2015 09:23:14 +0200

sched/numa: Limit the amount of virtual memory scanned in task_numa_work()

Currently task_numa_work() scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work() scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150911090027.4a7987bd@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9176f7c..1bfad9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
-	long pages;
+	long pages, virtpages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	virtpages = pages * 8;	   /* Scan up to this much virtual space */
 	if (!pages)
 		return;
 
+
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (!vma) {
@@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			nr_pte_updates += change_prot_numa(vma, start, end);
+			nr_pte_updates = change_prot_numa(vma, start, end);
 
 			/*
-			 * Scan sysctl_numa_balancing_scan_size but ensure that
-			 * at least one PTE is updated so that unused virtual
-			 * address space is quickly skipped.
+			 * Try to scan sysctl_numa_balancing_size worth of
+			 * hpages that have at least one present PTE that
+			 * is not already pte-numa. If the VMA contains
+			 * areas that are unused or already full of prot_numa
+			 * PTEs, scan up to virtpages, to skip through those
+			 * areas faster.
 			 */
 			if (nr_pte_updates)
 				pages -= (end - start) >> PAGE_SHIFT;
+			virtpages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
-			if (pages <= 0)
+			if (pages <= 0 || virtpages <= 0)
 				goto out;
 
 			cond_resched();

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-18  8:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-11 13:00 [PATCH] sched,numa: limit amount of virtual memory scanned in task_numa_work Rik van Riel
2015-09-11 15:05 ` Mel Gorman
2015-09-11 15:57   ` Rik van Riel
2015-09-11 16:16     ` Mel Gorman
2015-09-18  8:48 ` [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa_work() tip-bot for Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.