+ bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
@ 2024-04-26 19:08 Andrew Morton
  2024-04-26 19:48 ` Yury Norov
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2024-04-26 19:08 UTC (permalink / raw)
  To: mm-commits, yury.norov, jserv, visitorckw, akpm


The patch titled
     Subject: bitops: optimize fns() for improved performance
has been added to the -mm mm-nonmm-unstable branch.  Its filename is
     bitops-optimize-fns-for-improved-performance.patch

This patch will shortly appear at
     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch

This patch will later appear in the mm-nonmm-unstable branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

------------------------------------------------------
From: Kuan-Wei Chiu <visitorckw@gmail.com>
Subject: bitops: optimize fns() for improved performance
Date: Fri, 26 Apr 2024 11:51:52 +0800

The current fns() repeatedly uses __ffs() to find the index of the least
significant bit and then clears the corresponding bit using __clear_bit().
The method for clearing the least significant bit can be optimized by
using word &= word - 1 instead.

Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction.  To improve
performance, the loop for clearing the least significant bit has been
replaced with word &= word - 1, followed by a single __ffs() operation to
obtain the answer.  This change reduces the number of __ffs() iterations
from n to just one, enhancing overall performance.

The following microbenchmark data, conducted on my x86-64 machine, shows
the execution time (in microseconds) required for 1000000 test data
generated by get_random_u64() and executed by fns() under different values
of n:

+-----+---------------+---------------+
|  n  |   time_old    |   time_new    |
+-----+---------------+---------------+
|  0  |     29194     |     25878     |
|  1  |     25510     |     25497     |
|  2  |     27836     |     25721     |
|  3  |     30140     |     25673     |
|  4  |     32569     |     25426     |
|  5  |     34792     |     25690     |
|  6  |     37117     |     25651     |
|  7  |     39742     |     25383     |
|  8  |     42360     |     25657     |
|  9  |     44672     |     25897     |
| 10  |     47237     |     25819     |
| 11  |     49884     |     26530     |
| 12  |     51864     |     26647     |
| 13  |     54265     |     28915     |
| 14  |     56440     |     28373     |
| 15  |     58839     |     28616     |
| 16  |     62383     |     29128     |
| 17  |     64257     |     30041     |
| 18  |     66805     |     29773     |
| 19  |     69368     |     33203     |
| 20  |     72942     |     33688     |
| 21  |     77006     |     34518     |
| 22  |     80926     |     34298     |
| 23  |     85723     |     35586     |
| 24  |     90324     |     36376     |
| 25  |     95992     |     37465     |
| 26  |    101101     |     37599     |
| 27  |    106520     |     37466     |
| 28  |    113287     |     38163     |
| 29  |    120552     |     38810     |
| 30  |    128040     |     39373     |
| 31  |    135624     |     40500     |
| 32  |    142580     |     40343     |
| 33  |    148915     |     40460     |
| 34  |    154005     |     41294     |
| 35  |    157996     |     41730     |
| 36  |    160806     |     41523     |
| 37  |    162975     |     42088     |
| 38  |    163426     |     41530     |
| 39  |    164872     |     41789     |
| 40  |    164477     |     42505     |
| 41  |    164758     |     41879     |
| 42  |    164182     |     41415     |
| 43  |    164842     |     42119     |
| 44  |    164881     |     42297     |
| 45  |    164870     |     42145     |
| 46  |    164673     |     42066     |
| 47  |    164616     |     42051     |
| 48  |    165055     |     41902     |
| 49  |    164847     |     41862     |
| 50  |    165171     |     41960     |
| 51  |    164851     |     42089     |
| 52  |    164763     |     41717     |
| 53  |    164635     |     42154     |
| 54  |    164757     |     41983     |
| 55  |    165095     |     41419     |
| 56  |    164641     |     42381     |
| 57  |    164601     |     41654     |
| 58  |    164864     |     41834     |
| 59  |    164594     |     41920     |
| 60  |    165207     |     42020     |
| 61  |    165056     |     41185     |
| 62  |    165160     |     41722     |
| 63  |    164923     |     41702     |
| 64  |    164777     |     41880     |
+-----+---------------+---------------+

Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bitops.h |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

--- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
+++ a/include/linux/bitops.h
@@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64
  */
 static inline unsigned long fns(unsigned long word, unsigned int n)
 {
-	unsigned int bit;
+	unsigned int i;
 
-	while (word) {
-		bit = __ffs(word);
-		if (n-- == 0)
-			return bit;
-		__clear_bit(bit, &word);
-	}
+	for (i = 0; word && i < n; i++)
+		word &= word - 1;
 
-	return BITS_PER_LONG;
+	return word ? __ffs(word) : BITS_PER_LONG;
 }
 
 /**
_

Patches currently in -mm which might be from visitorckw@gmail.com are

bitops-optimize-fns-for-improved-performance.patch


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-26 19:08 + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch Andrew Morton
@ 2024-04-26 19:48 ` Yury Norov
  2024-04-27  5:33   ` Kuan-Wei Chiu
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2024-04-26 19:48 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mm-commits, jserv, visitorckw

On Fri, Apr 26, 2024 at 12:08 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
>
> The patch titled
>      Subject: bitops: optimize fns() for improved performance
> has been added to the -mm mm-nonmm-unstable branch.  Its filename is
>      bitops-optimize-fns-for-improved-performance.patch
>
> This patch will shortly appear at
>      https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch
>
> This patch will later appear in the mm-nonmm-unstable branch at
>     git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
>
> *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
>
> The -mm tree is included into linux-next via the mm-everything
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> and is updated there every 2-3 working days
>
> ------------------------------------------------------
> From: Kuan-Wei Chiu <visitorckw@gmail.com>
> Subject: bitops: optimize fns() for improved performance
> Date: Fri, 26 Apr 2024 11:51:52 +0800
>
> The current fns() repeatedly uses __ffs() to find the index of the least
> significant bit and then clears the corresponding bit using __clear_bit().
> The method for clearing the least significant bit can be optimized by
> using word &= word - 1 instead.
>
> Typically, the execution time of one __ffs() plus one __clear_bit() is
> longer than that of a bitwise AND operation and a subtraction.  To improve
> performance, the loop for clearing the least significant bit has been
> replaced with word &= word - 1, followed by a single __ffs() operation to
> obtain the answer.  This change reduces the number of __ffs() iterations
> from n to just one, enhancing overall performance.
>
> The following microbenchmark data, conducted on my x86-64 machine, shows
> the execution time (in microseconds) required for 1000000 test data
> generated by get_random_u64() and executed by fns() under different values
> of n:
>
> +-----+---------------+---------------+
> |  n  |   time_old    |   time_new    |
> +-----+---------------+---------------+
> |  0  |     29194     |     25878     |
> |  1  |     25510     |     25497     |
> |  2  |     27836     |     25721     |
> |  3  |     30140     |     25673     |
> |  4  |     32569     |     25426     |
> |  5  |     34792     |     25690     |
> |  6  |     37117     |     25651     |
> |  7  |     39742     |     25383     |
> |  8  |     42360     |     25657     |
> |  9  |     44672     |     25897     |
> | 10  |     47237     |     25819     |
> | 11  |     49884     |     26530     |
> | 12  |     51864     |     26647     |
> | 13  |     54265     |     28915     |
> | 14  |     56440     |     28373     |
> | 15  |     58839     |     28616     |
> | 16  |     62383     |     29128     |
> | 17  |     64257     |     30041     |
> | 18  |     66805     |     29773     |
> | 19  |     69368     |     33203     |
> | 20  |     72942     |     33688     |
> | 21  |     77006     |     34518     |
> | 22  |     80926     |     34298     |
> | 23  |     85723     |     35586     |
> | 24  |     90324     |     36376     |
> | 25  |     95992     |     37465     |
> | 26  |    101101     |     37599     |
> | 27  |    106520     |     37466     |
> | 28  |    113287     |     38163     |
> | 29  |    120552     |     38810     |
> | 30  |    128040     |     39373     |
> | 31  |    135624     |     40500     |
> | 32  |    142580     |     40343     |
> | 33  |    148915     |     40460     |
> | 34  |    154005     |     41294     |
> | 35  |    157996     |     41730     |
> | 36  |    160806     |     41523     |
> | 37  |    162975     |     42088     |
> | 38  |    163426     |     41530     |
> | 39  |    164872     |     41789     |
> | 40  |    164477     |     42505     |
> | 41  |    164758     |     41879     |
> | 42  |    164182     |     41415     |
> | 43  |    164842     |     42119     |
> | 44  |    164881     |     42297     |
> | 45  |    164870     |     42145     |
> | 46  |    164673     |     42066     |
> | 47  |    164616     |     42051     |
> | 48  |    165055     |     41902     |
> | 49  |    164847     |     41862     |
> | 50  |    165171     |     41960     |
> | 51  |    164851     |     42089     |
> | 52  |    164763     |     41717     |
> | 53  |    164635     |     42154     |
> | 54  |    164757     |     41983     |
> | 55  |    165095     |     41419     |
> | 56  |    164641     |     42381     |
> | 57  |    164601     |     41654     |
> | 58  |    164864     |     41834     |
> | 59  |    164594     |     41920     |
> | 60  |    165207     |     42020     |
> | 61  |    165056     |     41185     |
> | 62  |    165160     |     41722     |
> | 63  |    164923     |     41702     |
> | 64  |    164777     |     41880     |
> +-----+---------------+---------------+

Hi Kuan-Wei,

I didn't receive the original email for some reason...
We've got a performance test for the function in find_bit_benchmark.
Can you print before/after here?

Thanks,
Yury

> Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@gmail.com
> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> Cc: Yury Norov <yury.norov@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  include/linux/bitops.h |   12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> --- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
> +++ a/include/linux/bitops.h
> @@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64
>   */
>  static inline unsigned long fns(unsigned long word, unsigned int n)
>  {
> -       unsigned int bit;
> +       unsigned int i;
>
> -       while (word) {
> -               bit = __ffs(word);
> -               if (n-- == 0)
> -                       return bit;
> -               __clear_bit(bit, &word);
> -       }
> +       for (i = 0; word && i < n; i++)
> +               word &= word - 1;
>
> -       return BITS_PER_LONG;
> +       return word ? __ffs(word) : BITS_PER_LONG;
>  }
>
>  /**
> _
>
> Patches currently in -mm which might be from visitorckw@gmail.com are
>
> bitops-optimize-fns-for-improved-performance.patch
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-26 19:48 ` Yury Norov
@ 2024-04-27  5:33   ` Kuan-Wei Chiu
  2024-04-28 16:08     ` Yury Norov
  0 siblings, 1 reply; 11+ messages in thread
From: Kuan-Wei Chiu @ 2024-04-27  5:33 UTC (permalink / raw)
  To: Yury Norov; +Cc: Andrew Morton, mm-commits, jserv, n26122115

On Fri, Apr 26, 2024 at 12:48:48PM -0700, Yury Norov wrote:
> On Fri, Apr 26, 2024 at 12:08 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> >
> > The patch titled
> >      Subject: bitops: optimize fns() for improved performance
> > has been added to the -mm mm-nonmm-unstable branch.  Its filename is
> >      bitops-optimize-fns-for-improved-performance.patch
> >
> > This patch will shortly appear at
> >      https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch
> >
> > This patch will later appear in the mm-nonmm-unstable branch at
> >     git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >
> > Before you just go and hit "reply", please:
> >    a) Consider who else should be cc'ed
> >    b) Prefer to cc a suitable mailing list as well
> >    c) Ideally: find the original patch on the mailing list and do a
> >       reply-to-all to that, adding suitable additional cc's
> >
> > *** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
> >
> > The -mm tree is included into linux-next via the mm-everything
> > branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > and is updated there every 2-3 working days
> >
> > ------------------------------------------------------
> > From: Kuan-Wei Chiu <visitorckw@gmail.com>
> > Subject: bitops: optimize fns() for improved performance
> > Date: Fri, 26 Apr 2024 11:51:52 +0800
> >
> > The current fns() repeatedly uses __ffs() to find the index of the least
> > significant bit and then clears the corresponding bit using __clear_bit().
> > The method for clearing the least significant bit can be optimized by
> > using word &= word - 1 instead.
> >
> > Typically, the execution time of one __ffs() plus one __clear_bit() is
> > longer than that of a bitwise AND operation and a subtraction.  To improve
> > performance, the loop for clearing the least significant bit has been
> > replaced with word &= word - 1, followed by a single __ffs() operation to
> > obtain the answer.  This change reduces the number of __ffs() iterations
> > from n to just one, enhancing overall performance.
> >
> > The following microbenchmark data, conducted on my x86-64 machine, shows
> > the execution time (in microseconds) required for 1000000 test data
> > generated by get_random_u64() and executed by fns() under different values
> > of n:
> >
> > +-----+---------------+---------------+
> > |  n  |   time_old    |   time_new    |
> > +-----+---------------+---------------+
> > |  0  |     29194     |     25878     |
> > |  1  |     25510     |     25497     |
> > |  2  |     27836     |     25721     |
> > |  3  |     30140     |     25673     |
> > |  4  |     32569     |     25426     |
> > |  5  |     34792     |     25690     |
> > |  6  |     37117     |     25651     |
> > |  7  |     39742     |     25383     |
> > |  8  |     42360     |     25657     |
> > |  9  |     44672     |     25897     |
> > | 10  |     47237     |     25819     |
> > | 11  |     49884     |     26530     |
> > | 12  |     51864     |     26647     |
> > | 13  |     54265     |     28915     |
> > | 14  |     56440     |     28373     |
> > | 15  |     58839     |     28616     |
> > | 16  |     62383     |     29128     |
> > | 17  |     64257     |     30041     |
> > | 18  |     66805     |     29773     |
> > | 19  |     69368     |     33203     |
> > | 20  |     72942     |     33688     |
> > | 21  |     77006     |     34518     |
> > | 22  |     80926     |     34298     |
> > | 23  |     85723     |     35586     |
> > | 24  |     90324     |     36376     |
> > | 25  |     95992     |     37465     |
> > | 26  |    101101     |     37599     |
> > | 27  |    106520     |     37466     |
> > | 28  |    113287     |     38163     |
> > | 29  |    120552     |     38810     |
> > | 30  |    128040     |     39373     |
> > | 31  |    135624     |     40500     |
> > | 32  |    142580     |     40343     |
> > | 33  |    148915     |     40460     |
> > | 34  |    154005     |     41294     |
> > | 35  |    157996     |     41730     |
> > | 36  |    160806     |     41523     |
> > | 37  |    162975     |     42088     |
> > | 38  |    163426     |     41530     |
> > | 39  |    164872     |     41789     |
> > | 40  |    164477     |     42505     |
> > | 41  |    164758     |     41879     |
> > | 42  |    164182     |     41415     |
> > | 43  |    164842     |     42119     |
> > | 44  |    164881     |     42297     |
> > | 45  |    164870     |     42145     |
> > | 46  |    164673     |     42066     |
> > | 47  |    164616     |     42051     |
> > | 48  |    165055     |     41902     |
> > | 49  |    164847     |     41862     |
> > | 50  |    165171     |     41960     |
> > | 51  |    164851     |     42089     |
> > | 52  |    164763     |     41717     |
> > | 53  |    164635     |     42154     |
> > | 54  |    164757     |     41983     |
> > | 55  |    165095     |     41419     |
> > | 56  |    164641     |     42381     |
> > | 57  |    164601     |     41654     |
> > | 58  |    164864     |     41834     |
> > | 59  |    164594     |     41920     |
> > | 60  |    165207     |     42020     |
> > | 61  |    165056     |     41185     |
> > | 62  |    165160     |     41722     |
> > | 63  |    164923     |     41702     |
> > | 64  |    164777     |     41880     |
> > +-----+---------------+---------------+
> 
> Hi Kuan-Wei,
> 
> I didn't receive the original email for some reason...
> We've got a performance test for the function in find_bit_benchmark.
> Can you print before/after here?
> 
> Thanks,
> Yury
>

Hi Yury,

Here are the benchmark results:

Before:
               Start testing find_bit() with random-filled bitmap
[    0.299085] fbcon: Taking over console
[    0.299820] find_next_bit:                  606286 ns, 164169 iterations
[    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
[    0.300996] find_last_bit:                  531027 ns, 164169 iterations
[    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
[    0.306434] find_first_bit:                1199357 ns,  16455 iterations
[    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
[    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
[    0.321918] 
               Start testing find_bit() with sparse bitmap
[    0.321953] find_next_bit:                    7931 ns,    656 iterations
[    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
[    0.323210] find_last_bit:                    8000 ns,    656 iterations
[    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
[    0.324813] find_first_bit:                 384747 ns,    656 iterations
[    0.324817] find_first_and_bit:               2220 ns,      1 iterations
[    0.324820] find_next_and_bit:                1831 ns,      1 iterations

After:
               Start testing find_bit() with random-filled bitmap
[    0.305081] fbcon: Taking over console
[    0.306126] find_next_bit:                  854517 ns, 163960 iterations
[    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
[    0.307711] find_last_bit:                  668261 ns, 163960 iterations
[    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
[    0.312358] find_first_bit:                1196633 ns,  16373 iterations
[    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
[    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
[    0.327504] 
               Start testing find_bit() with sparse bitmap
[    0.327539] find_next_bit:                    7633 ns,    656 iterations
[    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
[    0.328797] find_last_bit:                    8425 ns,    656 iterations
[    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
[    0.330428] find_first_bit:                 392086 ns,    656 iterations
[    0.330431] find_first_and_bit:               1980 ns,      1 iterations
[    0.330434] find_next_and_bit:                1831 ns,      1 iterations

Some benchmarks seem to have worsened after applying this patch.
However, unless I'm mistaken, the fns() changes should only affect the
results of find_nth_bit, while the others are just random fluctuations.
Should I include the above benchmark data in the commit message and
send a v2 patch?

Additionally, I apologize for you not receiving the email. I received
the following "Message not delivered" email, but I'm unsure if it's
related and what caused the error:

Date: Sat, 27 Apr 2024 04:29:04 +0000 (UTC)
From: do-not-reply@sophosemail.com
To: visitorckw@gmail.com
Subject: Undelivered Mail

This is an automated message from mail service of vivek.yagnik@sophosemail.com

⚠ Message not delivered
------------------ Message details ------------------
From: visitorckw@gmail.com
To: vivek.yagnik@sophosemail.com
Sent: 2024-04-27T04:29:03.000Z
Subject: [PATCH] bitops: Optimize fns() for improved performance
Failure reason:  <vivek.yagnik@sophosemail.com>: host     sophosemail-com.mail.protection.outlook.com[52.101.144.3]
+said: 451 4.4.4     Mail received as unauthenticated, incoming to a recipient domain configured     in a hosted tenant
+which has no mail-enabled subscriptions. ATTR5     [MA1PEPF000072B2.INDPRD01.PROD.OUTLOOK.COM 2024-04-27T04:29:03.836Z
+08DC631634A0BBEB] (in reply to end of DATA command)

Regards,
Kuan-Wei

> > Link: https://lkml.kernel.org/r/20240426035152.956702-1-visitorckw@gmail.com
> > Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
> > Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
> > Cc: Yury Norov <yury.norov@gmail.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >
> >  include/linux/bitops.h |   12 ++++--------
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> >
> > --- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
> > +++ a/include/linux/bitops.h
> > @@ -254,16 +254,12 @@ static inline unsigned long __ffs64(u64
> >   */
> >  static inline unsigned long fns(unsigned long word, unsigned int n)
> >  {
> > -       unsigned int bit;
> > +       unsigned int i;
> >
> > -       while (word) {
> > -               bit = __ffs(word);
> > -               if (n-- == 0)
> > -                       return bit;
> > -               __clear_bit(bit, &word);
> > -       }
> > +       for (i = 0; word && i < n; i++)
> > +               word &= word - 1;
> >
> > -       return BITS_PER_LONG;
> > +       return word ? __ffs(word) : BITS_PER_LONG;
> >  }
> >
> >  /**
> > _
> >
> > Patches currently in -mm which might be from visitorckw@gmail.com are
> >
> > bitops-optimize-fns-for-improved-performance.patch
> >

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-27  5:33   ` Kuan-Wei Chiu
@ 2024-04-28 16:08     ` Yury Norov
  2024-04-29  8:05       ` Rasmus Villemoes
  2024-04-29 15:31       ` Kuan-Wei Chiu
  0 siblings, 2 replies; 11+ messages in thread
From: Yury Norov @ 2024-04-28 16:08 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: Andrew Morton, mm-commits, jserv, n26122115, Rasmus Villemoes

+ Rasmus

On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> Before:
>                Start testing find_bit() with random-filled bitmap
> [    0.299085] fbcon: Taking over console
> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> [    0.321918] 
>                Start testing find_bit() with sparse bitmap
> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> 
> After:
>                Start testing find_bit() with random-filled bitmap
> [    0.305081] fbcon: Taking over console
> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> [    0.327504] 
>                Start testing find_bit() with sparse bitmap
> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> 
> Some benchmarks seem to have worsened after applying this patch.
> However, unless I'm mistaken, the fns() changes should only affect the
> results of find_nth_bit, while the others are just random fluctuations.

So...

The patch itself looks good, and indeed it should not affect
anything except find_nth_bit(). But -40% for find_next_bit()
and find_next_zero_bit(), and -25% for find_last_bit() don't
look like a fluctuation by any measure.

Looking at the numbers, I can only guess that your testing machine
isn't trustworthy, which means that +18% for find_nth_bit() is not
trustworthy as well. Maybe it's a sort of virtualized environment?

Can you share more about your hardware? Can you make sure it's a bare
metal machine and run your test compiled in the kernel? That way it
will run at boot time, at least before any possible userspace payload.
Can you also run test_bitmap? It has some functional testing for the
function?

> Should I include the above benchmark data in the commit message and
> send a v2 patch?

Yes, I'd like to do so. Because the current benchmark numbers look so
weird, can you run it again? Maybe several times before and after, to
estimate the scatter. Also please run the test_bitmap to check
functional correctness.

Since you already have a code testing the performance of fns(), can
you add it in lib/find_bit_benchmark or lib/test_bitops in a following
patch?

Thanks,
Yury
 
> Additionally, I apologize for you not receiving the email. I received
> the following "Message not delivered" email, but I'm unsure if it's
> related and what caused the error:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-28 16:08     ` Yury Norov
@ 2024-04-29  8:05       ` Rasmus Villemoes
  2024-04-29 15:40         ` Kuan-Wei Chiu
  2024-04-29 16:26         ` Yury Norov
  2024-04-29 15:31       ` Kuan-Wei Chiu
  1 sibling, 2 replies; 11+ messages in thread
From: Rasmus Villemoes @ 2024-04-29  8:05 UTC (permalink / raw)
  To: Yury Norov, Kuan-Wei Chiu; +Cc: Andrew Morton, mm-commits, jserv, n26122115

On 28/04/2024 18.08, Yury Norov wrote:
> + Rasmus
> 
> On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
>> Before:
>>                Start testing find_bit() with random-filled bitmap
>> [    0.299085] fbcon: Taking over console
>> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
>> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
>> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
>> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
>> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
>> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
>> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
>> [    0.321918] 
>>                Start testing find_bit() with sparse bitmap
>> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
>> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
>> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
>> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
>> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
>> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
>> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
>>
>> After:
>>                Start testing find_bit() with random-filled bitmap
>> [    0.305081] fbcon: Taking over console
>> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
>> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
>> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
>> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
>> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
>> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
>> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
>> [    0.327504] 
>>                Start testing find_bit() with sparse bitmap
>> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
>> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
>> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
>> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
>> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
>> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
>> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
>>
>> Some benchmarks seem to have worsened after applying this patch.
>> However, unless I'm mistaken, the fns() changes should only affect the
>> results of find_nth_bit, while the others are just random fluctuations.
> 
> So...
> 
> The patch itself looks good, 

Well, I think it could be even better. While I agree that bit=ffs();
clear_bit(bit, ) is probably a bad way of doing it, I think the basic
structure of the function is good. Introducing a "count from 0 up to n"
loop is rarely a good thing, keeping the n counting down to 0 is likely
better.

So I'd instead just change the function to

static inline unsigned long fns(unsigned long word, unsigned int n)
{
	while (word) {
		if (n-- == 0)
			return __ffs(word);
		word &= word - 1;
	}

	return BITS_PER_LONG;
}


Now that I look closer, I think the

 * Returns the bit number of the N'th set bit.
 * If no such, returns @size.

in the find_nth_bit and friends' docs is wrong. Tell me what happens here:


  DECLARE_BITMAP(x, 12);
  int i;

  x[0] = 3;
  i = find_nth_bit(&x, 12, 7);

So I'm asking for the seventh (counting from 0) bit set in a bitmap of
12 bits, but only two bits are set. So i should be 12? No, i will be
BITS_PER_LONG, because fns() doesn't know anything about the limit of
12. Do we really not have any tests covering that? Or indeed covering
any of the small_const_nbits optimizations?

I've said this before, and I repeat. It was a mistake for the bitmap
functions to promise a return of exactly @size when stuff is not found,
they should always have just promised to return something >= @size. The
checking in the callers would be just as easy (and some indeed do >=
instead of ==), but the implementation can be somewhat cheaper. I'm
afraid that ship has sailed, but it annoys me every time I stumble on this.

Rasmus


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-28 16:08     ` Yury Norov
  2024-04-29  8:05       ` Rasmus Villemoes
@ 2024-04-29 15:31       ` Kuan-Wei Chiu
  1 sibling, 0 replies; 11+ messages in thread
From: Kuan-Wei Chiu @ 2024-04-29 15:31 UTC (permalink / raw)
  To: Yury Norov; +Cc: Andrew Morton, mm-commits, jserv, n26122115, Rasmus Villemoes

On Sun, Apr 28, 2024 at 09:08:26AM -0700, Yury Norov wrote:
> + Rasmus
> 
> On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> > Before:
> >                Start testing find_bit() with random-filled bitmap
> > [    0.299085] fbcon: Taking over console
> > [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> > [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> > [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> > [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> > [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> > [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> > [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> > [    0.321918] 
> >                Start testing find_bit() with sparse bitmap
> > [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> > [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> > [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> > [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> > [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> > [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> > [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> > 
> > After:
> >                Start testing find_bit() with random-filled bitmap
> > [    0.305081] fbcon: Taking over console
> > [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> > [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> > [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> > [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> > [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> > [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> > [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> > [    0.327504] 
> >                Start testing find_bit() with sparse bitmap
> > [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> > [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> > [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> > [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> > [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> > [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> > [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> > 
> > Some benchmarks seem to have worsened after applying this patch.
> > However, unless I'm mistaken, the fns() changes should only affect the
> > results of find_nth_bit, while the others are just random fluctuations.
> 
> So...
> 
> The patch itself looks good, and indeed it should not affect
> anything except find_nth_bit(). But -40% for find_next_bit()
> and find_next_zero_bit(), and -25% for find_last_bit() don't
> look like a fluctuation by any measure.
> 
> Looking at the numbers, I can only guess that your testing machine
> isn't trustworthy, which means that +18% for find_nth_bit() is not
> trustworthy as well. Maybe it's a sort of virtualized environment?
>
I've retested the patch against Linux v6.9-rc6, running three
iterations before and after applying the patch. The data now appears to
be much more consistent and shows a clear efficiency improvement in the
benchmark for find_nth_bit.

I believe there may have been an error in my previous testing that I
didn't catch. Here are the results from the three runs before and after
applying the patch:

Before:

# run1:
[    0.295555] 
               Start testing find_bit() with random-filled bitmap
[    0.295557] fbcon: Taking over console
[    0.296287] find_next_bit:                  602426 ns, 163845 iterations
[    0.296933] find_next_zero_bit:             643572 ns, 163836 iterations
[    0.297456] find_last_bit:                  521223 ns, 163844 iterations
[    0.301680] find_nth_bit:                  4223086 ns,  16358 iterations
[    0.302865] find_first_bit:                1183479 ns,  16359 iterations
[    0.318173] find_first_and_bit:           15304868 ns,  32955 iterations
[    0.318473] find_next_and_bit:              297802 ns,  74031 iterations
[    0.318474] 
               Start testing find_bit() with sparse bitmap
[    0.318508] find_next_bit:                    7892 ns,    653 iterations
[    0.319758] find_next_zero_bit:            1248034 ns, 327028 iterations
[    0.319767] find_last_bit:                    7818 ns,    653 iterations
[    0.320995] find_nth_bit:                  1224578 ns,    652 iterations
[    0.321383] find_first_bit:                 387053 ns,    653 iterations
[    0.321387] find_first_and_bit:               2005 ns,      1 iterations
[    0.321390] find_next_and_bit:                1833 ns,      1 iterations
[    0.321391] test_bitmap: loaded.
[    0.321454] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 189
[    0.321465] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
               ', Time: 564
[    0.322496] test_bitmap: all 6564 tests passed

# run2:
[    0.295442] 
               Start testing find_bit() with random-filled bitmap
[    0.295445] fbcon: Taking over console
[    0.296182] find_next_bit:                  603891 ns, 164170 iterations
[    0.296825] find_next_zero_bit:             642025 ns, 163511 iterations
[    0.297357] find_last_bit:                  529755 ns, 164170 iterations
[    0.301671] find_nth_bit:                  4312387 ns,  16590 iterations
[    0.302876] find_first_bit:                1203730 ns,  16591 iterations
[    0.317946] find_first_and_bit:           15067894 ns,  32773 iterations
[    0.318247] find_next_and_bit:              297303 ns,  73781 iterations
[    0.318248] 
               Start testing find_bit() with sparse bitmap
[    0.318283] find_next_bit:                    8193 ns,    654 iterations
[    0.319532] find_next_zero_bit:            1247954 ns, 327027 iterations
[    0.319541] find_last_bit:                    7530 ns,    654 iterations
[    0.320750] find_nth_bit:                  1205862 ns,    653 iterations
[    0.321134] find_first_bit:                 381843 ns,    654 iterations
[    0.321138] find_first_and_bit:               2087 ns,      1 iterations
[    0.321141] find_next_and_bit:                1843 ns,      1 iterations
[    0.321142] test_bitmap: loaded.
[    0.321207] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 162
[    0.321225] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
               ', Time: 1079
[    0.322286] test_bitmap: all 6564 tests passed

# run3:
[    0.295315] 
               Start testing find_bit() with random-filled bitmap
[    0.295318] fbcon: Taking over console
[    0.296054] find_next_bit:                  604012 ns, 163612 iterations
[    0.296697] find_next_zero_bit:             641051 ns, 164069 iterations
[    0.297220] find_last_bit:                  521720 ns, 163611 iterations
[    0.301476] find_nth_bit:                  4254313 ns,  16525 iterations
[    0.302667] find_first_bit:                1189415 ns,  16526 iterations
[    0.317626] find_first_and_bit:           14955907 ns,  32801 iterations
[    0.317924] find_next_and_bit:              296492 ns,  73456 iterations
[    0.317925] 
               Start testing find_bit() with sparse bitmap
[    0.317959] find_next_bit:                    8187 ns,    656 iterations
[    0.319208] find_next_zero_bit:            1247622 ns, 327025 iterations
[    0.319218] find_last_bit:                    8146 ns,    656 iterations
[    0.320406] find_nth_bit:                  1184997 ns,    655 iterations
[    0.320784] find_first_bit:                 376883 ns,    656 iterations
[    0.320788] find_first_and_bit:               2270 ns,      1 iterations
[    0.320791] find_next_and_bit:                1844 ns,      1 iterations
[    0.320793] test_bitmap: loaded.
[    0.320856] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 176
[    0.320868] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
               ', Time: 624
[    0.321903] test_bitmap: all 6564 tests passed

After:

# run1:
[    0.299444] 
               Start testing find_bit() with random-filled bitmap
[    0.299446] fbcon: Taking over console
[    0.300176] find_next_bit:                  601818 ns, 163940 iterations
[    0.300820] find_next_zero_bit:             642027 ns, 163741 iterations
[    0.301350] find_last_bit:                  528024 ns, 163939 iterations
[    0.304680] find_nth_bit:                  3329159 ns,  16479 iterations
[    0.305868] find_first_bit:                1186153 ns,  16480 iterations
[    0.320205] find_first_and_bit:           14334269 ns,  32836 iterations
[    0.320504] find_next_and_bit:              297451 ns,  73788 iterations
[    0.320505] 
               Start testing find_bit() with sparse bitmap
[    0.320539] find_next_bit:                    7372 ns,    655 iterations
[    0.321789] find_next_zero_bit:            1248188 ns, 327026 iterations
[    0.321799] find_last_bit:                    8284 ns,    655 iterations
[    0.323032] find_nth_bit:                  1229792 ns,    654 iterations
[    0.323422] find_first_bit:                 389751 ns,    655 iterations
[    0.323426] find_first_and_bit:               2166 ns,      1 iterations
[    0.323430] find_next_and_bit:                1834 ns,      1 iterations
[    0.323431] test_bitmap: loaded.
[    0.323494] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 184
[    0.323505] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
               ', Time: 503
[    0.324560] test_bitmap: all 6564 tests passed

# run2:
[    0.295331] 
               Start testing find_bit() with random-filled bitmap
[    0.295333] fbcon: Taking over console
[    0.296066] find_next_bit:                  600626 ns, 163710 iterations
[    0.296713] find_next_zero_bit:             644866 ns, 163971 iterations
[    0.297243] find_last_bit:                  528601 ns, 163711 iterations
[    0.300570] find_nth_bit:                  3325597 ns,  16300 iterations
[    0.301751] find_first_bit:                1179476 ns,  16301 iterations
[    0.316235] find_first_and_bit:           14480647 ns,  32619 iterations
[    0.316536] find_next_and_bit:              298766 ns,  73682 iterations
[    0.316537] 
               Start testing find_bit() with sparse bitmap
[    0.316571] find_next_bit:                    7277 ns,    656 iterations
[    0.317820] find_next_zero_bit:            1248426 ns, 327025 iterations
[    0.317829] find_last_bit:                    7801 ns,    656 iterations
[    0.319052] find_nth_bit:                  1219313 ns,    655 iterations
[    0.319439] find_first_bit:                 386162 ns,    656 iterations
[    0.319443] find_first_and_bit:               1869 ns,      1 iterations
[    0.319446] find_next_and_bit:                1827 ns,      1 iterations
[    0.319448] test_bitmap: loaded.
[    0.319511] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 172
[    0.319522] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
               ', Time: 492
[    0.320568] test_bitmap: all 6564 tests passed

# run3:
[    0.299355] 
               Start testing find_bit() with random-filled bitmap
[    0.299357] fbcon: Taking over console
[    0.300087] find_next_bit:                  601452 ns, 163877 iterations
[    0.300735] find_next_zero_bit:             646048 ns, 163804 iterations
[    0.301256] find_last_bit:                  519870 ns, 163877 iterations
[    0.304620] find_nth_bit:                  3362863 ns,  16501 iterations
[    0.305818] find_first_bit:                1195522 ns,  16502 iterations
[    0.320875] find_first_and_bit:           15054509 ns,  32826 iterations
[    0.321172] find_next_and_bit:              295444 ns,  73697 iterations
[    0.321173] 
               Start testing find_bit() with sparse bitmap
[    0.321207] find_next_bit:                    7366 ns,    656 iterations
[    0.322456] find_next_zero_bit:            1248232 ns, 327025 iterations
[    0.322466] find_last_bit:                    8266 ns,    656 iterations
[    0.323732] find_nth_bit:                  1262568 ns,    655 iterations
[    0.324133] find_first_bit:                 400010 ns,    656 iterations
[    0.324137] find_first_and_bit:               2000 ns,      1 iterations
[    0.324140] find_next_and_bit:                1824 ns,      1 iterations
[    0.324141] test_bitmap: loaded.
[    0.324204] test_bitmap: parselist: 14: input is '0-2047:128/256' OK, Time: 187
[    0.324218] test_bitmap: bitmap_print_to_pagebuf: input is '0-32767
               ', Time: 540
[    0.325256] test_bitmap: all 6564 tests passed

> Can you share more about your hardware? Can you make sure it's a bare
> metal machine and run your test compiled in the kernel? That way it
> will run at boot time, at least before any possible userspace payload.
> Can you also run test_bitmap? It has some functional testing for the
> function?
> 
Both the previous and current tests were conducted on real hardware
during the boot phase. While I did test in a qemu environment
previously, it was solely for validation purposes and not for
performance testing. Below is the hardware information I used:

$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   16
  On-line CPU(s) list:    0-15
Vendor ID:                GenuineIntel
  Model name:             11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
    CPU family:           6
    Model:                167
    Thread(s) per core:   2
    Core(s) per socket:   8
    Socket(s):            1
    Stepping:             1
    CPU max MHz:          4900.0000
    CPU min MHz:          800.0000
    BogoMIPS:             4992.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe sy
                          scall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc
                          _known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic mo
                          vbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibr
                          s_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512
                          dq rdseed adx smap avx512ifma clflushopt intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves dtherm 
                          ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes v
                          pclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    384 KiB (8 instances)
  L1i:                    256 KiB (8 instances)
  L2:                     4 MiB (8 instances)
  L3:                     16 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-15
Vulnerabilities:          
  Gather data sampling:   Mitigation; Microcode
  Itlb multihit:          Not affected
  L1tf:                   Not affected
  Mds:                    Not affected
  Meltdown:               Not affected
  Mmio stale data:        Mitigation; Clear CPU buffers; SMT vulnerable
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; Enhanced IBRS
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop
  Srbds:                  Not affected
  Tsx async abort:        Not affected


> > Should I include the above benchmark data in the commit message and
> > send a v2 patch?
> 
> Yes, I'd like to do so. Because the current benchmark numbers look so
> weird, can you run it again? Maybe several times before and after, to
> estimate the scatter. Also please run the test_bitmap to check
> functional correctness.
> 
> Since you already have a code testing the performance of fns(), can
> you add it in lib/find_bit_benchmark or lib/test_bitops in a following
> patch?
> 
Sure, I can do that.
I'll include the benchmark data along with the testing code in the v2
patch series that I'll send.

Regards,
Kuan-Wei

>  
> > Additionally, I apologize for you not receiving the email. I received
> > the following "Message not delivered" email, but I'm unsure if it's
> > related and what caused the error:

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-29  8:05       ` Rasmus Villemoes
@ 2024-04-29 15:40         ` Kuan-Wei Chiu
  2024-04-29 16:34           ` Yury Norov
  2024-04-29 16:26         ` Yury Norov
  1 sibling, 1 reply; 11+ messages in thread
From: Kuan-Wei Chiu @ 2024-04-29 15:40 UTC (permalink / raw)
  To: Rasmus Villemoes; +Cc: Yury Norov, Andrew Morton, mm-commits, jserv, n26122115

On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote:
> On 28/04/2024 18.08, Yury Norov wrote:
> > + Rasmus
> > 
> > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> >> Before:
> >>                Start testing find_bit() with random-filled bitmap
> >> [    0.299085] fbcon: Taking over console
> >> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> >> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> >> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> >> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> >> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> >> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> >> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> >> [    0.321918] 
> >>                Start testing find_bit() with sparse bitmap
> >> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> >> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> >> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> >> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> >> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> >> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> >> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> >>
> >> After:
> >>                Start testing find_bit() with random-filled bitmap
> >> [    0.305081] fbcon: Taking over console
> >> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> >> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> >> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> >> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> >> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> >> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> >> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> >> [    0.327504] 
> >>                Start testing find_bit() with sparse bitmap
> >> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> >> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> >> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> >> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> >> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> >> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> >> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> >>
> >> Some benchmarks seem to have worsened after applying this patch.
> >> However, unless I'm mistaken, the fns() changes should only affect the
> >> results of find_nth_bit, while the others are just random fluctuations.
> > 
> > So...
> > 
> > The patch itself looks good, 
> 
> Well, I think it could be even better. While I agree that bit=ffs();
> clear_bit(bit, ) is probably a bad way of doing it, I think the basic
> structure of the function is good. Introducing a "count from 0 up to n"
> loop is rarely a good thing, keeping the n counting down to 0 is likely
> better.
> 
> So I'd instead just change the function to
> 
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
> 	while (word) {
> 		if (n-- == 0)
> 			return __ffs(word);
> 		word &= word - 1;
> 	}
> 
> 	return BITS_PER_LONG;
> }
>
How about rewriting it as follows:

static inline unsigned long fns(unsigned long word, unsigned int n)
{
    while (word && n--)
        word &= word - 1;

    return word ? __ffs(word) : BITS_PER_LONG;
}

IMHO, this way the code can be shorter and more elegant.

Regards,
Kuan-Wei

> 
> Now that I look closer, I think the
> 
>  * Returns the bit number of the N'th set bit.
>  * If no such, returns @size.
> 
> in the find_nth_bit and friends' docs is wrong. Tell me what happens here:
> 
> 
>   DECLARE_BITMAP(x, 12);
>   int i;
> 
>   x[0] = 3;
>   i = find_nth_bit(&x, 12, 7);
> 
> So I'm asking for the seventh (counting from 0) bit set in a bitmap of
> 12 bits, but only two bits are set. So i should be 12? No, i will be
> BITS_PER_LONG, because fns() doesn't know anything about the limit of
> 12. Do we really not have any tests covering that? Or indeed covering
> any of the small_const_nbits optimizations?
> 
> I've said this before, and I repeat. It was a mistake for the bitmap
> functions to promise a return of exactly @size when stuff is not found,
> they should always have just promised to return something >= @size. The
> checking in the callers would be just as easy (and some indeed do >=
> instead of ==), but the implementation can be somewhat cheaper. I'm
> afraid that ship has sailed, but it annoys me every time I stumble on this.
> 
> Rasmus
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-29  8:05       ` Rasmus Villemoes
  2024-04-29 15:40         ` Kuan-Wei Chiu
@ 2024-04-29 16:26         ` Yury Norov
  1 sibling, 0 replies; 11+ messages in thread
From: Yury Norov @ 2024-04-29 16:26 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Kuan-Wei Chiu, Andrew Morton, mm-commits, jserv, n26122115

On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote:
> On 28/04/2024 18.08, Yury Norov wrote:
> > + Rasmus
> > 
> > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> >> Before:
> >>                Start testing find_bit() with random-filled bitmap
> >> [    0.299085] fbcon: Taking over console
> >> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> >> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> >> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> >> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> >> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> >> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> >> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> >> [    0.321918] 
> >>                Start testing find_bit() with sparse bitmap
> >> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> >> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> >> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> >> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> >> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> >> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> >> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> >>
> >> After:
> >>                Start testing find_bit() with random-filled bitmap
> >> [    0.305081] fbcon: Taking over console
> >> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> >> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> >> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> >> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> >> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> >> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> >> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> >> [    0.327504] 
> >>                Start testing find_bit() with sparse bitmap
> >> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> >> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> >> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> >> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> >> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> >> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> >> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> >>
> >> Some benchmarks seem to have worsened after applying this patch.
> >> However, unless I'm mistaken, the fns() changes should only affect the
> >> results of find_nth_bit, while the others are just random fluctuations.
> > 
> > So...
> > 
> > The patch itself looks good, 
> 
> Well, I think it could be even better. While I agree that bit=ffs();
> clear_bit(bit, ) is probably a bad way of doing it, I think the basic
> structure of the function is good. Introducing a "count from 0 up to n"
> loop is rarely a good thing, keeping the n counting down to 0 is likely
> better.
> 
> So I'd instead just change the function to
> 
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
> 	while (word) {
> 		if (n-- == 0)
> 			return __ffs(word);
> 		word &= word - 1;
> 	}
> 
> 	return BITS_PER_LONG;
> }
 
Agree. Even better.
 
> Now that I look closer, I think the
> 
>  * Returns the bit number of the N'th set bit.
>  * If no such, returns @size.
> 
> in the find_nth_bit and friends' docs is wrong. Tell me what happens here:
> 
> 
>   DECLARE_BITMAP(x, 12);
>   int i;
> 
>   x[0] = 3;
>   i = find_nth_bit(&x, 12, 7);
> 
> So I'm asking for the seventh (counting from 0) bit set in a bitmap of
> 12 bits, but only two bits are set. So i should be 12? No, i will be
> BITS_PER_LONG, because fns() doesn't know anything about the limit of
> 12. Do we really not have any tests covering that? Or indeed covering
> any of the small_const_nbits optimizations?

We test non-small_const_nbits() case in  test_find_nth_bit() tests at
line 257:

        expect_eq_uint(64 * 3 - 1, find_nth_bit(bmap, 64 * 3 - 1, 8));

And this is enforced in FIND_NTH_BIT() like this:

        sz = min(idx * BITS_PER_LONG + fns(tmp, nr), sz);
        
We don't test small_const_nbits() case, and it seems violating that
rule.

> I've said this before, and I repeat. It was a mistake for the bitmap
> functions to promise a return of exactly @size when stuff is not found,
> they should always have just promised to return something >= @size. The
> checking in the callers would be just as easy (and some indeed do >=
> instead of ==), but the implementation can be somewhat cheaper. I'm
> afraid that ship has sailed, but it annoys me every time I stumble on this.

The ship has sailed for an old API. For the new one we can make new
rules. Moreover, almost all kernel users of the function use it as
cpumask_nth(), and in cpumask API we always claim to return >= @size
if nothing is found.

So, we need to either fix small_const_bit() branch, or relax the
'== @size' requirement in FIND_NTH_BIT() and in the comment. I'd
rather do 2nd despite it breaks consistency...

Can you send a patch?  Or I can do it myself if you prefer.

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-29 15:40         ` Kuan-Wei Chiu
@ 2024-04-29 16:34           ` Yury Norov
  2024-04-29 17:06             ` Kuan-Wei Chiu
  0 siblings, 1 reply; 11+ messages in thread
From: Yury Norov @ 2024-04-29 16:34 UTC (permalink / raw)
  To: Kuan-Wei Chiu
  Cc: Rasmus Villemoes, Andrew Morton, mm-commits, jserv, n26122115

On Mon, Apr 29, 2024 at 11:40:34PM +0800, Kuan-Wei Chiu wrote:
> On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote:
> > On 28/04/2024 18.08, Yury Norov wrote:
> > > + Rasmus
> > > 
> > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> > >> Before:
> > >>                Start testing find_bit() with random-filled bitmap
> > >> [    0.299085] fbcon: Taking over console
> > >> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> > >> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> > >> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> > >> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> > >> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> > >> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> > >> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> > >> [    0.321918] 
> > >>                Start testing find_bit() with sparse bitmap
> > >> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> > >> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> > >> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> > >> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> > >> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> > >> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> > >> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> > >>
> > >> After:
> > >>                Start testing find_bit() with random-filled bitmap
> > >> [    0.305081] fbcon: Taking over console
> > >> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> > >> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> > >> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> > >> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> > >> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> > >> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> > >> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> > >> [    0.327504] 
> > >>                Start testing find_bit() with sparse bitmap
> > >> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> > >> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> > >> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> > >> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> > >> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> > >> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> > >> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> > >>
> > >> Some benchmarks seem to have worsened after applying this patch.
> > >> However, unless I'm mistaken, the fns() changes should only affect the
> > >> results of find_nth_bit, while the others are just random fluctuations.
> > > 
> > > So...
> > > 
> > > The patch itself looks good, 
> > 
> > Well, I think it could be even better. While I agree that bit=ffs();
> > clear_bit(bit, ) is probably a bad way of doing it, I think the basic
> > structure of the function is good. Introducing a "count from 0 up to n"
> > loop is rarely a good thing, keeping the n counting down to 0 is likely
> > better.
> > 
> > So I'd instead just change the function to
> > 
> > static inline unsigned long fns(unsigned long word, unsigned int n)
> > {
> > 	while (word) {
> > 		if (n-- == 0)
> > 			return __ffs(word);
> > 		word &= word - 1;
> > 	}
> > 
> > 	return BITS_PER_LONG;
> > }
> >
> How about rewriting it as follows:
> 
> static inline unsigned long fns(unsigned long word, unsigned int n)
> {
>     while (word && n--)
>         word &= word - 1;
> 
>     return word ? __ffs(word) : BITS_PER_LONG;
> }
> 
> IMHO, this way the code can be shorter and more elegant.

Rasmus' code looks shorter because it tests the 'word != 0' only once
when n == 0, for example. But we never sure how the compiler would
optimize it. Can you show a disassembly of both and pick the best?

Thanks,
Yury

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
  2024-04-29 16:34           ` Yury Norov
@ 2024-04-29 17:06             ` Kuan-Wei Chiu
  0 siblings, 0 replies; 11+ messages in thread
From: Kuan-Wei Chiu @ 2024-04-29 17:06 UTC (permalink / raw)
  To: Yury Norov; +Cc: Rasmus Villemoes, Andrew Morton, mm-commits, jserv, n26122115

On Mon, Apr 29, 2024 at 09:34:06AM -0700, Yury Norov wrote:
> On Mon, Apr 29, 2024 at 11:40:34PM +0800, Kuan-Wei Chiu wrote:
> > On Mon, Apr 29, 2024 at 10:05:12AM +0200, Rasmus Villemoes wrote:
> > > On 28/04/2024 18.08, Yury Norov wrote:
> > > > + Rasmus
> > > > 
> > > > On Sat, Apr 27, 2024 at 01:33:55PM +0800, Kuan-Wei Chiu wrote:
> > > >> Before:
> > > >>                Start testing find_bit() with random-filled bitmap
> > > >> [    0.299085] fbcon: Taking over console
> > > >> [    0.299820] find_next_bit:                  606286 ns, 164169 iterations
> > > >> [    0.300463] find_next_zero_bit:             641072 ns, 163512 iterations
> > > >> [    0.300996] find_last_bit:                  531027 ns, 164169 iterations
> > > >> [    0.305233] find_nth_bit:                  4235859 ns,  16454 iterations
> > > >> [    0.306434] find_first_bit:                1199357 ns,  16455 iterations
> > > >> [    0.321616] find_first_and_bit:           15179667 ns,  32869 iterations
> > > >> [    0.321917] find_next_and_bit:              298836 ns,  73875 iterations
> > > >> [    0.321918] 
> > > >>                Start testing find_bit() with sparse bitmap
> > > >> [    0.321953] find_next_bit:                    7931 ns,    656 iterations
> > > >> [    0.323201] find_next_zero_bit:            1246980 ns, 327025 iterations
> > > >> [    0.323210] find_last_bit:                    8000 ns,    656 iterations
> > > >> [    0.324427] find_nth_bit:                  1213161 ns,    655 iterations
> > > >> [    0.324813] find_first_bit:                 384747 ns,    656 iterations
> > > >> [    0.324817] find_first_and_bit:               2220 ns,      1 iterations
> > > >> [    0.324820] find_next_and_bit:                1831 ns,      1 iterations
> > > >>
> > > >> After:
> > > >>                Start testing find_bit() with random-filled bitmap
> > > >> [    0.305081] fbcon: Taking over console
> > > >> [    0.306126] find_next_bit:                  854517 ns, 163960 iterations
> > > >> [    0.307041] find_next_zero_bit:             911725 ns, 163721 iterations
> > > >> [    0.307711] find_last_bit:                  668261 ns, 163960 iterations
> > > >> [    0.311160] find_nth_bit:                  3447530 ns,  16372 iterations
> > > >> [    0.312358] find_first_bit:                1196633 ns,  16373 iterations
> > > >> [    0.327191] find_first_and_bit:           14830129 ns,  32951 iterations
> > > >> [    0.327503] find_next_and_bit:              310560 ns,  73719 iterations
> > > >> [    0.327504] 
> > > >>                Start testing find_bit() with sparse bitmap
> > > >> [    0.327539] find_next_bit:                    7633 ns,    656 iterations
> > > >> [    0.328787] find_next_zero_bit:            1247398 ns, 327025 iterations
> > > >> [    0.328797] find_last_bit:                    8425 ns,    656 iterations
> > > >> [    0.330034] find_nth_bit:                  1234044 ns,    655 iterations
> > > >> [    0.330428] find_first_bit:                 392086 ns,    656 iterations
> > > >> [    0.330431] find_first_and_bit:               1980 ns,      1 iterations
> > > >> [    0.330434] find_next_and_bit:                1831 ns,      1 iterations
> > > >>
> > > >> Some benchmarks seem to have worsened after applying this patch.
> > > >> However, unless I'm mistaken, the fns() changes should only affect the
> > > >> results of find_nth_bit, while the others are just random fluctuations.
> > > > 
> > > > So...
> > > > 
> > > > The patch itself looks good, 
> > > 
> > > Well, I think it could be even better. While I agree that bit=ffs();
> > > clear_bit(bit, ) is probably a bad way of doing it, I think the basic
> > > structure of the function is good. Introducing a "count from 0 up to n"
> > > loop is rarely a good thing, keeping the n counting down to 0 is likely
> > > better.
> > > 
> > > So I'd instead just change the function to
> > > 
> > > static inline unsigned long fns(unsigned long word, unsigned int n)
> > > {
> > > 	while (word) {
> > > 		if (n-- == 0)
> > > 			return __ffs(word);
> > > 		word &= word - 1;
> > > 	}
> > > 
> > > 	return BITS_PER_LONG;
> > > }
> > >
> > How about rewriting it as follows:
> > 
> > static inline unsigned long fns(unsigned long word, unsigned int n)
> > {
> >     while (word && n--)
> >         word &= word - 1;
> > 
> >     return word ? __ffs(word) : BITS_PER_LONG;
> > }
> > 
> > IMHO, this way the code can be shorter and more elegant.
> 
> Rasmus' code looks shorter because it tests the 'word != 0' only once
> when n == 0, for example. But we never sure how the compiler would
> optimize it. Can you show a disassembly of both and pick the best?
> 
> Thanks,
> Yury

It appears that gcc with the O2 flag is smart enough to generate the
same code. The only difference in my disassembly results is the
generated label names. Tested with gcc 11.4.0.

Regards,
Kuan-Wei

^ permalink raw reply	[flat|nested] 11+ messages in thread

* + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch
@ 2024-05-02 15:18 Andrew Morton
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2024-05-02 15:18 UTC (permalink / raw)
  To: mm-commits, yury.norov, linux, jserv, visitorckw, akpm


The patch titled
     Subject: bitops: optimize fns() for improved performance
has been added to the -mm mm-nonmm-unstable branch.  Its filename is
     bitops-optimize-fns-for-improved-performance.patch

This patch will shortly appear at
     https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/bitops-optimize-fns-for-improved-performance.patch

This patch will later appear in the mm-nonmm-unstable branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days

------------------------------------------------------
From: Kuan-Wei Chiu <visitorckw@gmail.com>
Subject: bitops: optimize fns() for improved performance
Date: Thu, 2 May 2024 17:24:43 +0800

The current fns() repeatedly uses __ffs() to find the index of the least
significant bit and then clears the corresponding bit using __clear_bit().
The method for clearing the least significant bit can be optimized by
using word &= word - 1 instead.

Typically, the execution time of one __ffs() plus one __clear_bit() is
longer than that of a bitwise AND operation and a subtraction.  To improve
performance, the loop for clearing the least significant bit has been
replaced with word &= word - 1, followed by a single __ffs() operation to
obtain the answer.  This change reduces the number of __ffs() iterations
from n to just one, enhancing overall performance.

This modification significantly accelerates the fns() function in the
test_bitops benchmark, improving its speed by approximately 7.6 times. 
Additionally, it enhances the performance of find_nth_bit() in the
find_bit benchmark by approximately 26%.

Before:
test_bitops: fns:            58033164 ns
find_nth_bit:                  4254313 ns,  16525 iterations

After:
test_bitops: fns:             7637268 ns
find_nth_bit:                  3362863 ns,  16501 iterations

Link: https://lkml.kernel.org/r/20240502092443.6845-3-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bitops.h |   12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

--- a/include/linux/bitops.h~bitops-optimize-fns-for-improved-performance
+++ a/include/linux/bitops.h
@@ -254,16 +254,10 @@ static inline unsigned long __ffs64(u64
  */
 static inline unsigned long fns(unsigned long word, unsigned int n)
 {
-	unsigned int bit;
+	while (word && n--)
+		word &= word - 1;
 
-	while (word) {
-		bit = __ffs(word);
-		if (n-- == 0)
-			return bit;
-		__clear_bit(bit, &word);
-	}
-
-	return BITS_PER_LONG;
+	return word ? __ffs(word) : BITS_PER_LONG;
 }
 
 /**
_

Patches currently in -mm which might be from visitorckw@gmail.com are

lib-test_bitops-add-benchmark-test-for-fns.patch
bitops-optimize-fns-for-improved-performance.patch


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-05-02 15:18 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-26 19:08 + bitops-optimize-fns-for-improved-performance.patch added to mm-nonmm-unstable branch Andrew Morton
2024-04-26 19:48 ` Yury Norov
2024-04-27  5:33   ` Kuan-Wei Chiu
2024-04-28 16:08     ` Yury Norov
2024-04-29  8:05       ` Rasmus Villemoes
2024-04-29 15:40         ` Kuan-Wei Chiu
2024-04-29 16:34           ` Yury Norov
2024-04-29 17:06             ` Kuan-Wei Chiu
2024-04-29 16:26         ` Yury Norov
2024-04-29 15:31       ` Kuan-Wei Chiu
  -- strict thread matches above, loose matches on Subject: below --
2024-05-02 15:18 Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.