Segfaults on http

cmogstored dev/user discussion/issues/patches/etc
 help / color / mirror / code / Atom feed

* Segfaults on http_close?
@ 2021-01-11 20:48 Xiao Yu
  2021-01-11 21:26 ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Yu @ 2021-01-11 20:48 UTC (permalink / raw)
  To: cmogstored-public

Howdy, we are running a 96 node cmogstored cluster and have noticed
that when the cluster is busy with lots of writes we occasionally get
segfaults in cmogstored. This has happened 7 times in the past week
each time on a random and different cmogstored node. Looking at the
abrt backtrace of the core dump shows something similar to the
following in each instance:

---
{   "signal": 11
,   "executable": "/usr/local/bin/cmogstored"
,   "stacktrace":
      [ {   "crash_thread": true
        ,   "frames":
              [ {   "address": 140389358944542
                ,   "build_id": "3c61131d1dac9da79b73188e7702bef786c2ad54"
                ,   "build_id_offset": 528670
                ,   "function_name": "_int_free"
                ,   "file_name": "/usr/lib64/libc-2.17.so"
                }
              , {   "address": 4225373
                ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
                ,   "build_id_offset": 31069
                ,   "function_name": "http_close"
                ,   "file_name": "/usr/local/bin/cmogstored"
                }
              , {   "address": 4228819
                ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
                ,   "build_id_offset": 34515
                ,   "function_name": "mog_http_queue_step"
                ,   "file_name": "/usr/local/bin/cmogstored"
                }
              , {   "address": 4256381
                ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
                ,   "build_id_offset": 62077
                ,   "function_name": "mog_queue_loop"
                ,   "file_name": "/usr/local/bin/cmogstored"
                }
              , {   "address": 140389362433493
                ,   "build_id": "3d9441083d079dc2977f1bd50c8068d11767232d"
                ,   "build_id_offset": 32213
                ,   "function_name": "start_thread"
                ,   "file_name": "/usr/lib64/libpthread-2.17.so"
                }
              , {   "address": 140389359455917
                ,   "build_id": "3c61131d1dac9da79b73188e7702bef786c2ad54"
                ,   "build_id_offset": 1040045
                ,   "function_name": "__clone"
                ,   "file_name": "/usr/lib64/libc-2.17.so"
                } ]
        } ]
}
---

We are using the latest 1.8.0 release on SL 7
(5.8.7-1.el7.elrepo.x86_64) and here's what it's linked against:

---
# ldd -v /usr/local/bin/cmogstored
  linux-vdso.so.1 =>  (0x00007ffc2898d000)
  libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f5c0ccff000)
  libc.so.6 => /lib64/libc.so.6 (0x00007f5c0c932000)
  /lib64/ld-linux-x86-64.so.2 (0x00007f5c0cf1b000)

  Version information:
  /usr/local/bin/cmogstored:
    libpthread.so.0 (GLIBC_2.3.2) => /lib64/libpthread.so.0
    libpthread.so.0 (GLIBC_2.2.5) => /lib64/libpthread.so.0
    libc.so.6 (GLIBC_2.3) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.9) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.8) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.3.2) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.7) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.10) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.6) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.17) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.4) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.3.4) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
  /lib64/libpthread.so.0:
    ld-linux-x86-64.so.2 (GLIBC_2.2.5) => /lib64/ld-linux-x86-64.so.2
    ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
    ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
    libc.so.6 (GLIBC_2.14) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.3.2) => /lib64/libc.so.6
    libc.so.6 (GLIBC_PRIVATE) => /lib64/libc.so.6
    libc.so.6 (GLIBC_2.2.5) => /lib64/libc.so.6
  /lib64/libc.so.6:
    ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
    ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
---

Looking at http_close() it does not appear to really do all that much
and mog_rbuf_free() appears to already test to see if the rbuf pointer
is null before freeing it so I'm not sure what the issue is. (Sorry
I'm not really a C dev so don't have a strong grasp on what is
happening.) I'm not really sure how to debug this issue further, is
there any other data I could collect or something I can do to try and
track down the issue?

Thanks!
Xiao Yu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-11 20:48 Segfaults on http_close? Xiao Yu
@ 2021-01-11 21:26 ` Eric Wong
  2021-01-17  9:51   ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2021-01-11 21:26 UTC (permalink / raw)
  To: Xiao Yu; +Cc: cmogstored-public

Xiao Yu <xyu@automattic.com> wrote:
> Howdy, we are running a 96 node cmogstored cluster and have noticed
> that when the cluster is busy with lots of writes we occasionally get
> segfaults in cmogstored. This has happened 7 times in the past week
> each time on a random and different cmogstored node. Looking at the
> abrt backtrace of the core dump shows something similar to the
> following in each instance:

Thanks for the bug report, sorry this caused you trouble
and I wonder if this is the same issue Arkadi was hitting
last year...

> ---
> {   "signal": 11
> ,   "executable": "/usr/local/bin/cmogstored"
> ,   "stacktrace":
>       [ {   "crash_thread": true
>         ,   "frames":
>               [ {   "address": 140389358944542
>                 ,   "build_id": "3c61131d1dac9da79b73188e7702bef786c2ad54"
>                 ,   "build_id_offset": 528670
>                 ,   "function_name": "_int_free"
>                 ,   "file_name": "/usr/lib64/libc-2.17.so"

Anything in stderr or dmesg kernel logs from that?  glibc malloc
will emit something to stderr, I think...

>                 }
>               , {   "address": 4225373
>                 ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
>                 ,   "build_id_offset": 31069
>                 ,   "function_name": "http_close"
>                 ,   "file_name": "/usr/local/bin/cmogstored"
>                 }
>               , {   "address": 4228819
>                 ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
>                 ,   "build_id_offset": 34515
>                 ,   "function_name": "mog_http_queue_step"
>                 ,   "file_name": "/usr/local/bin/cmogstored"
>                 }

<snip>

That's a pretty standard code path; though a better backtrace +
core dumps with line numbers and pointer values would be more
useful.

> We are using the latest 1.8.0 release on SL 7
> (5.8.7-1.el7.elrepo.x86_64) and here's what it's linked against:

I'll need more time to investigate, but can you try mixing in
some older versions (1.6, 1.7.x) and maybe see if it reproduces
on a test cluster?

I know 1.6 has gone through several PB of traffic without
problems (but with older kernels and IPv4); the newer releases
are more focused on my smaller home setup.

I don't know how long or what kernels you've tried with
cmogstored and what versions you've used in the past.

Is this your first time seeing this?

<snip>

> Looking at http_close() it does not appear to really do all that much
> and mog_rbuf_free() appears to already test to see if the rbuf pointer
> is null before freeing it so I'm not sure what the issue is. (Sorry
> I'm not really a C dev so don't have a strong grasp on what is
> happening.) I'm not really sure how to debug this issue further, is
> there any other data I could collect or something I can do to try and
> track down the issue?

Actually, mog_packaddr_free could be there, too, if you're using
IPv6.  (I haven't used IPv6 heavily).

But it could also be a double-free in mog_rbuf_free.

Would you happen to have anybody on staff that can look at core
dumps and poke at it from gdb?

Basically you need to ensure you're getting backtraces that
gdb can inspect.  It would be helpful to see exactly the
contents and owner of the pointer being freed.

The default build uses -ggdb3 to maximize debug info.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-11 21:26 ` Eric Wong
@ 2021-01-17  9:51   ` Eric Wong
  2021-01-20  5:21     ` Xiao Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2021-01-17  9:51 UTC (permalink / raw)
  To: Xiao Yu, Arkadi Colson; +Cc: cmogstored-public

Eric Wong <e@80x24.org> wrote:
> Xiao Yu <xyu@automattic.com> wrote:
> > Howdy, we are running a 96 node cmogstored cluster and have noticed
> > that when the cluster is busy with lots of writes we occasionally get
> > segfaults in cmogstored. This has happened 7 times in the past week
> > each time on a random and different cmogstored node. Looking at the
> > abrt backtrace of the core dump shows something similar to the
> > following in each instance:
> 
> Thanks for the bug report, sorry this caused you trouble
> and I wonder if this is the same issue Arkadi was hitting
> last year...

Hi Xiao and Arkadi: Can either of you try the 1-line patch
below to disable pthread_attr_setstacksize?

I took another look at the code and couldn't find any other
culprits... (though I admit I'm not mentally as sharp given
pandemic-induced stress and anxiety :<).

Given the mysterious nature of this problem and my inability to
reproduce it; I wonder if there's stack corruption with certain
compilers/glibc happening and blowing past the 4K guard page...

@Arkadi: Xiao recently brought up this (or similar) issue again:
  https://yhbt.net/cmogstored-public/20210111212621.GA12555@dcvr/T/

diff --git a/thrpool.c b/thrpool.c
index bc67ea0..bd71f95 100644
--- a/thrpool.c
+++ b/thrpool.c
@@ -141,7 +141,7 @@ thrpool_add(struct mog_thrpool *tp, unsigned size, unsigned long *nr_eagain)

 	CHECK(int, 0, pthread_attr_init(&attr));

-	if (stacksize > 0)
+	if (0 && stacksize > 0)
 		CHECK(int, 0, pthread_attr_setstacksize(&attr, stacksize));

 	thr = &tp->threads[tp->n_threads].thr;

In retrospect, running a small stack is unnecessary on 64-bit
systems due to practically unlimited virtual address space and
lazy allocation.  It may still make sense for 32-bit (some
embedded systems), though they can set RLIMIT_STACK before
launching.

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-17  9:51   ` Eric Wong
@ 2021-01-20  5:21     ` Xiao Yu
  2021-01-20  8:57       ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Yu @ 2021-01-20  5:21 UTC (permalink / raw)
  To: Eric Wong; +Cc: Xiao Yu, Arkadi Colson, cmogstored-public

Thanks for the quick response! Sorry about the delay but I ran into a
couple issues (sorry kinda learning gdb and compiling binaries in
general on the go here) and have not been able to capture more useful
logs yet as crashes have seem to have slowed / stopped since
recompiling and reloading. For the record I recompiled cmogstored with
the newer RH `devtoolset-9-toolchain` (9.1) and it has not crash
since.

Also sorry about the lack of useful logs in my initial message but
neither kernel logs nor messages contained anything interesting around
the segfaults. Making matters worse, we didn't consistently reload
cmogstored as various versions of the compiled binary was installed
across the cluster and didn't really save the debugging symbols from
each of the compilations so can't really reply with a more useful
stack trace with the current core dumps.

For what it's worth, we also run another cluster with 1.7.3 that we've
upgraded over the years and have never seen this issue. Those nodes
are on a different distro (Debian) if that makes any difference.

On Sun, Jan 17, 2021 at 9:51 AM Eric Wong <e@80x24.org> wrote:
>
> Eric Wong <e@80x24.org> wrote:
> > Xiao Yu <xyu@automattic.com> wrote:
> > > Howdy, we are running a 96 node cmogstored cluster and have noticed
> > > that when the cluster is busy with lots of writes we occasionally get
> > > segfaults in cmogstored. This has happened 7 times in the past week
> > > each time on a random and different cmogstored node. Looking at the
> > > abrt backtrace of the core dump shows something similar to the
> > > following in each instance:
> >
> > Thanks for the bug report, sorry this caused you trouble
> > and I wonder if this is the same issue Arkadi was hitting
> > last year...
>
> Hi Xiao and Arkadi: Can either of you try the 1-line patch
> below to disable pthread_attr_setstacksize?

I'm going to let the current version run for a bit more and hope we
get a core dump, I think I finally have things cleaned up so that all
96 of our servers are running the same binary and have saved the right
/ corresponding debugging symbols file to check the stack trace later.
If we see another crash I'll recompile with the patch below and try
again. :)

> I took another look at the code and couldn't find any other
> culprits... (though I admit I'm not mentally as sharp given
> pandemic-induced stress and anxiety :<).

Oof, sorry to hear that, take care of yourself man!

> Given the mysterious nature of this problem and my inability to
> reproduce it; I wonder if there's stack corruption with certain
> compilers/glibc happening and blowing past the 4K guard page...
>
> @Arkadi: Xiao recently brought up this (or similar) issue again:
>   https://yhbt.net/cmogstored-public/20210111212621.GA12555@dcvr/T/
>
> diff --git a/thrpool.c b/thrpool.c
> index bc67ea0..bd71f95 100644
> --- a/thrpool.c
> +++ b/thrpool.c
> @@ -141,7 +141,7 @@ thrpool_add(struct mog_thrpool *tp, unsigned size, unsigned long *nr_eagain)
>
>         CHECK(int, 0, pthread_attr_init(&attr));
>
> -       if (stacksize > 0)
> +       if (0 && stacksize > 0)
>                 CHECK(int, 0, pthread_attr_setstacksize(&attr, stacksize));
>
>         thr = &tp->threads[tp->n_threads].thr;
>
>
>
> In retrospect, running a small stack is unnecessary on 64-bit
> systems due to practically unlimited virtual address space and
> lazy allocation.  It may still make sense for 32-bit (some
> embedded systems), though they can set RLIMIT_STACK before
> launching.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-20  5:21     ` Xiao Yu
@ 2021-01-20  8:57       ` Eric Wong
  2021-01-20 21:13         ` Xiao Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2021-01-20  8:57 UTC (permalink / raw)
  To: Xiao Yu; +Cc: Arkadi Colson, cmogstored-public

Xiao Yu <xyu@automattic.com> wrote:
> Thanks for the quick response! Sorry about the delay but I ran into a
> couple issues (sorry kinda learning gdb and compiling binaries in
> general on the go here) and have not been able to capture more useful
> logs yet as crashes have seem to have slowed / stopped since
> recompiling and reloading. For the record I recompiled cmogstored with
> the newer RH `devtoolset-9-toolchain` (9.1) and it has not crash
> since.

No worries.  Based on the toolchain change being a success, I
more strongly suspect the tiny patch I sent you guys will fix
the problem with all compilers.

> Also sorry about the lack of useful logs in my initial message but
> neither kernel logs nor messages contained anything interesting around
> the segfaults. Making matters worse, we didn't consistently reload
> cmogstored as various versions of the compiled binary was installed
> across the cluster and didn't really save the debugging symbols from
> each of the compilations so can't really reply with a more useful
> stack trace with the current core dumps.

That's fine.  One thing is I suggest is NOT using --daemonize/-d
flag and instead rely on systemd or something similar that can
capture stderr.

The --daemonize/-d flag (unfortunately) matches Perlbal
mogstored behavior, which causes stderr to be unconditionally
redirected to /dev/null and hides some errors.

> For what it's worth, we also run another cluster with 1.7.3 that we've
> upgraded over the years and have never seen this issue. Those nodes
> are on a different distro (Debian) if that makes any difference.

Good to know and thanks for the data point.  I suspect it
depends on the compiler version/settings, and maybe other
versions of gcc/clang shipped with Debian will still trigger the
problem without my proposed patch to use default stack size.

Thanks again for the report and will look forward to more info
whenever you can provide it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-20  8:57       ` Eric Wong
@ 2021-01-20 21:13         ` Xiao Yu
  2021-01-20 21:22           ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Yu @ 2021-01-20 21:13 UTC (permalink / raw)
  To: Eric Wong; +Cc: Xiao Yu, Arkadi Colson, cmogstored-public

On Wed, Jan 20, 2021 at 8:57 AM Eric Wong <e@80x24.org> wrote:
>
> Xiao Yu <xyu@automattic.com> wrote:
> > Thanks for the quick response! Sorry about the delay but I ran into a
> > couple issues (sorry kinda learning gdb and compiling binaries in
> > general on the go here) and have not been able to capture more useful
> > logs yet as crashes have seem to have slowed / stopped since
> > recompiling and reloading. For the record I recompiled cmogstored with
> > the newer RH `devtoolset-9-toolchain` (9.1) and it has not crash
> > since.
>
> No worries.  Based on the toolchain change being a success, I
> more strongly suspect the tiny patch I sent you guys will fix
> the problem with all compilers.

That's good to know, would it help if I applied the patch and then
compiled with the older compiler to check?

> > Also sorry about the lack of useful logs in my initial message but
> > neither kernel logs nor messages contained anything interesting around
> > the segfaults. Making matters worse, we didn't consistently reload
> > cmogstored as various versions of the compiled binary was installed
> > across the cluster and didn't really save the debugging symbols from
> > each of the compilations so can't really reply with a more useful
> > stack trace with the current core dumps.
>
> That's fine.  One thing is I suggest is NOT using --daemonize/-d
> flag and instead rely on systemd or something similar that can
> capture stderr.
>
> The --daemonize/-d flag (unfortunately) matches Perlbal
> mogstored behavior, which causes stderr to be unconditionally
> redirected to /dev/null and hides some errors.

I noticed the lack of verbosity / logs :) so right now we're running
daemonized but also using syslog to capture logs, it's still pretty
quiet overall with nothing around when the segfaults happened and
pretty much only logs about premature EOFs when timeouts happen and
things like the following when we reload cmogstored:

```
2021-01-14T15:04:15.888261-05:00 host cmogstored[44351]: upgrade
spawned PID:27575
2021-01-14T15:04:15.891551-05:00 host cmogstored[27575]: inherited
0.0.0.0:7501 on fd=4
2021-01-14T15:04:15.891564-05:00 host cmogstored[27575]: inherited
0.0.0.0:7500 on fd=5
```

Would non-daemonized mode and capturing stderr provide more verbosity?

Thanks again, Xiao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-20 21:13         ` Xiao Yu
@ 2021-01-20 21:22           ` Eric Wong
  2021-01-25 17:36             ` Xiao Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2021-01-20 21:22 UTC (permalink / raw)
  To: Xiao Yu; +Cc: Arkadi Colson, cmogstored-public

Xiao Yu <xyu@automattic.com> wrote:
> That's good to know, would it help if I applied the patch and then
> compiled with the older compiler to check?

Yes, that would be tremendously helpful, thanks.

> Would non-daemonized mode and capturing stderr provide more verbosity?

Yes, syslog can't capture fatal errors from glibc, unfortunately.
It might only be one line, but that could tell you what kind of free()
failure there was.

--daemonize/-d behavior was a mistake copied from Perl mogstored :<
I'm not sure if I want to add another option or env variable...
(I use systemd myself)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-20 21:22           ` Eric Wong
@ 2021-01-25 17:36             ` Xiao Yu
  2021-01-25 17:47               ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Yu @ 2021-01-25 17:36 UTC (permalink / raw)
  To: Eric Wong; +Cc: Xiao Yu, Arkadi Colson, cmogstored-public

On Wed, Jan 20, 2021 at 9:22 PM Eric Wong <e@80x24.org> wrote:
>
> Xiao Yu <xyu@automattic.com> wrote:
> > That's good to know, would it help if I applied the patch and then
> > compiled with the older compiler to check?
>
> Yes, that would be tremendously helpful, thanks.
>
> > Would non-daemonized mode and capturing stderr provide more verbosity?

We've been running the patched version compiled with the older GCC for
a couple days now and so far so good and hopefully I can provide more
details if it does crash again in the future. Thanks again for all
your help!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-25 17:36             ` Xiao Yu
@ 2021-01-25 17:47               ` Eric Wong
  2021-01-25 19:27                 ` Xiao Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2021-01-25 17:47 UTC (permalink / raw)
  To: Xiao Yu; +Cc: Arkadi Colson, cmogstored-public

Xiao Yu <xyu@automattic.com> wrote:
> On Wed, Jan 20, 2021 at 9:22 PM Eric Wong <e@80x24.org> wrote:
> > Xiao Yu <xyu@automattic.com> wrote:
> > > That's good to know, would it help if I applied the patch and then
> > > compiled with the older compiler to check?
> >
> > Yes, that would be tremendously helpful, thanks.
> >
> > > Would non-daemonized mode and capturing stderr provide more verbosity?
> 
> We've been running the patched version compiled with the older GCC for
> a couple days now and so far so good and hopefully I can provide more
> details if it does crash again in the future. Thanks again for all
> your help!

No problem and thanks for the update!  Any thoughts on whether I
should release 1.8.1 with the patch soon or maybe wait a bit longer?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-25 17:47               ` Eric Wong
@ 2021-01-25 19:27                 ` Xiao Yu
  2021-02-12  6:54                   ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Yu @ 2021-01-25 19:27 UTC (permalink / raw)
  To: Eric Wong; +Cc: Xiao Yu, Arkadi Colson, cmogstored-public

On Mon, Jan 25, 2021 at 5:47 PM Eric Wong <e@80x24.org> wrote:
>
> No problem and thanks for the update!  Any thoughts on whether I
> should release 1.8.1 with the patch soon or maybe wait a bit longer?

Let's wait a week, I think it's pretty safe because across the cluster
we were seeing ~1 segfault / day and so far none of the nodes have
experienced any issues. However I just want to be sure it's not due to
random chance and give it a couple more days of full weekday load to
be sure.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-01-25 19:27                 ` Xiao Yu
@ 2021-02-12  6:54                   ` Eric Wong
  2021-02-12 21:18                     ` Xiao Yu
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Wong @ 2021-02-12  6:54 UTC (permalink / raw)
  To: Xiao Yu; +Cc: Arkadi Colson, cmogstored-public

Xiao Yu <xyu@automattic.com> wrote:
> On Mon, Jan 25, 2021 at 5:47 PM Eric Wong <e@80x24.org> wrote:
> >
> > No problem and thanks for the update!  Any thoughts on whether I
> > should release 1.8.1 with the patch soon or maybe wait a bit longer?
> 
> Let's wait a week, I think it's pretty safe because across the cluster
> we were seeing ~1 segfault / day and so far none of the nodes have
> experienced any issues. However I just want to be sure it's not due to
> random chance and give it a couple more days of full weekday load to
> be sure.

Ping?  It's been a few weeks :>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-02-12  6:54                   ` Eric Wong
@ 2021-02-12 21:18                     ` Xiao Yu
  2021-02-13  2:19                       ` Eric Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Xiao Yu @ 2021-02-12 21:18 UTC (permalink / raw)
  To: Eric Wong; +Cc: Xiao Yu, Arkadi Colson, cmogstored-public

On Fri, Feb 12, 2021 at 6:54 AM Eric Wong <e@80x24.org> wrote:
>
> Ping?  It's been a few weeks :>

Apologies for not getting back to you earlier! I just checked all of
our logs across our 96 nodes and there's been no segfaults since the
patch was deployed. I think we can call that fix good and tag a new
version.

Thanks again for the quick fix and diagnosing the issue even with the
limited amount of info I was able to provide.

--Xiao

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Segfaults on http_close?
  2021-02-12 21:18                     ` Xiao Yu
@ 2021-02-13  2:19                       ` Eric Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Eric Wong @ 2021-02-13  2:19 UTC (permalink / raw)
  To: Xiao Yu; +Cc: Arkadi Colson, cmogstored-public

Xiao Yu <xyu@automattic.com> wrote:
> On Fri, Feb 12, 2021 at 6:54 AM Eric Wong <e@80x24.org> wrote:
> >
> > Ping?  It's been a few weeks :>
> 
> Apologies for not getting back to you earlier! I just checked all of
> our logs across our 96 nodes and there's been no segfaults since the
> patch was deployed. I think we can call that fix good and tag a new
> version.
> 
> Thanks again for the quick fix and diagnosing the issue even with the
> limited amount of info I was able to provide.

No problem and thanks for the update!

I've pushed out the final patch as commit f441da11c290373
and will push out a 1.8.1 release in a few.
-------------8<------------
Date: Sat, 13 Feb 2021 01:03:58 +0000
Subject: [PATCH] thrpool: remove stack size changing for all platforms

As compilers and system C libraries change, the using a
non-default stack size is too risky and can lead to
difficult-to-diagnose problems.

Using the default stack size seems to solve the segfaults
at http_close reported by Xiao Yu <xyu@automattic.com>.

Users on modern 64-bit systems were unlikely to find any benefit
in using a small stack size with this code base.
Users on 32-bit systems who wish to continue with a minimal
stack should use "ulimit -s" in startup scripts or configure
their process manager appropriately (e.g. setting the
"LimitSTACK" directive in described in systemd.exec(5)).

Reported-and-tested-by: Xiao Yu <xyu@automattic.com>
Link: https://yhbt.net/cmogstored-public/CABfxMcW+kb5gwq3pSB_89P49EVv+4UkJXz+mUPQTy19AdrwbAg@mail.gmail.com/T/
---
 thrpool.c | 21 ---------------------
 1 file changed, 21 deletions(-)

diff --git a/thrpool.c b/thrpool.c
index bc67ea0..56a7ef0 100644
--- a/thrpool.c
+++ b/thrpool.c
@@ -12,24 +12,6 @@ struct mog_thr_start_arg {
 	unsigned *do_quit;
 };
 
-/*
- * we can lower this if we can test with lower values, NPTL minimum is 16K.
- * We also use syslog() and *printf() functions which take a lot of
- * stack under glibc, so we'll add BUFSIZ (8192 on glibc) to that
- */
-#if MOG_LIBKQUEUE /* libkqueue uses quite a bit of stack */
-#  define MOG_THR_STACK_SIZE (0)
-#elif defined(__GLIBC__) || defined(__FreeBSD__)
-#  define MOG_THR_STACK_SIZE ((16 * 1024) + MAX(8192,BUFSIZ))
-#  if defined(PTHREAD_STACK_MIN) && (PTHREAD_STACK_MIN > MOG_THR_STACK_SIZE)
-#    undef MOG_THR_STACK_SIZE
-#    define MOG_THR_STACK_SIZE PTHREAD_STACK_MIN
-#  endif
-#else
-#  define MOG_THR_STACK_SIZE (0)
-#endif
-static const size_t stacksize = (size_t)MOG_THR_STACK_SIZE;
-
 static sigset_t quitset;
 
 __attribute__((constructor)) static void thrpool_init(void)
@@ -141,9 +123,6 @@ thrpool_add(struct mog_thrpool *tp, unsigned size, unsigned long *nr_eagain)
 
 	CHECK(int, 0, pthread_attr_init(&attr));
 
-	if (stacksize > 0)
-		CHECK(int, 0, pthread_attr_setstacksize(&attr, stacksize));
-
 	thr = &tp->threads[tp->n_threads].thr;
 
 	CHECK(int, 0, pthread_mutex_lock(&arg.mtx));

^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-02-13  2:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-11 20:48 Segfaults on http_close? Xiao Yu
2021-01-11 21:26 ` Eric Wong
2021-01-17  9:51   ` Eric Wong
2021-01-20  5:21     ` Xiao Yu
2021-01-20  8:57       ` Eric Wong
2021-01-20 21:13         ` Xiao Yu
2021-01-20 21:22           ` Eric Wong
2021-01-25 17:36             ` Xiao Yu
2021-01-25 17:47               ` Eric Wong
2021-01-25 19:27                 ` Xiao Yu
2021-02-12  6:54                   ` Eric Wong
2021-02-12 21:18                     ` Xiao Yu
2021-02-13  2:19                       ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/cmogstored.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).