From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <e@80x24.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
	by dcvr.yhbt.net (Postfix) with ESMTP id BFA9E1F4B4;
	Mon, 11 Jan 2021 21:26:21 +0000 (UTC)
Date: Mon, 11 Jan 2021 21:26:21 +0000
From: Eric Wong <e@80x24.org>
To: Xiao Yu <xyu@automattic.com>
Cc: cmogstored-public@yhbt.net
Subject: Re: Segfaults on http_close?
Message-ID: <20210111212621.GA12555@dcvr>
References: <CABfxMcXPr7q8o1ayRdn1x-Fukuh7-s3YG=04KX=vAzz4DYqhuQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <CABfxMcXPr7q8o1ayRdn1x-Fukuh7-s3YG=04KX=vAzz4DYqhuQ@mail.gmail.com>
List-Id: <cmogstored-public.yhbt.net>

Xiao Yu <xyu@automattic.com> wrote:
> Howdy, we are running a 96 node cmogstored cluster and have noticed
> that when the cluster is busy with lots of writes we occasionally get
> segfaults in cmogstored. This has happened 7 times in the past week
> each time on a random and different cmogstored node. Looking at the
> abrt backtrace of the core dump shows something similar to the
> following in each instance:

Thanks for the bug report, sorry this caused you trouble
and I wonder if this is the same issue Arkadi was hitting
last year...

> ---
> {   "signal": 11
> ,   "executable": "/usr/local/bin/cmogstored"
> ,   "stacktrace":
>       [ {   "crash_thread": true
>         ,   "frames":
>               [ {   "address": 140389358944542
>                 ,   "build_id": "3c61131d1dac9da79b73188e7702bef786c2ad54"
>                 ,   "build_id_offset": 528670
>                 ,   "function_name": "_int_free"
>                 ,   "file_name": "/usr/lib64/libc-2.17.so"

Anything in stderr or dmesg kernel logs from that?  glibc malloc
will emit something to stderr, I think...

>                 }
>               , {   "address": 4225373
>                 ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
>                 ,   "build_id_offset": 31069
>                 ,   "function_name": "http_close"
>                 ,   "file_name": "/usr/local/bin/cmogstored"
>                 }
>               , {   "address": 4228819
>                 ,   "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7"
>                 ,   "build_id_offset": 34515
>                 ,   "function_name": "mog_http_queue_step"
>                 ,   "file_name": "/usr/local/bin/cmogstored"
>                 }

<snip>

That's a pretty standard code path; though a better backtrace +
core dumps with line numbers and pointer values would be more
useful.

> We are using the latest 1.8.0 release on SL 7
> (5.8.7-1.el7.elrepo.x86_64) and here's what it's linked against:

I'll need more time to investigate, but can you try mixing in
some older versions (1.6, 1.7.x) and maybe see if it reproduces
on a test cluster?

I know 1.6 has gone through several PB of traffic without
problems (but with older kernels and IPv4); the newer releases
are more focused on my smaller home setup.

I don't know how long or what kernels you've tried with
cmogstored and what versions you've used in the past.

Is this your first time seeing this?

<snip>

> Looking at http_close() it does not appear to really do all that much
> and mog_rbuf_free() appears to already test to see if the rbuf pointer
> is null before freeing it so I'm not sure what the issue is. (Sorry
> I'm not really a C dev so don't have a strong grasp on what is
> happening.) I'm not really sure how to debug this issue further, is
> there any other data I could collect or something I can do to try and
> track down the issue?

Actually, mog_packaddr_free could be there, too, if you're using
IPv6.  (I haven't used IPv6 heavily).

But it could also be a double-free in mog_rbuf_free.

Would you happen to have anybody on staff that can look at core
dumps and poke at it from gdb?

Basically you need to ensure you're getting backtraces that
gdb can inspect.  It would be helpful to see exactly the
contents and owner of the pointer being freed.

The default build uses -ggdb3 to maximize debug info.