From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 2B76A1F4B4; Wed, 20 Jan 2021 08:57:45 +0000 (UTC) Date: Wed, 20 Jan 2021 08:57:45 +0000 From: Eric Wong To: Xiao Yu Cc: Arkadi Colson , cmogstored-public@yhbt.net Subject: Re: Segfaults on http_close? Message-ID: <20210120085745.GB29704@dcvr> References: <20210111212621.GA12555@dcvr> <20210117095109.GA28219@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Xiao Yu wrote: > Thanks for the quick response! Sorry about the delay but I ran into a > couple issues (sorry kinda learning gdb and compiling binaries in > general on the go here) and have not been able to capture more useful > logs yet as crashes have seem to have slowed / stopped since > recompiling and reloading. For the record I recompiled cmogstored with > the newer RH `devtoolset-9-toolchain` (9.1) and it has not crash > since. No worries. Based on the toolchain change being a success, I more strongly suspect the tiny patch I sent you guys will fix the problem with all compilers. > Also sorry about the lack of useful logs in my initial message but > neither kernel logs nor messages contained anything interesting around > the segfaults. Making matters worse, we didn't consistently reload > cmogstored as various versions of the compiled binary was installed > across the cluster and didn't really save the debugging symbols from > each of the compilations so can't really reply with a more useful > stack trace with the current core dumps. That's fine. One thing is I suggest is NOT using --daemonize/-d flag and instead rely on systemd or something similar that can capture stderr. The --daemonize/-d flag (unfortunately) matches Perlbal mogstored behavior, which causes stderr to be unconditionally redirected to /dev/null and hides some errors. > For what it's worth, we also run another cluster with 1.7.3 that we've > upgraded over the years and have never seen this issue. Those nodes > are on a different distro (Debian) if that makes any difference. Good to know and thanks for the data point. I suspect it depends on the compiler version/settings, and maybe other versions of gcc/clang shipped with Debian will still trigger the problem without my proposed patch to use default stack size. Thanks again for the report and will look forward to more info whenever you can provide it.