From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id BFA9E1F4B4; Mon, 11 Jan 2021 21:26:21 +0000 (UTC) Date: Mon, 11 Jan 2021 21:26:21 +0000 From: Eric Wong To: Xiao Yu Cc: cmogstored-public@yhbt.net Subject: Re: Segfaults on http_close? Message-ID: <20210111212621.GA12555@dcvr> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: List-Id: Xiao Yu wrote: > Howdy, we are running a 96 node cmogstored cluster and have noticed > that when the cluster is busy with lots of writes we occasionally get > segfaults in cmogstored. This has happened 7 times in the past week > each time on a random and different cmogstored node. Looking at the > abrt backtrace of the core dump shows something similar to the > following in each instance: Thanks for the bug report, sorry this caused you trouble and I wonder if this is the same issue Arkadi was hitting last year... > --- > { "signal": 11 > , "executable": "/usr/local/bin/cmogstored" > , "stacktrace": > [ { "crash_thread": true > , "frames": > [ { "address": 140389358944542 > , "build_id": "3c61131d1dac9da79b73188e7702bef786c2ad54" > , "build_id_offset": 528670 > , "function_name": "_int_free" > , "file_name": "/usr/lib64/libc-2.17.so" Anything in stderr or dmesg kernel logs from that? glibc malloc will emit something to stderr, I think... > } > , { "address": 4225373 > , "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7" > , "build_id_offset": 31069 > , "function_name": "http_close" > , "file_name": "/usr/local/bin/cmogstored" > } > , { "address": 4228819 > , "build_id": "9ca387b687027c0bac678943337d72b109fdf1e7" > , "build_id_offset": 34515 > , "function_name": "mog_http_queue_step" > , "file_name": "/usr/local/bin/cmogstored" > } That's a pretty standard code path; though a better backtrace + core dumps with line numbers and pointer values would be more useful. > We are using the latest 1.8.0 release on SL 7 > (5.8.7-1.el7.elrepo.x86_64) and here's what it's linked against: I'll need more time to investigate, but can you try mixing in some older versions (1.6, 1.7.x) and maybe see if it reproduces on a test cluster? I know 1.6 has gone through several PB of traffic without problems (but with older kernels and IPv4); the newer releases are more focused on my smaller home setup. I don't know how long or what kernels you've tried with cmogstored and what versions you've used in the past. Is this your first time seeing this? > Looking at http_close() it does not appear to really do all that much > and mog_rbuf_free() appears to already test to see if the rbuf pointer > is null before freeing it so I'm not sure what the issue is. (Sorry > I'm not really a C dev so don't have a strong grasp on what is > happening.) I'm not really sure how to debug this issue further, is > there any other data I could collect or something I can do to try and > track down the issue? Actually, mog_packaddr_free could be there, too, if you're using IPv6. (I haven't used IPv6 heavily). But it could also be a double-free in mog_rbuf_free. Would you happen to have anybody on staff that can look at core dumps and poke at it from gdb? Basically you need to ensure you're getting backtraces that gdb can inspect. It would be helpful to see exactly the contents and owner of the pointer being freed. The default build uses -ggdb3 to maximize debug info.