From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-3.2 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00, BODY_8BITS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id CB6981F463; Wed, 18 Dec 2019 17:58:49 +0000 (UTC) Date: Wed, 18 Dec 2019 17:58:49 +0000 From: Eric Wong To: Arkadi Colson Cc: cmogstored-public@bogomips.org Subject: Re: Heavy load Message-ID: <20191218175849.GA27056@dcvr> References: <20191211170658.GA16951@dcvr> <20191212191642.GA16109@dcvr> <20191217084330.GB32606@dcvr> <2fc09aa7-dc0a-d37e-4015-2ba834aa4525@smartbit.be> <20191217194205.GA31282@dcvr> <06b931d9-fc30-ba20-6b02-ddd2b9120766@smartbit.be> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <06b931d9-fc30-ba20-6b02-ddd2b9120766@smartbit.be> List-Id: Arkadi Colson wrote: > On 17/12/19 20:42, Eric Wong wrote: > > Arkadi Colson wrote: > >>> Any idea? If you need more info, please just ask! > >> We are having about 192 devices spread over about 23 cmogstored hosts. > >> Each device is one disk with one partition... > >>> How many "/devXYZ" devices do you have? Are they all > >>> on different partitions? > > OK, thanks. I've only got a single host nowadays with 3 > > rotational HDD. Most I ever had was 20 rotational HDD on a > > host but that place is out-of-business. > > > > Since your build did not include -ggdb3 with debug_info by default; > > I wonder if there's something broken in your build system or > > build scripts... Which compiler are you using? > > > > Can you share the output of "ldd /path/to/cmogstored" ? > root@mogstore:~# ldd /usr/bin/cmogstored >     linux-vdso.so.1 (0x00007fff6584b000) >     libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 > (0x00007f634a7f2000) >     libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f634a453000) >     /lib64/ld-linux-x86-64.so.2 (0x00007f634ac50000) > > > Since this is Linux, you're not using libkqueue, are you? I was hoping for a simple explanation with libkqueue being the culprit, but that's not it. Have you gotten a better backtrace with debug_info? (-ggdb3) > > Also, which Linux kernel is it? > > In fact it's a clean debian stretch installation with htis kernel: > > root@mogstore:~# uname -a > Linux mogstore 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2 > (2019-11-11) x86_64 GNU/Linux OK, I don't think there's known problems with that kernel; cmogstored is pretty sensitive to OS bugs and bugs in emulation layers like libkqueue. > > Are you using "server aio_threads =" via mgmt interface? > I don't think so. How can I verify this? You'd have something connecting to the mgmt port (7501 in your case) and setting "server aio_threads = $NUMBER". > > Are you using the undocumented -W/--worker-processes or > > -M (multi-config) option(s?) > > I don't think so: Config looks like this: > > httplisten  = 0.0.0.0:7500 > mgmtlisten  = 0.0.0.0:7501 > maxconns    = 10000 > docroot     = /var/mogdata > daemonize   = 1 > server      = none OK > > > > Is your traffic read-heavy or write-heavy? > We saw peaks of 3Gb traffic on the newest cmogstore when marking one > host dead... Are you able to reproduce the problem on a test instance with just cmogstored? (no need for full MogileFS instance, just PUT/GET over HTTP). Also, are you on SSD or HDD? Lower latency of SSD could trigger some bugs. The design is for high-latency HDD, but it ought to work well with SSD, too. I haven't tested with SSD, much, unfortunately.