From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS33070 50.56.128.0/17 X-Spam-Status: No, score=0.0 required=3.0 tests=MSGID_FROM_MTA_HEADER, TVD_RCVD_IP shortcircuit=no autolearn=unavailable version=3.3.2 Path: news.gmane.org!not-for-mail From: Samuel Kadolph Newsgroups: gmane.comp.lang.ruby.rainbows.general Subject: Re: Unicorn is killing our rainbows workers Date: Thu, 19 Jul 2012 16:57:31 -0400 Message-ID: References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> <20120719201633.GA8203@dcvr.yhbt.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1342731461 10404 80.91.229.3 (19 Jul 2012 20:57:41 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 19 Jul 2012 20:57:41 +0000 (UTC) Cc: Cody Fauser , ops , Harry Brundage , Jonathan Rudenberg To: "Rainbows! list" Original-X-From: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Thu Jul 19 22:57:40 2012 Return-path: Envelope-to: gclrrg-rainbows-talk@m.gmane.org X-Original-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Delivered-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopify.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=KHxk4aptpyy8qVFzClzlsJWlI/tF8OaRmPTaz+uFDGk=; b=blRTbf3z3hG4pU09AwM4vIYocoBrH9rWFb+AQ4CokKrwypoHhLlTDPkb0t/lHlF6S4 3S1l2m+p8rFlndEV92fNoYEK6evnJGTSA5WvXTxy8Pfsh3ZPBfjXhSd5eKZPzL67jamw LuSihRpuXbbbcpNeQmSIan7DAPGPSBD2rzBjc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:x-gm-message-state; bh=KHxk4aptpyy8qVFzClzlsJWlI/tF8OaRmPTaz+uFDGk=; b=Kld04DdWECEIdeMlNr2ukpgXpanW3fwRDjsP3rDBvvYhmIkhPQ/8VMvzpm/4W/ZFal B4e9JjwUTJM75OYmkChdI9VfG5NRB8vGNfzV2NwltQ9PudiOJlKGY/UIGZflllDLNnhG KPTbg6h/8V4VG2xlqoOIpZs2Hnai8318Mgh7FXBsZyL36xD7B+LF2S5xjscCKeRZ289L wVWb4QakowKdzWJ56X+Hz0PsO5q26ZHpM8BheRWEJtipF5AAdvHnaC3EM6puQhBRBYJx e+nx0hf5ec7QvCmMVwlMr4Ppl2L3lo/Z2idwx6bfv+pFqhV2d3Gjtk6aF1QlieLc4DHm 7N+w== In-Reply-To: <20120719201633.GA8203-yBiyF41qdooeIZ0/mPfg9Q@public.gmane.org> X-Gm-Message-State: ALoCoQk43xrgRpGoozIjnnPl/viOE/jw4IMkgI2fWIw/DPrHwtv1F9FkTpTCKX423bgCquUwlZez X-BeenThere: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Errors-To: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Xref: news.gmane.org gmane.comp.lang.ruby.rainbows.general:378 Archived-At: Received: from 50-56-192-79.static.cloud-ips.com ([50.56.192.79] helo=rubyforge.org) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Srxn0-00029D-Ic for gclrrg-rainbows-talk@m.gmane.org; Thu, 19 Jul 2012 22:57:38 +0200 Received: from localhost.localdomain (localhost [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id A1C0C2E068; Thu, 19 Jul 2012 20:57:36 +0000 (UTC) Received: from mail-yw0-f50.google.com (mail-yw0-f50.google.com [209.85.213.50]) by rubyforge.org (Postfix) with ESMTP id 900322E063 for ; Thu, 19 Jul 2012 20:57:32 +0000 (UTC) Received: by yhjj63 with SMTP id j63so3941815yhj.23 for ; Thu, 19 Jul 2012 13:57:32 -0700 (PDT) Received: by 10.66.75.133 with SMTP id c5mr6526734paw.24.1342731451758; Thu, 19 Jul 2012 13:57:31 -0700 (PDT) Received: by 10.66.217.225 with HTTP; Thu, 19 Jul 2012 13:57:31 -0700 (PDT) On Thu, Jul 19, 2012 at 4:16 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: > > > Samuel Kadolph wrote: > > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > > >> > Samuel Kadolph wrote: > > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > > >> >> unicorn is killing our ThreadPool workers would help us greatly. If > > >> >> you require additional info I would be happy to provide it. > > > > > > Also, are you using "preload_app true" ? > > > > Yes we are using preload_app true. > > > > > I'm a bit curious how these messages are happening, too: > > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after > > > suspend/hibernation > > > > They are strange. My current hunch is the killing and that message are > > symptoms of the same issue. Since it always follows a killing. > > I wonder if there's some background thread one of your gems spawns on > load that causes the master to stall. I'm not seeing how else unicorn > could think it was in suspend/hibernation. > > > > Can you tell (from Rails logs) if the to-be-killed workers are still > > > processing requests/responses the 300s before when the unicorn timeout > > > hits it? AFAIK, Rails logs the PID of each worker processing the > > > request. > > > > rails doesn't log the pid but it would seem that after upgrading to > > mysql 0.2.18 it is no longer killing workers that are busy with > > requests. > > Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger > is the default Rails logger :) > > > > If anything, I'd lower the unicorn timeout to something low (maybe > > > 5-10s) since that detects hard lockups at the VM level. Individual > > > requests in Rainbows! _are_ allowed to take longer than the unicorn > > > timeout. > > > > We lowered the unicorn timeout to 5 seconds and but that did not > > change the killings but they seem to be happening less often. I have > > some of our stderr logs after setting the timeout to 5 seconds at > > https://gist.github.com/3144250. > > Thanks for trying that! > > > > Newer versions of mysql2 should avoid potential issues with > > > ThreadTimeout/Timeout (or anything that hits Thread#kill). I think > > > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare > > > (but possibly related to your issue) bug, > > > > Upgrading mysql2 seems to have stopped unicorn from killing workers > > that are currently busy. We were stress testing it last night and > > after we upgraded to 0.2.18 we had no more 502s from the app but this > > could be a coincidence since the killings are still happen. > > Alright, good to know 0.2.18 solved your problems. Btw, have you > noticed any general connectivity issues to your MySQL server? > There were quite a few bugfixes from 0.2.6..0.2.18, though. > > Anyways, I'm happy your problem seems to be fixed with the mysql2 > upgrade :) Unfortunately that didn't fix the problem. We had a large sale today and had 2 502s. We're going to try p194 on next week and I'll let you know if that fixes it. > > Our ops guys say we had this problem before we were using ThreadTimeout. > > OK. That's somewhat reassuring to know (especially since the culprit > seems to be an old mysql2 gem). I've had other users (privately) report > issues with recursive locking because of ensure clauses (e.g. > Mutex#synchronize) that I forgot to document. We're going to try going without ThreadTimeout again to make sure that's not the issue. _______________________________________________ Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org http://rubyforge.org/mailman/listinfo/rainbows-talk Do not quote signatures (like this one) or top post when replying