From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: * X-Spam-ASN: AS33070 50.56.128.0/17 X-Spam-Status: No, score=1.0 required=3.0 tests=AWL,HK_RANDOM_FROM, MSGID_FROM_MTA_HEADER,TVD_RCVD_IP shortcircuit=no autolearn=no version=3.3.2 Path: news.gmane.org!not-for-mail From: Eric Wong Newsgroups: gmane.comp.lang.ruby.rainbows.general Subject: Re: Unicorn is killing our rainbows workers Date: Thu, 19 Jul 2012 20:16:33 +0000 Message-ID: <20120719201633.GA8203@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> <20120719002641.GA17210@dcvr.yhbt.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1342729154 24921 80.91.229.3 (19 Jul 2012 20:19:14 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 19 Jul 2012 20:19:14 +0000 (UTC) Cc: Cody Fauser , ops , Harry Brundage , Jonathan Rudenberg To: Rainbows! list Original-X-From: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Thu Jul 19 22:19:14 2012 Return-path: Envelope-to: gclrrg-rainbows-talk@m.gmane.org X-Original-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Delivered-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Errors-To: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Xref: news.gmane.org gmane.comp.lang.ruby.rainbows.general:377 Archived-At: Received: from 50-56-192-79.static.cloud-ips.com ([50.56.192.79] helo=rubyforge.org) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SrxBk-0007MH-Mt for gclrrg-rainbows-talk@m.gmane.org; Thu, 19 Jul 2012 22:19:09 +0200 Received: from localhost.localdomain (localhost [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id 660072E06D; Thu, 19 Jul 2012 20:19:07 +0000 (UTC) Received: from dcvr.yhbt.net (dcvr.yhbt.net [64.71.152.64]) by rubyforge.org (Postfix) with ESMTP id 80A3E2E067 for ; Thu, 19 Jul 2012 20:16:34 +0000 (UTC) Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 1FB471F449; Thu, 19 Jul 2012 20:16:34 +0000 (UTC) Samuel Kadolph wrote: > On Wed, Jul 18, 2012 at 8:26 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > >> On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > >> > Samuel Kadolph wrote: > >> >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > >> >> unicorn is killing our ThreadPool workers would help us greatly. If > >> >> you require additional info I would be happy to provide it. > > > > Also, are you using "preload_app true" ? > > Yes we are using preload_app true. > > > I'm a bit curious how these messages are happening, too: > > D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after > > suspend/hibernation > > They are strange. My current hunch is the killing and that message are > symptoms of the same issue. Since it always follows a killing. I wonder if there's some background thread one of your gems spawns on load that causes the master to stall. I'm not seeing how else unicorn could think it was in suspend/hibernation. > > Can you tell (from Rails logs) if the to-be-killed workers are still > > processing requests/responses the 300s before when the unicorn timeout > > hits it? AFAIK, Rails logs the PID of each worker processing the > > request. > > rails doesn't log the pid but it would seem that after upgrading to > mysql 0.2.18 it is no longer killing workers that are busy with > requests. Oops, I think I've been spoiled into thinking the Hodel3000CompliantLogger is the default Rails logger :) > > If anything, I'd lower the unicorn timeout to something low (maybe > > 5-10s) since that detects hard lockups at the VM level. Individual > > requests in Rainbows! _are_ allowed to take longer than the unicorn > > timeout. > > We lowered the unicorn timeout to 5 seconds and but that did not > change the killings but they seem to be happening less often. I have > some of our stderr logs after setting the timeout to 5 seconds at > https://gist.github.com/3144250. Thanks for trying that! > > Newer versions of mysql2 should avoid potential issues with > > ThreadTimeout/Timeout (or anything that hits Thread#kill). I think > > mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare > > (but possibly related to your issue) bug, > > Upgrading mysql2 seems to have stopped unicorn from killing workers > that are currently busy. We were stress testing it last night and > after we upgraded to 0.2.18 we had no more 502s from the app but this > could be a coincidence since the killings are still happen. Alright, good to know 0.2.18 solved your problems. Btw, have you noticed any general connectivity issues to your MySQL server? There were quite a few bugfixes from 0.2.6..0.2.18, though. Anyways, I'm happy your problem seems to be fixed with the mysql2 upgrade :) > Our ops guys say we had this problem before we were using ThreadTimeout. OK. That's somewhat reassuring to know (especially since the culprit seems to be an old mysql2 gem). I've had other users (privately) report issues with recursive locking because of ensure clauses (e.g. Mutex#synchronize) that I forgot to document. _______________________________________________ Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org http://rubyforge.org/mailman/listinfo/rainbows-talk Do not quote signatures (like this one) or top post when replying