From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: * X-Spam-ASN: AS33070 50.56.128.0/17 X-Spam-Status: No, score=1.0 required=3.0 tests=AWL,HK_RANDOM_FROM, MSGID_FROM_MTA_HEADER,TVD_RCVD_IP shortcircuit=no autolearn=no version=3.3.2 Path: news.gmane.org!not-for-mail From: Eric Wong Newsgroups: gmane.comp.lang.ruby.rainbows.general Subject: Re: Unicorn is killing our rainbows workers Date: Thu, 19 Jul 2012 00:26:41 +0000 Message-ID: <20120719002641.GA17210@dcvr.yhbt.net> References: <20120718215222.GA11539@dcvr.yhbt.net> NNTP-Posting-Host: plane.gmane.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Trace: dough.gmane.org 1342657719 26347 80.91.229.3 (19 Jul 2012 00:28:39 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 19 Jul 2012 00:28:39 +0000 (UTC) Cc: Cody Fauser , ops , Harry Brundage , Jonathan Rudenberg To: Rainbows! list Original-X-From: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Thu Jul 19 02:28:38 2012 Return-path: Envelope-to: gclrrg-rainbows-talk@m.gmane.org X-Original-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Delivered-To: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Errors-To: rainbows-talk-bounces-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org Xref: news.gmane.org gmane.comp.lang.ruby.rainbows.general:374 Archived-At: Received: from 50-56-192-79.static.cloud-ips.com ([50.56.192.79] helo=rubyforge.org) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SrebY-0008Tt-3l for gclrrg-rainbows-talk@m.gmane.org; Thu, 19 Jul 2012 02:28:32 +0200 Received: from localhost.localdomain (localhost [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id 642212E06B; Thu, 19 Jul 2012 00:28:30 +0000 (UTC) Received: from dcvr.yhbt.net (dcvr.yhbt.net [64.71.152.64]) by rubyforge.org (Postfix) with ESMTP id 784CB2E06B for ; Thu, 19 Jul 2012 00:26:42 +0000 (UTC) Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 41BAC358D4; Thu, 19 Jul 2012 00:26:41 +0000 (UTC) Samuel Kadolph wrote: > On Wed, Jul 18, 2012 at 5:52 PM, Eric Wong wrote: > > Samuel Kadolph wrote: > >> Hey rainbows-talk, > >> > >> We have 40 servers that each run rainbows with 2 workers with 100 > >> threads using ThreadPool. We're having an issue where unicorn is > >> killing the worker process. We use ThreadTimeout (set to 70 seconds) > >> and originally had the unicorn timeout set to 150 seconds and we're > >> seeing unicorn eventually killing each worker. So we bumped the > >> timeout to 300 seconds and it took about 5 minutes but we started > >> seeing unicorn starting to kill workers again. You can see our stderr > >> log file (timeout at 300s) at > >> https://gist.github.com/9ec96922e55a59753997. Any insight into why > >> unicorn is killing our ThreadPool workers would help us greatly. If > >> you require additional info I would be happy to provide it. Also, are you using "preload_app true" ? I'm a bit curious how these messages are happening, too: D, [2012-07-18T15:12:43.185808 #17213] DEBUG -- : waiting 151.5s after suspend/hibernation Can you tell (from Rails logs) if the to-be-killed workers are still processing requests/responses the 300s before when the unicorn timeout hits it? AFAIK, Rails logs the PID of each worker processing the request. Also, what in your app takes 150s, or even 70s? I'm curious why the timeouts are so high. I wonder if there are bugs with unicorn/rainbows with huge timeout values, too... If anything, I'd lower the unicorn timeout to something low (maybe 5-10s) since that detects hard lockups at the VM level. Individual requests in Rainbows! _are_ allowed to take longer than the unicorn timeout. Can you reproduce this in a simulation environment or only with real traffic? If possible, can you setup an instance with a single worker process and get an strace ("strace -f") of all the threads when this happens? > We're running ruby 1.9.3-p125 with the performance patches at > https://gist.github.com/1688857. Can you reproduce this with an unpatched 1.9.3-p194? I'm not too familiar with the performance patches, but I'd like to reduce the amount of less-common/tested code to isolate the issue. > I listed the gems we use and which > ones that have c extension at https://gist.github.com/3139226. Fortunately, I'm familiar with nearly all of these C gems. Newer versions of mysql2 should avoid potential issues with ThreadTimeout/Timeout (or anything that hits Thread#kill). I think mysql2 0.2.9 fixed a fairly important bug, and 0.2.18 fixed a very rare (but possibly related to your issue) bug, Unrelated to your current issue, I strongly suggest Ruby 1.9.3-p194, previous versions had a nasty GC memory corruption bug triggered by Nokogiri (ref: https://github.com/tenderlove/nokogiri/issues/616) I also have no idea why mongrel is in there :x > We'll try running without the ThreadTimeout. We don't think we're > having deadlock issues because our stress tests do not timeout but > they do 502 when the rainbows worker gets killed during a request. OK. I'm starting to believe ThreadTimeout isn't good for the majority of applications out there, and perhaps the only way is to have support for this tightly coupled with the VM. Even then, "ensure" clauses would still be tricky/ugly to deal with... So maybe forcing developers to use app/library-level timeouts for everything they do is the only way. _______________________________________________ Rainbows! mailing list - rainbows-talk-GrnCvJ7WPxnNLxjTenLetw@public.gmane.org http://rubyforge.org/mailman/listinfo/rainbows-talk Do not quote signatures (like this one) or top post when replying