From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS14383 205.234.109.0/24 X-Spam-Status: No, score=-0.7 required=5.0 tests=MSGID_FROM_MTA_HEADER, RP_MATCHES_RCVD shortcircuit=no autolearn=unavailable version=3.3.2 Path: news.gmane.org!not-for-mail From: "J. Austin Hughey" Newsgroups: gmane.comp.lang.ruby.unicorn.general Subject: Sending ABRT to timeout-errant process before KILL Date: Thu, 8 Sep 2011 03:04:37 -0600 Message-ID: <6310128B-282E-40C0-801B-15DBBE9E6AC3@engineyard.com> NNTP-Posting-Host: lo.gmane.org Mime-Version: 1.0 (Apple Message framework v1244.3) Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: quoted-printable X-Trace: dough.gmane.org 1315474079 9136 80.91.229.12 (8 Sep 2011 09:27:59 GMT) X-Complaints-To: usenet@dough.gmane.org NNTP-Posting-Date: Thu, 8 Sep 2011 09:27:59 +0000 (UTC) To: mongrel-unicorn@rubyforge.org Original-X-From: mongrel-unicorn-bounces@rubyforge.org Thu Sep 08 11:27:54 2011 Return-path: Envelope-to: gclrug-mongrel-unicorn@m.gmane.org X-Original-To: mongrel-unicorn@rubyforge.org Delivered-To: mongrel-unicorn@rubyforge.org X-Mailer: Apple Mail (2.1244.3) X-BeenThere: mongrel-unicorn@rubyforge.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Original-Sender: mongrel-unicorn-bounces@rubyforge.org Errors-To: mongrel-unicorn-bounces@rubyforge.org Xref: news.gmane.org gmane.comp.lang.ruby.unicorn.general:1160 Archived-At: Received: from rubyforge.org ([205.234.109.19]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1R1atl-0006BR-5a for gclrug-mongrel-unicorn@m.gmane.org; Thu, 08 Sep 2011 11:27:53 +0200 Received: from rubyforge.org (rubyforge.org [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id AC6B51678363; Thu, 8 Sep 2011 05:27:46 -0400 (EDT) Received: from mail-yw0-f50.google.com (mail-yw0-f50.google.com [209.85.213.50]) by rubyforge.org (Postfix) with ESMTP id 87986185836C for ; Thu, 8 Sep 2011 05:04:39 -0400 (EDT) Received: by ywa8 with SMTP id 8so552625ywa.23 for ; Thu, 08 Sep 2011 02:04:39 -0700 (PDT) Received: by 10.236.173.67 with SMTP id u43mr2426141yhl.119.1315472679601; Thu, 08 Sep 2011 02:04:39 -0700 (PDT) Received: from [192.168.1.145] (adsl-75-54-102-119.dsl.elpstx.sbcglobal.net [75.54.102.119]) by mx.google.com with ESMTPS id s77sm2883907yhe.22.2011.09.08.02.04.38 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 08 Sep 2011 02:04:38 -0700 (PDT) Hello there - I've recently been working with a customer in my capacity as a support engi= neer at Engine Yard who's having some trouble with Unicorn. Basically, the= y're finding their application being force-killed once it reaches the defau= lt timeout. Rather than simply increasing the timeout, we're trying to fin= d out _why_ their application is being killed. Unfortunately we can't quit= e do that because the application is being force-killed without warning, ma= king it so that the customer's logging can't actually be written to the dis= k. (This is an intermittent issue as opposed to something that happens 100%= of the time.) In discussing the matter internally and with our customer, I came up with a= simple monkey patch to Unicorn that _sort of_ works, but I'm having some t= rouble with it once the number of Unicorn workers goes beyond one. I origi= nally limited it to just one worker because I wanted to limit the possibili= ty that multiple workers could cause problems in figuring things out -- tur= ns out I was right. I'm going to show the patch in two ways: 1) inline, at the bottom of this p= ost, and 2) by link to GitHub: https://github.com/jaustinhughey/unicorn/blob/abort_prior_to_kill_workers/l= ib/unicorn/http_server.rb#L438 The general idea is that I'd like to have some way to "warn" the applicatio= n when it's about to be killed. I've patched murder_lazy_workers to send a= n abort signal via kill_worker, sleep for 5 seconds, then look and see if t= he process still exists by using Process.getpgid. If so, it sends the orig= inal kill command, and if not, a rescue block kicks in to prevent the raise= d error from Process.getpgid from making things explode. I've created a simulation app, built on Rails 3.0, that uses a generic "pos= ts" controller to simulate a long-running request. Instead of just throwin= g a straight-up sleep 65 in there, I have it running through sleeping 1 sec= ond on a decrementing counter, and doing that 65 times. The reason is beca= use, assuming I've read the code correctly, even with my "skip sleeping wor= kers" commented line below, it'll skip over the process, thus rendering my = simulation of a long-running process invalid. However, clarification on th= is point is certainly welcome. You can see the app here: https://github.co= m/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb The problem I'm running into, and where I could use some help, is when I in= crease the number of Unicorn workers from one to two. When running only on= e Unicorn worker, I can access my application's posts_controller's index ac= tion, which has the intentionally long-running code. At that time I tail -= f unicorn.log and production.log. Those two logs look like this with one U= nicorn worker: WITH ONE UNICORN WORKER: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D production.log: --------------- Sleeping 1 second (65 to go)=85 ... continued ... Sleeping 1 second (7 to go)... Sleeping 1 second (6 to go)... Sleeping 1 second (5 to go)... Caught Unicorn kill exception! Sleeping 1 second (4 to go)... Sleeping 1 second (3 to go)... Sleeping 1 second (2 to go)... Sleeping 1 second (1 to go)... Completed 500 Internal Server Error in 65131ms NoMethodError (undefined method `query_options' for nil:NilClass): app/controllers/posts_controller.rb:32:in `index' (I think the NoMethodError issue above is due to me calling a disconnect on= ActiveRecord in the Signal.trap block, so I think that can be safely ignor= ed.) As you can see, the Signal.trap block inside the aforementioned posts_contr= oller is working in this case. Corresponding log entries in unicorn.log co= ncur: unicorn.log: ------------ worker=3D0 ready master process ready [2011-09-08 00:31:01 PDT] worker=3D0 PID: 28921 timeout hit, sending ABRT t= o process 28921 then sleeping 5 seconds... [2011-09-08 00:31:06 PDT] worker=3D0 PID:28921 timeout (61s > 60s), killing reaped # worker=3D0 worker=3D0 ready So with one worker, everything seems cool. But with two workers? WITH TWO UNICORN WORKERS: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D production.log: --------------- Sleeping 1 second (8 to go)... Sleeping 1 second (7 to go)... Sleeping 1 second (6 to go)... Sleeping 1 second (5 to go)... Sleeping 1 second (4 to go)... Sleeping 1 second (3 to go)... Sleeping 1 second (2 to go)... Sleeping 1 second (1 to go)... Rendered posts/index.html.erb within layouts/application (13.2ms) Completed 200 OK in 65311ms (Views: 16.9ms | ActiveRecord: 0.5ms) Note that there is no notice that the ABRT signal was trapped, nor is there= a NoMethodError (likely caused by disconnecting from the database) as abov= e. Odd. unicorn.log: ------------ Nothing. No new data whatsoever. The only potential clue I can see at this point would be a start-up message= in unicorn.log. After increasing the number of Unicorn workers to two, I = examined unicorn.log again and found this: master complete I, [2011-09-08T00:34:40.499437 #29572] INFO -- : unlinking existing socket= =3D/var/run/engineyard/unicorn_ut.sock I, [2011-09-08T00:34:40.499888 #29572] INFO -- : listening on addr=3D/var/= run/engineyard/unicorn_ut.sock fd=3D5 I, [2011-09-08T00:34:40.504542 #29572] INFO -- : Refreshing Gem list worker=3D0 ready master process ready [2011-09-08 00:34:49 PDT] worker=3D1 PID: 29582 timeout hit, sending ABRT t= o process 29582 then sleeping 5 seconds... [2011-09-08 00:34:50 PDT] worker=3D1 PID:29582 timeout (1315467289s > 60s),= killing reaped # worker=3D1 worker=3D1 ready So it looks like Worker 1 is hitting a strange/false timeout of 1315467289 = seconds, which isn't really possible as it wasn't even running 1315467289 s= econds prior to that (which equates to roughly 41 years ago if my math is r= ight). --- Needless to say, I'm a bit stumped at this point, and would sincerely appre= ciate another point of view on this. Am I going about this all wrong? Is = there a better approach I should consider? And if I'm on the right track, = how can I get this to work regardless of how many Unicorn workers are runni= ng? Thank you very much for any assistance you can provide! -- INLINE VERSION OF PATCH -- diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 78d80b4..8a2323f 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -429,6 +429,11 @@ class Unicorn::HttpServer proc_name 'master (old)' end = + # A custom formatted timestamp for debugging + def custom_timestamp + return Time.now.strftime("[%Y-%m-%d %T %Z]") + end + # forcibly terminate all workers that haven't checked in in timeout seco= nds. The timeout is implemented using an unlinked File def murder_lazy_workers t =3D @timeout @@ -436,16 +441,40 @@ class Unicorn::HttpServer now =3D Time.now.to_i WORKERS.dup.each_pair do |wpid, worker| tick =3D worker.tick - 0 =3D=3D tick and next # skip workers that are sleeping + + # REMOVE THE FOLLOWING COMMENT WHEN TESTING PRODUCTION +# 0 =3D=3D tick and next # skip workers that are sleeping + # ^ needs to be active, commented here for simulation purposes + diff =3D now - tick tmp =3D t - diff if tmp >=3D 0 next_sleep < tmp and next_sleep =3D tmp next end - logger.error "worker=3D#{worker.nr} PID:#{wpid} timeout " \ - "(#{diff}s > #{t}s), killing" - kill_worker(:KILL, wpid) # take no prisoners for timeout violations + + + # Send an ABRT signal to Unicorn and wait 5 seconds before attemptin= g an + # actual kill, if and only if the process is still running. + + begin + # Send the ABRT signal. + logger.debug "#{custom_timestamp} worker=3D#{worker.nr} PID: #{wpi= d} timeout hit, sending ABRT to process #{wpid} then sleeping 5 seconds..." + kill_worker(:ABRT, wpid) + + sleep 5 + + # Now see if the process still exists after being given five + # seconds to terminate on its own, and if so, do a hard kill. + if Process.getpgid(wpid) + logger.error "#{custom_timestamp} worker=3D#{worker.nr} PID:#{wp= id} timeout " \ + "(#{diff}s > #{@timeout}s), killing" + kill_worker(:KILL, wpid) # take no prisoners for timeout violati= ons + end + rescue Errno::ESRCH =3D> e + # No process identified - maybe it exited on its own? + logger.debug "#{custom_timestamp} worker=3D#{worker.nr} PID: #{wpi= d} responded to ABRT on its own and no longer exists. (Received message: #{= e})" + end end next_sleep end -- END INLINE VERSION OF PATCH -- -- J. Austin Hughey Application Support Engineer - Engine Yard www.engineyard.com | jhughey@engineyard.com _______________________________________________ Unicorn mailing list - mongrel-unicorn@rubyforge.org http://rubyforge.org/mailman/listinfo/mongrel-unicorn Do not quote signatures (like this one) or top post when replying