Sending ABRT to timeout-errant process before KILL

From: "J. Austin Hughey" <jhughey@engineyard.com>
To: mongrel-unicorn@rubyforge.org
Subject: Sending ABRT to timeout-errant process before KILL
Date: Thu, 8 Sep 2011 03:04:37 -0600	[thread overview]
Message-ID: <6310128B-282E-40C0-801B-15DBBE9E6AC3@engineyard.com> (raw)

Hello there -

I've recently been working with a customer in my capacity as a support engineer at Engine Yard who's having some trouble with Unicorn.  Basically, they're finding their application being force-killed once it reaches the default timeout.  Rather than simply increasing the timeout, we're trying to find out _why_ their application is being killed.  Unfortunately we can't quite do that because the application is being force-killed without warning, making it so that the customer's logging can't actually be written to the disk. (This is an intermittent issue as opposed to something that happens 100% of the time.)

In discussing the matter internally and with our customer, I came up with a simple monkey patch to Unicorn that _sort of_ works, but I'm having some trouble with it once the number of Unicorn workers goes beyond one.  I originally limited it to just one worker because I wanted to limit the possibility that multiple workers could cause problems in figuring things out -- turns out I was right.

I'm going to show the patch in two ways: 1) inline, at the bottom of this post, and 2) by link to GitHub:
https://github.com/jaustinhughey/unicorn/blob/abort_prior_to_kill_workers/lib/unicorn/http_server.rb#L438

The general idea is that I'd like to have some way to "warn" the application when it's about to be killed.  I've patched murder_lazy_workers to send an abort signal via kill_worker, sleep for 5 seconds, then look and see if the process still exists by using Process.getpgid.  If so, it sends the original kill command, and if not, a rescue block kicks in to prevent the raised error from Process.getpgid from making things explode.

I've created a simulation app, built on Rails 3.0, that uses a generic "posts" controller to simulate a long-running request.  Instead of just throwing a straight-up sleep 65 in there, I have it running through sleeping 1 second on a decrementing counter, and doing that 65 times.  The reason is because, assuming I've read the code correctly, even with my "skip sleeping workers" commented line below, it'll skip over the process, thus rendering my simulation of a long-running process invalid.  However, clarification on this point is certainly welcome.  You can see the app here: https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb

The problem I'm running into, and where I could use some help, is when I increase the number of Unicorn workers from one to two.  When running only one Unicorn worker, I can access my application's posts_controller's index action, which has the intentionally long-running code.  At that time I tail -f unicorn.log and production.log.  Those two logs look like this with one Unicorn worker:

WITH ONE UNICORN WORKER:
========================

production.log:
---------------
Sleeping 1 second (65 to go)…
 ... continued ...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Caught Unicorn kill exception!
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Completed 500 Internal Server Error in 65131ms

NoMethodError (undefined method `query_options' for nil:NilClass):
  app/controllers/posts_controller.rb:32:in `index'

(I think the NoMethodError issue above is due to me calling a disconnect on ActiveRecord in the Signal.trap block, so I think that can be safely ignored.)

As you can see, the Signal.trap block inside the aforementioned posts_controller is working in this case.  Corresponding log entries in unicorn.log concur:

unicorn.log:
------------
worker=0 ready
master process ready
[2011-09-08 00:31:01 PDT] worker=0 PID: 28921 timeout hit, sending ABRT to process 28921 then sleeping 5 seconds...
[2011-09-08 00:31:06 PDT] worker=0 PID:28921 timeout (61s > 60s), killing
reaped #<Process::Status: pid 28921 SIGKILL (signal 9)> worker=0
worker=0 ready

So with one worker, everything seems cool.  But with two workers?


WITH TWO UNICORN WORKERS:
=========================

production.log:
---------------
Sleeping 1 second (8 to go)...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Rendered posts/index.html.erb within layouts/application (13.2ms)
Completed 200 OK in 65311ms (Views: 16.9ms | ActiveRecord: 0.5ms)

Note that there is no notice that the ABRT signal was trapped, nor is there a NoMethodError (likely caused by disconnecting from the database) as above.  Odd.

unicorn.log:
------------
Nothing.  No new data whatsoever.

The only potential clue I can see at this point would be a start-up message in unicorn.log.  After increasing the number of Unicorn workers to two, I examined unicorn.log again and found this:

master complete
I, [2011-09-08T00:34:40.499437 #29572]  INFO -- : unlinking existing socket=/var/run/engineyard/unicorn_ut.sock
I, [2011-09-08T00:34:40.499888 #29572]  INFO -- : listening on addr=/var/run/engineyard/unicorn_ut.sock fd=5
I, [2011-09-08T00:34:40.504542 #29572]  INFO -- : Refreshing Gem list
worker=0 ready
master process ready
[2011-09-08 00:34:49 PDT] worker=1 PID: 29582 timeout hit, sending ABRT to process 29582 then sleeping 5 seconds...
[2011-09-08 00:34:50 PDT] worker=1 PID:29582 timeout (1315467289s > 60s), killing
reaped #<Process::Status: pid 29582 SIGIOT (signal 6)> worker=1
worker=1 ready

So it looks like Worker 1 is hitting a strange/false timeout of 1315467289 seconds, which isn't really possible as it wasn't even running 1315467289 seconds prior to that (which equates to roughly 41 years ago if my math is right).

---

Needless to say, I'm a bit stumped at this point, and would sincerely appreciate another point of view on this.  Am I going about this all wrong?  Is there a better approach I should consider?  And if I'm on the right track, how can I get this to work regardless of how many Unicorn workers are running?

Thank you very much for any assistance you can provide!


-- INLINE VERSION OF PATCH --

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 78d80b4..8a2323f 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -429,6 +429,11 @@ class Unicorn::HttpServer
     proc_name 'master (old)'
   end
 
+  # A custom formatted timestamp for debugging
+  def custom_timestamp
+    return Time.now.strftime("[%Y-%m-%d %T %Z]")
+  end
+
   # forcibly terminate all workers that haven't checked in in timeout seconds.  The timeout is implemented using an unlinked File
   def murder_lazy_workers
     t = @timeout
@@ -436,16 +441,40 @@ class Unicorn::HttpServer
     now = Time.now.to_i
     WORKERS.dup.each_pair do |wpid, worker|
       tick = worker.tick
-      0 == tick and next # skip workers that are sleeping
+
+      # REMOVE THE FOLLOWING COMMENT WHEN TESTING PRODUCTION
+# 0 == tick and next # skip workers that are sleeping
+      # ^ needs to be active, commented here for simulation purposes
+
       diff = now - tick
       tmp = t - diff
       if tmp >= 0
         next_sleep < tmp and next_sleep = tmp
         next
       end
-      logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
-                   "(#{diff}s > #{t}s), killing"
-      kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+
+
+      # Send an ABRT signal to Unicorn and wait 5 seconds before attempting an
+      # actual kill, if and only if the process is still running.
+
+      begin
+        # Send the ABRT signal.
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid} timeout hit, sending ABRT to process #{wpid} then sleeping 5 seconds..."
+        kill_worker(:ABRT, wpid)
+
+        sleep 5
+
+        # Now see if the process still exists after being given five
+        # seconds to terminate on its own, and if so, do a hard kill.
+        if Process.getpgid(wpid)
+          logger.error "#{custom_timestamp} worker=#{worker.nr} PID:#{wpid} timeout " \
+                       "(#{diff}s > #{@timeout}s), killing"
+          kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+        end
+      rescue Errno::ESRCH => e
+        # No process identified - maybe it exited on its own?
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid} responded to ABRT on its own and no longer exists. (Received message: #{e})"
+      end
     end
     next_sleep
   end

-- END INLINE VERSION OF PATCH --

--
J. Austin Hughey
Application Support Engineer - Engine Yard
www.engineyard.com | jhughey@engineyard.com

_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying