unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* Sending ABRT to timeout-errant process before KILL
@ 2011-09-08  9:04 J. Austin Hughey
  2011-09-08 19:13 ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: J. Austin Hughey @ 2011-09-08  9:04 UTC (permalink / raw)
  To: mongrel-unicorn

Hello there -

I've recently been working with a customer in my capacity as a support engineer at Engine Yard who's having some trouble with Unicorn.  Basically, they're finding their application being force-killed once it reaches the default timeout.  Rather than simply increasing the timeout, we're trying to find out _why_ their application is being killed.  Unfortunately we can't quite do that because the application is being force-killed without warning, making it so that the customer's logging can't actually be written to the disk. (This is an intermittent issue as opposed to something that happens 100% of the time.)

In discussing the matter internally and with our customer, I came up with a simple monkey patch to Unicorn that _sort of_ works, but I'm having some trouble with it once the number of Unicorn workers goes beyond one.  I originally limited it to just one worker because I wanted to limit the possibility that multiple workers could cause problems in figuring things out -- turns out I was right.

I'm going to show the patch in two ways: 1) inline, at the bottom of this post, and 2) by link to GitHub:
https://github.com/jaustinhughey/unicorn/blob/abort_prior_to_kill_workers/lib/unicorn/http_server.rb#L438

The general idea is that I'd like to have some way to "warn" the application when it's about to be killed.  I've patched murder_lazy_workers to send an abort signal via kill_worker, sleep for 5 seconds, then look and see if the process still exists by using Process.getpgid.  If so, it sends the original kill command, and if not, a rescue block kicks in to prevent the raised error from Process.getpgid from making things explode.

I've created a simulation app, built on Rails 3.0, that uses a generic "posts" controller to simulate a long-running request.  Instead of just throwing a straight-up sleep 65 in there, I have it running through sleeping 1 second on a decrementing counter, and doing that 65 times.  The reason is because, assuming I've read the code correctly, even with my "skip sleeping workers" commented line below, it'll skip over the process, thus rendering my simulation of a long-running process invalid.  However, clarification on this point is certainly welcome.  You can see the app here: https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb

The problem I'm running into, and where I could use some help, is when I increase the number of Unicorn workers from one to two.  When running only one Unicorn worker, I can access my application's posts_controller's index action, which has the intentionally long-running code.  At that time I tail -f unicorn.log and production.log.  Those two logs look like this with one Unicorn worker:

WITH ONE UNICORN WORKER:
========================

production.log:
---------------
Sleeping 1 second (65 to go)…
 ... continued ...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Caught Unicorn kill exception!
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Completed 500 Internal Server Error in 65131ms

NoMethodError (undefined method `query_options' for nil:NilClass):
  app/controllers/posts_controller.rb:32:in `index'

(I think the NoMethodError issue above is due to me calling a disconnect on ActiveRecord in the Signal.trap block, so I think that can be safely ignored.)

As you can see, the Signal.trap block inside the aforementioned posts_controller is working in this case.  Corresponding log entries in unicorn.log concur:

unicorn.log:
------------
worker=0 ready
master process ready
[2011-09-08 00:31:01 PDT] worker=0 PID: 28921 timeout hit, sending ABRT to process 28921 then sleeping 5 seconds...
[2011-09-08 00:31:06 PDT] worker=0 PID:28921 timeout (61s > 60s), killing
reaped #<Process::Status: pid 28921 SIGKILL (signal 9)> worker=0
worker=0 ready

So with one worker, everything seems cool.  But with two workers?


WITH TWO UNICORN WORKERS:
=========================

production.log:
---------------
Sleeping 1 second (8 to go)...
Sleeping 1 second (7 to go)...
Sleeping 1 second (6 to go)...
Sleeping 1 second (5 to go)...
Sleeping 1 second (4 to go)...
Sleeping 1 second (3 to go)...
Sleeping 1 second (2 to go)...
Sleeping 1 second (1 to go)...
Rendered posts/index.html.erb within layouts/application (13.2ms)
Completed 200 OK in 65311ms (Views: 16.9ms | ActiveRecord: 0.5ms)

Note that there is no notice that the ABRT signal was trapped, nor is there a NoMethodError (likely caused by disconnecting from the database) as above.  Odd.

unicorn.log:
------------
Nothing.  No new data whatsoever.

The only potential clue I can see at this point would be a start-up message in unicorn.log.  After increasing the number of Unicorn workers to two, I examined unicorn.log again and found this:

master complete
I, [2011-09-08T00:34:40.499437 #29572]  INFO -- : unlinking existing socket=/var/run/engineyard/unicorn_ut.sock
I, [2011-09-08T00:34:40.499888 #29572]  INFO -- : listening on addr=/var/run/engineyard/unicorn_ut.sock fd=5
I, [2011-09-08T00:34:40.504542 #29572]  INFO -- : Refreshing Gem list
worker=0 ready
master process ready
[2011-09-08 00:34:49 PDT] worker=1 PID: 29582 timeout hit, sending ABRT to process 29582 then sleeping 5 seconds...
[2011-09-08 00:34:50 PDT] worker=1 PID:29582 timeout (1315467289s > 60s), killing
reaped #<Process::Status: pid 29582 SIGIOT (signal 6)> worker=1
worker=1 ready

So it looks like Worker 1 is hitting a strange/false timeout of 1315467289 seconds, which isn't really possible as it wasn't even running 1315467289 seconds prior to that (which equates to roughly 41 years ago if my math is right).

---

Needless to say, I'm a bit stumped at this point, and would sincerely appreciate another point of view on this.  Am I going about this all wrong?  Is there a better approach I should consider?  And if I'm on the right track, how can I get this to work regardless of how many Unicorn workers are running?

Thank you very much for any assistance you can provide!


-- INLINE VERSION OF PATCH --

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 78d80b4..8a2323f 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -429,6 +429,11 @@ class Unicorn::HttpServer
     proc_name 'master (old)'
   end
 
+  # A custom formatted timestamp for debugging
+  def custom_timestamp
+    return Time.now.strftime("[%Y-%m-%d %T %Z]")
+  end
+
   # forcibly terminate all workers that haven't checked in in timeout seconds.  The timeout is implemented using an unlinked File
   def murder_lazy_workers
     t = @timeout
@@ -436,16 +441,40 @@ class Unicorn::HttpServer
     now = Time.now.to_i
     WORKERS.dup.each_pair do |wpid, worker|
       tick = worker.tick
-      0 == tick and next # skip workers that are sleeping
+
+      # REMOVE THE FOLLOWING COMMENT WHEN TESTING PRODUCTION
+# 0 == tick and next # skip workers that are sleeping
+      # ^ needs to be active, commented here for simulation purposes
+
       diff = now - tick
       tmp = t - diff
       if tmp >= 0
         next_sleep < tmp and next_sleep = tmp
         next
       end
-      logger.error "worker=#{worker.nr} PID:#{wpid} timeout " \
-                   "(#{diff}s > #{t}s), killing"
-      kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+
+
+      # Send an ABRT signal to Unicorn and wait 5 seconds before attempting an
+      # actual kill, if and only if the process is still running.
+
+      begin
+        # Send the ABRT signal.
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid} timeout hit, sending ABRT to process #{wpid} then sleeping 5 seconds..."
+        kill_worker(:ABRT, wpid)
+
+        sleep 5
+
+        # Now see if the process still exists after being given five
+        # seconds to terminate on its own, and if so, do a hard kill.
+        if Process.getpgid(wpid)
+          logger.error "#{custom_timestamp} worker=#{worker.nr} PID:#{wpid} timeout " \
+                       "(#{diff}s > #{@timeout}s), killing"
+          kill_worker(:KILL, wpid) # take no prisoners for timeout violations
+        end
+      rescue Errno::ESRCH => e
+        # No process identified - maybe it exited on its own?
+        logger.debug "#{custom_timestamp} worker=#{worker.nr} PID: #{wpid} responded to ABRT on its own and no longer exists. (Received message: #{e})"
+      end
     end
     next_sleep
   end

-- END INLINE VERSION OF PATCH --

--
J. Austin Hughey
Application Support Engineer - Engine Yard
www.engineyard.com | jhughey@engineyard.com

_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: Sending ABRT to timeout-errant process before KILL
  2011-09-08  9:04 Sending ABRT to timeout-errant process before KILL J. Austin Hughey
@ 2011-09-08 19:13 ` Eric Wong
  2011-09-09 23:01   ` Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2011-09-08 19:13 UTC (permalink / raw)
  To: unicorn list; +Cc: J. Austin Hughey

"J. Austin Hughey" <jhughey@engineyard.com> wrote:
> The general idea is that I'd like to have some way to "warn" the
> application when it's about to be killed.  I've patched
> murder_lazy_workers to send an abort signal via kill_worker, sleep for
> 5 seconds, then look and see if the process still exists by using
> Process.getpgid.  If so, it sends the original kill command, and if
> not, a rescue block kicks in to prevent the raised error from
> Process.getpgid from making things explode.

The problem with anything other than SIGKILL (or SIGSTOP) is that it
assumes the Ruby VM is working and in a good state.

> I've created a simulation app, built on Rails 3.0, that uses a generic
> "posts" controller to simulate a long-running request.  Instead of
> just throwing a straight-up sleep 65 in there, I have it running
> through sleeping 1 second on a decrementing counter, and doing that 65
> times.  The reason is because, assuming I've read the code correctly,
> even with my "skip sleeping workers" commented line below, it'll skip
> over the process, thus rendering my simulation of a long-running
> process invalid.  However, clarification on this point is certainly
> welcome.  You can see the app here:
> https://github.com/jaustinhughey/unicorn_test/blob/master/app/controllers/posts_controller.rb

(purely for educational purposes, since I'll point you towards another
approach I believe is better)

              Signal.trap(:ABRT) do
                # Write some stuff to the Rails log
                logger.info "Caught Unicorn kill exception!"

If this is the logger that ships with Ruby, it locks a Mutex, so it'll
deadlock if another SIGABRT is received while logging the above
statement (a very small window, admittedly).

                # Do a controlled disconnect from ActiveRecord
                ActiveRecord::Base.connection.disconnect!

Likewise, if AR needs to lock internal structures before disconnecting,
it also must be reentrant.  Ruby's normal Mutex implementation is not
reentrant-safe.


> So it looks like Worker 1 is hitting a strange/false timeout of
> 1315467289 seconds, which isn't really possible as it wasn't even
> running 1315467289 seconds prior to that (which equates to roughly 41
> years ago if my math is right).

You're getting this because you removed the following line:

      0 == tick and next # skip workers that are sleeping

sleeping means they haven't accepted a client connection, yet.  Not
sleeping while processing a client request.   I'll clarify that in the
code.

> Needless to say, I'm a bit stumped at this point, and would sincerely
> appreciate another point of view on this.  Am I going about this all
> wrong?  Is there a better approach I should consider?  And if I'm on
> the right track, how can I get this to work regardless of how many
> Unicorn workers are running?

Since it's an application error, it can be done as middleware.  You can
try something like the Rainbows::ThreadTimeout middleware, it's
currently Rainbows! specific but can easily be made to work with
Unicorn.

	git clone git://bogomips.org/rainbows
	cat rainbows/lib/rainbows/thread_timeout.rb

This is conceptually similar to "timeout" in the Ruby standard library,
but does not allow nesting.

I'll try to clarify more later today if you have questions, in a bit of
a rush right now.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Sending ABRT to timeout-errant process before KILL
  2011-09-08 19:13 ` Eric Wong
@ 2011-09-09 23:01   ` Eric Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2011-09-09 23:01 UTC (permalink / raw)
  To: unicorn list; +Cc: J. Austin Hughey

Eric Wong <normalperson@yhbt.net> wrote:
> sleeping means they haven't accepted a client connection, yet.  Not
> sleeping while processing a client request.   I'll clarify that in the
> code.

Pushed to git://bogomips.org/unicorn.git
(commit d209910e29d4983f8346233262a49541464252c1)

> 	git clone git://bogomips.org/rainbows
> 	cat rainbows/lib/rainbows/thread_timeout.rb
> 
> This is conceptually similar to "timeout" in the Ruby standard library,
> but does not allow nesting.
> 
> I'll try to clarify more later today if you have questions, in a bit of
> a rush right now.

Did you manage to get anything going based on this?  I should add that
this has the same chance of working as a SIGABRT handler written in
Ruby (but is less intrusive).

Hopefully the following diagram/chart from a previous post can explain
why you can't rely on trappable signal handlers if the Ruby VM is in a
bad state:

  http://mid.gmane.org/20110817201323.GA24581@dcvr.yhbt.net

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-09-09 23:50 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-08  9:04 Sending ABRT to timeout-errant process before KILL J. Austin Hughey
2011-09-08 19:13 ` Eric Wong
2011-09-09 23:01   ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).