unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* Master repeatedly killing workers due to timeouts
@ 2011-12-08 18:19 Jonathan del Strother
  2011-12-08 20:23 ` Eric Wong
  2011-12-11 14:19 ` Troex Nevelin
  0 siblings, 2 replies; 6+ messages in thread
From: Jonathan del Strother @ 2011-12-08 18:19 UTC (permalink / raw)
  To: mongrel-unicorn

Hi,
We're using unicorn as a Rails server on Solaris, and it's been
running great for several months.  We've recently been having a few
problems and I'm at a loss what might cause it.  A number of times in
the past few days, our unicorn slaves keep timing out & the master
keeps restarting them.  unicorn.log looks something like :


E, [2011-12-08T18:11:32.912237 #26661] ERROR -- : worker=5 PID:15367
timeout (61s > 60s), killing
E, [2011-12-08T18:11:32.952041 #26661] ERROR -- : reaped
#<Process::Status: pid 15367 SIGKILL (signal 9)> worker=5
I, [2011-12-08T18:11:32.985925 #17824]  INFO -- : worker=5 ready
E, [2011-12-08T18:11:42.962869 #26661] ERROR -- : worker=3 PID:15499
timeout (61s > 60s), killing
E, [2011-12-08T18:11:43.003741 #26661] ERROR -- : reaped
#<Process::Status: pid 15499 SIGKILL (signal 9)> worker=3
I, [2011-12-08T18:11:43.043797 #18263]  INFO -- : worker=3 ready
E, [2011-12-08T18:11:51.016129 #26661] ERROR -- : worker=0 PID:16072
timeout (61s > 60s), killing
E, [2011-12-08T18:11:51.053344 #26661] ERROR -- : reaped
#<Process::Status: pid 16072 SIGKILL (signal 9)> worker=0
I, [2011-12-08T18:11:51.084365 #18532]  INFO -- : worker=0 ready
E, [2011-12-08T18:11:53.063737 #26661] ERROR -- : worker=4 PID:16126
timeout (61s > 60s), killing
E, [2011-12-08T18:11:53.117573 #26661] ERROR -- : reaped
#<Process::Status: pid 16126 SIGKILL (signal 9)> worker=4
I, [2011-12-08T18:11:53.154511 #18577]  INFO -- : worker=4 ready
E, [2011-12-08T18:11:59.130075 #26661] ERROR -- : worker=6 PID:16165
timeout (61s > 60s), killing
E, [2011-12-08T18:11:59.176834 #26661] ERROR -- : reaped
#<Process::Status: pid 16165 SIGKILL (signal 9)> worker=6
I, [2011-12-08T18:11:59.220115 #18814]  INFO -- : worker=6 ready
E, [2011-12-08T18:12:12.188372 #26661] ERROR -- : worker=7 PID:16931
timeout (61s > 60s), killing
E, [2011-12-08T18:12:12.245647 #26661] ERROR -- : reaped
#<Process::Status: pid 16931 SIGKILL (signal 9)> worker=7
I, [2011-12-08T18:12:12.282607 #19718]  INFO -- : worker=7 ready
E, [2011-12-08T18:12:29.256887 #26661] ERROR -- : worker=1 PID:17406
timeout (61s > 60s), killing
E, [2011-12-08T18:12:29.302142 #26661] ERROR -- : reaped
#<Process::Status: pid 17406 SIGKILL (signal 9)> worker=1
I, [2011-12-08T18:12:29.335366 #20731]  INFO -- : worker=1 ready
E, [2011-12-08T18:12:31.313734 #26661] ERROR -- : worker=2 PID:17659
timeout (61s > 60s), killing
E, [2011-12-08T18:12:31.357066 #26661] ERROR -- : reaped
#<Process::Status: pid 17659 SIGKILL (signal 9)> worker=2
I, [2011-12-08T18:12:31.390368 #20800]  INFO -- : worker=2 ready



- which seems like quite a lot of timeouts.  However, our database
connection, network, NFS etc all seem fine.  There's no useful errors
in the rails log files.  The master still serves occasional requests,
but drops a high percentage of them.


Can anyone suggest where to look into this?  I'm not even sure if that
timeout is occurring during the initial fork of the worker process, or
if it's timing out during a slow rails request.

-Jonathan
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Master repeatedly killing workers due to timeouts
  2011-12-08 18:19 Master repeatedly killing workers due to timeouts Jonathan del Strother
@ 2011-12-08 20:23 ` Eric Wong
  2011-12-08 22:24   ` Jonathan del Strother
  2011-12-11 14:19 ` Troex Nevelin
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Wong @ 2011-12-08 20:23 UTC (permalink / raw)
  To: unicorn list

Jonathan del Strother <maillist@steelskies.com> wrote:
> Hi,
> We're using unicorn as a Rails server on Solaris, and it's been
> running great for several months.  We've recently been having a few
> problems and I'm at a loss what might cause it.  A number of times in
> the past few days, our unicorn slaves keep timing out & the master
> keeps restarting them.  unicorn.log looks something like :

Which versions of Ruby and Unicorn are you running?

> E, [2011-12-08T18:11:32.912237 #26661] ERROR -- : worker=5 PID:15367
> timeout (61s > 60s), killing
> E, [2011-12-08T18:11:32.952041 #26661] ERROR -- : reaped
> #<Process::Status: pid 15367 SIGKILL (signal 9)> worker=5
> I, [2011-12-08T18:11:32.985925 #17824]  INFO -- : worker=5 ready

The time between the "killing" and "ready" lines is small enough
to suggest the app isn't getting killed while it's loading

> E, [2011-12-08T18:11:42.962869 #26661] ERROR -- : worker=3 PID:15499
> timeout (61s > 60s), killing

Most of these timeouts are several minutes apart.  How many requests
do you service per second/minute during these times?

<snip>

> - which seems like quite a lot of timeouts.  However, our database
> connection, network, NFS etc all seem fine.  There's no useful errors
> in the rails log files.  The master still serves occasional requests,
> but drops a high percentage of them.

Since the errors you showed us were several minutes apart,
"high percentage" suggests your app sits idle for longish periods
of time.  Is there any external dependency that doesn't like periods
of idleness?

60s is an eternity for a typical HTTP request for Rails.

> Can anyone suggest where to look into this?  I'm not even sure if that
> timeout is occurring during the initial fork of the worker process, or
> if it's timing out during a slow rails request.

It seems to be the latter based on the timestamps.

You can try auditing potential slow paths in your code, and maybe
add alerts/timeouts to something lower than 60s.  Also, take a
look at: http://unicorn.bogomips.org/Application_Timeouts.html
if you haven't already.

I've also seen inexperienced developers do things like:
"SELECT * FROM huge_table" to kill servers in production because
their test fixtures only had 30 rows in huge_table while the
production DB has millions of rows.

MySQL has the ability to log slow queries, other DBs should, too.  Can
you check to see if you don't have any of those taking a long time?


In a related note, I also released the terrible_timeout RubyGem
yesterday: http://bogomips.org/terrible_timeout/
I hate it, but at least it documents is failings :)
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Master repeatedly killing workers due to timeouts
  2011-12-08 20:23 ` Eric Wong
@ 2011-12-08 22:24   ` Jonathan del Strother
  0 siblings, 0 replies; 6+ messages in thread
From: Jonathan del Strother @ 2011-12-08 22:24 UTC (permalink / raw)
  To: unicorn list

On 8 Dec 2011, at 20:23, Eric Wong <normalperson@yhbt.net> wrote:

> Jonathan del Strother <maillist@steelskies.com> wrote:
>> Hi,
>> We're using unicorn as a Rails server on Solaris, and it's been
>> running great for several months.  We've recently been having a few
>> problems and I'm at a loss what might cause it.  A number of times in
>> the past few days, our unicorn slaves keep timing out & the master
>> keeps restarting them.  unicorn.log looks something like :
> 
> Which versions of Ruby and Unicorn are you running?
> <snip>

Hi Eric, thanks for the comprehensive reply. As it happens, shortly after posting this, the problem started again...in the past it's been fairly sporadic, whereas this time it was extremely persistent so we were forced into tracking it down then & there. I ended up replacing unicorn with a single webrick instance, which would serve a few requests then hang...after a long debugging session we finally managed to track down the problem to a hung sphinx instance, which we use for our search engine.  Ruby would try to talk to that and just sit there without ever timing out. I'm going to go through the sphinx code tomorrow and see if I can figure out why that might occur.

Thanks for the help,
Jonathan
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Master repeatedly killing workers due to timeouts
  2011-12-08 18:19 Master repeatedly killing workers due to timeouts Jonathan del Strother
  2011-12-08 20:23 ` Eric Wong
@ 2011-12-11 14:19 ` Troex Nevelin
  2011-12-11 20:13   ` Eric Wong
  1 sibling, 1 reply; 6+ messages in thread
From: Troex Nevelin @ 2011-12-11 14:19 UTC (permalink / raw)
  To: unicorn list

Hello,

I have a simillar problem that started after update, 3 days ago we
moved from REE 1.8 / Rails 2 to Ruby 1.9.3 and Rails 3.1, also we
replaced memcached with redis so this is huge update and there are
many places where we can have an issue.

I also have checked my MySQL slow log and there are no requests which
take more than 5 seconds to complete.

My question is it possible to log the last request (URI) made to the
worker before killing it on timeout?

It's really hard to track this problem, usually unicorn start killing
workers when we got 1500-2000 rpms on 8 workers.

-- 
E, [2011-12-11T18:11:42.335656 #31736] ERROR -- : worker=1 PID:1632 
timeout (61s > 60s), killing
E, [2011-12-11T18:11:42.364076 #31736] ERROR -- : reaped 
#<Process::Status: pid 1632 SIGKILL (signal 9)> worker=1
I, [2011-12-11T18:11:42.364364 #31736]  INFO -- : worker=1 spawning...
I, [2011-12-11T18:11:42.366721 #5869]  INFO -- : worker=1 spawned pid=5869
I, [2011-12-11T18:11:42.366942 #5869]  INFO -- : Refreshing Gem list
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Master repeatedly killing workers due to timeouts
  2011-12-11 14:19 ` Troex Nevelin
@ 2011-12-11 20:13   ` Eric Wong
  2011-12-15 20:29     ` Troex Nevelin
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2011-12-11 20:13 UTC (permalink / raw)
  To: unicorn list

Troex Nevelin <list@mrtech.ru> wrote:
> Hello,
> 
> I have a simillar problem that started after update, 3 days ago we
> moved from REE 1.8 / Rails 2 to Ruby 1.9.3 and Rails 3.1, also we
> replaced memcached with redis so this is huge update and there are
> many places where we can have an issue.
> 
> I also have checked my MySQL slow log and there are no requests which
> take more than 5 seconds to complete.
> 
> My question is it possible to log the last request (URI) made to the
> worker before killing it on timeout?

There's no way to trap/handle SIGKILL in the worker, and telling the
master every URI the worker makes is too expensive.

This is horrible middleware, but _may_ work if you just forgot a timeout
and Ruby itself isn't blocked (due to a bad C extension):

    class LogBeforeTimeout < Struct.new(:app)
      def call(env)
        thr = Thread.new do
          sleep(59) # set this to Unicorn timeout - 1
          unless Thread.current[:done]
            path = env["PATH_INFO"]
            qs = env["QUERY_STRING"] and path = "#{path}?#{qs}"
            env["rack.logger"].warn("#{path} about to die from SIGKILL")
          end
        end
        app.call(env)
      ensure
        thr[:done] = true
        thr.run
      end
    end

Or you can just log the nev before app.call:

  def call(env)
    path = env["PATH_INFO"]
    qs = env["QUERY_STRING"] and path = "#{path}?#{qs}"
    env["rack.logger"].debug "START #{path}"
    app.call(env)
  ensure
    env["rack.logger"].debug "FINISH #{path}"
  end

And look for START lines without corresponding FINISH lines.

All code in this post is totally untested, so may contain syntax errors
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Master repeatedly killing workers due to timeouts
  2011-12-11 20:13   ` Eric Wong
@ 2011-12-15 20:29     ` Troex Nevelin
  0 siblings, 0 replies; 6+ messages in thread
From: Troex Nevelin @ 2011-12-15 20:29 UTC (permalink / raw)
  To: mongrel-unicorn

Thanks,

Your horrible middleware did the job pretty well, we've tracked down the 
prbolem in a few hours. Now our unicorn looks stable and calm, not 
killing poor-hard-workers on a busy server.


On 11/12/11 22:13, Eric Wong wrote:
> This is horrible middleware, but _may_ work if you just forgot a
> timeout
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-12-15 20:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-08 18:19 Master repeatedly killing workers due to timeouts Jonathan del Strother
2011-12-08 20:23 ` Eric Wong
2011-12-08 22:24   ` Jonathan del Strother
2011-12-11 14:19 ` Troex Nevelin
2011-12-11 20:13   ` Eric Wong
2011-12-15 20:29     ` Troex Nevelin

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).