unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* Strange quit behavior
@ 2011-08-02 20:09 James Cox
  2011-08-02 20:34 ` Alex Sharp
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: James Cox @ 2011-08-02 20:09 UTC (permalink / raw)
  To: unicorn list

Hey,

So here are some tasks for managing unicorn:

https://gist.github.com/1121076

I've found that it's very unreliable for quitting / terminating
unicorn and restarting with new code. When doing rapid deployments
particularly, i've found that i have to go in and kill -9 the master
process, and start again.

any thoughts on why it seems ineffective?

Thanks.

PS: here's the unicorn config:

https://gist.github.com/0dd07c5ad00c56d161c7

James
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 20:09 Strange quit behavior James Cox
@ 2011-08-02 20:34 ` Alex Sharp
  2011-08-02 21:54   ` Eric Wong
  2011-08-02 20:45 ` cliftonk
  2011-08-02 21:53 ` Eric Wong
  2 siblings, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-02 20:34 UTC (permalink / raw)
  To: unicorn list

> I've found that it's very unreliable for quitting / terminating
> unicorn and restarting with new code. When doing rapid deployments
> particularly, i've found that i have to go in and kill -9 the master
> process, and start again.

We've noticed some of this behavior as well. We've seen the new master
start spinning and consume 100% cpu (according to top). The old master
and workers are still running and working, but the new master just
hangs, and we have to kill -9. Our solution was to add the following
to our unicorn config, which *seems* to have solved the problem:

Unicorn::HttpServer::START_CTX[0] =
"#{path_to_app}/shared/bundle/ruby/1.9.1/bin/unicorn_rails"

This is outlined a bit here: http://unicorn.bogomips.org/Sandbox.html.
I'm not really sure why this seems to have fixed the problem -- all I
can tell you is that we haven't seen it since adding this line to the
config.

- Alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 20:09 Strange quit behavior James Cox
  2011-08-02 20:34 ` Alex Sharp
@ 2011-08-02 20:45 ` cliftonk
  2011-08-02 21:53 ` Eric Wong
  2 siblings, 0 replies; 30+ messages in thread
From: cliftonk @ 2011-08-02 20:45 UTC (permalink / raw)
  To: unicorn list

Your config has:

    sig = (worker.nr + 1) >= server.worker_processes ? :TERM : :TTOU
    Process.kill(sig, File.read(old_pid).to_i)

I don't have such a conditional in mine, and I use a QUIT signal (although I don't see why TERM isn't working for you).

    Process.kill("QUIT", File.read(old_pid).to_i)

You might also want to try logging a puts File.read(old_pid) to the unicorn.stdout.log to ensure the old_pid file is being correctly written to (and in the right path).
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 20:09 Strange quit behavior James Cox
  2011-08-02 20:34 ` Alex Sharp
  2011-08-02 20:45 ` cliftonk
@ 2011-08-02 21:53 ` Eric Wong
  2011-08-02 22:46   ` James Cox
  2011-08-03  0:34   ` Eric Wong
  2 siblings, 2 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-02 21:53 UTC (permalink / raw)
  To: unicorn list; +Cc: James Cox

James Cox <james@imaj.es> wrote:
> Hey,
> 
> So here are some tasks for managing unicorn:
> 
> https://gist.github.com/1121076

Can we ignore the :restart task?   It's a bit fragile since it doesn't
wait for the old process to die (SIGTERM means
kill-as-quickly-as-possible, but given a loaded system it can still take
some time).

> I've found that it's very unreliable for quitting / terminating
> unicorn and restarting with new code. When doing rapid deployments
> particularly, i've found that i have to go in and kill -9 the master
> process, and start again.

If you SIGQUIT/SIGTERM before the app is loaded, the signal could
be ignored.  This behavior should probably change...

> any thoughts on why it seems ineffective?
> 
> Thanks.

Which version of Unicorn are you using?  I recall some minor tweaks
to avoid/minimize race conditions over the years so maybe some
are fixed.

> PS: here's the unicorn config:
> 
> https://gist.github.com/0dd07c5ad00c56d161c7

Avoid the top piece of the before_fork hook to TTOU workers, it's
needlessly complex for most deployments unless you're really
memory-constrained.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 20:34 ` Alex Sharp
@ 2011-08-02 21:54   ` Eric Wong
  2011-08-05  4:09     ` Alex Sharp
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-02 21:54 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> > I've found that it's very unreliable for quitting / terminating
> > unicorn and restarting with new code. When doing rapid deployments
> > particularly, i've found that i have to go in and kill -9 the master
> > process, and start again.
> 
> We've noticed some of this behavior as well. We've seen the new master
> start spinning and consume 100% cpu (according to top). The old master
> and workers are still running and working, but the new master just
> hangs, and we have to kill -9.

Can you try to strace (or equivalent) the old master to see what's using
100% CPU?

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 21:53 ` Eric Wong
@ 2011-08-02 22:46   ` James Cox
  2011-08-02 23:08     ` Eric Wong
  2011-08-03  0:34   ` Eric Wong
  1 sibling, 1 reply; 30+ messages in thread
From: James Cox @ 2011-08-02 22:46 UTC (permalink / raw)
  To: Eric Wong, unicorn list

On Tue, Aug 2, 2011 at 17:53, Eric Wong <normalperson@yhbt.net> wrote:
> James Cox <james@imaj.es> wrote:
>> Hey,
>>
>> So here are some tasks for managing unicorn:
>>
>> https://gist.github.com/1121076
>
> Can we ignore the :restart task?   It's a bit fragile since it doesn't
> wait for the old process to die (SIGTERM means
> kill-as-quickly-as-possible, but given a loaded system it can still take
> some time).

We mostly restart (what surprise). What pattern works best here for
that? (speed of deploy is important, but definitely assume a long boot
time)


>
>> I've found that it's very unreliable for quitting / terminating
>> unicorn and restarting with new code. When doing rapid deployments
>> particularly, i've found that i have to go in and kill -9 the master
>> process, and start again.
>
> If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> be ignored.  This behavior should probably change...
>
>> any thoughts on why it seems ineffective?
>>
>> Thanks.
>
> Which version of Unicorn are you using?  I recall some minor tweaks
> to avoid/minimize race conditions over the years so maybe some
> are fixed.
>
gem 'unicorn' - so whatever seems up to date. My lock says 4.0.1

>> PS: here's the unicorn config:
>>
>> https://gist.github.com/0dd07c5ad00c56d161c7
>
> Avoid the top piece of the before_fork hook to TTOU workers, it's
> needlessly complex for most deployments unless you're really
> memory-constrained.
>

So what should that look like? all but that nr-workers stuff? can i
just remove the before fork? and what would you say is a super good
unicorn config to start from?

thanks!
james
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 22:46   ` James Cox
@ 2011-08-02 23:08     ` Eric Wong
  2011-08-02 23:49       ` Alex Sharp
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-02 23:08 UTC (permalink / raw)
  To: unicorn list; +Cc: James Cox

James Cox <james@imaj.es> wrote:
> On Tue, Aug 2, 2011 at 17:53, Eric Wong <normalperson@yhbt.net> wrote:
> > James Cox <james@imaj.es> wrote:
> >> Hey,
> >>
> >> So here are some tasks for managing unicorn:
> >>
> >> https://gist.github.com/1121076
> >
> > Can we ignore the :restart task?   It's a bit fragile since it doesn't
> > wait for the old process to die (SIGTERM means
> > kill-as-quickly-as-possible, but given a loaded system it can still take
> > some time).
> 
> We mostly restart (what surprise). What pattern works best here for
> that? (speed of deploy is important, but definitely assume a long boot
> time)

Yeah, if you do a full shutdown, I would definitely wait a little
for the old process to shut down (wait for pid to disappear), first.

However, full shutdown means you drop connections, so I would use
USR2 followed by QUIT (on same pid, USR2 will replace the current
pid file).  See http://unicorn.bogomips.org/SIGNALS.html

> So what should that look like? all but that nr-workers stuff? can i
> just remove the before fork? and what would you say is a super good
> unicorn config to start from?

Yeah, skip the before_fork and also after_fork.  Those are mainly for
disconnect/reconnect of ActiveRecord and anything else that might need a
network connection.

I try to have as little-as-possible in my unicorn config.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 23:08     ` Eric Wong
@ 2011-08-02 23:49       ` Alex Sharp
  0 siblings, 0 replies; 30+ messages in thread
From: Alex Sharp @ 2011-08-02 23:49 UTC (permalink / raw)
  To: unicorn list; +Cc: James Cox

On Tue, Aug 2, 2011 at 4:08 PM, Eric Wong <normalperson@yhbt.net> wrote:
> James Cox <james@imaj.es> wrote:
>> So what should that look like? all but that nr-workers stuff? can i
>> just remove the before fork? and what would you say is a super good
>> unicorn config to start from?
>
> Yeah, skip the before_fork and also after_fork.  Those are mainly for
> disconnect/reconnect of ActiveRecord and anything else that might need a
> network connection.

FWIW, we use the before_fork hook to automatically kill the old master
by sending it QUIT:

before_fork do |server, worker|
  old_pid = '/var/www/api/shared/pids/unicorn.pid.oldbin'

  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # another newly forked workers has already killed it
    end
  end
end

- alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 21:53 ` Eric Wong
  2011-08-02 22:46   ` James Cox
@ 2011-08-03  0:34   ` Eric Wong
  2011-08-03  1:36     ` Eric Wong
  2011-08-19 22:18     ` Eric Wong
  1 sibling, 2 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-03  0:34 UTC (permalink / raw)
  To: unicorn list; +Cc: James Cox

Eric Wong <normalperson@yhbt.net> wrote:
> If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> be ignored.  This behavior should probably change...

I pushed the following to git://bogomips.org/unicorn

>From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001
From: Eric Wong <normalperson@yhbt.net>
Date: Tue, 2 Aug 2011 23:52:14 +0000
Subject: [PATCH] trap death signals in the worker sooner

This helps close a race condition preventing shutdown if
loading the application (preload_app=false) takes a long
time and the user decides to kil workers instead.
---
 lib/unicorn/http_server.rb |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index ad5d6f0..565f132 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -535,12 +535,17 @@ class Unicorn::HttpServer
     handle_error(client, e)
   end
 
+  EXIT_SIGS = [ :QUIT, :TERM, :INT ]
+  WORKER_QUEUE_SIGS = QUEUE_SIGS - EXIT_SIGS
+
   # gets rid of stuff the worker has no business keeping track of
   # to free some resources and drops all sig handlers.
   # traps for USR1, USR2, and HUP may be set in the after_fork Proc
   # by the user.
   def init_worker_process(worker)
-    QUEUE_SIGS.each { |sig| trap(sig, nil) }
+    # we'll re-trap :QUIT later for graceful shutdown iff we accept clients
+    EXIT_SIGS.each { |sig| trap(sig) { exit!(0) } }
+    WORKER_QUEUE_SIGS.each { |sig| trap(sig, nil) }
     trap(:CHLD, 'DEFAULT')
     SIG_QUEUE.clear
     proc_name "worker[#{worker.nr}]"
@@ -578,7 +583,6 @@ class Unicorn::HttpServer
     # closing anything we IO.select on will raise EBADF
     trap(:USR1) { nr = -65536; SELF_PIPE[0].close rescue nil }
     trap(:QUIT) { worker = nil; LISTENERS.each { |s| s.close rescue nil }.clear }
-    [:TERM, :INT].each { |sig| trap(sig) { exit!(0) } } # instant shutdown
     logger.info "worker=#{worker.nr} ready"
 
     begin
-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-03  0:34   ` Eric Wong
@ 2011-08-03  1:36     ` Eric Wong
  2011-08-19 22:18     ` Eric Wong
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-03  1:36 UTC (permalink / raw)
  To: unicorn list; +Cc: James Cox

Eric Wong <normalperson@yhbt.net> wrote:
> Eric Wong <normalperson@yhbt.net> wrote:
> > If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> > be ignored.  This behavior should probably change...
> 
> I pushed the following to git://bogomips.org/unicorn
> 
> From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001

Also pushed to rubygems.org for easier testing:

  gem install --pre unicorn -v 4.0.1.4.g406b

This will probably be unicorn 4.0.2 shortly.
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-02 21:54   ` Eric Wong
@ 2011-08-05  4:09     ` Alex Sharp
  2011-08-05  4:12       ` Alex Sharp
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-05  4:09 UTC (permalink / raw)
  To: unicorn list

On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson@yhbt.net> wrote:
> Can you try to strace (or equivalent) the old master to see what's using
> 100% CPU?
>

All I see is a whole lot of this:

sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

- alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-05  4:09     ` Alex Sharp
@ 2011-08-05  4:12       ` Alex Sharp
  2011-08-05  8:07         ` Eric Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-05  4:12 UTC (permalink / raw)
  To: unicorn list

On Thu, Aug 4, 2011 at 9:09 PM, Alex Sharp <ajsharp@gmail.com> wrote:
> On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson@yhbt.net> wrote:
>> Can you try to strace (or equivalent) the old master to see what's using
>> 100% CPU?
>>
>
> All I see is a whole lot of this:
>
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0

Actually, my fault -- the last email was the output of new master.
Running strace on the old master shows the following:

select(4, [3], NULL, NULL, {13, 466686}) = 0 (Timeout)
wait4(-1, 0x7fff57e7bfcc, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425411398}) = 0
fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425625251}) = 0
fstat(11, {st_mode=S_IFREG, st_size=0, ...}) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425779281}) = 0
fstat(12, {st_mode=S_IFREG, st_size=0, ...}) = 0
clock_gettime(CLOCK_REALTIME, {1312517345, 425927762}) = 0

The first line was when the master was idle, and then I threw a few
requests at it.

- alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-05  4:12       ` Alex Sharp
@ 2011-08-05  8:07         ` Eric Wong
  2011-08-17  4:26           ` Alex Sharp
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-05  8:07 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> On Thu, Aug 4, 2011 at 9:09 PM, Alex Sharp <ajsharp@gmail.com> wrote:
> > On Tue, Aug 2, 2011 at 2:54 PM, Eric Wong <normalperson@yhbt.net> wrote:
> >> Can you try to strace (or equivalent) the old master to see what's using
> >> 100% CPU?
> >>
> 
> Actually, my fault -- the last email was the output of new master.
> Running strace on the old master shows the following:

Is this Unicorn 3.x?  Which 3.x version exactly?  Maybe give 4.0.1
or the 4.0.2 git prerelease a try, too.

> select(4, [3], NULL, NULL, {13, 466686}) = 0 (Timeout)
> wait4(-1, 0x7fff57e7bfcc, WNOHANG, NULL) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425411398}) = 0
> fstat(9, {st_mode=S_IFREG, st_size=0, ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425625251}) = 0
> fstat(11, {st_mode=S_IFREG, st_size=0, ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425779281}) = 0
> fstat(12, {st_mode=S_IFREG, st_size=0, ...}) = 0
> clock_gettime(CLOCK_REALTIME, {1312517345, 425927762}) = 0

Can I get more? (until the next select() call, at least).  I'd like to
see the timeout argument that gets passed to the next select().

If you see select() with very small timeout, use "strace -v" to get the
st_ctime from fstat()s...

I could have a math bug in murder_lazy_workers (please read/review the
logic in that method, I haven't noticed issues myself).

I made some tweaks to the master sleep strategy within the past year to
reduce wakeups during idle hours.  This is intended to go with Ruby
1.9.3 which will have /much/ better thread wakeup strategy that reduces
power consumption during idle times.

> The first line was when the master was idle, and then I threw a few
> requests at it.

Are all workers responding as expected and not dying?

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-05  8:07         ` Eric Wong
@ 2011-08-17  4:26           ` Alex Sharp
  2011-08-17  9:22             ` Eric Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-17  4:26 UTC (permalink / raw)
  To: unicorn list

On Fri, Aug 5, 2011 at 1:07 AM, Eric Wong <normalperson@yhbt.net> wrote:
> Is this Unicorn 3.x?  Which 3.x version exactly?  Maybe give 4.0.1
> or the 4.0.2 git prerelease a try, too.

We're using version 3.6.2 with ruby 1.9.2-p180 on ubuntu 11.x. I know
the kernel on this version of ubuntu has a know signal trapping bug,
but I don't think that's what happening here. The processes respond
after I send the restart signal to them again (USR2 + QUIT).

> Can I get more? (until the next select() call, at least).  I'd like to
> see the timeout argument that gets passed to the next select().

Here's some more output. This is from the old master though, not a new
master that is pegging the CPU. In this instance it's almost like the
old master just ignores the signal.

select(4, [3], NULL, NULL, {8, 129984}) = 0 (Timeout)
wait4(-1, 0x7fff16b82e4c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 259408364}) = 0
fstat(12, {st_dev=makedev(202, 1), st_ino=20373, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:20, st_mtime=2011/08/12-23:14:20,
st_ctime=2011/08/17-04:22:21}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 259775504}) = 0
fstat(13, {st_dev=makedev(202, 1), st_ino=20381, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:20, st_mtime=2011/08/12-23:14:20,
st_ctime=2011/08/17-04:22:17}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 259936816}) = 0
fstat(14, {st_dev=makedev(202, 1), st_ino=20382, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260086380}) = 0
fstat(15, {st_dev=makedev(202, 1), st_ino=20185, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:17}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260235797}) = 0
fstat(16, {st_dev=makedev(202, 1), st_ino=20255, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260384849}) = 0
fstat(17, {st_dev=makedev(202, 1), st_ino=20383, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260534792}) = 0
fstat(18, {st_dev=makedev(202, 1), st_ino=20384, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/12-23:14:21, st_mtime=2011/08/12-23:14:21,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260684278}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=20196, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/16-16:33:46, st_mtime=2011/08/16-16:33:46,
st_ctime=2011/08/17-04:22:19}) = 0
clock_gettime(CLOCK_REALTIME, {1313554942, 260833725}) = 0
select(4, [3], NULL, NULL, {8, 976047}

> If you see select() with very small timeout, use "strace -v" to get the
> st_ctime from fstat()s...
>
> I could have a math bug in murder_lazy_workers (please read/review the
> logic in that method, I haven't noticed issues myself).
>
> I made some tweaks to the master sleep strategy within the past year to
> reduce wakeups during idle hours.  This is intended to go with Ruby
> 1.9.3 which will have /much/ better thread wakeup strategy that reduces
> power consumption during idle times.
>
>> The first line was when the master was idle, and then I threw a few
>> requests at it.
>
> Are all workers responding as expected and not dying?

The old workers appear to be serving requests.

- Alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-17  4:26           ` Alex Sharp
@ 2011-08-17  9:22             ` Eric Wong
  2011-08-17 20:13               ` Eric Wong
  2011-08-31  0:33               ` Eric Wong
  0 siblings, 2 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-17  9:22 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> On Fri, Aug 5, 2011 at 1:07 AM, Eric Wong <normalperson@yhbt.net> wrote:
> > Can I get more? (until the next select() call, at least).  I'd like to
> > see the timeout argument that gets passed to the next select().
> 
> Here's some more output. This is from the old master though, not a new
> master that is pegging the CPU. In this instance it's almost like the
> old master just ignores the signal.

Wait, weren't you trying to track down a problem with the master that's
pegging the CPU?  Isn't the CPU usage the problem here?  I was also
thinking the CPU usage in the new master was caused by constant
worker respawning because bundler wasn't loaded correctly somehow....

> select(4, [3], NULL, NULL, {8, 129984}) = 0 (Timeout)
<snip>
> select(4, [3], NULL, NULL, {8, 976047}

OK, so /this/ master is confirmed to be sleeping with a reasonable
timeout and not pegging the CPU...

If you want to actually track down whether or not a signal is being
delivered, use "strace -f -e '!futex'" since Ruby 1.9 has a dedicated
signal handling thread (at the OS level).

(You only need the -e '!futex' part to filter out the noise from the
 signal handling thread in <= 1.9.2, 1.9.3 is much quieter :)


Below is a proposed patch (to unicorn.git) which may help debug issues
in the signal -> handler master path (but only once it enters the Ruby
runtime).  I'm a hesitant to commit it since it worthless if the Ruby
process is stuck because of some bad C extension.  That's the most
common cause of stuck/unresponsive processes I've seen.

Subject: [PATCH] http_server: add debug statements for master sig handlers

This should help users track down what's going on as soon as
Ruby can process the signal.  It's still not going to be useful
if the Ruby process is stuck because of a bug in a C extension
or Ruby itself, though.

Unicorn always defers signals in the master process to serialize
signal handling since some of the important actions (e.g. HUP,
USR1, USR2) are NOT reentrant.
---
 lib/unicorn/http_server.rb |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index aa8212e..b45d6d6 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -126,8 +126,17 @@ class Unicorn::HttpServer
     # setup signal handlers before writing pid file in case people get
     # trigger happy and send signals as soon as the pid file exists.
     # Note that signals don't actually get handled until the #join method
-    QUEUE_SIGS.each { |sig| trap(sig) { SIG_QUEUE << sig; awaken_master } }
-    trap(:CHLD) { awaken_master }
+    QUEUE_SIGS.each do |sig|
+      trap(sig) do
+        @logger.debug("received SIG#{sig}")
+        SIG_QUEUE << sig
+        awaken_master
+      end
+    end
+    trap(:CHLD) do
+      @logger.debug("received SIGCHLD")
+      awaken_master
+    end
     self.pid = config[:pid]
 
     self.master_pid = $$
-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-17  9:22             ` Eric Wong
@ 2011-08-17 20:13               ` Eric Wong
  2011-08-18 23:13                 ` Alex Sharp
  2011-08-31  0:33               ` Eric Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-17 20:13 UTC (permalink / raw)
  To: unicorn list

Eric Wong <normalperson@yhbt.net> wrote:
> Below is a proposed patch (to unicorn.git) which may help debug issues
> in the signal -> handler master path (but only once it enters the Ruby
> runtime).  I'm a hesitant to commit it since it worthless if the Ruby
> process is stuck because of some bad C extension.  That's the most
> common cause of stuck/unresponsive processes I've seen.

I think that was a bad patch, adding signal handler debugging at the
Ruby layer leads to the false assumption that interpreter/VM is in a
good state.  If you need to debug signal handlers, something is already
broken and tracing syscalls is the most reliable way to go.


Ruby (and any other high-level language) signal handling is not
straight forward[1].

Here's how things work in Matz Ruby 1.9.x[2]:

  you                 C timer thread             Ruby Thread(s)
  -------------------------------------------------------------------
                      traps signals              ignores most signals
                      sleeps                     runs Ruby...
  kill -USR2 ...
                      receives signal (async)
                      runs (system) sighandler[1]
                      wakes up from sleep
                      signals Ruby Thread(s)
                                                 *hopefully wakes up*
                                                 runs Ruby sighandler


The "*hopefully wakes up*" part is the part most likely to fail
as a result of a bad C extension or Ruby bug.


PS. In Ruby 1.9.3, timer thread uses the "self-pipe" sighandler
    implementation that the unicorn master process always used.
    This allows Ruby 1.9.3 to conserve power on idle processes.
    In 1.9.2, the timer thread signal handler just polls in
    10ms intervals to check if any signals were received.
    This is why "strace -f" is noisy and I recommend "-e '!futex'"
    for 1.9.2.

PPS. Unicorn still uses the "self-pipe" signal handler in Ruby-land
     because Ruby signal handlers are reentrant so must execute
     reentrant-safe code.  So without the self-pipe to serialize
     the signal handler dispatch, the Ruby signal handler execution
     can nest and overlap execution with itself.  This means if USR2
     is sent multiple times in short succession, you could spawn
     multiple new unicorn masters


[1] - See "man 7 signal" in Linux manpages or POSIX specs for the
      small list of safe functions that may be called in system-level
      sighandlers.  Ruby-level signal handlers can't run inside
      system-level signal handlers for this reason.

[2] - I think any high-level language that implements signal handlers
      AND native threads must do something similar.  The only valid
      variation I can think of is to execute the high-level language
      code inside the timer thread, but that requires the coders of
      the high-level language to have thread-safety (not just
      reentrancy) in mind when writing signal handlers even if the
      rest of their code uses no threads.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-17 20:13               ` Eric Wong
@ 2011-08-18 23:13                 ` Alex Sharp
  2011-08-19  1:53                   ` Eric Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-18 23:13 UTC (permalink / raw)
  To: unicorn list

Ok, thanks a lot for the for the patch attempt and subsequent in-depth
explanation. I know this thread is long-lived and a bit confusing,
mostly because I've seen two different sets of scenarios that yield
the same result, which is that unicorn does not restart.

The first scenario (and the one that started this thread) was that
unicorn actually receives the signal, spawns a new master, and that
process pegs the cpu and spins into eternity. The second scenario is
that the old master seems to completely ignore the USR2 signal the
first time, and when sent again, properly restarts. In both scenarios
the old master and workers continue to serve the old code.

I thought that setting the Unicorn::HttpServer::START_CTX[0] in my
unicorn config had solved the first scenario (pegged cpu), but I found
a new occurrence of it today, and I now have some new strace output
for this scenario (strace -v):

select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout)
wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0
fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:54:06}) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:54:05}) = 0
clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0
select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout)
wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 853870540}) = 0
fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-22:59:06}) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 854102750}) = 0
fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
st_ctime=2011/08/18-23:04:05}) = 0
clock_gettime(CLOCK_REALTIME, {1313708646, 854260655}) = 0
select(4, [3], NULL, NULL, {598, 630876}

With respect to the second scenario (ignoring signals), I'm going to
recommend we table that issue for now, as we're currently running on a
version of ubuntu (11.10) which has a known signal trapping bug with
ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem.
Granted, when I've observed this in the past with other libraries, the
process becomes completely unresponsive, whereas the behavior in
unicorn is more intermittent. Either way, we are in the process of
downgrading our servers to ubuntu 10.04, so let's not waste anymore
time trying to debug something that may well be occurring due to a
kernel bug. If we continue to see this after the downgrade to 10.04,
I'll report back and we can keep digging.

Again, my apologies for the confusion between the two scenarios, and
thanks for all your help.

- Alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-18 23:13                 ` Alex Sharp
@ 2011-08-19  1:53                   ` Eric Wong
  2011-08-19  9:42                     ` Alex Sharp
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-19  1:53 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> Ok, thanks a lot for the for the patch attempt and subsequent in-depth
> explanation. I know this thread is long-lived and a bit confusing,
> mostly because I've seen two different sets of scenarios that yield
> the same result, which is that unicorn does not restart.

No problem, I'm here to help and mutt makes it easy to follow long threads :>
<snip>

> I thought that setting the Unicorn::HttpServer::START_CTX[0] in my
> unicorn config had solved the first scenario (pegged cpu), but I found
> a new occurrence of it today, and I now have some new strace output
> for this scenario (strace -v):
> 
> select(4, [3], NULL, NULL, {60, 727229}) = 0 (Timeout)
> wait4(-1, 0x7ffffd70f72c, WNOHANG, NULL) = 0
> clock_gettime(CLOCK_REALTIME, {1313708081, 720757662}) = 0
> fstat(8, {st_dev=makedev(202, 1), st_ino=39806, st_mode=S_IFREG,
> st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
> st_ctime=2011/08/18-22:54:06}) = 0
> clock_gettime(CLOCK_REALTIME, {1313708081, 721131305}) = 0
> fstat(10, {st_dev=makedev(202, 1), st_ino=45370, st_mode=S_IFREG,
> st_nlink=0, st_uid=1001, st_gid=1001, st_blksize=4096, st_blocks=0,
> st_size=0, st_atime=2011/08/18-14:32:10, st_mtime=2011/08/18-14:32:10,
> st_ctime=2011/08/18-22:54:05}) = 0
> clock_gettime(CLOCK_REALTIME, {1313708081, 721290800}) = 0
> select(4, [3], NULL, NULL, {565, 34005}) = 0 (Timeout)

<snip>  OK, this looks good from the system call level in the main
thread of the master process:

  I can see that from select() timing out despite being called with
  long intervals.  I don't see workers dying, either...

Can you add "-f -e '!futex'" to the strace invocation so we see
all the threads?  You can also add '-T' to get timing information
to see how long each syscall takes or '-tt' to get timestamps
in strace if you think it's useful.

Since the main thread doesn't appear to be doing much, it's
probably a child thread somewhere...

Also, I assume you're using preload_app?  If you are, do you see this
issue with preload_app=false?  I suspect there's some background thread
that could be running in the master process taking up all the CPU time.
Unicorn itself never spawns background threads.

Basically anything you can tell use about the app, the configuration,
and which gems/libraries would be useful.

> With respect to the second scenario (ignoring signals), I'm going to
> recommend we table that issue for now, as we're currently running on a
> version of ubuntu (11.10) which has a known signal trapping bug with
> ruby 1.9.2-p180, and downgrading to 10.04 may solve that problem.
> Granted, when I've observed this in the past with other libraries, the
> process becomes completely unresponsive, whereas the behavior in
> unicorn is more intermittent. Either way, we are in the process of
> downgrading our servers to ubuntu 10.04, so let's not waste anymore
> time trying to debug something that may well be occurring due to a
> kernel bug. If we continue to see this after the downgrade to 10.04,
> I'll report back and we can keep digging.

OK.  I wouldn't rule out the CPU usage as unrelated to the signaling
issue, though.  The Ruby 1.9 timer thread could be going crazy
because it can't signal or receive signals properly...

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-19  1:53                   ` Eric Wong
@ 2011-08-19  9:42                     ` Alex Sharp
  2011-08-19 16:51                       ` Eric Wong
  2011-08-23  2:59                       ` Alex Sharp
  0 siblings, 2 replies; 30+ messages in thread
From: Alex Sharp @ 2011-08-19  9:42 UTC (permalink / raw)
  To: unicorn list

On Thu, Aug 18, 2011 at 6:53 PM, Eric Wong <normalperson@yhbt.net> wrote:
>  I can see that from select() timing out despite being called with
>  long intervals.  I don't see workers dying, either...

That's correct, the old workers are staying alive as well.


> Can you add "-f -e '!futex'" to the strace invocation so we see
> all the threads?  You can also add '-T' to get timing information
> to see how long each syscall takes or '-tt' to get timestamps
> in strace if you think it's useful.
>
> Also, I assume you're using preload_app?  If you are, do you see this
> issue with preload_app=false?  I suspect there's some background thread
> that could be running in the master process taking up all the CPU time.
> Unicorn itself never spawns background threads.

Yes, we're using preload_app=true. I haven't tried it with
preload_app=false -- if I get to it tomorrow I'll report back here.

We *are* using newrelic, which operates in a background thread. I've
just found this ticket in the newrelic support
forums: https://support.newrelic.com/help/discussions/support/7692-newrelic_rpm-gem-with-unicorn-40.

I'll re-run strace with the suggested flags above and report back.

> Basically anything you can tell use about the app, the configuration,
> and which gems/libraries would be useful.

Gemfile: https://gist.github.com/05a9445471ad7edfdcb7

> OK.  I wouldn't rule out the CPU usage as unrelated to the signaling
> issue, though.  The Ruby 1.9 timer thread could be going crazy
> because it can't signal or receive signals properly...

--
alex sharp
github.com/ajsharp
twitter.com/ajsharp
alexjsharp.com
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-19  9:42                     ` Alex Sharp
@ 2011-08-19 16:51                       ` Eric Wong
  2011-08-23  2:59                       ` Alex Sharp
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-19 16:51 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> On Thu, Aug 18, 2011 at 6:53 PM, Eric Wong <normalperson@yhbt.net> wrote:
> > Basically anything you can tell use about the app, the configuration,
> > and which gems/libraries would be useful.
> 
> Gemfile: https://gist.github.com/05a9445471ad7edfdcb7

OK, out of the few gems I recognize, I don't see anything else besides
newrelic_rpm that might run a background thread.  I would scan through
the rest of them to be sure (and maybe other folks here can give you
pointers).

Personally, I never depend on anything before I've had a chance to
scan through the code and know what it does, first.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-03  0:34   ` Eric Wong
  2011-08-03  1:36     ` Eric Wong
@ 2011-08-19 22:18     ` Eric Wong
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-19 22:18 UTC (permalink / raw)
  To: unicorn list

Eric Wong <normalperson@yhbt.net> wrote:
> Eric Wong <normalperson@yhbt.net> wrote:
> > If you SIGQUIT/SIGTERM before the app is loaded, the signal could
> > be ignored.  This behavior should probably change...
> 
> I pushed the following to git://bogomips.org/unicorn
> 
> From 406b8b0e2ed6e5be34d8ec3cd4b16048233c2856 Mon Sep 17 00:00:00 2001
> From: Eric Wong <normalperson@yhbt.net>
> Date: Tue, 2 Aug 2011 23:52:14 +0000
> Subject: [PATCH] trap death signals in the worker sooner
> 
> This helps close a race condition preventing shutdown if
> loading the application (preload_app=false) takes a long
> time and the user decides to kil workers instead.

The following completely eliminates the race condition:

>From 8de6ab371c1623669b86a5dfa8703c8fd539011f Mon Sep 17 00:00:00 2001
From: Eric Wong <normalperson@yhbt.net>
Date: Fri, 19 Aug 2011 22:13:04 +0000
Subject: [PATCH] close race if an exit signal hits the worker before trap

The signal handler from the master is still active and will
push the pending signal to SIG_QUEUE if a worker receives
a signal immediately after forking.
---
 lib/unicorn/http_server.rb |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index aa8212e..4f516c9 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -549,6 +549,7 @@ class Unicorn::HttpServer
   def init_worker_process(worker)
     # we'll re-trap :QUIT later for graceful shutdown iff we accept clients
     EXIT_SIGS.each { |sig| trap(sig) { exit!(0) } }
+    exit!(0) if (SIG_QUEUE & EXIT_SIGS)[0]
     WORKER_QUEUE_SIGS.each { |sig| trap(sig, nil) }
     trap(:CHLD, 'DEFAULT')
     SIG_QUEUE.clear
-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-19  9:42                     ` Alex Sharp
  2011-08-19 16:51                       ` Eric Wong
@ 2011-08-23  2:59                       ` Alex Sharp
  2011-08-23  7:12                         ` Eric Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-23  2:59 UTC (permalink / raw)
  To: unicorn list

>> Can you add "-f -e '!futex'" to the strace invocation so we see
>> all the threads?  You can also add '-T' to get timing information
>> to see how long each syscall takes or '-tt' to get timestamps
>> in strace if you think it's useful.

Here's the result of strace with suggested flags on the old master.
(To clarify, this is still the spinning CPU issue)

→ sudo strace -v -f -e '!futex' -p 18862
Process 18862 attached with 2 threads - interrupt to quit
[pid 22170] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 18862] select(4, [3], NULL, NULL, {3, 546106}) = 0 (Timeout)
[pid 18862] wait4(-1, 0x7fffbb907d3c, WNOHANG, NULL) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459472514}) = 0
[pid 18862] fstat(7, {st_dev=makedev(202, 1), st_ino=31695,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:06}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459813102}) = 0
[pid 18862] fstat(11, {st_dev=makedev(202, 1), st_ino=31696,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 459999675}) = 0
[pid 18862] fstat(13, {st_dev=makedev(202, 1), st_ino=31697,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 460168995}) = 0
[pid 18862] fstat(14, {st_dev=makedev(202, 1), st_ino=31698,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:07}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067031, 460343593}) = 0
[pid 18862] select(4, [3], NULL, NULL, {6, 255959}) = 0 (Timeout)
[pid 18862] wait4(-1, 0x7fffbb907d3c, WNOHANG, NULL) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739352608}) = 0
[pid 18862] fstat(7, {st_dev=makedev(202, 1), st_ino=31695,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:16}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739570659}) = 0
[pid 18862] fstat(11, {st_dev=makedev(202, 1), st_ino=31696,
st_mode=S_IFREG, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739768755}) = 0
[pid 18862] fstat(13, {st_dev=makedev(202, 1), st_ino=31697,
st_mode=S_IFREG|01, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 739951254}) = 0
[pid 18862] fstat(14, {st_dev=makedev(202, 1), st_ino=31698,
st_mode=S_IFREG|01, st_nlink=0, st_uid=1001, st_gid=1001,
st_blksize=4096, st_blocks=0, st_size=0, st_atime=2011/08/23-00:09:05,
st_mtime=2011/08/23-00:09:05, st_ctime=2011/08/23-02:37:17}) = 0
[pid 18862] clock_gettime(CLOCK_REALTIME, {1314067037, 740130965}) = 0
[pid 18862] select(4, [3], NULL, NULL, {9, 996190}^C <unfinished ...>

>> Also, I assume you're using preload_app?  If you are, do you see this
>> issue with preload_app=false?  I suspect there's some background thread
>> that could be running in the master process taking up all the CPU time.
>> Unicorn itself never spawns background threads.

I went ahead and ran strace with the same flags on the *new* master,
and saw a bunch of output that looked bundler-related:
https://gist.github.com/138344b5b19ec6ba1a4c

Even more bizarre, eventually the process started successfully :-/ Is
it possible this had something to do with strace de-taching?

You can see this in the unicorn.stderr.log file I included in the
gist. Check out these two lines in particular, which occur 25 minutes
apart:

I, [2011-08-23T02:15:08.396868 #22169]  INFO -- : Refreshing Gem list
I, [2011-08-23T02:40:16.621210 #22925]  INFO -- : worker=1 spawned pid=22925

Question: If I turn on debug flags in unicorn, will I get much of this
output in the unicorn.stderr log?

- Alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-23  2:59                       ` Alex Sharp
@ 2011-08-23  7:12                         ` Eric Wong
  2011-08-23 16:49                           ` Alex Sharp
  0 siblings, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-23  7:12 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> >> Can you add "-f -e '!futex'" to the strace invocation so we see
> >> all the threads?  You can also add '-T' to get timing information
> >> to see how long each syscall takes or '-tt' to get timestamps
> >> in strace if you think it's useful.
> 
> Here's the result of strace with suggested flags on the old master.
> (To clarify, this is still the spinning CPU issue)
> 
> → sudo strace -v -f -e '!futex' -p 18862
> Process 18862 attached with 2 threads - interrupt to quit
> [pid 22170] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
> [pid 18862] select(4, [3], NULL, NULL, {3, 546106}) = 0 (Timeout)

<snip>

22170 is presumably the timer thread.

Did you send any signals to 18862 while you were making this strace?

If you did send a signal to 18862 while stracing, it's worrying it
didn't show up in your strace.  If you didn't send a signal, then
try sending a signal.

> >> Also, I assume you're using preload_app?  If you are, do you see this
> >> issue with preload_app=false?  I suspect there's some background thread
> >> that could be running in the master process taking up all the CPU time.
> >> Unicorn itself never spawns background threads.
> 
> I went ahead and ran strace with the same flags on the *new* master,
> and saw a bunch of output that looked bundler-related:
> https://gist.github.com/138344b5b19ec6ba1a4c
> 
> Even more bizarre, eventually the process started successfully :-/ Is
> it possible this had something to do with strace de-taching?

That looks like pretty normal "require" behavior.  strace would slow
down requires, a lot.

So this was with preload_app=true?  While you're debugging problems,
I suggest keeping preload_app=false and worker_problems=1 to minimize
the variables.

> You can see this in the unicorn.stderr.log file I included in the
> gist. Check out these two lines in particular, which occur 25 minutes
> apart:
> 
> I, [2011-08-23T02:15:08.396868 #22169]  INFO -- : Refreshing Gem list
> I, [2011-08-23T02:40:16.621210 #22925]  INFO -- : worker=1 spawned pid=22925

Wow, it takes 25 minutes to load your application?  strace makes the
application /much/ slower, so I can actually believe it takes that long.

Loading large applications is very slow under Ruby 1.9.2, there's some
pathological load performance issues fixed in 1.9.3.

So you're saying /without/ strace, CPU usage _stays_ at 100% and _never_
goes down?

> Question: If I turn on debug flags in unicorn, will I get much of this
> output in the unicorn.stderr log?

You'll get every exception that's thrown even if it's rescued.
Unicorn 3.x itself doesn't throw that many exceptions.


Generally, can you reproduce this behavior on a plain (empty) Rails
application with no extra gems?

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-23  7:12                         ` Eric Wong
@ 2011-08-23 16:49                           ` Alex Sharp
  2011-08-23 20:23                             ` Eric Wong
  0 siblings, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-08-23 16:49 UTC (permalink / raw)
  To: unicorn list

On Tuesday, August 23, 2011 at 12:12 AM, Eric Wong wrote:
> Did you send any signals to 18862 while you were making this strace?
Not while, no. I had already sent a USR2 signal to this process (the old master), and then I attached strace to it. 

I'll try sending another USR2 signal next time while strace is attached. 
> >  I went ahead and ran strace with the same flags on the *new* master,
> > and saw a bunch of output that looked bundler-related:
> > https://gist.github.com/138344b5b19ec6ba1a4c
> > 
> > Even more bizarre, eventually the process started successfully :-/ Is
> > it possible this had something to do with strace de-taching?
> 
> That looks like pretty normal "require" behavior. strace would slow
>  down requires, a lot.
> 
> So this was with preload_app=true? While you're debugging problems,
> I suggest keeping preload_app=false and worker_problems=1 to minimize
> the variables.
Ok, I'll change those and report back. I'm guessing you meant worker_processes (not problems)? 
> 
> > You can see this in the unicorn.stderr.log file I included in the
> > gist. Check out these two lines in particular, which occur 25 minutes
> >  apart:
> > 
> > I, [2011-08-23T02:15:08.396868 #22169] INFO -- : Refreshing Gem list
> > I, [2011-08-23T02:40:16.621210 #22925] INFO -- : worker=1 spawned pid=22925
> 
> Wow, it takes 25 minutes to load your application? strace makes the
>  application /much/ slower, so I can actually believe it takes that long.
No, my mistake. Loading the application only takes about 10 seconds, and I only had strace attached to this process for a few seconds (less than 10). My point here was to show that the new master just spun for a good 25 minutes (presumably trying to load files over and over again), and then, seemingly out of nowhere, the new master came up and spawned the new workers. 
Next time I'll try to get attached with strace earlier and record more output.

> 
>  Loading large applications is very slow under Ruby 1.9.2, there's some
> pathological load performance issues fixed in 1.9.3.
> 
Yep, I've read about those, and I've seen Xavier's patch, but I don't think that's the issue here (though, it appears that's why the files attempting to be loaded in the strace output do not exist). Under normal circumstances, loading the app takes about 10 seconds and doesn't peg the cpu while doing it.
> 
> So you're saying /without/ strace, CPU usage _stays_ at 100% and _never_
> goes down?
Correct. 
> 
> 
> Generally, can you reproduce this behavior on a plain (empty) Rails
>  application with no extra gems?
> 
Good idea, I'll try that next.

- Alex 
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-23 16:49                           ` Alex Sharp
@ 2011-08-23 20:23                             ` Eric Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Eric Wong @ 2011-08-23 20:23 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> On Tuesday, August 23, 2011 at 12:12 AM, Eric Wong wrote:
> > So this was with preload_app=true? While you're debugging problems,
> > I suggest keeping preload_app=false and worker_problems=1 to minimize
> > the variables.
> Ok, I'll change those and report back. I'm guessing you meant
> worker_processes (not problems)? 

LOL yes, it was late last night :)

> > Generally, can you reproduce this behavior on a plain (empty) Rails
> >  application with no extra gems?
> > 
> Good idea, I'll try that next.

I rarely run into problems, but doing things like this (starting with
empty app, Unicorn as default-as-possible (worker_processes=1,
preload_app=false)) should be standard procedure for troubleshooting.

These things are so second nature to me and I forget to mention it :<

I'm also the type who reads the code of /every/ library I introduce into
the app so I know (or am very quickly capable) of knowing exactly what
an app is doing.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-17  9:22             ` Eric Wong
  2011-08-17 20:13               ` Eric Wong
@ 2011-08-31  0:33               ` Eric Wong
  2011-09-01 18:45                 ` Alex Sharp
  1 sibling, 1 reply; 30+ messages in thread
From: Eric Wong @ 2011-08-31  0:33 UTC (permalink / raw)
  To: unicorn list

Eric Wong <normalperson@yhbt.net> wrote:
> +      trap(sig) do
> +        @logger.debug("received SIG#{sig}")

I'm even more glad I didn't apply this patch to Unicorn.  I completely
forgot Logger uses a mutex internally (even though it doesn't need to
when writing to a POSIX-compliant file system).

Rainbows! has a similar issue I fixed/worked around:
  http://mid.gmane.org/20110830233232.GA19633@dcvr.yhbt.net

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-08-31  0:33               ` Eric Wong
@ 2011-09-01 18:45                 ` Alex Sharp
  2011-09-01 19:46                   ` Eric Wong
  2011-09-01 19:57                   ` Alex Sharp
  0 siblings, 2 replies; 30+ messages in thread
From: Alex Sharp @ 2011-09-01 18:45 UTC (permalink / raw)
  To: unicorn list

On Tuesday, August 30, 2011 at 5:33 PM, Eric Wong wrote:
> Eric Wong <normalperson@yhbt.net (mailto:normalperson@yhbt.net)> wrote:
> > + trap(sig) do
> > + @logger.debug("received SIG#{sig}")
> 
> I'm even more glad I didn't apply this patch to Unicorn. I completely
> forgot Logger uses a mutex internally (even though it doesn't need to
> when writing to a POSIX-compliant file system).
> 
> Rainbows! has a similar issue I fixed/worked around:
> http://mid.gmane.org/20110830233232.GA19633@dcvr.yhbt.net
So are you saying that Unicorn is affected with the same issue? I'm confused -- does the master send USR1 signals to the workers when it receives a USR2? I'm not sending a USR1 signal to the master. 

- Alex
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-09-01 18:45                 ` Alex Sharp
@ 2011-09-01 19:46                   ` Eric Wong
  2011-09-01 19:57                   ` Alex Sharp
  1 sibling, 0 replies; 30+ messages in thread
From: Eric Wong @ 2011-09-01 19:46 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
> On Tuesday, August 30, 2011 at 5:33 PM, Eric Wong wrote:
> > Eric Wong <normalperson@yhbt.net (mailto:normalperson@yhbt.net)> wrote:
> > > + trap(sig) do
> > > + @logger.debug("received SIG#{sig}")
> > 
> > I'm even more glad I didn't apply this patch to Unicorn. I completely
> > forgot Logger uses a mutex internally (even though it doesn't need to
> > when writing to a POSIX-compliant file system).
> > 
> > Rainbows! has a similar issue I fixed/worked around:
> > http://mid.gmane.org/20110830233232.GA19633@dcvr.yhbt.net
> 
> So are you saying that Unicorn is affected with the same issue? I'm
> confused -- does the master send USR1 signals to the workers when it
> receives a USR2? I'm not sending a USR1 signal to the master. 

No, Unicorn is not affected.  It /would have/ been, had I applied that
patch.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-09-01 18:45                 ` Alex Sharp
  2011-09-01 19:46                   ` Eric Wong
@ 2011-09-01 19:57                   ` Alex Sharp
  2011-09-01 20:26                     ` Eric Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Alex Sharp @ 2011-09-01 19:57 UTC (permalink / raw)
  To: unicorn list

 Also, today I did an strace on an old unicorn master that was not responding to USR2 signals (but not pegging the cpu), and I saw tons of gettimeofday calls in the output. Is this normal? Also, I was seeing this after having removed newrelic from the project. 

--Alex Sharp

github.com/ajsharp
twitter.com/ajsharp
alexjsharp.com



_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Strange quit behavior
  2011-09-01 19:57                   ` Alex Sharp
@ 2011-09-01 20:26                     ` Eric Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Eric Wong @ 2011-09-01 20:26 UTC (permalink / raw)
  To: unicorn list

Alex Sharp <ajsharp@gmail.com> wrote:
>  Also, today I did an strace on an old unicorn master that was not
>  responding to USR2 signals (but not pegging the cpu), and I saw tons
>  of gettimeofday calls in the output. Is this normal? Also, I was
>  seeing this after having removed newrelic from the project. 

Yes, if it's from the same (timer) thread that's making a bunch of
futex() calls (which you've probably filtered out) on 1.9.2.
They'll go away on Ruby 1.9.3.

On x86_64 and a modern-ish kernel + glibc, gettimeofday() isn't even
a syscall (and won't show up on strace):
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-General_System_Tuning-gettimeofday_speedup.html
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-09-01 20:54 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-02 20:09 Strange quit behavior James Cox
2011-08-02 20:34 ` Alex Sharp
2011-08-02 21:54   ` Eric Wong
2011-08-05  4:09     ` Alex Sharp
2011-08-05  4:12       ` Alex Sharp
2011-08-05  8:07         ` Eric Wong
2011-08-17  4:26           ` Alex Sharp
2011-08-17  9:22             ` Eric Wong
2011-08-17 20:13               ` Eric Wong
2011-08-18 23:13                 ` Alex Sharp
2011-08-19  1:53                   ` Eric Wong
2011-08-19  9:42                     ` Alex Sharp
2011-08-19 16:51                       ` Eric Wong
2011-08-23  2:59                       ` Alex Sharp
2011-08-23  7:12                         ` Eric Wong
2011-08-23 16:49                           ` Alex Sharp
2011-08-23 20:23                             ` Eric Wong
2011-08-31  0:33               ` Eric Wong
2011-09-01 18:45                 ` Alex Sharp
2011-09-01 19:46                   ` Eric Wong
2011-09-01 19:57                   ` Alex Sharp
2011-09-01 20:26                     ` Eric Wong
2011-08-02 20:45 ` cliftonk
2011-08-02 21:53 ` Eric Wong
2011-08-02 22:46   ` James Cox
2011-08-02 23:08     ` Eric Wong
2011-08-02 23:49       ` Alex Sharp
2011-08-03  0:34   ` Eric Wong
2011-08-03  1:36     ` Eric Wong
2011-08-19 22:18     ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).