unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* feature request - when_ready() hook
@ 2009-11-26  4:50 Suraj Kurapati
  2009-11-26  6:05 ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Suraj Kurapati @ 2009-11-26  4:50 UTC (permalink / raw)
  To: unicorn list

Hello,

I've been trying to achieve truly transparent zero downtime deploys
with Unicorn and Rails for some time now (using SIGUSR2 and SIGQUIT
strategy) and I've always hit the problem of my "last worker sends
SIGQUIT to the old master" logic being executed way too soon.

In particular, I tried killing the old master in:

* before_fork() -- approx. 2 minute downtime
* after_fork() -- approx. 2 minute downtime
* storing the old-master-killing logic inside a lambda in after_fork()
(for the last worker only) and later executing that lambda in Rails'
config.after_initialize() hook -- approx. 20 second downtime

As you can see, the more I delayed the execution of that "killing the
old master" logic, the closer I got to zero downtime deploys.  In this
manner, I request the addition of a when_ready() hook which is
executed just after Unicorn prints "worker=# ready!" to its error log
inside Unicorn::HttpServer#worker_loop().

I am happy to implement this (with tests) and submit a patch, but I
first wanted to know your opinion on this approach.  (I should note
that my unicorn setup does not run very close to the memory limit of
its host; instead, the amount of free memory is more than double of
the current unicorn memory footprint, so I can safely spawn a second
set of Unicorn master + workers (via SIGUSR2) without worrying about
the SIGTTOU before_fork() strategy shown in the Unicorn configuration
example.)

Thanks for your consideration.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: feature request - when_ready() hook
  2009-11-26  4:50 feature request - when_ready() hook Suraj Kurapati
@ 2009-11-26  6:05 ` Eric Wong
  2009-11-26 19:05   ` Suraj Kurapati
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2009-11-26  6:05 UTC (permalink / raw)
  To: unicorn list

Suraj Kurapati <sunaku@gmail.com> wrote:
> Hello,
> 
> I've been trying to achieve truly transparent zero downtime deploys
> with Unicorn and Rails for some time now (using SIGUSR2 and SIGQUIT
> strategy) and I've always hit the problem of my "last worker sends
> SIGQUIT to the old master" logic being executed way too soon.
> 
> In particular, I tried killing the old master in:
> 
> * before_fork() -- approx. 2 minute downtime
> * after_fork() -- approx. 2 minute downtime
> * storing the old-master-killing logic inside a lambda in after_fork()
> (for the last worker only) and later executing that lambda in Rails'
> config.after_initialize() hook -- approx. 20 second downtime

Hi Suraj,

I'm looking at those times and can't help but wonder if there's
something very weird/broken with your setup..  20 seconds is already an
eternity (even with preload_app=false), but 2 minutes?(!).

Are you doing per-process listeners and retrying?  The new ones could be
fighting for a port held by the old workers...  Other than that...

I have many questions, because those times look extremely scary to me
and I wonder if such a hook would only be masking the symptoms of
a bigger problem.

What kind of software/hardware stack are you running?
(please don't say NSLU2 :)

How many workers?

How heavy is traffic on the site when you're deploying?

How long does it take for a single worker to get ready and start
serving requests?

Are you using preload_app?  It should be faster if you do, but there
really appears to be something else wrong based on those times.

Thanks in advance.

> As you can see, the more I delayed the execution of that "killing the
> old master" logic, the closer I got to zero downtime deploys.  In this
> manner, I request the addition of a when_ready() hook which is
> executed just after Unicorn prints "worker=# ready!" to its error log
> inside Unicorn::HttpServer#worker_loop().

At this stage, maybe even implementing something as middleware and
making it hook into request processing (that way you really know the
worker is really responding to requests) is the way to go...

> I am happy to implement this (with tests) and submit a patch, but I
> first wanted to know your opinion on this approach.  (I should note
> that my unicorn setup does not run very close to the memory limit of
> its host; instead, the amount of free memory is more than double of
> the current unicorn memory footprint, so I can safely spawn a second
> set of Unicorn master + workers (via SIGUSR2) without worrying about
> the SIGTTOU before_fork() strategy shown in the Unicorn configuration
> example.)

Given your memory availability, I wouldn't even worry about the
automated killing of the old workers.

Automatically killing old workers means you need a redeploy to roll back
changes, whereas if you SIGWINCH the old set away, you can HUP the old
master to bring them back in case the new set is having problems.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: feature request - when_ready() hook
  2009-11-26  6:05 ` Eric Wong
@ 2009-11-26 19:05   ` Suraj Kurapati
  2009-11-26 19:53     ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Suraj Kurapati @ 2009-11-26 19:05 UTC (permalink / raw)
  To: unicorn list

On Wed, Nov 25, 2009 at 10:05 PM, Eric Wong <normalperson@yhbt.net> wrote:
> Suraj Kurapati <sunaku@gmail.com> wrote:
>> * before_fork() -- approx. 2 minute downtime
>> * after_fork() -- approx. 2 minute downtime
>> * storing the old-master-killing logic inside a lambda in after_fork()
>> (for the last worker only) and later executing that lambda in Rails'
>> config.after_initialize() hook -- approx. 20 second downtime
>
> I'm looking at those times and can't help but wonder if there's
> something very weird/broken with your setup..  20 seconds is already an
> eternity (even with preload_app=false), but 2 minutes?(!).

Yes, I am using preload_app=false.  These delays mainly come from
establishing DB connections and loading XML datasets into the Rails
app.  Our production DBs are pretty slow to give out new connections.
The startup time is much faster in development, where I use SQLite.

Please note that the reported downtimes are shocking only because they
were *visible* downtimes, where the last worker of the new Unicorn
master killed the old Unicorn master too soon.  IMHO, it doesn't
matter how long it takes for the Rails app to become ready, so long as
the old Unicorn master + workers continue to exist & service requests
until the new Unicorn master + workers take over.

> Are you doing per-process listeners and retrying?  The new ones could be
> fighting for a port held by the old workers...  Other than that...

No, I have one listen() call (on a UNIX socket) at the top level of my
Unicorn configuration file.  Nothing fancy.

> I have many questions, because those times look extremely scary to me
> and I wonder if such a hook would only be masking the symptoms of
> a bigger problem.
>
> What kind of software/hardware stack are you running?
> (please don't say NSLU2 :)

The hardware is some kind of VM running on a VM server farm, running CentOS 4.

The software is Ruby 1.9.1-p243 with Rails 2.3.3, running on Unicorn
0.95.1, behind Nginx, behind M$ IIS.

> How many workers?

Three.

> How heavy is traffic on the site when you're deploying?

About 15 to 20 users.

> How long does it take for a single worker to get ready and start
> serving requests?

Approximately 2 minutes.

> Are you using preload_app?  It should be faster if you do, but there
> really appears to be something else wrong based on those times.

I was for a few weeks, but I stopped because the XML dataset loading
(see above) kept increasing the master's (and the new set of workers')
memory footprint by 1.5x every time Unicorn was restarted via SIGUSR2.

>> As you can see, the more I delayed the execution of that "killing the
>> old master" logic, the closer I got to zero downtime deploys.  In this
>> manner, I request the addition of a when_ready() hook which is
>> executed just after Unicorn prints "worker=# ready!" to its error log
>> inside Unicorn::HttpServer#worker_loop().
>
> At this stage, maybe even implementing something as middleware and
> making it hook into request processing (that way you really know the
> worker is really responding to requests) is the way to go...

Hmm, but that would incur a penalty on each request (check if I've
already killed the old master and do it if necessary).  I'm pretty
confident that killing the old master in the when_ready() hook will be
Good Enough for my setup (at most I expect to see 1-2 second
"down"time).  Let me try this out and I'll tell you the results &
submit a patch.

>> my unicorn setup does not run very close to the memory limit of
>> its host; instead, the amount of free memory is more than double of
>> the current unicorn memory footprint, so I can safely spawn a second
>> set of Unicorn master + workers (via SIGUSR2) without worrying about
>> the SIGTTOU before_fork() strategy shown in the Unicorn configuration
>> example.)
>
> Given your memory availability, I wouldn't even worry about the
> automated killing of the old workers.
>
> Automatically killing old workers means you need a redeploy to roll back
> changes, whereas if you SIGWINCH the old set away, you can HUP the old
> master to bring them back in case the new set is having problems.

Wow this is cool.  Perhaps this strategy could be mentioned in the
documentation?

Thanks for your consideration.
_______________________________________________
mongrel-unicorn mailing list
mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: feature request - when_ready() hook
  2009-11-26 19:05   ` Suraj Kurapati
@ 2009-11-26 19:53     ` Eric Wong
  2009-11-30 23:47       ` Suraj Kurapati
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2009-11-26 19:53 UTC (permalink / raw)
  To: unicorn list

Suraj Kurapati <sunaku@gmail.com> wrote:
> On Wed, Nov 25, 2009 at 10:05 PM, Eric Wong <normalperson@yhbt.net> wrote:
> > Suraj Kurapati <sunaku@gmail.com> wrote:
> >> * before_fork() -- approx. 2 minute downtime
> >> * after_fork() -- approx. 2 minute downtime
> >> * storing the old-master-killing logic inside a lambda in after_fork()
> >> (for the last worker only) and later executing that lambda in Rails'
> >> config.after_initialize() hook -- approx. 20 second downtime
> >
> > I'm looking at those times and can't help but wonder if there's
> > something very weird/broken with your setup..  20 seconds is already an
> > eternity (even with preload_app=false), but 2 minutes?(!).
> 
> Yes, I am using preload_app=false.  These delays mainly come from
> establishing DB connections and loading XML datasets into the Rails
> app.  Our production DBs are pretty slow to give out new connections.
> The startup time is much faster in development, where I use SQLite.

Yikes.  Is there some sort of misconfiguration in the DBs?  Perhaps
they're trying to do reverse DNS on every client connection, that
can really ruin your day if your app servers don't have reverse DNS.

> Please note that the reported downtimes are shocking only because they
> were *visible* downtimes, where the last worker of the new Unicorn
> master killed the old Unicorn master too soon.  IMHO, it doesn't
> matter how long it takes for the Rails app to become ready, so long as
> the old Unicorn master + workers continue to exist & service requests
> until the new Unicorn master + workers take over.

<snip>

> > Are you using preload_app?  It should be faster if you do, but there
> > really appears to be something else wrong based on those times.
> 
> I was for a few weeks, but I stopped because the XML dataset loading
> (see above) kept increasing the master's (and the new set of workers')
> memory footprint by 1.5x every time Unicorn was restarted via SIGUSR2.

Side problem, but another thing that makes me go "Huh?"

Did the new master's footprint increase?  Are you mmap()-ing the XML
dataset?  Is RSS increasing or just VmSize?  Unicorn sets FD_CLOEXEC on
the first 1024 (non-listener) file descriptors, so combined with exec(),
that should give the new master (and subsequent workers) a clean memory
footprint.

> >> As you can see, the more I delayed the execution of that "killing the
> >> old master" logic, the closer I got to zero downtime deploys.  In this
> >> manner, I request the addition of a when_ready() hook which is
> >> executed just after Unicorn prints "worker=# ready!" to its error log
> >> inside Unicorn::HttpServer#worker_loop().
> >
> > At this stage, maybe even implementing something as middleware and
> > making it hook into request processing (that way you really know the
> > worker is really responding to requests) is the way to go...
> 
> Hmm, but that would incur a penalty on each request (check if I've
> already killed the old master and do it if necessary).  I'm pretty
> confident that killing the old master in the when_ready() hook will be
> Good Enough for my setup (at most I expect to see 1-2 second
> "down"time).  Let me try this out and I'll tell you the results &
> submit a patch.

I don't think a runtime condition would be any more expensive than all
the routing/filters/checks that any Rails app already does and you can
cache the result into a global variable.

As you may have noticed, I'm quite hesitant to add new features,
especially for uncommon/rare cases.  Things like supporting the
"working_directory" directive and user-switching took *months* of
debating with myself before they were finally added.

> >> my unicorn setup does not run very close to the memory limit of
> >> its host; instead, the amount of free memory is more than double of
> >> the current unicorn memory footprint, so I can safely spawn a second
> >> set of Unicorn master + workers (via SIGUSR2) without worrying about
> >> the SIGTTOU before_fork() strategy shown in the Unicorn configuration
> >> example.)
> >
> > Given your memory availability, I wouldn't even worry about the
> > automated killing of the old workers.
> >
> > Automatically killing old workers means you need a redeploy to roll back
> > changes, whereas if you SIGWINCH the old set away, you can HUP the old
> > master to bring them back in case the new set is having problems.
> 
> Wow this is cool.  Perhaps this strategy could be mentioned in the
> documentation?

It's already at the bottom of the SIGNALS
(http://unicorn.bogomips.org/SIGNALS.html) document

> Thanks for your consideration.

No problem, let us know if it's the DB doing reverse DNS because
I've seen that to be a problem in a lot of cases.

-- 
Eric Wong

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: feature request - when_ready() hook
  2009-11-26 19:53     ` Eric Wong
@ 2009-11-30 23:47       ` Suraj Kurapati
  2010-03-08 22:58         ` Suraj Kurapati
  0 siblings, 1 reply; 6+ messages in thread
From: Suraj Kurapati @ 2009-11-30 23:47 UTC (permalink / raw)
  To: unicorn list

On Thu, Nov 26, 2009 at 11:53 AM, Eric Wong <normalperson@yhbt.net> wrote:
> Suraj Kurapati <sunaku@gmail.com> wrote:
>> the XML dataset loading
>> (see above) kept increasing the master's (and the new set of workers')
>> memory footprint by 1.5x every time Unicorn was restarted via SIGUSR2.
>
> Side problem, but another thing that makes me go "Huh?"
>
> Did the new master's footprint increase?

Yes, but this seems to have been my fault.  I programmed the master
(via the Unicorn configuration file) to accept a SIGPWR signal which
made it (1) reload the XML dataset (i.e. added memory bloat) and (2)
send SIGUSR2 to itself (thereby providing the new XML data to the new
Unicorn master + workers).  The idea was to cut down the time required
for loading the XML dataset.  Unfortunately, the extra memory bloat
added by reloading the XML dataset carried over to the new Unicorn
generation as a side-effect.

> Are you mmap()-ing the XML  dataset?

Nope, nothing fancy like that.

> Is RSS increasing or just VmSize?

Hmm, I did not pay attention to these individual stats.  I just saw
that the memory% statistic in `ps xv` would increase by 10% every time
Unicorn was restarted through my SIGPWR handler.  Again, this was my
fault for using such a non-standard approach.

> Unicorn sets FD_CLOEXEC on
> the first 1024 (non-listener) file descriptors, so combined with exec(),
> that should give the new master (and subsequent workers) a clean memory
> footprint.

Thanks.  This is good to know, now that I'm using the standard
SIGUSR2/QUIT method.

>> > At this stage, maybe even implementing something as middleware and
>> > making it hook into request processing (that way you really know the
>> > worker is really responding to requests) is the way to go...
>>
>> Hmm, but that would incur a penalty on each request (check if I've
>> already killed the old master and do it if necessary).
>
> I don't think a runtime condition would be any more expensive than all
> the routing/filters/checks that any Rails app already does and you can
> cache the result into a global variable.

Okay.

> As you may have noticed, I'm quite hesitant to add new features,
> especially for uncommon/rare cases.  Things like supporting the
> "working_directory" directive and user-switching took *months* of
> debating with myself before they were finally added.

No problem.  I ended up using a simple workaround for this whole
problem:  from Capistrano, I send SIGUSR2 to the existing Unicorn
master (which will become the old Unicorn master), wait 90 seconds,
and then send SIGQUIT to the old Unicorn master.  There's nothing
fancy in my Unicorn configuration file anymore --- no before/after
hooks at all; just a number of workers + listen directive.

This configuration is working out pretty well, and I have finally
achieved zero downtime deploys.  (Yay! :-)  The only thing I'm worried
about is that I'll have to keep adjusting this timeout as the
infrastructure my app depends upon becomes slower/faster.  A
when_ready() hook would really do wonders for me, and I will implement
and try it as planned when I get some time.

> let us know if it's the DB doing reverse DNS because
> I've seen that to be a problem in a lot of cases.

I'll ask about this and let you know.

Thanks for your consideration.
_______________________________________________
mongrel-unicorn mailing list
mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: feature request - when_ready() hook
  2009-11-30 23:47       ` Suraj Kurapati
@ 2010-03-08 22:58         ` Suraj Kurapati
  0 siblings, 0 replies; 6+ messages in thread
From: Suraj Kurapati @ 2010-03-08 22:58 UTC (permalink / raw)
  To: unicorn list

Hi Eric,

On Mon, Nov 30, 2009 at 3:47 PM, Suraj Kurapati wrote:
> from Capistrano, I send SIGUSR2 to the existing Unicorn master (which
> will become the old Unicorn master), wait 90 seconds, and then send
> SIGQUIT to the old Unicorn master. [...] The only thing I'm worried
> about is that I'll have to keep adjusting this timeout as the
> infrastructure my app depends upon becomes slower/faster.  A
> when_ready() hook would really do wonders for me

Inspired by your solution[1] to a recent question about forking, I
solved this problem by temporarily sneaking into the build_app! method
and killing the old Unicorn master after the Rails app has been built:

  before_fork do |server, worker|
    #
    # the following allows a new master process to incrementally
    # phase out the old master process with SIGTTOU to avoid a
    # thundering herd (especially in the "preload_app false" case)
    # when doing a transparent upgrade.  The last worker spawned
    # will then kill off the old master process with a SIGQUIT.
    #
    old_pid_file = server.config[:pid].to_s + '.oldbin'

    if File.exist? old_pid_file and
      server.pid != old_pid_file and
      worker.nr == server.worker_processes-1
    then
      #
      # wait until Rails app is built and ready to serve (we do this by
      # sneaking into Unicorn's build_app! method) by the last worker
      # process inside the new Unicorn before stopping old Unicorn
      #
      orig_meth_name = :build_app!
      orig_meth_impl = server.method(orig_meth_name)

      server_metaclass = class << server; self; end
      server_metaclass.class_eval do

        # replace Unicorn's method with our own sneaky version
        define_method orig_meth_name do

          # behave like Unicorn's original method
          orig_meth_impl.call

          # do our sneaky business! (kill the old Unicorn)
          begin
            Process.kill :QUIT, File.read(old_pid_file).to_i
          rescue Errno::ENOENT, Errno::ESRCH
            # ignore
          end

          # restore Unicorn's original method
          server_metaclass.class_eval do
            define_method orig_meth_name, orig_meth_impl
          end

        end
      end
    end
  end

Thanks for your help! :)

[1]: http://article.gmane.org/gmane.comp.lang.ruby.unicorn.general/425
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-03-09 12:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-26  4:50 feature request - when_ready() hook Suraj Kurapati
2009-11-26  6:05 ` Eric Wong
2009-11-26 19:05   ` Suraj Kurapati
2009-11-26 19:53     ` Eric Wong
2009-11-30 23:47       ` Suraj Kurapati
2010-03-08 22:58         ` Suraj Kurapati

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).