On USR2, new master runs with same PID

unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed

* On USR2, new master runs with same PID
@ 2015-03-12  1:04 Kevin Yank
  2015-03-12  1:45 ` Eric Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Yank @ 2015-03-12  1:04 UTC (permalink / raw)
  To: unicorn-public

Having recently migrated our Rails app to MRI 2.2.0 (which may or may not be related), we’re experiencing problems with our Unicorn zero-downtime restarts.

When I send USR2 to the master process (PID 19216 in this example), I get the following in the Unicorn log:

I, [2015-03-11T23:47:33.992274 #6848]  INFO -- : executing ["/srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn", "/srv/ourapp/current/config.ru", "-Dc", "/srv/ourapp/shared/config/unicorn.rb", {10=>#<Kgio::UNIXServer:/srv/ourapp/shared/sockets/unicorn.sock>}] (in /srv/ourapp/releases/a0e8b5df474ad5129200654f92a76af00a750f47)
I, [2015-03-11T23:47:36.504235 #6848]  INFO -- : inherited addr=/srv/ourapp/shared/sockets/unicorn.sock fd=10
/srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:206:in `pid=': Already running on PID:19216 (or pid=/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError)
  from /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:134:in `start'
  from /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/bin/unicorn:126:in `<top (required)>'
  from /srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn:23:in `load'
  from /srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn:23:in `<main>'
E, [2015-03-11T23:47:36.519549 #19216] ERROR -- : reaped #<Process::Status: pid 6848 exit 1> exec()-ed
E, [2015-03-11T23:47:36.520296 #19216] ERROR -- : master loop error: Already running on PID:19216 (or pid=/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError)
E, [2015-03-11T23:47:36.520496 #19216] ERROR -- : /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:206:in `pid='
E, [2015-03-11T23:47:36.520650 #19216] ERROR -- : /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:404:in `reap_all_workers'
E, [2015-03-11T23:47:36.520790 #19216] ERROR -- : /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:279:in `join'
E, [2015-03-11T23:47:36.520928 #19216] ERROR -- : /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/bin/unicorn:126:in `<top (required)>'
E, [2015-03-11T23:47:36.521115 #19216] ERROR -- : /srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn:23:in `load'
E, [2015-03-11T23:47:36.521254 #19216] ERROR -- : /srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn:23:in `<main>’

And when I check, indeed, there is now unicorn.pid and unicorn.pid.oldbin, both containing 19216.

What could cause this situation to arise?


Here’s my unicorn.rb FWIW:

# Set your full path to application.
app_path = "/srv/ourapp/current"

# Set unicorn options
worker_processes 3
preload_app true
timeout 30
listen "/srv/ourapp/shared/sockets/unicorn.sock", :backlog => 64

# Spawn unicorn master worker for user deploy (group: deploy)
user 'deploy', 'deploy'

# Fill path to your app
working_directory app_path

# Should be 'production' by default, otherwise use other env
rails_env = ENV['RAILS_ENV'] || 'production'

# Log everything to one file
stderr_path "/srv/ourapp/shared/log/unicorn.log"
stdout_path "/srv/ourapp/shared/log/unicorn.log"

# Set master PID location
pid "/srv/ourapp/shared/pids/unicorn.pid"

before_exec do |server|
  ENV["BUNDLE_GEMFILE"] = "#{app_path}/Gemfile"
end

before_fork do |server, worker|
  ActiveRecord::Base.connection.disconnect!

  sleep 10

  old_pid = "#{server.config[:pid]}.oldbin"
  if File.exists?(old_pid) && server.pid != old_pid
    begin
      Process.kill("QUIT", File.read(old_pid).to_i)
    rescue Errno::ENOENT, Errno::ESRCH
      # someone else did our job for us
    end
  end
end

after_fork do |server, worker|
  ActiveRecord::Base.establish_connection

  Sidekiq.configure_client do |config|
    config.redis = { namespace: 'sidekiq' }
  end
end


--
Kevin Yank
Chief Technology Officer, Avalanche Technology Group
http://avalanche.com.au/

ph: +61 4 2241 0083



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: On USR2, new master runs with same PID
  2015-03-12  1:04 On USR2, new master runs with same PID Kevin Yank
@ 2015-03-12  1:45 ` Eric Wong
  2015-03-12  6:26   ` Kevin Yank
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Wong @ 2015-03-12  1:45 UTC (permalink / raw)
  To: Kevin Yank; +Cc: unicorn-public

Kevin Yank <kyank@avalanche.com.au> wrote:
> Having recently migrated our Rails app to MRI 2.2.0 (which may or may
> not be related), we’re experiencing problems with our Unicorn
> zero-downtime restarts.
> 
> When I send USR2 to the master process (PID 19216 in this example), I
> get the following in the Unicorn log:
>
> I, [2015-03-11T23:47:33.992274 #6848]  INFO -- : executing ["/srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn", "/srv/ourapp/current/config.ru", "-Dc", "/srv/ourapp/shared/config/unicorn.rb", {10=>#<Kgio::UNIXServer:/srv/ourapp/shared/sockets/unicorn.sock>}] (in /srv/ourapp/releases/a0e8b5df474ad5129200654f92a76af00a750f47)
> I, [2015-03-11T23:47:36.504235 #6848]  INFO -- : inherited addr=/srv/ourapp/shared/sockets/unicorn.sock fd=10
> /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:206:in `pid=': Already running on PID:19216 (or pid=/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError)

Nothing suspicious until the above line...

<snip>

> And when I check, indeed, there is now unicorn.pid and
> unicorn.pid.oldbin, both containing 19216.
> 
> What could cause this situation to arise?

Any chance you have a process manager or something else creating the
(non-.oldbin) pid file behind unicorn's back?

Can you check your process table to ensure there's not multiple
unicorn instances running off the same config and pid files, too?

Also, were there other activity (another USR2 or HUP) in the logs
a few seconds beforehand?

What kind of filesystem / kernel is the pid file on?
A network FS or anything without the consistency guarantees provided
by POSIX would not work for pid files.

pid files are unfortunately prone to to nasty race conditions,
but I'm not sure what you're seeing happens very often.

Likewise, the check for stale Unix domain socket paths at startup is
inevitably racy, too, but the window is usually small enough to be
unnoticeable.  But yes, just in case, check the process table to make
sure there aren't multiple, non-related masters running on off the
same paths.

<snip>

> before_fork do |server, worker|
>   ActiveRecord::Base.connection.disconnect!

-------------------8<-------------------
>   sleep 10
> 
>   old_pid = "#{server.config[:pid]}.oldbin"
>   if File.exists?(old_pid) && server.pid != old_pid
>     begin
>       Process.kill("QUIT", File.read(old_pid).to_i)
>     rescue Errno::ENOENT, Errno::ESRCH
>       # someone else did our job for us
>     end
>   end
-------------------8<-------------------

I'd get rid of that hunk starting with the "sleep 10" (at least while
debugging this issue).  If you did a USR2 previously, maybe it could
affect the current USR2 upgrade process.  Sleeping so long in the master
like that is pretty bad it throws off timing and delays signal handling.

That's a pretty fragile config and I regret ever including it in the
example config files

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: On USR2, new master runs with same PID
  2015-03-12  1:45 ` Eric Wong
@ 2015-03-12  6:26   ` Kevin Yank
  2015-03-12  6:45     ` Eric Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Kevin Yank @ 2015-03-12  6:26 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

[-- Attachment #1: Type: text/plain, Size: 5164 bytes --]

Thanks for your help, Eric!

> On 12 Mar 2015, at 12:45 pm, Eric Wong <e@80x24.org> wrote:
> 
> Kevin Yank <kyank@avalanche.com.au> wrote:
>> When I send USR2 to the master process (PID 19216 in this example), I
>> get the following in the Unicorn log:
>> 
>> I, [2015-03-11T23:47:33.992274 #6848]  INFO -- : executing ["/srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn", "/srv/ourapp/current/config.ru", "-Dc", "/srv/ourapp/shared/config/unicorn.rb", {10=>#<Kgio::UNIXServer:/srv/ourapp/shared/sockets/unicorn.sock>}] (in /srv/ourapp/releases/a0e8b5df474ad5129200654f92a76af00a750f47)
>> I, [2015-03-11T23:47:36.504235 #6848]  INFO -- : inherited addr=/srv/ourapp/shared/sockets/unicorn.sock fd=10
>> /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_server.rb:206:in `pid=': Already running on PID:19216 (or pid=/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError)
> 
> Nothing suspicious until the above line...

That’s right.

>> And when I check, indeed, there is now unicorn.pid and
>> unicorn.pid.oldbin, both containing 19216.
> 
> Any chance you have a process manager or something else creating the
> (non-.oldbin) pid file behind unicorn's back?

It’s possible; I’m using eye (https://github.com/kostya/eye) as a process monitor. I’m not aware of it writing .pid files for processes that self-daemonize like Unicorn, though. And once one of my servers “goes bad” (i.e. Unicorn starts failing to restart in response to a USR2), it does so 100% consistently until I stop and restart Unicorn entirely. Based on that, I don’t believe it’s a race condition where my process monitor is slipping in a new unicorn.pid file some of the time.

> Can you check your process table to ensure there's not multiple
> unicorn instances running off the same config and pid files, too?

As far as I can tell, I have some servers that have gotten themselves into this “USR2 restart fails” state, while others are working just fine. In both cases, the Unicorn process tree (as show in htop) looks like this “at rest” (i.e. no deployments/restarts in progress):

unicorn master
`- unicorn master
`- unicorn worker[2]
|  `- unicorn worker[2]
`- unicorn worker[1]
|  `- unicorn worker[1]
`- unicorn worker[0]
   `- unicorn worker[0]

At first glance I’d definitely say it appears that I have two masters running from the same config files. However, there’s only one unicorn.pid file of course (the root process in the tree above), and when I try to kill -TERM the master process that doesn’t have a .pid file, the entire process tree exits. Am I misinterpreting the process table? Is this process list actually normal?

Thus far I’ve been unable to find any difference in state between a properly-behaving server and a misbehaving server, apart from the behaviour of the Unicorn master when it receives a USR2.

> Also, were there other activity (another USR2 or HUP) in the logs
> a few seconds beforehand?

No, didn’t see anything like that (and I was looking for it).

> What kind of filesystem / kernel is the pid file on?

EXT4 / Ubuntu Server 12.04 LTS

> A network FS or anything without the consistency guarantees provided
> by POSIX would not work for pid files.

Given my environment above, I should be OK, right?

> pid files are unfortunately prone to to nasty race conditions,
> but I'm not sure what you're seeing happens very often.

This has been happening pretty frequently across multiple server instances, and again once it starts happening on an instance, it keeps happening 100% of the time (until I restart Unicorn completely). So it’s not a rare edge case.

> -------------------8<-------------------
>>  sleep 10
>> 
>>  old_pid = "#{server.config[:pid]}.oldbin"
>>  if File.exists?(old_pid) && server.pid != old_pid
>>    begin
>>      Process.kill("QUIT", File.read(old_pid).to_i)
>>    rescue Errno::ENOENT, Errno::ESRCH
>>      # someone else did our job for us
>>    end
>>  end
> -------------------8<-------------------
> 
> I'd get rid of that hunk starting with the "sleep 10" (at least while
> debugging this issue).

If I simply delete this hunk, I’ll have old masters left running on my servers because they’ll never get sent the QUIT signal. I can definitely remove it temporarily (and kill the old master myself) while debugging, though.

> If you did a USR2 previously, maybe it could
> affect the current USR2 upgrade process.  Sleeping so long in the master
> like that is pretty bad it throws off timing and delays signal handling.

I’d definitely like to get rid of the sleep, as my restarts definitely feel slow. I’m not clear on what a better approach would be, though.

> That's a pretty fragile config and I regret ever including it in the
> example config files

Any better examples/docs you’d recommend I consult for guidance? Or is expecting to achieve a robust zero-downtime restart using before_fork/after_fork hooks unrealistic?

--
Kevin Yank
Chief Technology Officer, Avalanche Technology Group
http://avalanche.com.au/

ph: +61 4 2241 0083

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: On USR2, new master runs with same PID
  2015-03-12  6:26   ` Kevin Yank
@ 2015-03-12  6:45     ` Eric Wong
  2015-03-20  1:55       ` Kevin Yank
  2015-03-20  1:58       ` Kevin Yank
  0 siblings, 2 replies; 7+ messages in thread
From: Eric Wong @ 2015-03-12  6:45 UTC (permalink / raw)
  To: Kevin Yank; +Cc: unicorn-public

Kevin Yank <kyank@avalanche.com.au> wrote:
> It’s possible; I’m using eye (https://github.com/kostya/eye) as a

Aha!  I forgot about that one, try upgrading to unicorn 4.8.3 which
fixed this issue last year.  ref:

http://bogomips.org/unicorn-public/m/20140504023338.GA583@dcvr.yhbt.net.html
http://bogomips.org/unicorn-public/m/20140502231558.GA4215@dcvr.yhbt.net.html

<snip>

> unicorn master
> `- unicorn master 
> `- unicorn worker[2]
> |  `- unicorn worker[2]
> `- unicorn worker[1]
> |  `- unicorn worker[1]
> `- unicorn worker[0]
>    `- unicorn worker[0]

The second master there is the new one, and the first is the old one.

I was more concerned if there are multiple masters with no parent-child
relationship to each other; but that seems to not be the case.

But yeah, give 4.8.3 a try since it should've fixed the problem
(which was reported and confirmed privately)

> This has been happening pretty frequently across multiple server
> instances, and again once it starts happening on an instance, it keeps
> happening 100% of the time (until I restart Unicorn completely). So
> it’s not a rare edge case.

You can probably recover by removing the pid files entirely
and HUP the (only) master; but I haven't tried it...

> > That's a pretty fragile config and I regret ever including it in the
> > example config files

> Any better examples/docs you’d recommend I consult for guidance? Or is
> expecting to achieve a robust zero-downtime restart using
> before_fork/after_fork hooks unrealistic?

Best bet would be to run with double the workers temporarily unless
you're too low on memory (and swapping) or backend (DB) connections or
any other resource.

If you're really low on memory/connections, do WINCH + USR2 + QUIT (and
pray the new master works right :).  Clients will suffer in response
time as your new app loads, so do it during non-peak traffic and/or
shift traffic away from that machine.

You can also send TTOU a few times to reduce worker counts instead of
WINCH to nuke all of them, but I'd send all signals via deploy
script/commands rather than automatically via (synchronous) hooks.

But that's just me; probably others do things differently...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: On USR2, new master runs with same PID
  2015-03-12  6:45     ` Eric Wong
@ 2015-03-20  1:55       ` Kevin Yank
  2015-03-20  1:58       ` Kevin Yank
  1 sibling, 0 replies; 7+ messages in thread
From: Kevin Yank @ 2015-03-20  1:55 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]

> On 12 Mar 2015, at 5:45 pm, Eric Wong <e@80x24.org> wrote:
> 
> Kevin Yank <kyank@avalanche.com.au> wrote:
>> It’s possible; I’m using eye (https://github.com/kostya/eye) as a
> 
> Aha!  I forgot about that one, try upgrading to unicorn 4.8.3 which
> fixed this issue last year.  ref:
> 
> http://bogomips.org/unicorn-public/m/20140504023338.GA583@dcvr.yhbt.net.html
> http://bogomips.org/unicorn-public/m/20140502231558.GA4215@dcvr.yhbt.net.html

Finally solved this definitively. It was user error to do with my setup of the eye process monitor.

I’d accidentally deployed a buggy logrotate configuration for eye, which was causing a second eye daemon to be spawned once a day (when the logs were rotated). Those two eye daemons ran side-by-side, and fought with each other when one was told to restart Unicorn. I’d already anticipated and fixed this problem, but failed to deploy the correct version of the config to our cluster.

All fixed now. Thanks for your pointers; they put me on the right track. :)

--
Kevin Yank
Chief Technology Officer, Avalanche Technology Group
http://avalanche.com.au/

ph: +61 4 2241 0083

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: On USR2, new master runs with same PID
  2015-03-12  6:45     ` Eric Wong
  2015-03-20  1:55       ` Kevin Yank
@ 2015-03-20  1:58       ` Kevin Yank
  2015-03-20  2:08         ` Eric Wong
  1 sibling, 1 reply; 7+ messages in thread
From: Kevin Yank @ 2015-03-20  1:58 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

Regarding zero-downtime deploys:

On 12 Mar 2015, at 5:45 pm, Eric Wong <e@80x24.org> wrote:

> Best bet would be to run with double the workers temporarily unless
> you're too low on memory (and swapping) or backend (DB) connections or
> any other resource.

I’d like to take this approach as I do have enough memory to spare. How do you usually implement this? Any good write-ups or sample configs you can point me to?

Thanks again,

--
Kevin Yank
Chief Technology Officer, Avalanche Technology Group
http://avalanche.com.au/

ph: +61 4 2241 0083




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: On USR2, new master runs with same PID
  2015-03-20  1:58       ` Kevin Yank
@ 2015-03-20  2:08         ` Eric Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2015-03-20  2:08 UTC (permalink / raw)
  To: Kevin Yank; +Cc: unicorn-public

Kevin Yank <kyank@avalanche.com.au> wrote:
> Regarding zero-downtime deploys:
> 
> On 12 Mar 2015, at 5:45 pm, Eric Wong <e@80x24.org> wrote:
> 
> > Best bet would be to run with double the workers temporarily unless
> > you're too low on memory (and swapping) or backend (DB) connections or
> > any other resource.
>  
> I’d like to take this approach as I do have enough memory to spare.
> How do you usually implement this? Any good write-ups or sample
> configs you can point me to?

Only send SIGUSR2 to the master, leaving you with two masters and two
sets of workers.  Skip (automated) sending of SIGTTOU signals to lower
worker count to the old master.

Eventually, you'll decide to send SIGQUIT to the old master to stop
it (or the new one, if you decide the new code is broken).

You can still combine this with SIGWINCH (or SIGTTOU) to stop traffic
flow to the old master, too.

Thanks for following up on your logrotate/eye issue, by the way.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-20  2:08 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-12  1:04 On USR2, new master runs with same PID Kevin Yank
2015-03-12  1:45 ` Eric Wong
2015-03-12  6:26   ` Kevin Yank
2015-03-12  6:45     ` Eric Wong
2015-03-20  1:55       ` Kevin Yank
2015-03-20  1:58       ` Kevin Yank
2015-03-20  2:08         ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).