Simon Eskildsen <simon.eskildsen@shopify.com> wrote: Thank you for your reply. It is a good reminder of how far away I am from the rest of the web development world :> <snip> > It becomes difficult, because sometimes you have legitimate requests > that take 10-20s, because the merchant's data set is so large that it > exposes anomalies. Again, with the size of our code-base, we need this > wiggle room in the global timeout to not just error on users. You can > have endpoints that do 4 HTTP requests, 5 RPC requests, 4 MySQL > queries, and 30 calls to Memcached. In that case, your worst case is > the timeout of all of those actions, which easily exceeds the Unicorn > timeout. Wow, that is frightening... People will actually wait for a web page to load in that case? I guess that's why ccc exists :> > We've debated having "budgets" and "shitlisting" > (http://sirupsen.com/shitlists/) paths that obviously take longer than > the budget for a single resource. The probability of more than one > resource being very slow at once, is quite low (and if it is, again, > we rely on the Unicorn timeout). Interesting. I guess for now, you can use nginx or similar to route to differently-configured unicorns with different timeouts (or even other servers)? Anyways, I'm coming around to reconciling the two mindsets of "typical" code running on unicorn ("it's alright to crash") and the "no room for error: nuclear war starts if you screw up" mindset I adopt for other projects. > Some of these bugs are even deep in Ruby, Jean B, one of my co-workers > submitted a bug about there being no write_timeout in Net::HTTP (you > even replied!): https://bugs.ruby-lang.org/issues/13396 Yeah, that got me thinking of improving core timeouts again... One big problem is the lack of a portable standard asynchronous name resolution mechanism in the C standard library. I'm not sure how well resolv.rb/resolv-replace.rb holds up in real-world usage, nor if pulling in something like ares2 would be an acceptable dependency for ruby-core... > BTW we deployed 5.3.0 and replaced our `before_murder` hook with > `after_worker_exit`. Everything works perfectly and we finally are not > using a forked version of Unicorn anymore. Thanks for the release! Cool, good to know.
I agree with you in principle, absolutely. However, when you have a code-base the size of ours (100Ks of lines of Ruby) with 100s of developers, and new ones coming on every month with no prior Ruby or Rails experience, we can't rely on everyone doing the right thing all the time. With a surface area of that size, there will be things that are missed, especially when you run gems like Liquid that have billions of different ways of composing templates—some of these paths, unfortunately, are going to be slow. We definitely chase down all the worst offenders, but when new ones creep up, we do our best to chase them down when time allows. Using this hook, allows us to monitor when that happens, and how often it happens, and for which endpoints. With 100Ks of lines of code, 100s of developers, and 10s of thousands of requests per second—once in a million happens every couple of seconds. Multiply that with the size of the code-base, and Unicorn timeouts due to the conditions below will happen somewhat often. It becomes difficult, because sometimes you have legitimate requests that take 10-20s, because the merchant's data set is so large that it exposes anomalies. Again, with the size of our code-base, we need this wiggle room in the global timeout to not just error on users. You can have endpoints that do 4 HTTP requests, 5 RPC requests, 4 MySQL queries, and 30 calls to Memcached. In that case, your worst case is the timeout of all of those actions, which easily exceeds the Unicorn timeout. We've debated having "budgets" and "shitlisting" (http://sirupsen.com/shitlists/) paths that obviously take longer than the budget for a single resource. The probability of more than one resource being very slow at once, is quite low (and if it is, again, we rely on the Unicorn timeout). In other words, the Unicorn timeout is not a crutch for timeouts in the application, but a global timeout to as a last line of defense against many timeouts, or some bug we didn't foresee. This seems unavoidable in my eyes, unless you have very aggressive timeouts and meticulously keep track of the budgets in a testing environment and raise if the budget is exceeded. When we hit many timeouts, we use Semian (http://github.com/shopify/semian) to trigger circuit breakers—so the reliance on Unicorn should be brief. Some of these bugs are even deep in Ruby, Jean B, one of my co-workers submitted a bug about there being no write_timeout in Net::HTTP (you even replied!): https://bugs.ruby-lang.org/issues/13396 BTW we deployed 5.3.0 and replaced our `before_murder` hook with `after_worker_exit`. Everything works perfectly and we finally are not using a forked version of Unicorn anymore. Thanks for the release!
Simon Eskildsen <simon.eskildsen@shopify.com> wrote: > With Jeremy Evans' work on `after_worker_exit`, I was hoping I could > replace our internal fork which has a `before_murder` hook to report > to monitoring systems when workers are forcibly killed. However, the > `after_worker_exit` is only called on reaping—not when murdering. Hi Simon, it looks like Jeremy clarified after_worker_exit for you. Anyways... I remember rejecting patches to add more to timeout support in unicorn over the years since I did not want it to be a crutch for application developers, or worse; a reason for people to feel locked into unicorn. Instead I wrote things like https://bogomips.org/unicorn/Application_Timeouts.html to discourage relying on unicorn's built-in `timeout' feature. But, it seems like there's still a reliance on the built-in timeout... Why is that? (If you're allowed to disclose) I don't mean to put you guys (Shopify) on the spot, as I'm sure other folks do it, too; but you're here :) Anyways, is this something that could or should be improved in Rack or ruby itself? Or are there buggy external libraries or even external dependencies like NFS to blame? (Or, perhaps "I don't know" is fine :) The Ruby timeout handling in stdlib could be way better if done natively, I think. Perhaps `ensure` could be better declared, too; but there's still the problem of getting external code to work well with it... Anyways I am of two minds; on one hand, buggy code is fine! unicorn handles it by nuking a process. But on the other hand... it's irritatingly inefficient and shoving things under the rug still rots and stinks eventually.
You're right, I misread the code—I thought the wait(2) would happen in the kill worker call, but the reap worker call takes care of that. Thanks for clearing that up Jeremy!
On 04/04 10:08, Simon Eskildsen wrote:
> Hi!
>
> With Jeremy Evans' work on `after_worker_exit`, I was hoping I could
> replace our internal fork which has a `before_murder` hook to report
> to monitoring systems when workers are forcibly killed. However, the
> `after_worker_exit` is only called on reaping???not when murdering.
>
> How would you feel about executing this on the murder path?
> Alternatively, we could add an `after_murder` hook for this path.
I checked and after_worker_exit is called when reaping the worker
after murdering, so I don't think any changes are necessary. I would
guess if you added it when murdering, it would be called twice on the
same worker.
Thanks,
Jeremy
Hi! With Jeremy Evans' work on `after_worker_exit`, I was hoping I could replace our internal fork which has a `before_murder` hook to report to monitoring systems when workers are forcibly killed. However, the `after_worker_exit` is only called on reaping—not when murdering. How would you feel about executing this on the murder path? Alternatively, we could add an `after_murder` hook for this path.
Followup-to: 650e01ab0b118803486b56f3ee59521d59042dae ("doc: add version annotations for new features") --- Oops, missed the 5.3.0.pre1 release; I managed to squash in the worker_exec change for that, though. lib/unicorn/configurator.rb | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/unicorn/configurator.rb b/lib/unicorn/configurator.rb index 5203537b..7fb9030f 100644 --- a/lib/unicorn/configurator.rb +++ b/lib/unicorn/configurator.rb @@ -172,6 +172,8 @@ def after_fork(*args, &block) # server.logger.error("worker process failure: #{status.inspect}") # end # end + # + # after_worker_exit is only available in unicorn 5.3.0+ def after_worker_exit(*args, &block) set_hook(:after_worker_exit, block_given? ? block : args[0], 3) end -- EW
Jeremy Evans <code@jeremyevans.net> wrote: > Here's a separate commit for the documentation fix. Thanks, pushed as c8f06be298d667ba85573668ee916680a258c2c7 to 'master'. I added a "Fixes: " tag to link it the original. > From 6efb4b8e07a0459180579b72fac5003e424837c0 Mon Sep 17 00:00:00 2001 Hint: you can omit the above line, and add "----8<----" so the applier can use "git am --scissors" without having to just part of the message. It's no big deal, but I guess I'll do what I can to promote --scissors as it's not too well-known. (and usually, the entire email is the patch for "git am")
Here's a separate commit for the documentation fix. From 6efb4b8e07a0459180579b72fac5003e424837c0 Mon Sep 17 00:00:00 2001 From: Jeremy Evans <code@jeremyevans.net> Date: Thu, 23 Feb 2017 10:17:19 -0800 Subject: [PATCH] Fix code example in after_worker_exit documentation --- lib/unicorn/configurator.rb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/unicorn/configurator.rb b/lib/unicorn/configurator.rb index 5bad925..1e2b6e4 100644 --- a/lib/unicorn/configurator.rb +++ b/lib/unicorn/configurator.rb @@ -162,7 +162,7 @@ def after_fork(*args, &block) # sets after_worker_exit hook to a given block. This block will be called # by the master process after a worker exits: # - # after_fork do |server,worker,status| + # after_worker_exit do |server,worker,status| # # status is a Process::Status instance for the exited worker process # unless status.success? # server.logger.error("worker process failure: #{status.inspect}") -- 2.11.0
Jeremy Evans <code@jeremyevans.net> wrote: > Here's a revised patch that should address the issues you identified: Thanks. Pushed as 2af91a1fef70d6546ee03760011c170a082db667 I needed one change to an integration test to workaround the lack of reader attribute, but I'd rather pay the cost in the test than the server, itself: https://bogomips.org/unicorn.git/patch?id=2c6aa5878d052abb ("t/t0012-reload-empty-config.sh: access ivars directly if needed")
On 02/21 08:15, Eric Wong wrote: > Jeremy Evans <code@jeremyevans.net> wrote: > > On 02/21 07:43, Eric Wong wrote: > > > Jeremy Evans <code@jeremyevans.net> wrote: > > > > This option can be used to implement custom behavior for handling > > > > worker exits. For example, let's say you have a specific request > > > > that crashes a worker process, which you expect to be due to a > > > > improperly programmed C extension. By modifying your worker to > > > > save request related data in a temporary file and using this option, > > > > you can get a record of what request is crashing the application, > > > > which will make debugging easier. > > > > > > > > This is not a complete patch as it doesn't include tests, but > > > > before writing tests I wanted to see if this is something you'd > > > > consider including in unicorn. > > > > > > What advantage does this have over Ruby's builtin at_exit? > > > > at_exit is not called if the interpreter crashes: > > > > ruby -e 'at_exit{File.write('a.txt', 'a')}; Process.kill :SEGV, $$' 2>/dev/null > > ([ -f a.txt ] && echo at_exit called) || echo at_exit not called > > Ah, thanks. I didn't read the code carefully, enough. > The commit message and documentation should reflect that it's > called in the master and not the worker. > > Anyways, this is probably OK since I can't think of a less > intrusive or more generic (across all Rack servers) way of doing > it. So I'm inclined to accept some version of this. > > > --- a/lib/unicorn/http_server.rb > > +++ b/lib/unicorn/http_server.rb > > @@ -14,7 +14,8 @@ class Unicorn::HttpServer > > attr_accessor :app, :timeout, :worker_processes, > > :before_fork, :after_fork, :before_exec, > > :listener_opts, :preload_app, > > - :orig_app, :config, :ready_pipe, :user > > + :orig_app, :config, :ready_pipe, :user, > > + :after_worker_exit > > I've been trying to reduce the method entry overhead across > the board. Since this is a new field, can it be attr_writer > solely for configurator? > > I don't think there's reader dependency other than below... > > > @@ -395,8 +396,7 @@ def reap_all_workers > > proc_name 'master' > > else > > worker = @workers.delete(wpid) and worker.close rescue nil > > - m = "reaped #{status.inspect} worker=#{worker.nr rescue 'unknown'}" > > - status.success? ? logger.info(m) : logger.error(m) > > + after_worker_exit.call(self, worker, status) > > Which can be made to access the ivar directly: > > @after_worker_exit.call(self, worker, status) Here's a revised patch that should address the issues you identified: From 91b95652e4bdb7fc9af77e9ac06ad7400faef796 Mon Sep 17 00:00:00 2001 From: Jeremy Evans <code@jeremyevans.net> Date: Fri, 17 Feb 2017 16:12:33 -0800 Subject: [PATCH] Add after_worker_exit configuration option This option is executed in the master process following all worker process exits. It is most useful in the case where the worker process crashes the ruby interpreter, as the worker process may not be able to send error notifications appropriately. For example, let's say you have a specific request that crashes a worker process, which you expect to be due to a improperly programmed C extension. By modifying your worker to save request related data in a temporary file and using this option, you can get a record of what request is crashing the application, which will make debugging easier. Example: after_worker_exit do |server, worker, status| server.logger.info "worker #{status.success? ? 'exit' : 'crash'}: #{status}" file = "request.#{status.pid}.txt" if File.exist?(file) do_something_with(File.read(file)) unless status.success? File.delete(file) end end --- lib/unicorn/configurator.rb | 21 +++++++++++++++++++++ lib/unicorn/http_server.rb | 4 ++-- 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/lib/unicorn/configurator.rb b/lib/unicorn/configurator.rb index 3329c10..5bad925 100644 --- a/lib/unicorn/configurator.rb +++ b/lib/unicorn/configurator.rb @@ -41,6 +41,14 @@ class Unicorn::Configurator :before_exec => lambda { |server| server.logger.info("forked child re-executing...") }, + :after_worker_exit => lambda { |server, worker, status| + m = "reaped #{status.inspect} worker=#{worker.nr rescue 'unknown'}" + if status.success? + server.logger.info(m) + else + server.logger.error(m) + end + }, :pid => nil, :preload_app => false, :check_client_connection => false, @@ -151,6 +159,19 @@ def after_fork(*args, &block) set_hook(:after_fork, block_given? ? block : args[0]) end + # sets after_worker_exit hook to a given block. This block will be called + # by the master process after a worker exits: + # + # after_fork do |server,worker,status| + # # status is a Process::Status instance for the exited worker process + # unless status.success? + # server.logger.error("worker process failure: #{status.inspect}") + # end + # end + def after_worker_exit(*args, &block) + set_hook(:after_worker_exit, block_given? ? block : args[0], 3) + end + # sets before_fork got be a given Proc object. This Proc # object will be called by the master process before forking # each worker. diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 35bd100..c2086cb 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -15,6 +15,7 @@ class Unicorn::HttpServer :before_fork, :after_fork, :before_exec, :listener_opts, :preload_app, :orig_app, :config, :ready_pipe, :user + attr_writer :after_worker_exit attr_reader :pid, :logger include Unicorn::SocketHelper @@ -395,8 +396,7 @@ def reap_all_workers proc_name 'master' else worker = @workers.delete(wpid) and worker.close rescue nil - m = "reaped #{status.inspect} worker=#{worker.nr rescue 'unknown'}" - status.success? ? logger.info(m) : logger.error(m) + @after_worker_exit.call(self, worker, status) end rescue Errno::ECHILD break -- 2.11.0
Jeremy Evans <code@jeremyevans.net> wrote: > On 02/21 07:43, Eric Wong wrote: > > Jeremy Evans <code@jeremyevans.net> wrote: > > > This option can be used to implement custom behavior for handling > > > worker exits. For example, let's say you have a specific request > > > that crashes a worker process, which you expect to be due to a > > > improperly programmed C extension. By modifying your worker to > > > save request related data in a temporary file and using this option, > > > you can get a record of what request is crashing the application, > > > which will make debugging easier. > > > > > > This is not a complete patch as it doesn't include tests, but > > > before writing tests I wanted to see if this is something you'd > > > consider including in unicorn. > > > > What advantage does this have over Ruby's builtin at_exit? > > at_exit is not called if the interpreter crashes: > > ruby -e 'at_exit{File.write('a.txt', 'a')}; Process.kill :SEGV, $$' 2>/dev/null > ([ -f a.txt ] && echo at_exit called) || echo at_exit not called Ah, thanks. I didn't read the code carefully, enough. The commit message and documentation should reflect that it's called in the master and not the worker. Anyways, this is probably OK since I can't think of a less intrusive or more generic (across all Rack servers) way of doing it. So I'm inclined to accept some version of this. > --- a/lib/unicorn/http_server.rb > +++ b/lib/unicorn/http_server.rb > @@ -14,7 +14,8 @@ class Unicorn::HttpServer > attr_accessor :app, :timeout, :worker_processes, > :before_fork, :after_fork, :before_exec, > :listener_opts, :preload_app, > - :orig_app, :config, :ready_pipe, :user > + :orig_app, :config, :ready_pipe, :user, > + :after_worker_exit I've been trying to reduce the method entry overhead across the board. Since this is a new field, can it be attr_writer solely for configurator? I don't think there's reader dependency other than below... > @@ -395,8 +396,7 @@ def reap_all_workers > proc_name 'master' > else > worker = @workers.delete(wpid) and worker.close rescue nil > - m = "reaped #{status.inspect} worker=#{worker.nr rescue 'unknown'}" > - status.success? ? logger.info(m) : logger.error(m) > + after_worker_exit.call(self, worker, status) Which can be made to access the ivar directly: @after_worker_exit.call(self, worker, status)
On 02/21 07:43, Eric Wong wrote: > Jeremy Evans <code@jeremyevans.net> wrote: > > This option can be used to implement custom behavior for handling > > worker exits. For example, let's say you have a specific request > > that crashes a worker process, which you expect to be due to a > > improperly programmed C extension. By modifying your worker to > > save request related data in a temporary file and using this option, > > you can get a record of what request is crashing the application, > > which will make debugging easier. > > > > This is not a complete patch as it doesn't include tests, but > > before writing tests I wanted to see if this is something you'd > > consider including in unicorn. > > What advantage does this have over Ruby's builtin at_exit? at_exit is not called if the interpreter crashes: ruby -e 'at_exit{File.write('a.txt', 'a')}; Process.kill :SEGV, $$' 2>/dev/null ([ -f a.txt ] && echo at_exit called) || echo at_exit not called > > AFAIK, at_exit may be registered inside after_fork hooks to the > same effect. > > Thanks.
Jeremy Evans <code@jeremyevans.net> wrote:
> This option can be used to implement custom behavior for handling
> worker exits. For example, let's say you have a specific request
> that crashes a worker process, which you expect to be due to a
> improperly programmed C extension. By modifying your worker to
> save request related data in a temporary file and using this option,
> you can get a record of what request is crashing the application,
> which will make debugging easier.
>
> This is not a complete patch as it doesn't include tests, but
> before writing tests I wanted to see if this is something you'd
> consider including in unicorn.
What advantage does this have over Ruby's builtin at_exit?
AFAIK, at_exit may be registered inside after_fork hooks to the
same effect.
Thanks.
This option can be used to implement custom behavior for handling worker exits. For example, let's say you have a specific request that crashes a worker process, which you expect to be due to a improperly programmed C extension. By modifying your worker to save request related data in a temporary file and using this option, you can get a record of what request is crashing the application, which will make debugging easier. This is not a complete patch as it doesn't include tests, but before writing tests I wanted to see if this is something you'd consider including in unicorn. Example: after_worker_exit do |server, worker, status| server.logger.info "worker #{status.success? ? 'exit' : 'crash'}: #{status}" file = "request.#{status.pid}.txt" if File.exist?(file) do_something_with(File.read(file)) unless status.success? File.delete(file) end end --- lib/unicorn/configurator.rb | 14 ++++++++++++++ lib/unicorn/http_server.rb | 6 +++--- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/lib/unicorn/configurator.rb b/lib/unicorn/configurator.rb index 3329c10..81589b0 100644 --- a/lib/unicorn/configurator.rb +++ b/lib/unicorn/configurator.rb @@ -41,6 +41,14 @@ class Unicorn::Configurator :before_exec => lambda { |server| server.logger.info("forked child re-executing...") }, + :after_worker_exit => lambda { |server, worker, status| + m = "reaped #{status.inspect} worker=#{worker.nr rescue 'unknown'}" + if status.success? + server.logger.info(m) + else + server.logger.error(m) + end + }, :pid => nil, :preload_app => false, :check_client_connection => false, @@ -151,6 +159,12 @@ def after_fork(*args, &block) set_hook(:after_fork, block_given? ? block : args[0]) end + # sets after_worker_exit hook to a given block. This block will be called + # by the master process after a worker exits. + def after_worker_exit(*args, &block) + set_hook(:after_worker_exit, block_given? ? block : args[0], 3) + end + # sets before_fork got be a given Proc object. This Proc # object will be called by the master process before forking # each worker. diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb index 35bd100..567ee0e 100644 --- a/lib/unicorn/http_server.rb +++ b/lib/unicorn/http_server.rb @@ -14,7 +14,8 @@ class Unicorn::HttpServer attr_accessor :app, :timeout, :worker_processes, :before_fork, :after_fork, :before_exec, :listener_opts, :preload_app, - :orig_app, :config, :ready_pipe, :user + :orig_app, :config, :ready_pipe, :user, + :after_worker_exit attr_reader :pid, :logger include Unicorn::SocketHelper @@ -395,8 +396,7 @@ def reap_all_workers proc_name 'master' else worker = @workers.delete(wpid) and worker.close rescue nil - m = "reaped #{status.inspect} worker=#{worker.nr rescue 'unknown'}" - status.success? ? logger.info(m) : logger.error(m) + after_worker_exit.call(self, worker, status) end rescue Errno::ECHILD break -- 2.11.0