* What would happen if a worker thread died? @ 2015-05-08 12:20 Lin Jen-Shin (godfat) 2015-05-08 17:03 ` Eric Wong 0 siblings, 1 reply; 8+ messages in thread From: Lin Jen-Shin (godfat) @ 2015-05-08 12:20 UTC (permalink / raw) To: yahns-public; +Cc: wildjcrt Hi, We just tried running yahns on production, and would like to give some feedback. Eventually we turned back to Unicorn, but that's because apparently Rails, or to be specific, ActiveRecord (4.1.10) is not thread safe and giving random errors once in a while. However, we're also trying to split out an API server which is not running on Rails, therefore we should be able to use a threaded server. During the experiments, we found that whenever a worker thread died due to LoadError raised from the application, which is not a StandardError therefore was not rescued at all, crashing the worker thread (assumed, not verified). When this happened, the client just hanged forever with yahns. Is there something we can do about this? Would yahns respawn a new worker thread? Can we close the socket when this happen? I am aware that yahns is *extremely sensitive to fatal bugs in the applications it hosts*, so I am just curious. For reference, Puma would immediately close the socket without sending anything, and Unicorn would log the error backtrace and kill the worker (If I read it correctly). In this case, Unicorn helped me figure out what's happened. Cheers, ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: What would happen if a worker thread died? 2015-05-08 12:20 What would happen if a worker thread died? Lin Jen-Shin (godfat) @ 2015-05-08 17:03 ` Eric Wong 2015-05-08 17:36 ` Lin Jen-Shin (godfat) 0 siblings, 1 reply; 8+ messages in thread From: Eric Wong @ 2015-05-08 17:03 UTC (permalink / raw) To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote: > During the experiments, we found that whenever a worker > thread died due to LoadError raised from the application, > which is not a StandardError therefore was not rescued at all, > crashing the worker thread (assumed, not verified). Ugh, I guess since it happened in a thread, the error message got swallowed unless you were running in $DEBUG. Loading code after the server is ready and serving requests is a bad idea. It leads to really nasty thread-safety problems as well as invalidating the method/constant caches. > When this happened, the client just hanged forever with yahns. > Is there something we can do about this? Would yahns respawn > a new worker thread? Can we close the socket when this happen? It's unfortunately difficult to detect thread death from ruby (no SIGCHLD handler unlike for processes) besides polling Thread#join We had this issue in ruby-core a few years back, but apparently it was forgotten/ignored by matz. Care to chime in? https://bugs.ruby-lang.org/issues/6647 > I am aware that yahns is *extremely sensitive to fatal bugs in the > applications it hosts*, so I am just curious. > > For reference, Puma would immediately close the socket without > sending anything, and Unicorn would log the error backtrace and > kill the worker (If I read it correctly). > > In this case, Unicorn helped me figure out what's happened. yahns can probably rescue Exception (or Object(!) like puma) and then log + abort the entire process. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: What would happen if a worker thread died? 2015-05-08 17:03 ` Eric Wong @ 2015-05-08 17:36 ` Lin Jen-Shin (godfat) 2015-05-09 1:03 ` Eric Wong 0 siblings, 1 reply; 8+ messages in thread From: Lin Jen-Shin (godfat) @ 2015-05-08 17:36 UTC (permalink / raw) To: Eric Wong; +Cc: yahns-public, wildjcrt On Sat, May 9, 2015 at 1:03 AM, Eric Wong <e@80x24.org> wrote: > "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote: > Ugh, I guess since it happened in a thread, the error message got > swallowed unless you were running in $DEBUG. I like the idea of setting $DEBUG, however it's too noisy given current state :( See how many messages got printed with this: $ ruby -dS gem list > /dev/null > Loading code after the server is ready and serving requests is a bad > idea. It leads to really nasty thread-safety problems as well as > invalidating the method/constant caches. Yeah, I did that in the first place because I don't want my runtime being polluted with those methods from activesupport. However at some point we just need to use some stuffs from activesupport in some cases. Therefore I put those requires in that particular method. Only a few cases it would be used, that's what I was thinking. But eventually it got triggered from some unexpected places. Now it's removed and replaced with codes copied directly from activesupport. No activesupport and requires need now. >> When this happened, the client just hanged forever with yahns. >> Is there something we can do about this? Would yahns respawn >> a new worker thread? Can we close the socket when this happen? > > It's unfortunately difficult to detect thread death from ruby (no > SIGCHLD handler unlike for processes) besides polling Thread#join > > We had this issue in ruby-core a few years back, but apparently > it was forgotten/ignored by matz. Care to chime in? > https://bugs.ruby-lang.org/issues/6647 I just sent a few characters, hope that would speed up the process. >> I am aware that yahns is *extremely sensitive to fatal bugs in the >> applications it hosts*, so I am just curious. >> >> For reference, Puma would immediately close the socket without >> sending anything, and Unicorn would log the error backtrace and >> kill the worker (If I read it correctly). >> >> In this case, Unicorn helped me figure out what's happened. > > yahns can probably rescue Exception (or Object(!) like puma) > and then log + abort the entire process. I think rescuing Object is misleading. AFAIK, we cannot raise an instance which is not a kind of Exception. I learned that rescuing Exception is a bad idea because like signal handling and some other stuffs are also using Exception to communicate, and of course we won't want to interfere. However for a worker thread, I guess that might be ok? Cheers, ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: What would happen if a worker thread died? 2015-05-08 17:36 ` Lin Jen-Shin (godfat) @ 2015-05-09 1:03 ` Eric Wong 2015-05-09 7:26 ` Lin Jen-Shin (godfat) 0 siblings, 1 reply; 8+ messages in thread From: Eric Wong @ 2015-05-09 1:03 UTC (permalink / raw) To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote: > On Sat, May 9, 2015 at 1:03 AM, Eric Wong <e@80x24.org> wrote: > > It's unfortunately difficult to detect thread death from ruby (no > > SIGCHLD handler unlike for processes) besides polling Thread#join > > > > We had this issue in ruby-core a few years back, but apparently > > it was forgotten/ignored by matz. Care to chime in? > > https://bugs.ruby-lang.org/issues/6647 > > I just sent a few characters, hope that would speed up the process. Thanks for reminding us of this, care to examine/fix some of the MRI test failures in the patch I posted to MRI? :) > >> I am aware that yahns is *extremely sensitive to fatal bugs in the > >> applications it hosts*, so I am just curious. > >> > >> For reference, Puma would immediately close the socket without > >> sending anything, and Unicorn would log the error backtrace and > >> kill the worker (If I read it correctly). Actually, unicorn doesn't do that explicitly, it's standard Ruby behavior for the main thread. > >> In this case, Unicorn helped me figure out what's happened. > > > > yahns can probably rescue Exception (or Object(!) like puma) > > and then log + abort the entire process. > > I think rescuing Object is misleading. AFAIK, we cannot raise > an instance which is not a kind of Exception. I guess, there's some internal non-object interrupts in MRI for threads (eKillSignal, eTerminateSignal) but I don't think those get exposed to Ruby-land... > I learned that rescuing Exception is a bad idea because like > signal handling and some other stuffs are also using Exception > to communicate, and of course we won't want to interfere. Right. > However for a worker thread, I guess that might be ok? Maybe limiting it to the common types {Standard,Load,Syntax}Error is sufficient. Below, I'm choosing to both leave the socket open and keep the worker running to slow down a potentially malicious client if this happens and to hopefully prevent an evil client from taking others down with it. The process may be in bad state from Load/SyntaxErrors anyways with partially loaded code, though. yahns cannot be made error-tolerant when given buggy code, but it should at least allow users to find problems since the Ruby default behavior sucks right now: diff --git a/lib/yahns/queue_epoll.rb b/lib/yahns/queue_epoll.rb index 4f3289e..2875920 100644 --- a/lib/yahns/queue_epoll.rb +++ b/lib/yahns/queue_epoll.rb @@ -64,7 +64,7 @@ class Yahns::Queue < SleepyPenguin::Epoll::IO # :nodoc: raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}" end end - rescue => e + rescue StandardError, LoadError, SyntaxError => e break if closed? # can still happen due to shutdown_timeout Yahns::Log.exception(logger, 'queue loop', e) end while true diff --git a/lib/yahns/queue_kqueue.rb b/lib/yahns/queue_kqueue.rb index 4176f7a..33f5f8b 100644 --- a/lib/yahns/queue_kqueue.rb +++ b/lib/yahns/queue_kqueue.rb @@ -72,7 +72,7 @@ class Yahns::Queue < SleepyPenguin::Kqueue::IO # :nodoc: raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}" end end - rescue => e + rescue StandardError, LoadError, SyntaxError => e break if closed? # can still happen due to shutdown_timeout Yahns::Log.exception(logger, 'queue loop', e) end while true Thoughts? ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: What would happen if a worker thread died? 2015-05-09 1:03 ` Eric Wong @ 2015-05-09 7:26 ` Lin Jen-Shin (godfat) 2015-05-09 8:47 ` Eric Wong 0 siblings, 1 reply; 8+ messages in thread From: Lin Jen-Shin (godfat) @ 2015-05-09 7:26 UTC (permalink / raw) To: Eric Wong; +Cc: yahns-public, wildjcrt On Sat, May 9, 2015 at 9:03 AM, Eric Wong <e@80x24.org> wrote: > "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote: >> On Sat, May 9, 2015 at 1:03 AM, Eric Wong <e@80x24.org> wrote: >> > It's unfortunately difficult to detect thread death from ruby (no >> > SIGCHLD handler unlike for processes) besides polling Thread#join >> > >> > We had this issue in ruby-core a few years back, but apparently >> > it was forgotten/ignored by matz. Care to chime in? >> > https://bugs.ruby-lang.org/issues/6647 >> >> I just sent a few characters, hope that would speed up the process. > > Thanks for reminding us of this, care to examine/fix some of the MRI > test failures in the patch I posted to MRI? :) Haha, cool. Probably not now though. I just took some look, ignoring warnings, I guess some of the tests were trying to capture stdout or stderr and assert on messages. Along with abort_on_exception and using join to peek the exception, this probably breaks those tests. So I assume most of them were bugs in the tests, not in MRI itself. Testing error messages is hard :( >> I think rescuing Object is misleading. AFAIK, we cannot raise >> an instance which is not a kind of Exception. > > I guess, there's some internal non-object interrupts in MRI for threads > (eKillSignal, eTerminateSignal) but I don't think those get exposed to > Ruby-land... Got it, makes sense. >> However for a worker thread, I guess that might be ok? > > Maybe limiting it to the common types {Standard,Load,Syntax}Error > is sufficient. Those are what I can think of right now, too. > Below, I'm choosing to both leave the socket open and keep the worker > running to slow down a potentially malicious client if this happens and > to hopefully prevent an evil client from taking others down with it. I am curious how this could slow down a malicious client? Because this might somehow confuse them that the worker is still working? > The process may be in bad state from Load/SyntaxErrors anyways with > partially loaded code, though. > > yahns cannot be made error-tolerant when given buggy code, but it should > at least allow users to find problems since the Ruby default behavior > sucks right now: > > diff --git a/lib/yahns/queue_epoll.rb b/lib/yahns/queue_epoll.rb > index 4f3289e..2875920 100644 > --- a/lib/yahns/queue_epoll.rb > +++ b/lib/yahns/queue_epoll.rb > @@ -64,7 +64,7 @@ class Yahns::Queue < SleepyPenguin::Epoll::IO # :nodoc: > raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}" > end > end > - rescue => e > + rescue StandardError, LoadError, SyntaxError => e > break if closed? # can still happen due to shutdown_timeout > Yahns::Log.exception(logger, 'queue loop', e) > end while true > diff --git a/lib/yahns/queue_kqueue.rb b/lib/yahns/queue_kqueue.rb > index 4176f7a..33f5f8b 100644 > --- a/lib/yahns/queue_kqueue.rb > +++ b/lib/yahns/queue_kqueue.rb > @@ -72,7 +72,7 @@ class Yahns::Queue < SleepyPenguin::Kqueue::IO # :nodoc: > raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}" > end > end > - rescue => e > + rescue StandardError, LoadError, SyntaxError => e > break if closed? # can still happen due to shutdown_timeout > Yahns::Log.exception(logger, 'queue loop', e) > end while true > > Thoughts? A backtrace for knowing what's happening I think is quite enough for me now. Still curious though, could this worker do anything else if this happened? I am guessing that if the application no longer does anything, then this worker would not do anything. Or the socket might timeout eventually? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: What would happen if a worker thread died? 2015-05-09 7:26 ` Lin Jen-Shin (godfat) @ 2015-05-09 8:47 ` Eric Wong 2015-05-09 9:03 ` [PATCH] worker threads log LoadError and SyntaxError, too Eric Wong 2015-05-09 9:06 ` What would happen if a worker thread died? Lin Jen-Shin (godfat) 0 siblings, 2 replies; 8+ messages in thread From: Eric Wong @ 2015-05-09 8:47 UTC (permalink / raw) To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote: > On Sat, May 9, 2015 at 9:03 AM, Eric Wong <e@80x24.org> wrote: > > Below, I'm choosing to both leave the socket open and keep the worker > > running to slow down a potentially malicious client if this happens and > > to hopefully prevent an evil client from taking others down with it. > > I am curious how this could slow down a malicious client? Because this > might somehow confuse them that the worker is still working? Right, it might not know if the app server is throttling responses or if there's packet loss on the network. Other than the small amount of memory used for the socket, it won't use other system resources once the error is logged. > A backtrace for knowing what's happening I think is quite enough for me now. > Still curious though, could this worker do anything else if this happened? > I am guessing that if the application no longer does anything, then this worker > would not do anything. Or the socket might timeout eventually? It depends on the application structure. Often apps have very different code paths for different endpoints so some endpoint being fatally broken may not affect others. A simple endpoint (e.g. static files) could function at 100% and serve other clients without any problems. Eventually the socket will timeout if the client_expire_threshold is reached, otherwise it's fairly harmless to keep the socket around (aside from memory overhead). ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] worker threads log LoadError and SyntaxError, too 2015-05-09 8:47 ` Eric Wong @ 2015-05-09 9:03 ` Eric Wong 2015-05-09 9:06 ` What would happen if a worker thread died? Lin Jen-Shin (godfat) 1 sibling, 0 replies; 8+ messages in thread From: Eric Wong @ 2015-05-09 9:03 UTC (permalink / raw) To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt Some applications may lazily load code during app dispatch, triggering LoadError or SyntaxError exceptions. Log the error and backtrace so application maintainers can more easily notice and diagnose problems. Keep in mind users are likely to have performance and race condition problems with lazy loading, and the process may still be in a bad state due to partially-loaded code. This commit is only intended to give application authors a chance to notice and fix or avoid problems in the future. Note: logging fatal exceptions by default in all threads was proposed in ruby-core, but currently not implemented in any released version: https://bugs.ruby-lang.org/issues/6647 Reported-by: Lin Jen-Shin (godfat) <godfat@godfat.org> <CAA2_N1umJO12XH9r+JHnA6r=z=Mwp_PqOrdnW65oqW2K2-iAoQ@mail.gmail.com> --- I'll push this over the weekend and release 1.7 from master (proxy_pass isn't production-ready, but there's still a good deal of small improvements going in). lib/yahns/queue_epoll.rb | 2 +- lib/yahns/queue_kqueue.rb | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/yahns/queue_epoll.rb b/lib/yahns/queue_epoll.rb index 4f3289e..2875920 100644 --- a/lib/yahns/queue_epoll.rb +++ b/lib/yahns/queue_epoll.rb @@ -64,7 +64,7 @@ class Yahns::Queue < SleepyPenguin::Epoll::IO # :nodoc: raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}" end end - rescue => e + rescue StandardError, LoadError, SyntaxError => e break if closed? # can still happen due to shutdown_timeout Yahns::Log.exception(logger, 'queue loop', e) end while true diff --git a/lib/yahns/queue_kqueue.rb b/lib/yahns/queue_kqueue.rb index 4176f7a..33f5f8b 100644 --- a/lib/yahns/queue_kqueue.rb +++ b/lib/yahns/queue_kqueue.rb @@ -72,7 +72,7 @@ class Yahns::Queue < SleepyPenguin::Kqueue::IO # :nodoc: raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}" end end - rescue => e + rescue StandardError, LoadError, SyntaxError => e break if closed? # can still happen due to shutdown_timeout Yahns::Log.exception(logger, 'queue loop', e) end while true -- EW ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: What would happen if a worker thread died? 2015-05-09 8:47 ` Eric Wong 2015-05-09 9:03 ` [PATCH] worker threads log LoadError and SyntaxError, too Eric Wong @ 2015-05-09 9:06 ` Lin Jen-Shin (godfat) 1 sibling, 0 replies; 8+ messages in thread From: Lin Jen-Shin (godfat) @ 2015-05-09 9:06 UTC (permalink / raw) To: Eric Wong; +Cc: yahns-public, wildjcrt On Sat, May 9, 2015 at 4:47 PM, Eric Wong <e@80x24.org> wrote: > "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote: >> A backtrace for knowing what's happening I think is quite enough for me now. >> Still curious though, could this worker do anything else if this happened? >> I am guessing that if the application no longer does anything, then this worker >> would not do anything. Or the socket might timeout eventually? > > It depends on the application structure. > Often apps have very different code paths for different endpoints so > some endpoint being fatally broken may not affect others. A simple > endpoint (e.g. static files) could function at 100% and serve other > clients without any problems. > > Eventually the socket will timeout if the client_expire_threshold is > reached, otherwise it's fairly harmless to keep the socket around > (aside from memory overhead). Great! I was just worried that idled workers would get piled up and eventually no other workers would be able to do any work. As long as there's a timeout for this and it could recover itself, I think this is could be the best solution given all the trade off. Thank you! ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2015-05-09 9:06 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-05-08 12:20 What would happen if a worker thread died? Lin Jen-Shin (godfat) 2015-05-08 17:03 ` Eric Wong 2015-05-08 17:36 ` Lin Jen-Shin (godfat) 2015-05-09 1:03 ` Eric Wong 2015-05-09 7:26 ` Lin Jen-Shin (godfat) 2015-05-09 8:47 ` Eric Wong 2015-05-09 9:03 ` [PATCH] worker threads log LoadError and SyntaxError, too Eric Wong 2015-05-09 9:06 ` What would happen if a worker thread died? Lin Jen-Shin (godfat)
Code repositories for project(s) associated with this public inbox https://yhbt.net/yahns.git/ This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).