What would happen if a worker thread died?

yahns Ruby server user/dev discussion
 help / color / mirror / code / Atom feed

* What would happen if a worker thread died?
@ 2015-05-08 12:20 Lin Jen-Shin (godfat)
  2015-05-08 17:03 ` Eric Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Lin Jen-Shin (godfat) @ 2015-05-08 12:20 UTC (permalink / raw)
  To: yahns-public; +Cc: wildjcrt

Hi,

We just tried running yahns on production, and would like to
give some feedback. Eventually we turned back to Unicorn,
but that's because apparently Rails, or to be specific,
ActiveRecord (4.1.10) is not thread safe and giving random
errors once in a while.

However, we're also trying to split out an API server which is
not running on Rails, therefore we should be able to use a
threaded server.

During the experiments, we found that whenever a worker
thread died due to LoadError raised from the application,
which is not a StandardError therefore was not rescued at all,
crashing the worker thread (assumed, not verified).

When this happened, the client just hanged forever with yahns.
Is there something we can do about this? Would yahns respawn
a new worker thread? Can we close the socket when this happen?

I am aware that yahns is *extremely sensitive to fatal bugs in the
applications it hosts*, so I am just curious.

For reference, Puma would immediately close the socket without
sending anything, and Unicorn would log the error backtrace and
kill the worker (If I read it correctly).

In this case, Unicorn helped me figure out what's happened.

Cheers,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What would happen if a worker thread died?
  2015-05-08 12:20 What would happen if a worker thread died? Lin Jen-Shin (godfat)
@ 2015-05-08 17:03 ` Eric Wong
  2015-05-08 17:36   ` Lin Jen-Shin (godfat)
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Wong @ 2015-05-08 17:03 UTC (permalink / raw)
  To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt

"Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote:
> During the experiments, we found that whenever a worker
> thread died due to LoadError raised from the application,
> which is not a StandardError therefore was not rescued at all,
> crashing the worker thread (assumed, not verified).

Ugh, I guess since it happened in a thread, the error message got
swallowed unless you were running in $DEBUG.

Loading code after the server is ready and serving requests is a bad
idea.  It leads to really nasty thread-safety problems as well as
invalidating the method/constant caches.

> When this happened, the client just hanged forever with yahns.
> Is there something we can do about this? Would yahns respawn
> a new worker thread? Can we close the socket when this happen?

It's unfortunately difficult to detect thread death from ruby (no
SIGCHLD handler unlike for processes) besides polling Thread#join

We had this issue in ruby-core a few years back, but apparently
it was forgotten/ignored by matz.  Care to chime in?
https://bugs.ruby-lang.org/issues/6647

> I am aware that yahns is *extremely sensitive to fatal bugs in the
> applications it hosts*, so I am just curious.
> 
> For reference, Puma would immediately close the socket without
> sending anything, and Unicorn would log the error backtrace and
> kill the worker (If I read it correctly).
> 
> In this case, Unicorn helped me figure out what's happened.

yahns can probably rescue Exception (or Object(!) like puma)
and then log + abort the entire process.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What would happen if a worker thread died?
  2015-05-08 17:03 ` Eric Wong
@ 2015-05-08 17:36   ` Lin Jen-Shin (godfat)
  2015-05-09  1:03     ` Eric Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Lin Jen-Shin (godfat) @ 2015-05-08 17:36 UTC (permalink / raw)
  To: Eric Wong; +Cc: yahns-public, wildjcrt

On Sat, May 9, 2015 at 1:03 AM, Eric Wong <e@80x24.org> wrote:
> "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote:
> Ugh, I guess since it happened in a thread, the error message got
> swallowed unless you were running in $DEBUG.

I like the idea of setting $DEBUG, however it's too noisy given
current state :( See how many messages got printed with this:

$ ruby -dS gem list > /dev/null

> Loading code after the server is ready and serving requests is a bad
> idea.  It leads to really nasty thread-safety problems as well as
> invalidating the method/constant caches.

Yeah, I did that in the first place because I don't want my runtime
being polluted with those methods from activesupport. However at
some point we just need to use some stuffs from activesupport
in some cases. Therefore I put those requires in that particular
method. Only a few cases it would be used, that's what I was thinking.
But eventually it got triggered from some unexpected places.

Now it's removed and replaced with codes copied directly from
activesupport. No activesupport and requires need now.

>> When this happened, the client just hanged forever with yahns.
>> Is there something we can do about this? Would yahns respawn
>> a new worker thread? Can we close the socket when this happen?
>
> It's unfortunately difficult to detect thread death from ruby (no
> SIGCHLD handler unlike for processes) besides polling Thread#join
>
> We had this issue in ruby-core a few years back, but apparently
> it was forgotten/ignored by matz.  Care to chime in?
> https://bugs.ruby-lang.org/issues/6647

I just sent a few characters, hope that would speed up the process.

>> I am aware that yahns is *extremely sensitive to fatal bugs in the
>> applications it hosts*, so I am just curious.
>>
>> For reference, Puma would immediately close the socket without
>> sending anything, and Unicorn would log the error backtrace and
>> kill the worker (If I read it correctly).
>>
>> In this case, Unicorn helped me figure out what's happened.
>
> yahns can probably rescue Exception (or Object(!) like puma)
> and then log + abort the entire process.

I think rescuing Object is misleading. AFAIK, we cannot raise
an instance which is not a kind of Exception.

I learned that rescuing Exception is a bad idea because like
signal handling and some other stuffs are also using Exception
to communicate, and of course we won't want to interfere.

However for a worker thread, I guess that might be ok?

Cheers,

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What would happen if a worker thread died?
  2015-05-08 17:36   ` Lin Jen-Shin (godfat)
@ 2015-05-09  1:03     ` Eric Wong
  2015-05-09  7:26       ` Lin Jen-Shin (godfat)
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Wong @ 2015-05-09  1:03 UTC (permalink / raw)
  To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt

"Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote:
> On Sat, May 9, 2015 at 1:03 AM, Eric Wong <e@80x24.org> wrote:
> > It's unfortunately difficult to detect thread death from ruby (no
> > SIGCHLD handler unlike for processes) besides polling Thread#join
> >
> > We had this issue in ruby-core a few years back, but apparently
> > it was forgotten/ignored by matz.  Care to chime in?
> > https://bugs.ruby-lang.org/issues/6647
> 
> I just sent a few characters, hope that would speed up the process.

Thanks for reminding us of this, care to examine/fix some of the MRI
test failures in the patch I posted to MRI? :)

> >> I am aware that yahns is *extremely sensitive to fatal bugs in the
> >> applications it hosts*, so I am just curious.
> >>
> >> For reference, Puma would immediately close the socket without
> >> sending anything, and Unicorn would log the error backtrace and
> >> kill the worker (If I read it correctly).

Actually, unicorn doesn't do that explicitly, it's standard Ruby
behavior for the main thread.

> >> In this case, Unicorn helped me figure out what's happened.
> >
> > yahns can probably rescue Exception (or Object(!) like puma)
> > and then log + abort the entire process.
> 
> I think rescuing Object is misleading. AFAIK, we cannot raise
> an instance which is not a kind of Exception.

I guess, there's some internal non-object interrupts in MRI for threads
(eKillSignal, eTerminateSignal) but I don't think those get exposed to
Ruby-land...

> I learned that rescuing Exception is a bad idea because like
> signal handling and some other stuffs are also using Exception
> to communicate, and of course we won't want to interfere.

Right.

> However for a worker thread, I guess that might be ok?

Maybe limiting it to the common types {Standard,Load,Syntax}Error
is sufficient.

Below, I'm choosing to both leave the socket open and keep the worker
running to slow down a potentially malicious client if this happens and
to hopefully prevent an evil client from taking others down with it.

The process may be in bad state from Load/SyntaxErrors anyways with
partially loaded code, though.

yahns cannot be made error-tolerant when given buggy code, but it should
at least allow users to find problems since the Ruby default behavior
sucks right now:

diff --git a/lib/yahns/queue_epoll.rb b/lib/yahns/queue_epoll.rb
index 4f3289e..2875920 100644
--- a/lib/yahns/queue_epoll.rb
+++ b/lib/yahns/queue_epoll.rb
@@ -64,7 +64,7 @@ class Yahns::Queue < SleepyPenguin::Epoll::IO # :nodoc:
             raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}"
           end
         end
-      rescue => e
+      rescue StandardError, LoadError, SyntaxError => e
         break if closed? # can still happen due to shutdown_timeout
         Yahns::Log.exception(logger, 'queue loop', e)
       end while true
diff --git a/lib/yahns/queue_kqueue.rb b/lib/yahns/queue_kqueue.rb
index 4176f7a..33f5f8b 100644
--- a/lib/yahns/queue_kqueue.rb
+++ b/lib/yahns/queue_kqueue.rb
@@ -72,7 +72,7 @@ class Yahns::Queue < SleepyPenguin::Kqueue::IO # :nodoc:
             raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}"
           end
         end
-      rescue => e
+      rescue StandardError, LoadError, SyntaxError => e
         break if closed? # can still happen due to shutdown_timeout
         Yahns::Log.exception(logger, 'queue loop', e)
       end while true

Thoughts?

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: What would happen if a worker thread died?
  2015-05-09  1:03     ` Eric Wong
@ 2015-05-09  7:26       ` Lin Jen-Shin (godfat)
  2015-05-09  8:47         ` Eric Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Lin Jen-Shin (godfat) @ 2015-05-09  7:26 UTC (permalink / raw)
  To: Eric Wong; +Cc: yahns-public, wildjcrt

On Sat, May 9, 2015 at 9:03 AM, Eric Wong <e@80x24.org> wrote:
> "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote:
>> On Sat, May 9, 2015 at 1:03 AM, Eric Wong <e@80x24.org> wrote:
>> > It's unfortunately difficult to detect thread death from ruby (no
>> > SIGCHLD handler unlike for processes) besides polling Thread#join
>> >
>> > We had this issue in ruby-core a few years back, but apparently
>> > it was forgotten/ignored by matz.  Care to chime in?
>> > https://bugs.ruby-lang.org/issues/6647
>>
>> I just sent a few characters, hope that would speed up the process.
>
> Thanks for reminding us of this, care to examine/fix some of the MRI
> test failures in the patch I posted to MRI? :)

Haha, cool. Probably not now though. I just took some look, ignoring
warnings, I guess some of the tests were trying to capture stdout or
stderr and assert on messages. Along with abort_on_exception and
using join to peek the exception, this probably breaks those tests.

So I assume most of them were bugs in the tests, not in MRI itself.
Testing error messages is hard :(

>> I think rescuing Object is misleading. AFAIK, we cannot raise
>> an instance which is not a kind of Exception.
>
> I guess, there's some internal non-object interrupts in MRI for threads
> (eKillSignal, eTerminateSignal) but I don't think those get exposed to
> Ruby-land...

Got it, makes sense.

>> However for a worker thread, I guess that might be ok?
>
> Maybe limiting it to the common types {Standard,Load,Syntax}Error
> is sufficient.

Those are what I can think of right now, too.

> Below, I'm choosing to both leave the socket open and keep the worker
> running to slow down a potentially malicious client if this happens and
> to hopefully prevent an evil client from taking others down with it.

I am curious how this could slow down a malicious client? Because this
might somehow confuse them that the worker is still working?

> The process may be in bad state from Load/SyntaxErrors anyways with
> partially loaded code, though.
>
> yahns cannot be made error-tolerant when given buggy code, but it should
> at least allow users to find problems since the Ruby default behavior
> sucks right now:
>
> diff --git a/lib/yahns/queue_epoll.rb b/lib/yahns/queue_epoll.rb
> index 4f3289e..2875920 100644
> --- a/lib/yahns/queue_epoll.rb
> +++ b/lib/yahns/queue_epoll.rb
> @@ -64,7 +64,7 @@ class Yahns::Queue < SleepyPenguin::Epoll::IO # :nodoc:
>              raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}"
>            end
>          end
> -      rescue => e
> +      rescue StandardError, LoadError, SyntaxError => e
>          break if closed? # can still happen due to shutdown_timeout
>          Yahns::Log.exception(logger, 'queue loop', e)
>        end while true
> diff --git a/lib/yahns/queue_kqueue.rb b/lib/yahns/queue_kqueue.rb
> index 4176f7a..33f5f8b 100644
> --- a/lib/yahns/queue_kqueue.rb
> +++ b/lib/yahns/queue_kqueue.rb
> @@ -72,7 +72,7 @@ class Yahns::Queue < SleepyPenguin::Kqueue::IO # :nodoc:
>              raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}"
>            end
>          end
> -      rescue => e
> +      rescue StandardError, LoadError, SyntaxError => e
>          break if closed? # can still happen due to shutdown_timeout
>          Yahns::Log.exception(logger, 'queue loop', e)
>        end while true
>
> Thoughts?

A backtrace for knowing what's happening I think is quite enough for me now.
Still curious though, could this worker do anything else if this happened?
I am guessing that if the application no longer does anything, then this worker
would not do anything. Or the socket might timeout eventually?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: What would happen if a worker thread died?
  2015-05-09  7:26       ` Lin Jen-Shin (godfat)
@ 2015-05-09  8:47         ` Eric Wong
  2015-05-09  9:03           ` [PATCH] worker threads log LoadError and SyntaxError, too Eric Wong
  2015-05-09  9:06           ` What would happen if a worker thread died? Lin Jen-Shin (godfat)
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Wong @ 2015-05-09  8:47 UTC (permalink / raw)
  To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt

"Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote:
> On Sat, May 9, 2015 at 9:03 AM, Eric Wong <e@80x24.org> wrote:
> > Below, I'm choosing to both leave the socket open and keep the worker
> > running to slow down a potentially malicious client if this happens and
> > to hopefully prevent an evil client from taking others down with it.
> 
> I am curious how this could slow down a malicious client? Because this
> might somehow confuse them that the worker is still working?

Right, it might not know if the app server is throttling responses or if
there's packet loss on the network.  Other than the small amount of
memory used for the socket, it won't use other system resources once the
error is logged.

> A backtrace for knowing what's happening I think is quite enough for me now.
> Still curious though, could this worker do anything else if this happened?
> I am guessing that if the application no longer does anything, then this worker
> would not do anything. Or the socket might timeout eventually?

It depends on the application structure.
Often apps have very different code paths for different endpoints so
some endpoint being fatally broken may not affect others.  A simple
endpoint (e.g. static files) could function at 100% and serve other
clients without any problems.

Eventually the socket will timeout if the client_expire_threshold is
reached, otherwise it's fairly harmless to keep the socket around
(aside from memory overhead).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] worker threads log LoadError and SyntaxError, too
  2015-05-09  8:47         ` Eric Wong
@ 2015-05-09  9:03           ` Eric Wong
  2015-05-09  9:06           ` What would happen if a worker thread died? Lin Jen-Shin (godfat)
  1 sibling, 0 replies; 8+ messages in thread
From: Eric Wong @ 2015-05-09  9:03 UTC (permalink / raw)
  To: Lin Jen-Shin (godfat); +Cc: yahns-public, wildjcrt

Some applications may lazily load code during app dispatch,
triggering LoadError or SyntaxError exceptions.  Log the error and
backtrace so application maintainers can more easily notice and
diagnose problems.

Keep in mind users are likely to have performance and race condition
problems with lazy loading, and the process may still be in a bad
state due to partially-loaded code.  This commit is only intended to
give application authors a chance to notice and fix or avoid
problems in the future.

Note: logging fatal exceptions by default in all threads was
proposed in ruby-core, but currently not implemented in any released
version:

	https://bugs.ruby-lang.org/issues/6647

Reported-by: Lin Jen-Shin (godfat) <godfat@godfat.org>
<CAA2_N1umJO12XH9r+JHnA6r=z=Mwp_PqOrdnW65oqW2K2-iAoQ@mail.gmail.com>
---
 I'll push this over the weekend and release 1.7 from master
 (proxy_pass isn't production-ready, but there's still a good deal
  of small improvements going in).

 lib/yahns/queue_epoll.rb  | 2 +-
 lib/yahns/queue_kqueue.rb | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/yahns/queue_epoll.rb b/lib/yahns/queue_epoll.rb
index 4f3289e..2875920 100644
--- a/lib/yahns/queue_epoll.rb
+++ b/lib/yahns/queue_epoll.rb
@@ -64,7 +64,7 @@ class Yahns::Queue < SleepyPenguin::Epoll::IO # :nodoc:
             raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}"
           end
         end
-      rescue => e
+      rescue StandardError, LoadError, SyntaxError => e
         break if closed? # can still happen due to shutdown_timeout
         Yahns::Log.exception(logger, 'queue loop', e)
       end while true
diff --git a/lib/yahns/queue_kqueue.rb b/lib/yahns/queue_kqueue.rb
index 4176f7a..33f5f8b 100644
--- a/lib/yahns/queue_kqueue.rb
+++ b/lib/yahns/queue_kqueue.rb
@@ -72,7 +72,7 @@ class Yahns::Queue < SleepyPenguin::Kqueue::IO # :nodoc:
             raise "BUG: #{io.inspect}#yahns_step returned: #{rv.inspect}"
           end
         end
-      rescue => e
+      rescue StandardError, LoadError, SyntaxError => e
         break if closed? # can still happen due to shutdown_timeout
         Yahns::Log.exception(logger, 'queue loop', e)
       end while true
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: What would happen if a worker thread died?
  2015-05-09  8:47         ` Eric Wong
  2015-05-09  9:03           ` [PATCH] worker threads log LoadError and SyntaxError, too Eric Wong
@ 2015-05-09  9:06           ` Lin Jen-Shin (godfat)
  1 sibling, 0 replies; 8+ messages in thread
From: Lin Jen-Shin (godfat) @ 2015-05-09  9:06 UTC (permalink / raw)
  To: Eric Wong; +Cc: yahns-public, wildjcrt

On Sat, May 9, 2015 at 4:47 PM, Eric Wong <e@80x24.org> wrote:
> "Lin Jen-Shin (godfat)" <godfat@godfat.org> wrote:
>> A backtrace for knowing what's happening I think is quite enough for me now.
>> Still curious though, could this worker do anything else if this happened?
>> I am guessing that if the application no longer does anything, then this worker
>> would not do anything. Or the socket might timeout eventually?
>
> It depends on the application structure.
> Often apps have very different code paths for different endpoints so
> some endpoint being fatally broken may not affect others.  A simple
> endpoint (e.g. static files) could function at 100% and serve other
> clients without any problems.
>
> Eventually the socket will timeout if the client_expire_threshold is
> reached, otherwise it's fairly harmless to keep the socket around
> (aside from memory overhead).

Great! I was just worried that idled workers would get piled up and
eventually no other workers would be able to do any work.
As long as there's a timeout for this and it could recover itself,
I think this is could be the best solution given all the trade off.

Thank you!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-05-09  9:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-08 12:20 What would happen if a worker thread died? Lin Jen-Shin (godfat)
2015-05-08 17:03 ` Eric Wong
2015-05-08 17:36   ` Lin Jen-Shin (godfat)
2015-05-09  1:03     ` Eric Wong
2015-05-09  7:26       ` Lin Jen-Shin (godfat)
2015-05-09  8:47         ` Eric Wong
2015-05-09  9:03           ` [PATCH] worker threads log LoadError and SyntaxError, too Eric Wong
2015-05-09  9:06           ` What would happen if a worker thread died? Lin Jen-Shin (godfat)

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/yahns.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).