unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* unicorn stuck in sched_yield after ERESTARTNOHAND
       [not found] ` <BANLkTimF78PW9YgEAURS604Q8mucNwSDrg@mail.gmail.com>
@ 2011-05-31 12:02   ` Bharanee Rathna
  2011-05-31 15:17     ` Eric Wong
  2011-05-31 23:48     ` Eric Wong
  0 siblings, 2 replies; 7+ messages in thread
From: Bharanee Rathna @ 2011-05-31 12:02 UTC (permalink / raw)
  To: mongrel-unicorn

Hi,

I'm encountering a weird error where the unicorn workers are stuck in
a loop after hitting a 500 on the backend sinatra app.

strace at the point where it starts to go into a loop of death


select(7, [4 5], NULL, [3 6], {30, 0})  = 1 (in [5], left {27, 274382})
fchmod(8, 01)                           = 0
fcntl(5, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
accept4(5, {sa_family=AF_INET, sin_port=htons(56728),
sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12
recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource
temporarily unavailable)
select(13, [12], NULL, NULL, NULL)      = ? ERESTARTNOHAND (To be restarted)
--- SIGINT (Interrupt) @ 0 (0) ---
rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0
sched_yield()                           = 0

Longer  strace outputs can be found over at
https://gist.github.com/fe4e3172994e5de21317
I close any open db connections in before_fork and reopen connections
in after_fork. A bit of research suggests that rb_thread_wait has
issues when the select receives ERESTARTNOHAND, any ideas as to why
this might be happening ?

I'm running

$ uname -a
Linux bbox 2.6.38-02063806-generic #201105121509 SMP Thu May 12
15:14:14 UTC 2011 x86_64 GNU/Linux

$ ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]

Thanks
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unicorn stuck in sched_yield after ERESTARTNOHAND
  2011-05-31 12:02   ` unicorn stuck in sched_yield after ERESTARTNOHAND Bharanee Rathna
@ 2011-05-31 15:17     ` Eric Wong
  2011-05-31 22:28       ` Bharanee Rathna
  2011-05-31 23:48     ` Eric Wong
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Wong @ 2011-05-31 15:17 UTC (permalink / raw)
  To: unicorn list

Bharanee Rathna <deepfryed@gmail.com> wrote:
> A bit of research suggestsi that rb_thread_wait has
> issues when the select receives ERESTARTNOHAND, any ideas as to why
> this might be happening ?

Not sure, could be a bug in Ruby itself, kernel, or glibc.

I've seen similar reports of this in the past outside of Unicorn but
don't recall ever finding a satisfactory explanation.

> I'm running
> 
> $ uname -a
> Linux bbox 2.6.38-02063806-generic #201105121509 SMP Thu May 12
> 15:14:14 UTC 2011 x86_64 GNU/Linux
> 
> $ ruby -v
> ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]

Can you try 1.9.2-p180 or Ruby trunk?  Or maybe a different version of
glibc, too.  Do you have any non-standard kernel patches/scheduler
configs?

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unicorn stuck in sched_yield after ERESTARTNOHAND
  2011-05-31 15:17     ` Eric Wong
@ 2011-05-31 22:28       ` Bharanee Rathna
  0 siblings, 0 replies; 7+ messages in thread
From: Bharanee Rathna @ 2011-05-31 22:28 UTC (permalink / raw)
  To: unicorn list

> Can you try 1.9.2-p180 or Ruby trunk?  Or maybe a different version of
> glibc, too.

upgraded libc6 & ruby and was able to replicate it under libc6
2.12.1-0ubuntu10.2

$ ruby -v
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]


> Do you have any non-standard kernel patches/scheduler
> configs?

No
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unicorn stuck in sched_yield after ERESTARTNOHAND
  2011-05-31 12:02   ` unicorn stuck in sched_yield after ERESTARTNOHAND Bharanee Rathna
  2011-05-31 15:17     ` Eric Wong
@ 2011-05-31 23:48     ` Eric Wong
  2011-06-01  0:31       ` Bharanee Rathna
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Wong @ 2011-05-31 23:48 UTC (permalink / raw)
  To: unicorn list

Bharanee Rathna <deepfryed@gmail.com> wrote:
> I'm encountering a weird error where the unicorn workers are stuck in
> a loop after hitting a 500 on the backend sinatra app.

Also, what extensions are you using in your app?

> strace at the point where it starts to go into a loop of death

> select(7, [4 5], NULL, [3 6], {30, 0})  = 1 (in [5], left {27, 274382})
> fchmod(8, 01)                           = 0
> fcntl(5, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
> accept4(5, {sa_family=AF_INET, sin_port=htons(56728),
> sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12
> recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource
> temporarily unavailable)

(I'm somewhat more awake, now, haven't been sleeping much)

Two things look off in the line above:

1) recvfrom() isn't using the MSG_DONTWAIT flag.  I know you're using
   Linux, so kgio should be using MSG_DONTWAIT to do non-blocking
   recv...  Which versions of unicorn/kgio are you using?

2) TCP_DEFER_ACCEPT should prevent recvfrom() from hitting EAGAIN
   in the common case under Linux.

> select(13, [12], NULL, NULL, NULL)      = ? ERESTARTNOHAND (To be restarted)
> --- SIGINT (Interrupt) @ 0 (0) ---
> rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)

What triggered SIGINT?

> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> sched_yield()                           = 0
> 
> Longer  strace outputs can be found over at
> https://gist.github.com/fe4e3172994e5de21317

Actually, after many lines of sched_yield() in your gist, I can see it
does actually exit the process.  Did you kill it with SIGINT?  If so, I
see nothing wrong...

Ruby 1.9 seems to sched_yield a lot during shutdown, but it does
eventually finish.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unicorn stuck in sched_yield after ERESTARTNOHAND
  2011-05-31 23:48     ` Eric Wong
@ 2011-06-01  0:31       ` Bharanee Rathna
  2011-06-01  0:44         ` Bharanee Rathna
  2011-06-01 16:48         ` Eric Wong
  0 siblings, 2 replies; 7+ messages in thread
From: Bharanee Rathna @ 2011-06-01  0:31 UTC (permalink / raw)
  To: unicorn list

thanks for the quick response eric,

On Wed, Jun 1, 2011 at 9:48 AM, Eric Wong <normalperson@yhbt.net> wrote:
> Bharanee Rathna <deepfryed@gmail.com> wrote:
>> I'm encountering a weird error where the unicorn workers are stuck in
>> a loop after hitting a 500 on the backend sinatra app.
>
> Also, what extensions are you using in your app?

heaps of em. yajl, swift, rmagick, fastcaptcha, flock, nokogiri &
curb.  except swift and curb none of the others would be touching the
network.

>> strace at the point where it starts to go into a loop of death
>
>> select(7, [4 5], NULL, [3 6], {30, 0})  = 1 (in [5], left {27, 274382})
>> fchmod(8, 01)                           = 0
>> fcntl(5, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
>> accept4(5, {sa_family=AF_INET, sin_port=htons(56728),
>> sin_addr=inet_addr("10.1.1.4")}, [16], SOCK_CLOEXEC) = 12
>> recvfrom(12, 0x1c99fb0, 16384, 64, 0, 0) = -1 EAGAIN (Resource
>> temporarily unavailable)
>
> (I'm somewhat more awake, now, haven't been sleeping much)
>
> Two things look off in the line above:
>
> 1) recvfrom() isn't using the MSG_DONTWAIT flag.  I know you're using
>   Linux, so kgio should be using MSG_DONTWAIT to do non-blocking
>   recv...  Which versions of unicorn/kgio are you using?

using kgio 2.3.2, i'll upgrade it and give it another try

>
> 2) TCP_DEFER_ACCEPT should prevent recvfrom() from hitting EAGAIN
>   in the common case under Linux.
>
>> select(13, [12], NULL, NULL, NULL)      = ? ERESTARTNOHAND (To be restarted)
>> --- SIGINT (Interrupt) @ 0 (0) ---
>> rt_sigreturn(0x2)                       = -1 EINTR (Interrupted system call)
>
> What triggered SIGINT?

not sure

>
> Actually, after many lines of sched_yield() in your gist, I can see it
> does actually exit the process.  Did you kill it with SIGINT?  If so, I
> see nothing wrong...

yes i killed it after the worker looked stuck and wasn't responding for 30s
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unicorn stuck in sched_yield after ERESTARTNOHAND
  2011-06-01  0:31       ` Bharanee Rathna
@ 2011-06-01  0:44         ` Bharanee Rathna
  2011-06-01 16:48         ` Eric Wong
  1 sibling, 0 replies; 7+ messages in thread
From: Bharanee Rathna @ 2011-06-01  0:44 UTC (permalink / raw)
  To: unicorn list

>> 1) recvfrom() isn't using the MSG_DONTWAIT flag.  I know you're using
>>   Linux, so kgio should be using MSG_DONTWAIT to do non-blocking
>>   recv...  Which versions of unicorn/kgio are you using?
>
> using kgio 2.3.2, i'll upgrade it and give it another try

repeatable with kgio 2.4.1
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: unicorn stuck in sched_yield after ERESTARTNOHAND
  2011-06-01  0:31       ` Bharanee Rathna
  2011-06-01  0:44         ` Bharanee Rathna
@ 2011-06-01 16:48         ` Eric Wong
  1 sibling, 0 replies; 7+ messages in thread
From: Eric Wong @ 2011-06-01 16:48 UTC (permalink / raw)
  To: unicorn list

Bharanee Rathna <deepfryed@gmail.com> wrote:
> On Wed, Jun 1, 2011 at 9:48 AM, Eric Wong <normalperson@yhbt.net> wrote:
> > Bharanee Rathna <deepfryed@gmail.com> wrote:
> >> I'm encountering a weird error where the unicorn workers are stuck in
> >> a loop after hitting a 500 on the backend sinatra app.
> >
> > Also, what extensions are you using in your app?
> 
> heaps of em. yajl, swift, rmagick, fastcaptcha, flock, nokogiri &
> curb.  except swift and curb none of the others would be touching the
> network.

So are any of them hitting Unicorn from a Unicorn worker itself?
This only happens when your app hits a 500?  I would not write
an app that does that, but if you do, be sure to shutdown any
open connections on a 500 (or avoid the error in the first place) ...

> >> strace at the point where it starts to go into a loop of death

Actually, the sched_yield() only started when you hit sent SIGINT.

> > What triggered SIGINT?
> 
> not sure
> >
> > Actually, after many lines of sched_yield() in your gist, I can see it
> > does actually exit the process.  Did you kill it with SIGINT?  If so, I
> > see nothing wrong...
> 
> yes i killed it after the worker looked stuck and wasn't responding for 30s

So you hit Ctrl-C (which sends SIGINT)?

So basically somebody from 10.1.1.4 opened a connection on Unicorn
and just let it sit idle there.  That somebody at 10.1.1.4 could've
been the worker that triggered the 500 (and forgot it had open
ccnnections)  or a third party that made a request but sit idle
because it couldn't handle the 500 which your Unicorn worker sent.

From what I understand so far, it's not a Unicorn problem, but
something in your app.

Also, you wouldn't have this problem if nginx is in front of Unicorn
since nginx won't open a connection and sit idle before making a
request.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-06-01 16:50 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <BANLkTikFid3n0QpsrnXf2oNansFmuJDyuw@mail.gmail.com>
     [not found] ` <BANLkTimF78PW9YgEAURS604Q8mucNwSDrg@mail.gmail.com>
2011-05-31 12:02   ` unicorn stuck in sched_yield after ERESTARTNOHAND Bharanee Rathna
2011-05-31 15:17     ` Eric Wong
2011-05-31 22:28       ` Bharanee Rathna
2011-05-31 23:48     ` Eric Wong
2011-06-01  0:31       ` Bharanee Rathna
2011-06-01  0:44         ` Bharanee Rathna
2011-06-01 16:48         ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).