Weird Unicorn Timeout Issues (Hibernation problem?)

unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed

* Weird Unicorn Timeout Issues (Hibernation problem?)
@ 2014-08-04 18:12 Tony Devlin
  2014-08-04 18:39 ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Devlin @ 2014-08-04 18:12 UTC (permalink / raw)
  To: unicorn-public

Current Setup:  (the problem existed before updating nginx, unicorn and
rails; but in an attempt to solve the problem I updated them).

CentOS 6.3 (x64)
Ruby 2.1.2p95
Rails 4.0.0
Nginx 1.6.0
Unicorn 4.8.3

====

We have an issue where if a site is not accessed for around (average) 30
minutes the next query will timeout, and it will timeout on all the workers
opened.  IE:   If I have two workers, then both of those workers will
timeout, even if the first one,
after timeout, starts to work.  As soon as the second worker is called upon
it will timeout.  Then everything runs perfectly good and great until the
site is not accessed for 30 minutes or more.  Then the timeout issue starts
all over again.

This happens on our dev machine on two different applications (only apps on
the server) and on our production machine (also running two apps).

I can't figure out exactly what is going on.  I tried strace the master
unicorn process, but I get no relevant information.   One of the
applications is also tied to a New Relic account and it gives no
information either.

Nginx logs show this: (sorry, obfuscated certain details due to the
internal enterprise nature)

2014/08/04 11:31:21 [error] 6353#0: *339 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: ****.org, request: "GET /assets/application.css HTTP/1.1",
upstream:
"http://unix:/var/www/sites/****/shared/sockets/.unicorn.sock.0:/assets/application.css",
host: "****.org", referrer: "http://****.org/outages"

2014/08/04 11:31:53 [error] 31058#0: *1 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: ****.org, request: "GET /outages HTTP/1.1", upstream:
"http://unix:/var/www/sites/****/shared/sockets/.unicorn.sock.0:/outages",
host: "****.org", referrer: "http://****.org/outages"

Unicorn stderr shows this: (sorry, obfuscated certain details due to the
internal enterprise nature)

E, [2014-08-04T11:31:21.620379 #11991] ERROR -- : worker=1 PID:29701
timeout (21s > 20s), killing
E, [2014-08-04T11:31:21.630521 #11991] ERROR -- : reaped #<Process::Status:
pid 29701 SIGKILL (signal 9)> worker=1
I, [2014-08-04T11:31:21.639881 #30521]  INFO -- : worker=1 ready

E, [2014-08-04T11:31:53.666300 #11991] ERROR -- : worker=0 PID:29705
timeout (21s > 20s), killing
E, [2014-08-04T11:31:53.676984 #11991] ERROR -- : reaped #<Process::Status:
pid 29705 SIGKILL (signal 9)> worker=0
I, [2014-08-04T11:31:53.687157 #30528]  INFO -- : worker=0 ready

Its interesting that the timeout is occurring on different stages, for
example if you look at the two nginx errors, one timeout at
/assets/application.css and the other at the root /outages/, I've had it be
small gifs as well.   I'd also like to point at
that I have changed the timeout from 30s to 60s to 300s to 20s and it
occurs at every interval.  So it is not a "true" timeout issue.    Also the
two apps on the server are completely different, with only Ruby and Rails
being the same process.  The
gems are different in both, the database is different in both, in fact one
of them has no assets, no css and no ui files at all.  The only
similarities are Ruby, Rails, Nginx and Unicorn.

It's like there is some sort of hibernation problem with the unicorn
workers.  The hibernation thought comes from these debug messages (I
dropped to recording debug messages trying to locate the problem).  I get
these every now and then:

D, [2014-08-03T22:14:49.776538 #11991] DEBUG -- : waiting 11.0s after
suspend/hibernation

But then they cease to occur for days, example is that message was the last
time that occurred (non in the logs for today).

I've been trying for the past 4 days to solve this, make a large number of
changes to no avail.  Any help would be GREATLY appreciated.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:12 Weird Unicorn Timeout Issues (Hibernation problem?) Tony Devlin
@ 2014-08-04 18:39 ` Eric Wong
  2014-08-04 18:41   ` Kapil Israni
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Eric Wong @ 2014-08-04 18:39 UTC (permalink / raw)
  To: Tony Devlin; +Cc: unicorn-public

Tony Devlin <tonydevlin@gmail.com> wrote:
> We have an issue where if a site is not accessed for around (average) 30
> minutes the next query will timeout, and it will timeout on all the workers
> opened.  IE:   If I have two workers, then both of those workers will
> timeout, even if the first one,
> after timeout, starts to work.  As soon as the second worker is called upon
> it will timeout.  Then everything runs perfectly good and great until the
> site is not accessed for 30 minutes or more.  Then the timeout issue starts
> all over again.

This sounds like the idle timeout for MySQL (or similar) kicking in.
What database(s) or other backends are you using?

That said, we've had problems with hibernate/suspend in the past,
so I'll double check.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:39 ` Eric Wong
@ 2014-08-04 18:41   ` Kapil Israni
  2014-08-04 18:48     ` Eric Wong
  2014-08-04 18:45   ` Tony Devlin
  2014-08-04 18:55   ` Daniel Condomitti
  2 siblings, 1 reply; 18+ messages in thread
From: Kapil Israni @ 2014-08-04 18:41 UTC (permalink / raw)
  To: unicorn-public

I unsubscribed from this list few days back, still getting emails?


On Mon, Aug 4, 2014 at 11:39 AM, Eric Wong <e@80x24.org> wrote:

> Tony Devlin <tonydevlin@gmail.com> wrote:
> > We have an issue where if a site is not accessed for around (average) 30
> > minutes the next query will timeout, and it will timeout on all the
> workers
> > opened.  IE:   If I have two workers, then both of those workers will
> > timeout, even if the first one,
> > after timeout, starts to work.  As soon as the second worker is called
> upon
> > it will timeout.  Then everything runs perfectly good and great until the
> > site is not accessed for 30 minutes or more.  Then the timeout issue
> starts
> > all over again.
>
> This sounds like the idle timeout for MySQL (or similar) kicking in.
> What database(s) or other backends are you using?
>
> That said, we've had problems with hibernate/suspend in the past,
> so I'll double check.
>
>


-- 
Kapil


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:39 ` Eric Wong
  2014-08-04 18:41   ` Kapil Israni
@ 2014-08-04 18:45   ` Tony Devlin
  2014-08-04 19:11     ` Eric Wong
  2014-08-04 18:55   ` Daniel Condomitti
  2 siblings, 1 reply; 18+ messages in thread
From: Tony Devlin @ 2014-08-04 18:45 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

Eric,

Thank you for responding.  We use a database on only one of the apps, it is
a Oracle 11G RAC Server.  I'll get the DBA to double check the idle timeout
for that DB.  Though the other app does not use a database.


On Mon, Aug 4, 2014 at 2:39 PM, Eric Wong <e@80x24.org> wrote:

> Tony Devlin <tonydevlin@gmail.com> wrote:
> > We have an issue where if a site is not accessed for around (average) 30
> > minutes the next query will timeout, and it will timeout on all the
> workers
> > opened.  IE:   If I have two workers, then both of those workers will
> > timeout, even if the first one,
> > after timeout, starts to work.  As soon as the second worker is called
> upon
> > it will timeout.  Then everything runs perfectly good and great until the
> > site is not accessed for 30 minutes or more.  Then the timeout issue
> starts
> > all over again.
>
> This sounds like the idle timeout for MySQL (or similar) kicking in.
> What database(s) or other backends are you using?
>
> That said, we've had problems with hibernate/suspend in the past,
> so I'll double check.
>



-- 

*Tony Devlin*
Founder / CTO
PrintKEG.com
(800) 676-0856
http://www.printkeg.com

Facebook: http://www.facebook.com/printkeg
Twitter: https://twitter.com/printkeg

250 business cards are only $10
http://www.printkeg.com/cheap-business-cards.php


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:41   ` Kapil Israni
@ 2014-08-04 18:48     ` Eric Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Wong @ 2014-08-04 18:48 UTC (permalink / raw)
  To: Kapil Israni; +Cc: unicorn-public

Kapil Israni <kapil.israni@gmail.com> wrote:
> I unsubscribed from this list few days back, still getting emails?

*checking logs*  You sent the unsubscribe request, but never responded
to the confirmation email.  If you lost the confirmation email, you
can resend the unsubscribe request again and make sure you respond to
the confirmation

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:39 ` Eric Wong
  2014-08-04 18:41   ` Kapil Israni
  2014-08-04 18:45   ` Tony Devlin
@ 2014-08-04 18:55   ` Daniel Condomitti
  2014-08-04 19:34     ` Eric Wong
  2 siblings, 1 reply; 18+ messages in thread
From: Daniel Condomitti @ 2014-08-04 18:55 UTC (permalink / raw)
  To: Eric Wong; +Cc: Tony Devlin, unicorn-public

It could also be that your TCP keepalive interval is higher than your dat=
abase server=E2=80=99s connection timeout. I=E2=80=99ve run into that in =
the past. =20


On Monday, August 4, 2014 at 11:39 AM, Eric Wong wrote:

> Tony Devlin <tonydevlin=40gmail.com (mailto:tonydevlin=40gmail.com)> wr=
ote:
> > We have an issue where if a site is not accessed for around (average)=
 30
> > minutes the next query will timeout, and it will timeout on all the w=
orkers
> > opened. IE: If I have two workers, then both of those workers will
> > timeout, even if the first one,
> > after timeout, starts to work. As soon as the second worker is called=
 upon
> > it will timeout. Then everything runs perfectly good and great until =
the
> > site is not accessed for 30 minutes or more. Then the timeout issue s=
tarts
> > all over again.
> > =20
> =20
> =20
> This sounds like the idle timeout for MySQL (or similar) kicking in.
> What database(s) or other backends are you using=3F
> =20
> That said, we've had problems with hibernate/suspend in the past,
> so I'll double check.
> =20
> =20




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:45   ` Tony Devlin
@ 2014-08-04 19:11     ` Eric Wong
  2014-08-04 19:21       ` Tony Devlin
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2014-08-04 19:11 UTC (permalink / raw)
  To: Tony Devlin; +Cc: unicorn-public

Tony Devlin <tonydevlin@gmail.com> wrote:
> Eric,
> 
> Thank you for responding.  We use a database on only one of the apps, it is
> a Oracle 11G RAC Server.  I'll get the DBA to double check the idle timeout
> for that DB.  Though the other app does not use a database.

OK.  Does the other app connect to any other external services?
If you're not sure, you can check with: lsof -p $WORKER_PID

Do lsof once a worker has had a few requests served, as some libraries
lazily open connections.  I suggest debugging problems on an instance
with only one worker to make reproducing the problem easier.

There should be a general query timeout for all DB/external connections:
  http://unicorn.bogomips.org/Application_Timeouts.html

I haven't been able to reproduce the issue on a hello-world app with
no external dependencies:

$ unicorn -E none -c unicorn.conf.rb
---------- unicorn.conf.rb ----------
timeout 20
------------ config.ru --------------
require 'rack/lobster'
use Rack::ContentLength
run Rack::Lobster.new
-- 
EW

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 19:11     ` Eric Wong
@ 2014-08-04 19:21       ` Tony Devlin
  2014-08-04 19:45         ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Devlin @ 2014-08-04 19:21 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

Thank you Eric,

I will look into the other worker to see what is going on with it.  I still
appreciate any hints you all can give me on where I can check.   I'm also
looking into the OS TCP timeouts to see if what Daniel said may be a
problem.


Eric Wong <e@80x24.org> wrote:

>
> OK.  Does the other app connect to any other external services?
> If you're not sure, you can check with: lsof -p $WORKER_PID
>
> Do lsof once a worker has had a few requests served, as some libraries
> lazily open connections.  I suggest debugging problems on an instance
> with only one worker to make reproducing the problem easier.
> --
> EW


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 18:55   ` Daniel Condomitti
@ 2014-08-04 19:34     ` Eric Wong
  2014-08-04 20:24       ` Tony Devlin
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2014-08-04 19:34 UTC (permalink / raw)
  To: Daniel Condomitti; +Cc: Tony Devlin, unicorn-public

Daniel Condomitti <daniel@condomitti.com> wrote:
> It could also be that your TCP keepalive interval is higher than your
> database server’s connection timeout. I’ve run into that in the past.

That kicks in at around 2 hours by default on Linux systems.
I'm not sure it would matter for Tony's case since he hit it
after ~30 minutes of idle (unless he tuned the knobs himself).

ref: tcp_keep* knobs in
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/networking/ip-sysctl.txt

unicorn itself has no timers outside of the configurable timeout.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 19:21       ` Tony Devlin
@ 2014-08-04 19:45         ` Eric Wong
  2014-08-04 20:06           ` Michael Fischer
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2014-08-04 19:45 UTC (permalink / raw)
  To: Tony Devlin; +Cc: unicorn-public

Tony Devlin <tonydevlin@gmail.com> wrote:
> Thank you Eric,
> 
> I will look into the other worker to see what is going on with it.  I still
> appreciate any hints you all can give me on where I can check.   I'm also
> looking into the OS TCP timeouts to see if what Daniel said may be a
> problem.

General rule for me is to get the problem reproducible in the smallest
possible way.  That could mean removing features, cutting out large
chunks of code, cutting out certain request types, reducing
workers.

More things:

1) Can you make sure nginx is not trying to maintain persistent
  connections?  nginx should respect unicorn closing the connection
  but I haven't checked the latest versions of nginx.
  lsof can help here, too.

  unicorn currently does not do persistent connections, allowing
  an M:N relationship between nginx instances and unicorn
  instances[1]

2) Any other odd external dependencies such as NFS mounts,
   file locks, FIFOs, etc?

[1] Perhaps persistent connections will be an option in the future
    if the support/documentation overhead is worth it, as nginx
    supports persistent connections to backends nowadays.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 19:45         ` Eric Wong
@ 2014-08-04 20:06           ` Michael Fischer
  0 siblings, 0 replies; 18+ messages in thread
From: Michael Fischer @ 2014-08-04 20:06 UTC (permalink / raw)
  To: Eric Wong; +Cc: Tony Devlin, unicorn-public

On Mon, Aug 4, 2014 at 12:45 PM, Eric Wong <e@80x24.org> wrote:

[1] Perhaps persistent connections will be an option in the future
>     if the support/documentation overhead is worth it, as nginx
>     supports persistent connections to backends nowadays.
>

I don't believe the added complexity is worth the effort.

--Michael


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 19:34     ` Eric Wong
@ 2014-08-04 20:24       ` Tony Devlin
  2014-08-04 20:44         ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Devlin @ 2014-08-04 20:24 UTC (permalink / raw)
  To: Eric Wong; +Cc: Daniel Condomitti, unicorn-public

Yep, it occurs after 30 minutes of inactivity.  Down to the minute; I hit
the site at 3:40 and tried at 4:10 and sure enough:

E, [2014-08-04T16:10:52.143541 #2596] ERROR -- : worker=3D0 PID:2599 timeou=
t
(21s > 20s), killing
E, [2014-08-04T16:10:52.158459 #2596] ERROR -- : reaped #<Process::Status:
pid 2599 SIGKILL (signal 9)> worker=3D0
I, [2014-08-04T16:10:52.181648 #3086]  INFO -- : worker=3D0 ready

2014/08/04 16:10:52 [error] 1684#0: *13 upstream prematurely closed
connection while reading response header from upstream, client: *.*.*.*,
server: ***.org, request: "GET /outages HTTP/1.1", upstream:
"http://unix:/var/www/sites/***/shared/sockets/.unicorn.sock.0:/outages",
host: "***.org", referrer: "http://***.org/outages"

=E2=80=8B=E2=80=8B=3D=3D=3D

This occurs on both instances of unicorn workers that we have opened.  I'm
going to reduce that to one instance, per Eric, to continue troubleshooting
in the smallest possible way.

1) It does not appear to be an nginx persistent connection issue, because
once the worker is reaped and restarted, nginx serves the content with no
problems.
2) No NFS mounts, no file locks, no FIFO issues.  (note: one of the apps
does write to files, aside from logs, but problem exists in both apps).

It's also important to note that once the worker is reaped the site is
blazingly fast, sub second responses (2s most time spent to show the
biggest page).  Until 30 minutes of inactivity, in which case timeout issue
and worker is reaped (rinse and repeat).

For the database portion, the DBA says inactivity is killed after 3 hours.
 Far greater time span than this issue is occurring.

Have any other ideas of places I can look?  It's too consistent, it has to
be some specific setting or functionality that does this.

I checked my TCP Timeout settings just in case, but the timeout is set to
2hrs.

On Mon, Aug 4, 2014 at 3:34 PM, Eric Wong <e@80x24.org> wrote:

> Daniel Condomitti <daniel@condomitti.com> wrote:
> > It could also be that your TCP keepalive interval is higher than your
> > database server=E2=80=99s connection timeout. I=E2=80=99ve run into tha=
t in the past.
>
> That kicks in at around 2 hours by default on Linux systems.
> I'm not sure it would matter for Tony's case since he hit it
> after ~30 minutes of idle (unless he tuned the knobs himself).
>
> ref: tcp_keep* knobs in
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Docu=
mentation/networking/ip-sysctl.txt
>
> unicorn itself has no timers outside of the configurable timeout.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 20:24       ` Tony Devlin
@ 2014-08-04 20:44         ` Eric Wong
  2014-08-04 20:46           ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2014-08-04 20:44 UTC (permalink / raw)
  To: Tony Devlin; +Cc: Daniel Condomitti, unicorn-public

Tony Devlin <tonydevlin@gmail.com> wrote:
> Have any other ideas of places I can look?  It's too consistent, it has to
> be some specific setting or functionality that does this.

Unless you find something out-of-the-ordinary with lsof,
we'd have to pull apart your apps to see what they're doing.

I just tested my hello world app (inactive for ~45 minutes) and
could not reproduce the error.

Did you try strace-ing for 30 minutes and reproducing the error?

I'm running out of ideas...

Perhaps your NTP setup is broken?  Or even hardware clock failure
(one of my machines hit that a few weeks ago).

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 20:44         ` Eric Wong
@ 2014-08-04 20:46           ` Eric Wong
  2014-08-05 14:46             ` Tony Devlin
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2014-08-04 20:46 UTC (permalink / raw)
  To: Tony Devlin; +Cc: Daniel Condomitti, unicorn-public

Eric Wong <e@80x24.org> wrote:
> Did you try strace-ing for 30 minutes and reproducing the error?

You can also try setting the unicorn timeout to longer than 30
minutes and get a longer/stalled strace.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-04 20:46           ` Eric Wong
@ 2014-08-05 14:46             ` Tony Devlin
  2014-08-06  9:45               ` Eric Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Devlin @ 2014-08-05 14:46 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public, Daniel Condomitti

I appreciate all your help Eric and Daniel.  I have not solved this yet,
but I think I have narrowed it down to a Firewall timeout issue.  One app
uses a database connection to Oracle, the other app uses a 3rd Party API
(still on location, but across the network).  The ping times to both of
these devices are extremely fast, however 30 minutes of inactivity across
the Firewall seems to disconnect these connections.  At least that appears
to be what the strace is telling me.  The place in the strace that the
timeout occurs is consistent, every time.  For example the strace of the
app that connects to Oracle shows this:

pid  7825] write(14,
"\0\373\0\0\6\0\0\0\0\0\21iB\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\v\0\0\0\3^Ca\201\0\0\0\0\0\0\376\377\377\377\377\377\377\377\22\0\0\0\376\377\377\377\377\377\377\377\r\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0d\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\22select
1 from
dual\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
251) = 251
[pid  7825] read(14,  <unfinished ...>
[pid  7827] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 7827 leader 7825
[pid  7846] +++ killed by SIGKILL +++
PANIC: handle_group_exit: 7846 leader 7825
+++ killed by SIGKILL +++

Clearly that is a database query 'select 1 from dual'.  It times out trying
to read the response.  At the same time if I watch the lsof -p <pid>, I see
that the database connection drops after 30 minutes.

I'll update this thread again once it is solved, for historical and future
issues (in case someone else experiences something similar).

Again thank you for your help Eric!

On Mon, Aug 4, 2014 at 4:46 PM, Eric Wong <e@80x24.org> wrote:

> Eric Wong <e@80x24.org> wrote:
> > Did you try strace-ing for 30 minutes and reproducing the error?
>
> You can also try setting the unicorn timeout to longer than 30
> minutes and get a longer/stalled strace.
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-05 14:46             ` Tony Devlin
@ 2014-08-06  9:45               ` Eric Wong
  2014-08-06 14:05                 ` Tony Devlin
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Wong @ 2014-08-06  9:45 UTC (permalink / raw)
  To: Tony Devlin; +Cc: unicorn-public, Daniel Condomitti

Tony Devlin <tonydevlin@gmail.com> wrote:
> pid  7825] write(14,
> "\0\373\0\0\6\0\0\0\0\0\21iB\376\377\377\377\377\377\377\377\1\0\0\0\0\0\0\0\v\0\0\0\3^Ca\201\0\0\0\0\0\0\376\377\377\377\377\377\377\377\22\0\0\0\376\377\377\377\377\377\377\377\r\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0d\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\0\0\0\0\0\0\0\0\376\377\377\377\377\377\377\377\376\377\377\377\377\377\377\377\22select
> 1 from
> dual\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
> 251) = 251
> [pid  7825] read(14,  <unfinished ...>
> [pid  7827] +++ killed by SIGKILL +++

Any update?  It looks like your DB driver is not using/respecting any
timeout at all[1].  It is bad to not have a timeout there.  There should
be a way to set a timeout so you can at least tell the user the DB
connection dropped or maybe get your app to disconnect+retry once.

A better looking strace would be something like:

    write(fd, ...); => success
    (poll|select|ppoll) syscall ...
    read(fd, ...); /* only if (poll|select|ppoll) was successful[2] */

This goes for configuring all connections/services for any app.

[1] or if it's relying on SO_RCVTIMEO socket option(rare), that's set
    way too high.  Any timeout set for any external connection should
    be lower than the unicorn (last-resort) timeout feature.

[2] any read() syscall after (poll|select|ppoll) should be non-blocking,
    because (poll|select|ppoll) may spuriously wakeup.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-06  9:45               ` Eric Wong
@ 2014-08-06 14:05                 ` Tony Devlin
  2014-08-06 18:26                   ` Daniel Condomitti
  0 siblings, 1 reply; 18+ messages in thread
From: Tony Devlin @ 2014-08-06 14:05 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public, Daniel Condomitti

Eric,

The problem is a firewall that sits between the servers and the database.
 It is an idle session timeout of 30 minutes, so it is silently killing the
connection.  I have reached out to our Network Engineering department but
they are saying they can not change that idle session timeout, nor create a
special rule to allow this connection to bypass that rule.

Currently, I setup a polling device that calls the applications URL every
20 minutes.  This causes the connection between the server and DB to
refresh it's idle timeout.  This is obviously a very hacky way to handle
it, so I am trying to look into AR and Oracle_Enhanced to see if they have
some sort of keepalive option for the database.  I thought it would work
with the reaping_frequency, but apparently that does not work out as I had
expected when you are not running in pools or a thread.  So I'm still on
the lookout for something to handle that.

On Wed, Aug 6, 2014 at 5:45 AM, Eric Wong <e@80x24.org> wrote:
>
>
> Any update?  It looks like your DB driver is not using/respecting any
> timeout at all[1].  It is bad to not have a timeout there.  There should
> be a way to set a timeout so you can at least tell the user the DB
> connection dropped or maybe get your app to disconnect+retry once.
>
> A better looking strace would be something like:
>
>     write(fd, ...); => success
>     (poll|select|ppoll) syscall ...
>     read(fd, ...); /* only if (poll|select|ppoll) was successful[2] */
>
> This goes for configuring all connections/services for any app.
>
> [1] or if it's relying on SO_RCVTIMEO socket option(rare), that's set
>     way too high.  Any timeout set for any external connection should
>     be lower than the unicorn (last-resort) timeout feature.
>
> [2] any read() syscall after (poll|select|ppoll) should be non-blocking,
>     because (poll|select|ppoll) may spuriously wakeup.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Weird Unicorn Timeout Issues (Hibernation problem?)
  2014-08-06 14:05                 ` Tony Devlin
@ 2014-08-06 18:26                   ` Daniel Condomitti
  0 siblings, 0 replies; 18+ messages in thread
From: Daniel Condomitti @ 2014-08-06 18:26 UTC (permalink / raw)
  To: Tony Devlin; +Cc: Eric Wong, unicorn-public

This is exactly what happened to us and I should have been clearer. I wasn’t referring to the default Linux kernel settings causing the killing the connection; it was a network device between our application servers and the database server. It only affected certain applications as some were hit hundreds of times per second and would never be disconnected and the ones that would disconnect were only hit a few times per hour. I -believe- we just dropped the keepalive interval on both sides of the firewall below its idle timeout.  


On Wednesday, August 6, 2014 at 7:05 AM, Tony Devlin wrote:

> Eric,
>  
> The problem is a firewall that sits between the servers and the database.  It is an idle session timeout of 30 minutes, so it is silently killing the connection.  I have reached out to our Network Engineering department but they are saying they can not change that idle session timeout, nor create a special rule to allow this connection to bypass that rule.   
>  
> Currently, I setup a polling device that calls the applications URL every 20 minutes.  This causes the connection between the server and DB to refresh it's idle timeout.  This is obviously a very hacky way to handle it, so I am trying to look into AR and Oracle_Enhanced to see if they have some sort of keepalive option for the database.  I thought it would work with the reaping_frequency, but apparently that does not work out as I had expected when you are not running in pools or a thread.  So I'm still on the lookout for something to handle that.  
>  
>  
>  
>  
> On Wed, Aug 6, 2014 at 5:45 AM, Eric Wong <e@80x24.org (mailto:e@80x24.org)> wrote:
> >  
> > Any update?  It looks like your DB driver is not using/respecting any
> > timeout at all[1].  It is bad to not have a timeout there.  There should
> > be a way to set a timeout so you can at least tell the user the DB
> > connection dropped or maybe get your app to disconnect+retry once.
> >  
> > A better looking strace would be something like:
> >  
> >     write(fd, ...); => success
> >     (poll|select|ppoll) syscall ...
> >     read(fd, ...); /* only if (poll|select|ppoll) was successful[2] */
> >  
> > This goes for configuring all connections/services for any app.
> >  
> > [1] or if it's relying on SO_RCVTIMEO socket option(rare), that's set
> >     way too high.  Any timeout set for any external connection should
> >     be lower than the unicorn (last-resort) timeout feature.
> >  
> > [2] any read() syscall after (poll|select|ppoll) should be non-blocking,
> >     because (poll|select|ppoll) may spuriously wakeup.  



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2014-08-06 18:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-04 18:12 Weird Unicorn Timeout Issues (Hibernation problem?) Tony Devlin
2014-08-04 18:39 ` Eric Wong
2014-08-04 18:41   ` Kapil Israni
2014-08-04 18:48     ` Eric Wong
2014-08-04 18:45   ` Tony Devlin
2014-08-04 19:11     ` Eric Wong
2014-08-04 19:21       ` Tony Devlin
2014-08-04 19:45         ` Eric Wong
2014-08-04 20:06           ` Michael Fischer
2014-08-04 18:55   ` Daniel Condomitti
2014-08-04 19:34     ` Eric Wong
2014-08-04 20:24       ` Tony Devlin
2014-08-04 20:44         ` Eric Wong
2014-08-04 20:46           ` Eric Wong
2014-08-05 14:46             ` Tony Devlin
2014-08-06  9:45               ` Eric Wong
2014-08-06 14:05                 ` Tony Devlin
2014-08-06 18:26                   ` Daniel Condomitti

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).