From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS15169 209.85.128.0/17 X-Spam-Status: No, score=-2.6 required=3.0 tests=BAYES_00,RCVD_IN_DNSWL_LOW, URIBL_BLOCKED shortcircuit=no autolearn=ham version=3.3.2 X-Original-To: unicorn-public@bogomips.org Received: from mail-pd0-f173.google.com (mail-pd0-f173.google.com [209.85.192.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 9EBC21F5CD for ; Thu, 12 Mar 2015 06:27:03 +0000 (UTC) Received: by pdjz10 with SMTP id z10so17468672pdj.11 for ; Wed, 11 Mar 2015 23:27:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:message-id:references:to; bh=9LgI110g9+5WV1hseDbQAus4C7I6RAWd7NzDdfr0ufA=; b=GCxFuvKhsn95T0Zc0WQvV+PUf370cpScc63d95FeLEQyJ24qdBPolk+hY0FMrrXgF1 t+/gihCJ3wNJxqIZSciij2l+KwBbjUZZLxZWt30v60Yz6rn4rrRKvRMie8tUmyIwnB9C lSMLHtFsg1ASMiR0d+NKTkeI2vmTwMFhy+3P8zBjbHkmiRiUxoP4trFL0VdJv1ocGFJn whErfK1FckkBeenS10jmQZ8H1nZlKoFfF1U518K0jNBzpNMi3KnCExZaVWlejIRxF6D8 QkXw1ispYr7XRzbobzZWTxVNlfMtXXg8LxKcKKYK0SpvOxzmGmx5qeLu2yFY6FEGNoeV wKCQ== X-Gm-Message-State: ALoCoQmnqzFxW15q+pN2SruAxwH7qvEuI7hIp1P4ucEVypLqt1OH3mjoIeKVdRKq0qHHdLjSKa7+azrZo2FD8jgjz/gLXmygTPJY0XLhFHSaVm/F6KpnkZ8LvgStybHja8pc6+BI5Zjh0wHhyFA3uRdCtXL4tv8qFBDsULuEM11oy9eRkOxLOC4BVkMN+k2iCp6ccauxJRoL X-Received: by 10.66.55.42 with SMTP id o10mr36564906pap.148.1426141622914; Wed, 11 Mar 2015 23:27:02 -0700 (PDT) Received: from [192.168.0.11] (203-217-74-216.dyn.iinet.net.au. [203.217.74.216]) by mx.google.com with ESMTPSA id om6sm9048844pdb.40.2015.03.11.23.27.00 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 11 Mar 2015 23:27:01 -0700 (PDT) Content-Type: multipart/signed; boundary="Apple-Mail=_335DD994-AB5B-485F-903B-9CD852EA1CE2"; micalg="sha1"; protocol="application/pkcs7-signature" Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2087\)) Subject: Re: On USR2, new master runs with same PID From: Kevin Yank In-Reply-To: <20150312014515.GA10605@dcvr.yhbt.net> Date: Thu, 12 Mar 2015 17:26:53 +1100 Cc: unicorn-public@bogomips.org Message-Id: References: <2C0638F8-338C-4B06-B641-0DCF48D4D16A@avalanche.com.au> <20150312014515.GA10605@dcvr.yhbt.net> To: Eric Wong X-Mailer: Apple Mail (2.2087) X-Content-Filtered-By: PublicInbox::Filter 0.0.1 List-Id: --Apple-Mail=_335DD994-AB5B-485F-903B-9CD852EA1CE2 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Thanks for your help, Eric! > On 12 Mar 2015, at 12:45 pm, Eric Wong wrote: >=20 > Kevin Yank wrote: >> When I send USR2 to the master process (PID 19216 in this example), I >> get the following in the Unicorn log: >>=20 >> I, [2015-03-11T23:47:33.992274 #6848] INFO -- : executing = ["/srv/ourapp/shared/bundle/ruby/2.2.0/bin/unicorn", = "/srv/ourapp/current/config.ru", "-Dc", = "/srv/ourapp/shared/config/unicorn.rb", = {10=3D>#}] (in = /srv/ourapp/releases/a0e8b5df474ad5129200654f92a76af00a750f47) >> I, [2015-03-11T23:47:36.504235 #6848] INFO -- : inherited = addr=3D/srv/ourapp/shared/sockets/unicorn.sock fd=3D10 >> = /srv/ourapp/shared/bundle/ruby/2.2.0/gems/unicorn-4.8.1/lib/unicorn/http_s= erver.rb:206:in `pid=3D': Already running on PID:19216 (or = pid=3D/srv/ourapp/shared/pids/unicorn.pid is stale) (ArgumentError) >=20 > Nothing suspicious until the above line... That=E2=80=99s right. >> And when I check, indeed, there is now unicorn.pid and >> unicorn.pid.oldbin, both containing 19216. >=20 > Any chance you have a process manager or something else creating the > (non-.oldbin) pid file behind unicorn's back? It=E2=80=99s possible; I=E2=80=99m using eye = (https://github.com/kostya/eye) as a process monitor. I=E2=80=99m not = aware of it writing .pid files for processes that self-daemonize like = Unicorn, though. And once one of my servers =E2=80=9Cgoes bad=E2=80=9D = (i.e. Unicorn starts failing to restart in response to a USR2), it does = so 100% consistently until I stop and restart Unicorn entirely. Based on = that, I don=E2=80=99t believe it=E2=80=99s a race condition where my = process monitor is slipping in a new unicorn.pid file some of the time. > Can you check your process table to ensure there's not multiple > unicorn instances running off the same config and pid files, too? As far as I can tell, I have some servers that have gotten themselves = into this =E2=80=9CUSR2 restart fails=E2=80=9D state, while others are = working just fine. In both cases, the Unicorn process tree (as show in = htop) looks like this =E2=80=9Cat rest=E2=80=9D (i.e. no = deployments/restarts in progress): unicorn master `- unicorn master `- unicorn worker[2] | `- unicorn worker[2] `- unicorn worker[1] | `- unicorn worker[1] `- unicorn worker[0] `- unicorn worker[0] At first glance I=E2=80=99d definitely say it appears that I have two = masters running from the same config files. However, there=E2=80=99s = only one unicorn.pid file of course (the root process in the tree = above), and when I try to kill -TERM the master process that doesn=E2=80=99= t have a .pid file, the entire process tree exits. Am I misinterpreting = the process table? Is this process list actually normal? Thus far I=E2=80=99ve been unable to find any difference in state = between a properly-behaving server and a misbehaving server, apart from = the behaviour of the Unicorn master when it receives a USR2. > Also, were there other activity (another USR2 or HUP) in the logs > a few seconds beforehand? No, didn=E2=80=99t see anything like that (and I was looking for it). > What kind of filesystem / kernel is the pid file on? EXT4 / Ubuntu Server 12.04 LTS > A network FS or anything without the consistency guarantees provided > by POSIX would not work for pid files. Given my environment above, I should be OK, right? > pid files are unfortunately prone to to nasty race conditions, > but I'm not sure what you're seeing happens very often. This has been happening pretty frequently across multiple server = instances, and again once it starts happening on an instance, it keeps = happening 100% of the time (until I restart Unicorn completely). So = it=E2=80=99s not a rare edge case. > -------------------8<------------------- >> sleep 10 >>=20 >> old_pid =3D "#{server.config[:pid]}.oldbin" >> if File.exists?(old_pid) && server.pid !=3D old_pid >> begin >> Process.kill("QUIT", File.read(old_pid).to_i) >> rescue Errno::ENOENT, Errno::ESRCH >> # someone else did our job for us >> end >> end > -------------------8<------------------- >=20 > I'd get rid of that hunk starting with the "sleep 10" (at least while > debugging this issue). If I simply delete this hunk, I=E2=80=99ll have old masters left running = on my servers because they=E2=80=99ll never get sent the QUIT signal. I = can definitely remove it temporarily (and kill the old master myself) = while debugging, though. > If you did a USR2 previously, maybe it could > affect the current USR2 upgrade process. Sleeping so long in the = master > like that is pretty bad it throws off timing and delays signal = handling. I=E2=80=99d definitely like to get rid of the sleep, as my restarts = definitely feel slow. I=E2=80=99m not clear on what a better approach = would be, though. > That's a pretty fragile config and I regret ever including it in the > example config files Any better examples/docs you=E2=80=99d recommend I consult for guidance? = Or is expecting to achieve a robust zero-downtime restart using = before_fork/after_fork hooks unrealistic? -- Kevin Yank Chief Technology Officer, Avalanche Technology Group http://avalanche.com.au/ ph: +61 4 2241 0083 --Apple-Mail=_335DD994-AB5B-485F-903B-9CD852EA1CE2--