unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* [PATCH] Master promotion with SIGURG (CoW optimization)
@ 2022-07-05 20:05 Jean Boussier
  2022-07-06  2:33 ` Eric Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Jean Boussier @ 2022-07-05 20:05 UTC (permalink / raw)
  To: unicorn-public

Unicorn rely on Copy on Write to limit applications' memory usage,
however Ruby Copy on Write performance isn't exactly perfect.
As code get executed various internal performance caches get warmed
which cause shared pages to be invalidated.

While there's certainly some improvements to be made to Ruby itself
on that front, it will likely get worse in the near future if YJIT
become popular, as it will generate executable code in the workers
hence won't be shared.

One way effective CoW performance could be improved would be to
periodically promote one worker as the new master. Since that worker
processed some requests, VM caches, JITed code, etc should be somewhat
warm already, hence shared pages should be dirtied less frequently after
each promotion.

Puma 5.0.0 introduced a similar experimental feature called
`fork_worker` or `refork`.

It's a bit more limited though, as instead of promoting a new
master and exiting, they only ask worker #0 to fork itself to
replace its siblings. So once all workers are replaces, there
is 3 levels of processes: cluster -> first_worker -> other_workers.

They also include automatic reforking after a predetermined amount
of requests (1k by default).

The happy path works pretty much like this:

- Assume daemonized mode.

- master: on `SIGURG`
   - Forward `SIGURG` to a chosen worker.

- worker: on `SIGURG` (become a master)
   - do the prep, register traps, open self-pipe, etc
   - don't spawn any worker, set number of wokers to 0.
   - send `SIGWHINCH` to master.

- old master: on `SIGWHINCH`
   - Gracefully shutdown workers, like during a reexec.

- old master: when a worker is reaped.
   - Send `SIGTTIN` to the new master
   - If it was the last worker:
     - replace pidfile
     - exit

This patch is not exactly production quality yet:

   - I need to write some tests
   - There's a race condition that can cause the promoted master
     master to have one less worker than required. Need to be addressed.
   - The pidfile replacement should be improved.
   - Multiple corner cases like a `QUIT` during promotion are not handled.

However it work well enough for demonstration.

So I figured I'd ask wether the feature is desired upstream
before investing more effort into it.

For this to be used in production without too much integration effort
I think automatic promotion based on some criteria like number of
request or process lifetime would be needed.

Ideally a hook interface to programatically trigger promotion would
be very valuable as I'd likely want to trigger promotion
based on memory metrics read from `/proc/<pid>/smaps_rollup`.
---
  SIGNALS                        |   4 ++
  lib/unicorn.rb                 |   2 +-
  lib/unicorn/http_server.rb     | 115 +++++++++++++++++++++++++--------
  lib/unicorn/promoted_worker.rb |  40 ++++++++++++
  4 files changed, 134 insertions(+), 27 deletions(-)
  create mode 100644 lib/unicorn/promoted_worker.rb

diff --git a/SIGNALS b/SIGNALS
index 7321f2b..f5716b9 100644
--- a/SIGNALS
+++ b/SIGNALS
@@ -39,6 +39,10 @@ https://yhbt.net/unicorn/examples/init.sh
  * WINCH - gracefully stops workers but keep the master running.
    This will only work for daemonized processes.

+* URG - promote one of the existing workers as a new master, and gracefully
+  stop workers.
+  This will only work for daemonized processes.
+
  * TTIN - increment the number of worker processes by one

  * TTOU - decrement the number of worker processes by one
diff --git a/lib/unicorn.rb b/lib/unicorn.rb
index 1a50631..832f78d 100644
--- a/lib/unicorn.rb
+++ b/lib/unicorn.rb
@@ -133,6 +133,6 @@ def self.pipe # :nodoc:
  # :enddoc:

  %w(const socket_helper stream_input tee_input http_request configurator
-   tmpio util http_response worker http_server).each do |s|
+   tmpio util http_response worker promoted_worker http_server).each do |s|
    require_relative "unicorn/#{s}"
  end
diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 21f2a05..dd8f021 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -50,6 +50,7 @@ class Unicorn::HttpServer
    #   Unicorn::HttpServer::START_CTX[0] = "/home/bofh/2.3.0/bin/unicorn"
    START_CTX = {
      :argv => ARGV.map(&:dup),
+    :generation => 0,
      0 => $0.dup,
    }
    # We favor ENV['PWD'] since it is (usually) symlink aware for Capistrano
@@ -106,7 +107,7 @@ def initialize(app, options = {})
      @orig_app = app
      # list of signals we care about and trap in master.
      @queue_sigs = [
-      :WINCH, :QUIT, :INT, :TERM, :USR1, :USR2, :HUP, :TTIN, :TTOU ]
+      :WINCH, :QUIT, :INT, :TERM, :USR1, :USR2, :HUP, :TTIN, :TTOU, :URG ]

      @worker_data = if worker_data = ENV['UNICORN_WORKER']
        worker_data = worker_data.split(',').map!(&:to_i)
@@ -119,7 +120,24 @@ def initialize(app, options = {})

    # Runs the thing.  Returns self so you can run join on it
    def start
+    @new_master = false
      inherit_listeners!
+    promote
+
+    # write pid early for Mongrel compatibility if we're not inheriting 
sockets
+    # This is needed for compatibility some Monit setups at least.
+    # This unfortunately has the side effect of clobbering valid PID if
+    # we upgrade and the upgrade breaks during preload_app==true && 
build_app!
+    self.pid = config[:pid]
+
+    build_app! if preload_app
+    bind_new_listeners!
+
+    spawn_missing_workers
+    self
+  end
+
+  def promote
      # this pipe is used to wake us up from select(2) in #join when signals
      # are trapped.  See trap_deferred.
      @self_pipe.replace(Unicorn.pipe)
@@ -130,18 +148,6 @@ def start
      # Note that signals don't actually get handled until the #join method
      @queue_sigs.each { |sig| trap(sig) { @sig_queue << sig; 
awaken_master } }
      trap(:CHLD) { awaken_master }
-
-    # write pid early for Mongrel compatibility if we're not inheriting 
sockets
-    # This is needed for compatibility some Monit setups at least.
-    # This unfortunately has the side effect of clobbering valid PID if
-    # we upgrade and the upgrade breaks during preload_app==true && 
build_app!
-    self.pid = config[:pid]
-
-    build_app! if preload_app
-    bind_new_listeners!
-
-    spawn_missing_workers
-    self
    end

    # replaces current listener set with +listeners+.  This will
@@ -178,16 +184,16 @@ def logger=(obj)
      Unicorn::HttpRequest::DEFAULTS["rack.logger"] = @logger = obj
    end

-  def clobber_pid(path)
+  def clobber_pid(path, content = $$)
      unlink_pid_safe(@pid) if @pid
      if path
        fp = begin
-        tmp = "#{File.dirname(path)}/#{rand}.#$$"
+        tmp = "#{File.dirname(path)}/#{rand}.#{content}"
          File.open(tmp, File::RDWR|File::CREAT|File::EXCL, 0644)
        rescue Errno::EEXIST
          retry
        end
-      fp.syswrite("#$$\n")
+      fp.syswrite("#{content}\n")
        File.rename(fp.path, path)
        fp.close
      end
@@ -279,6 +285,11 @@ def join
        end
        @ready_pipe = @ready_pipe.close rescue nil
      end
+
+    if @promoted
+      Process.kill(:WINCH, Process.ppid)
+    end
+
      begin
        reap_all_workers
        case @sig_queue.shift
@@ -292,6 +303,13 @@ def join
            @logger.debug("waiting #{sleep_time}s after 
suspend/hibernation")
          end
          maintain_worker_count if respawn
+
+        if @new_master && @new_master.ready? && @workers.empty?
+          # TODO: we should handle the new master dying like with reexec.
+          clobber_pid(pid, @new_master.pid)
+          break
+        end
+
          master_sleep(sleep_time)
        when :QUIT # graceful shutdown
          break
@@ -305,10 +323,20 @@ def join
          soft_kill_each_worker(:USR1)
        when :USR2 # exec binary, stay alive in case something went wrong
          reexec
+      when :URG
+        if $stdin.tty?
+          logger.info "SIGURG ignored because we're not daemonized"
+        else
+          promote_new_master
+        end
        when :WINCH
          if $stdin.tty?
            logger.info "SIGWINCH ignored because we're not daemonized"
          else
+          if @new_master
+            @new_master.ready!
+          end
+
            respawn = false
            logger.info "gracefully stopping all workers"
            soft_kill_each_worker(:QUIT)
@@ -408,11 +436,30 @@ def reap_all_workers
          worker = @workers.delete(wpid) and worker.close rescue nil
          @after_worker_exit.call(self, worker, status)
        end
+
+      if @new_master && @new_master.ready?
+        @new_master.scale(@workers.size)
+      end
      rescue Errno::ECHILD
        break
      end while true
    end

+  def promote_new_master
+    # Promoting the oldest worker
+    # TODO: handle `@new_master` being dead.
+    if @new_master
+      logger.error "can't promote because worker=#{@new_master.nr} is 
being promoted"
+    elsif pair = @workers.first
+      @new_master = Unicorn::PromotedWorker.new(*pair, worker_processes)
+      @workers.delete(@new_master.pid)
+      logger.info "master promoting worker=#{@new_master.worker.nr}"
+      @new_master.promote
+    else
+      logger.error "can't promote because there is no existing workers"
+    end
+  end
+
    # reexecutes the START_CTX with a new binary
    def reexec
      if @reexec_pid > 0
@@ -516,10 +563,11 @@ def murder_lazy_workers
    end

    def after_fork_internal
+    self.worker_processes = 0
      @self_pipe.each(&:close).clear # this is master-only, now
      @ready_pipe.close if @ready_pipe
      Unicorn::Configurator::RACKUP.clear
-    @ready_pipe = @init_listeners = @before_exec = @before_fork = nil
+    @ready_pipe = nil

      # The OpenSSL PRNG is seeded with only the pid, and apps with 
frequently
      # dying workers can recycle pids
@@ -545,6 +593,13 @@ def spawn_missing_workers
        unless pid
          after_fork_internal
          worker_loop(worker)
+
+        if @promoted
+          worker.tick = 0
+          promote
+          join
+        end
+
          exit
        end

@@ -678,19 +733,22 @@ def init_worker_process(worker)
      trap(:CHLD, 'DEFAULT')
      @sig_queue.clear
      proc_name "worker[#{worker.nr}]"
-    START_CTX.clear
      @workers.clear

      after_fork.call(self, worker) # can drop perms and create listeners
      LISTENERS.each { |sock| sock.close_on_exec = true }

      worker.user(*user) if user.kind_of?(Array) && ! worker.switched
-    @config = nil
      build_app! unless preload_app
-    @after_fork = @listener_opts = @orig_app = nil
+    @listener_opts = @orig_app = nil
      readers = LISTENERS.dup
      readers << worker
      trap(:QUIT) { nuke_listeners!(readers) }
+    @promoted = false
+    trap(:URG) do
+      @promoted = true
+      START_CTX[:generation] += 1
+    end
      readers
    end

@@ -706,11 +764,11 @@ def reopen_worker_logs(worker_nr)

    def prep_readers(readers)
      wtr = Unicorn::Waiter.prep_readers(readers)
-    @timeout *= 500 # to milliseconds for epoll, but halved
+    @select_timeout = @timeout * 500 # to milliseconds for epoll, but 
halved
      wtr
    rescue
      require_relative 'select_waiter'
-    @timeout /= 2.0 # halved for IO.select
+    @select_timeout = @timeout / 2.0 # halved for IO.select
      Unicorn::SelectWaiter.new
    end

@@ -720,7 +778,7 @@ def prep_readers(readers)
    def worker_loop(worker)
      readers = init_worker_process(worker)
      waiter = prep_readers(readers)
-    reopen = false
+    promote = reopen = false

      # this only works immediately if the master sent us the signal
      # (which is the normal case)
@@ -739,12 +797,17 @@ def worker_loop(worker)
            process_client(client)
            worker.tick = time_now.to_i
          end
+        if @promoted
+          worker.tick = time_now.to_i
+          return
+        end
+
          break if reopen
        end

        # timeout so we can .tick and keep parent from SIGKILL-ing us
        worker.tick = time_now.to_i
-      waiter.get_readers(ready, readers, @timeout)
+      waiter.get_readers(ready, readers, @select_timeout)
      rescue => e
        redo if reopen && readers[0]
        Unicorn.log_error(@logger, "listen loop error", e) if readers[0]
@@ -823,8 +886,8 @@ def build_app!
    end

    def proc_name(tag)
-    $0 = ([ File.basename(START_CTX[0]), tag
-          ]).concat(START_CTX[:argv]).join(' ')
+    $0 = ([ File.basename(START_CTX[0]), tag, "(gen: 
#{START_CTX[:generation]})",
+          ]).concat(START_CTX[:argv]).compact.join(' ')
    end

    def redirect_io(io, path)
diff --git a/lib/unicorn/promoted_worker.rb b/lib/unicorn/promoted_worker.rb
new file mode 100644
index 0000000..2595182
--- /dev/null
+++ b/lib/unicorn/promoted_worker.rb
@@ -0,0 +1,40 @@
+# -*- encoding: binary -*-
+
+class Unicorn::PromotedWorker
+  attr_reader :pid, :worker
+
+  def initialize(pid, worker, expected_worker_processes)
+    @pid = pid
+    @worker = worker
+    @worker_processes = 0
+    @expected_worker_processes = expected_worker_processes
+    @ready = false
+  end
+
+  def ready?
+    @ready
+  end
+
+  def ready!
+    @ready = true
+  end
+
+  def promote
+    @worker.soft_kill(:URG)
+  end
+
+  def scale(old_master_worker_processes)
+    diff = @expected_worker_processes -
+      old_master_worker_processes -
+      @worker_processes
+
+    if diff > 0
+      diff.times { kill(:TTIN) }
+      @worker_processes += diff
+    end
+  end
+
+  def kill(sig)
+    Process.kill(sig, @pid)
+  end
+end
-- 
2.35.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Master promotion with SIGURG (CoW optimization)
  2022-07-05 20:05 [PATCH] Master promotion with SIGURG (CoW optimization) Jean Boussier
@ 2022-07-06  2:33 ` Eric Wong
  2022-07-06  7:40   ` Jean Boussier
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Wong @ 2022-07-06  2:33 UTC (permalink / raw)
  To: Jean Boussier; +Cc: unicorn-public

Jean Boussier <jean.boussier@shopify.com> wrote:
> Unicorn rely on Copy on Write to limit applications' memory usage,
> however Ruby Copy on Write performance isn't exactly perfect.
> As code get executed various internal performance caches get warmed
> which cause shared pages to be invalidated.
> 
> While there's certainly some improvements to be made to Ruby itself
> on that front, it will likely get worse in the near future if YJIT
> become popular, as it will generate executable code in the workers
> hence won't be shared.
> 
> One way effective CoW performance could be improved would be to
> periodically promote one worker as the new master. Since that worker
> processed some requests, VM caches, JITed code, etc should be somewhat
> warm already, hence shared pages should be dirtied less frequently after
> each promotion.

Agreed, there's potential for savings, here...

> Puma 5.0.0 introduced a similar experimental feature called
> `fork_worker` or `refork`.
> 
> It's a bit more limited though, as instead of promoting a new
> master and exiting, they only ask worker #0 to fork itself to
> replace its siblings. So once all workers are replaces, there
> is 3 levels of processes: cluster -> first_worker -> other_workers.

OK, any numbers from Puma users which can be used to project
improvements for unicorn (and other servers)?

> They also include automatic reforking after a predetermined amount
> of requests (1k by default).

1k seems like a lot of requests (more on this below)

> The happy path works pretty much like this:
> 
> - Assume daemonized mode.
> 
> - master: on `SIGURG`
>   - Forward `SIGURG` to a chosen worker.

OK, I guess SIGURG is safe to use since nobody else relies on it (right?).

> - worker: on `SIGURG` (become a master)
>   - do the prep, register traps, open self-pipe, etc
>   - don't spawn any worker, set number of wokers to 0.
>   - send `SIGWHINCH` to master.
> 
> - old master: on `SIGWHINCH`
>   - Gracefully shutdown workers, like during a reexec.

nit: only one `H' in `SIGWINCH`

> - old master: when a worker is reaped.
>   - Send `SIGTTIN` to the new master
>   - If it was the last worker:
>     - replace pidfile
>     - exit
> 
> This patch is not exactly production quality yet:
> 
>   - I need to write some tests
>   - There's a race condition that can cause the promoted master
>     master to have one less worker than required. Need to be addressed.
>   - The pidfile replacement should be improved.
>   - Multiple corner cases like a `QUIT` during promotion are not handled.

Right.  PID file races and corner cases were painful to deal
with in the earliest days of this project, and I don't look
forward to having to deal with them ever again...

It looked like the world was moving towards socket-activation
with dedicated process managers and away from fragile PID files;
so being self-daemonization dependent seems like a step back.

> However it work well enough for demonstration.
> 
> So I figured I'd ask wether the feature is desired upstream
> before investing more effort into it.

It really depends on:

1) what memory savings and/or speedups can be achieved

2) what support can we expect from you (and other users) for
   this feature and potential regressions going forward?

I don't have the free time nor mental capacity to deal with
fallout from major new features such as this, myself.

> For this to be used in production without too much integration effort
> I think automatic promotion based on some criteria like number of
> request or process lifetime would be needed.
> 
> Ideally a hook interface to programatically trigger promotion would
> be very valuable as I'd likely want to trigger promotion
> based on memory metrics read from `/proc/<pid>/smaps_rollup`.

The hook interface would be yet another documentation+support
burden I'm worried about (and also more unicorn-specific stuff
to lock people in :<).


A completely different idea which should achieve the same goal,
completely independent of unicorn:

  App startup/loading can be made to trigger warmup paths which
  hit common endpoints.  IOW, app loading could run a small
  warmup suite which simulates common requests to heat up caches
  and trigger JIT.

  That would be completely self-contained in each app and work
  on every single app server; not just ones with special
  provisions to deal with this.  Of course, CoW-aware setups
  (preload_app) will still have the advantage, here.

  Best of all, it could be deterministic, too, instead of
  waiting and hoping 1k random requests will be sufficient
  warmup, developers can tune and mock exactly the requests
  required for warmup.  Of course, a lazy developer replay 1k
  requests from an old log, too.


Ultimately (long-term for ruby-core), it would be better to make
JIT (and perhaps even VM cache) results persistent and shareable
across different processes, somehow...  That would help all Ruby
tools, even short-lived CLI tools.  Portable locking would be a
pain, I suppose, but proprietary OSes are always a pain :P

>  4 files changed, 134 insertions(+), 27 deletions(-)
>  create mode 100644 lib/unicorn/promoted_worker.rb

There's some line-wrapping issues caused by your MUA;
I got this from "git am":

  warning: Patch sent with format=flowed; space at the end of lines might be lost.
  Applying: Master promotion with SIGURG (CoW optimization)
  error: corrupt patch at line 14

The git-format-patch(1) manpage has a section for Thunderbird users
which may help.

> --- a/SIGNALS
> +++ b/SIGNALS
> @@ -39,6 +39,10 @@ https://yhbt.net/unicorn/examples/init.sh
>  * WINCH - gracefully stops workers but keep the master running.
>    This will only work for daemonized processes.
> 
> +* URG - promote one of the existing workers as a new master, and gracefully
> +  stop workers.
> +  This will only work for daemonized processes.

Perhaps documenting this as experimental and subject to removal
at any time would make the addition of a major new feature an
easier pill to swallow; but I still think this is better outside
of app servers.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Master promotion with SIGURG (CoW optimization)
  2022-07-06  2:33 ` Eric Wong
@ 2022-07-06  7:40   ` Jean Boussier
  2022-07-07 10:23     ` Eric Wong
  0 siblings, 1 reply; 7+ messages in thread
From: Jean Boussier @ 2022-07-06  7:40 UTC (permalink / raw)
  Cc: unicorn-public

> OK, any numbers from Puma users which can be used to project
> improvements for unicorn (and other servers)?

There are some user reports here: https://github.com/puma/puma/issues/2258
but they are mixed in reports for two other new featurs.

Some reports are up to 20-30% savings, but I'd expect unicorn to benefit
even more from it, given that typical puma users spawn less processes
than with unicorn.

> > They also include automatic reforking after a predetermined amount
> > of requests (1k by default).
>
> 1k seems like a lot of requests (more on this below)

Agreed. Shared pages start being invalidated much faster than that.

> > - master: on `SIGURG`
> >   - Forward `SIGURG` to a chosen worker.
>
> OK, I guess SIGURG is safe to use since nobody else relies on it (right?).

That's my understanding, an alternative could be to re-use USR2 and have a
config flag to define wether it is a rexec or refork.

> Right.  PID file races and corner cases were painful to deal
> with in the earliest days of this project, and I don't look
> forward to having to deal with them ever again...
>
> It looked like the world was moving towards socket-activation
> with dedicated process managers and away from fragile PID files;
> so being self-daemonization dependent seems like a step back.

Agreed. We're currently running unicorn as PID 1 inside containers
and I'm not exactly looking forward to have to minitor a PID file.

Another avenue I explored was to keep the existing master and
refork from one of the worker like puma, but re-assign the parent
to the orignal master using PR_SET_CHILD_SUBREAPER.

Here's my notes of how it works:

### Requirements:

- Need `PR_SET_CHILD_SUBREAPER`, so Linux 3.4+ only (May 2012)
- Need `File.mkfifo`, so Ruby 2.3 (Dec 2015), but can be shimed for older
  rubies.

### Flow:

- master: set `PR_SET_CHILD_SUBREAPER` during boot.
- master: create a temporary directory (`$TMP`).
- master: spawn initial workers the old way.

- master: on `SIGCHLD` (a child died):
  - Fake signal oldest worker (`SIGURG`).
  - Write the new worker number in the fake signal pipe (at the same time).

- worker: on `SIGURG` (should spawn a sibling):
  - Note: worker fake signals are processed after the current request is
    completed.
  - Run `before_fork` callback (MAYBE: need a special `before_refork`
    callback?)
  - create a pipe.
  - daemonize (fork twice so that the new process is attributed to the
    nearest `CHILD_SUBREAPER`, aka the original master).
  - wait for the first child to exit.
  - write into the pipe to signal the parent is dead.
  - Run `after_fork` callback (MAYBE: need a special `after_refork` callback?)

- new sibling after fork:
  - wait for the parent to exit (though the temporary pipe).
  - create a named pipe (`File.mkfifo`) at `$TMP/$WORKER_NUM.pipe`.
  - create a pidfile at `$TMP/WORKER_NUM.pid`.
  - Open the named pipe in `IO::RDONLY | IO::NONBLOCK` (otherwise it would
    block until the master open in write mode).
    - NB: Need to convert it to a `Kgio::Pipe` with
      `Kgio::Pipe.for_fd(f.to_i)`, and keep `f` need `autoclose = false`.
  - Create the `Unicorn::Worker` instance with that pipe and worker number.
  - Send `SIGURG` to the parent process (should be the master).
  - Wait for `SIGCONT` in the named pipe.
  - enter the worker loop.

- master: on `SIGURG`:
  - for each outstanding refork order:
    - Look for `$TMP/$WORKER_NUM.pid` and `$TMP/$WORKER_NUM.pipe`
    - Read the pidfile
    - Open the pipe with `IO::RDONLY | IO::NONBLOCK`
      - NB: Need to convert it to a `Kgio::Pipe` with
        `Kgio::Pipe.for_fd(f.to_i)`, and keep `f` need `autoclose = false`.
    - Delete pidfile and named pipe.
    - Register that new worker normally.
    - Write `SIGCONT` into the pipe.

I can share the patch if you are interested, but the extra complexity
and Linux only features made me prefer the master-promotion approach.

> > However it work well enough for demonstration.
> >
> > So I figured I'd ask wether the feature is desired upstream
> > before investing more effort into it.
>
> It really depends on:
>
> 1) what memory savings and/or speedups can be achieved
>
> 2) what support can we expect from you (and other users) for
>    this feature and potential regressions going forward?
>
> I don't have the free time nor mental capacity to deal with
> fallout from major new features such as this, myself.

Yeah, this is just a RFC, I wouldn't submit a final patch without
first deploying it on our infra with good results. I'm just on the
fence between trying to get this upstream vs maintaining our own
server that solely work like this, hence somewhat simpler and that
can make more assumptions.

> The hook interface would be yet another documentation+support
> burden I'm worried about (and also more unicorn-specific stuff
> to lock people in :<).

Understandable. I suppose it can be done with a monitoring process.
Check `smaps_rollup` and send `SIGURG` when deemed necessary.

More moving pieces but keep unicorn simpler.

> A completely different idea which should achieve the same goal,
> completely independent of unicorn:
>
>   App startup/loading can be made to trigger warmup paths which
>   hit common endpoints.  IOW, app loading could run a small
>   warmup suite which simulates common requests to heat up caches
>   and trigger JIT.

Yeah, I don't really believe in this for a few reasons:

  - That would slow boot time.
  - On large apps there's just too many codepath for this to be realistic.
  - Many endpoints require database and other network access you probably
    don't want in the master.
  - Endpoints may have side effects.

> Ultimately (long-term for ruby-core), it would be better to make
> JIT (and perhaps even VM cache) results persistent and shareable
> across different processes, somehow...  That would help all Ruby
> tools, even short-lived CLI tools.  Portable locking would be a
> pain, I suppose, but proprietary OSes are always a pain :P

Based on my limited understanding of both JIT and VM caches, I don't think
that's really possible.

The VM itself could definitely do better CoW wise, and I have some proposals
on that front (https://bugs.ruby-lang.org/issues/18885) but that will take
time and will never be perfect.

That's why I think periodically promoting a new master has potential.

> There's some line-wrapping issues caused by your MUA;

Urk. Ok, trying another client now, and I'll resend the patch.

> Perhaps documenting this as experimental and subject to removal
> at any time would make the addition of a major new feature an
> easier pill to swallow; but I still think this is better outside
> of app servers.

Up to you, if you don't feel like maintaining such a feature I would
perfectly understand.

---
 SIGNALS                        |   4 ++
 lib/unicorn.rb                 |   2 +-
 lib/unicorn/http_server.rb     | 115 +++++++++++++++++++++++++--------
 lib/unicorn/promoted_worker.rb |  40 ++++++++++++
 4 files changed, 134 insertions(+), 27 deletions(-)
 create mode 100644 lib/unicorn/promoted_worker.rb

diff --git a/SIGNALS b/SIGNALS
index 7321f2b..f5716b9 100644
--- a/SIGNALS
+++ b/SIGNALS
@@ -39,6 +39,10 @@ https://yhbt.net/unicorn/examples/init.sh
 * WINCH - gracefully stops workers but keep the master running.
   This will only work for daemonized processes.

+* URG - promote one of the existing workers as a new master, and gracefully
+  stop workers.
+  This will only work for daemonized processes.
+
 * TTIN - increment the number of worker processes by one

 * TTOU - decrement the number of worker processes by one
diff --git a/lib/unicorn.rb b/lib/unicorn.rb
index 1a50631..832f78d 100644
--- a/lib/unicorn.rb
+++ b/lib/unicorn.rb
@@ -133,6 +133,6 @@ def self.pipe # :nodoc:
 # :enddoc:

 %w(const socket_helper stream_input tee_input http_request configurator
-   tmpio util http_response worker http_server).each do |s|
+   tmpio util http_response worker promoted_worker http_server).each do |s|
   require_relative "unicorn/#{s}"
 end
diff --git a/lib/unicorn/http_server.rb b/lib/unicorn/http_server.rb
index 21f2a05..dd8f021 100644
--- a/lib/unicorn/http_server.rb
+++ b/lib/unicorn/http_server.rb
@@ -50,6 +50,7 @@ class Unicorn::HttpServer
   #   Unicorn::HttpServer::START_CTX[0] = "/home/bofh/2.3.0/bin/unicorn"
   START_CTX = {
     :argv => ARGV.map(&:dup),
+    :generation => 0,
     0 => $0.dup,
   }
   # We favor ENV['PWD'] since it is (usually) symlink aware for Capistrano
@@ -106,7 +107,7 @@ def initialize(app, options = {})
     @orig_app = app
     # list of signals we care about and trap in master.
     @queue_sigs = [
-      :WINCH, :QUIT, :INT, :TERM, :USR1, :USR2, :HUP, :TTIN, :TTOU ]
+      :WINCH, :QUIT, :INT, :TERM, :USR1, :USR2, :HUP, :TTIN, :TTOU, :URG ]

     @worker_data = if worker_data = ENV['UNICORN_WORKER']
       worker_data = worker_data.split(',').map!(&:to_i)
@@ -119,7 +120,24 @@ def initialize(app, options = {})

   # Runs the thing.  Returns self so you can run join on it
   def start
+    @new_master = false
     inherit_listeners!
+    promote
+
+    # write pid early for Mongrel compatibility if we're not inheriting sockets
+    # This is needed for compatibility some Monit setups at least.
+    # This unfortunately has the side effect of clobbering valid PID if
+    # we upgrade and the upgrade breaks during preload_app==true && build_app!
+    self.pid = config[:pid]
+
+    build_app! if preload_app
+    bind_new_listeners!
+
+    spawn_missing_workers
+    self
+  end
+
+  def promote
     # this pipe is used to wake us up from select(2) in #join when signals
     # are trapped.  See trap_deferred.
     @self_pipe.replace(Unicorn.pipe)
@@ -130,18 +148,6 @@ def start
     # Note that signals don't actually get handled until the #join method
     @queue_sigs.each { |sig| trap(sig) { @sig_queue << sig; awaken_master } }
     trap(:CHLD) { awaken_master }
-
-    # write pid early for Mongrel compatibility if we're not inheriting sockets
-    # This is needed for compatibility some Monit setups at least.
-    # This unfortunately has the side effect of clobbering valid PID if
-    # we upgrade and the upgrade breaks during preload_app==true && build_app!
-    self.pid = config[:pid]
-
-    build_app! if preload_app
-    bind_new_listeners!
-
-    spawn_missing_workers
-    self
   end

   # replaces current listener set with +listeners+.  This will
@@ -178,16 +184,16 @@ def logger=(obj)
     Unicorn::HttpRequest::DEFAULTS["rack.logger"] = @logger = obj
   end

-  def clobber_pid(path)
+  def clobber_pid(path, content = $$)
     unlink_pid_safe(@pid) if @pid
     if path
       fp = begin
-        tmp = "#{File.dirname(path)}/#{rand}.#$$"
+        tmp = "#{File.dirname(path)}/#{rand}.#{content}"
         File.open(tmp, File::RDWR|File::CREAT|File::EXCL, 0644)
       rescue Errno::EEXIST
         retry
       end
-      fp.syswrite("#$$\n")
+      fp.syswrite("#{content}\n")
       File.rename(fp.path, path)
       fp.close
     end
@@ -279,6 +285,11 @@ def join
       end
       @ready_pipe = @ready_pipe.close rescue nil
     end
+
+    if @promoted
+      Process.kill(:WINCH, Process.ppid)
+    end
+
     begin
       reap_all_workers
       case @sig_queue.shift
@@ -292,6 +303,13 @@ def join
           @logger.debug("waiting #{sleep_time}s after suspend/hibernation")
         end
         maintain_worker_count if respawn
+
+        if @new_master && @new_master.ready? && @workers.empty?
+          # TODO: we should handle the new master dying like with reexec.
+          clobber_pid(pid, @new_master.pid)
+          break
+        end
+
         master_sleep(sleep_time)
       when :QUIT # graceful shutdown
         break
@@ -305,10 +323,20 @@ def join
         soft_kill_each_worker(:USR1)
       when :USR2 # exec binary, stay alive in case something went wrong
         reexec
+      when :URG
+        if $stdin.tty?
+          logger.info "SIGURG ignored because we're not daemonized"
+        else
+          promote_new_master
+        end
       when :WINCH
         if $stdin.tty?
           logger.info "SIGWINCH ignored because we're not daemonized"
         else
+          if @new_master
+            @new_master.ready!
+          end
+
           respawn = false
           logger.info "gracefully stopping all workers"
           soft_kill_each_worker(:QUIT)
@@ -408,11 +436,30 @@ def reap_all_workers
         worker = @workers.delete(wpid) and worker.close rescue nil
         @after_worker_exit.call(self, worker, status)
       end
+
+      if @new_master && @new_master.ready?
+        @new_master.scale(@workers.size)
+      end
     rescue Errno::ECHILD
       break
     end while true
   end

+  def promote_new_master
+    # Promoting the oldest worker
+    # TODO: handle `@new_master` being dead.
+    if @new_master
+      logger.error "can't promote because worker=#{@new_master.nr} is
being promoted"
+    elsif pair = @workers.first
+      @new_master = Unicorn::PromotedWorker.new(*pair, worker_processes)
+      @workers.delete(@new_master.pid)
+      logger.info "master promoting worker=#{@new_master.worker.nr}"
+      @new_master.promote
+    else
+      logger.error "can't promote because there is no existing workers"
+    end
+  end
+
   # reexecutes the START_CTX with a new binary
   def reexec
     if @reexec_pid > 0
@@ -516,10 +563,11 @@ def murder_lazy_workers
   end

   def after_fork_internal
+    self.worker_processes = 0
     @self_pipe.each(&:close).clear # this is master-only, now
     @ready_pipe.close if @ready_pipe
     Unicorn::Configurator::RACKUP.clear
-    @ready_pipe = @init_listeners = @before_exec = @before_fork = nil
+    @ready_pipe = nil

     # The OpenSSL PRNG is seeded with only the pid, and apps with frequently
     # dying workers can recycle pids
@@ -545,6 +593,13 @@ def spawn_missing_workers
       unless pid
         after_fork_internal
         worker_loop(worker)
+
+        if @promoted
+          worker.tick = 0
+          promote
+          join
+        end
+
         exit
       end

@@ -678,19 +733,22 @@ def init_worker_process(worker)
     trap(:CHLD, 'DEFAULT')
     @sig_queue.clear
     proc_name "worker[#{worker.nr}]"
-    START_CTX.clear
     @workers.clear

     after_fork.call(self, worker) # can drop perms and create listeners
     LISTENERS.each { |sock| sock.close_on_exec = true }

     worker.user(*user) if user.kind_of?(Array) && ! worker.switched
-    @config = nil
     build_app! unless preload_app
-    @after_fork = @listener_opts = @orig_app = nil
+    @listener_opts = @orig_app = nil
     readers = LISTENERS.dup
     readers << worker
     trap(:QUIT) { nuke_listeners!(readers) }
+    @promoted = false
+    trap(:URG) do
+      @promoted = true
+      START_CTX[:generation] += 1
+    end
     readers
   end

@@ -706,11 +764,11 @@ def reopen_worker_logs(worker_nr)

   def prep_readers(readers)
     wtr = Unicorn::Waiter.prep_readers(readers)
-    @timeout *= 500 # to milliseconds for epoll, but halved
+    @select_timeout = @timeout * 500 # to milliseconds for epoll, but halved
     wtr
   rescue
     require_relative 'select_waiter'
-    @timeout /= 2.0 # halved for IO.select
+    @select_timeout = @timeout / 2.0 # halved for IO.select
     Unicorn::SelectWaiter.new
   end

@@ -720,7 +778,7 @@ def prep_readers(readers)
   def worker_loop(worker)
     readers = init_worker_process(worker)
     waiter = prep_readers(readers)
-    reopen = false
+    promote = reopen = false

     # this only works immediately if the master sent us the signal
     # (which is the normal case)
@@ -739,12 +797,17 @@ def worker_loop(worker)
           process_client(client)
           worker.tick = time_now.to_i
         end
+        if @promoted
+          worker.tick = time_now.to_i
+          return
+        end
+
         break if reopen
       end

       # timeout so we can .tick and keep parent from SIGKILL-ing us
       worker.tick = time_now.to_i
-      waiter.get_readers(ready, readers, @timeout)
+      waiter.get_readers(ready, readers, @select_timeout)
     rescue => e
       redo if reopen && readers[0]
       Unicorn.log_error(@logger, "listen loop error", e) if readers[0]
@@ -823,8 +886,8 @@ def build_app!
   end

   def proc_name(tag)
-    $0 = ([ File.basename(START_CTX[0]), tag
-          ]).concat(START_CTX[:argv]).join(' ')
+    $0 = ([ File.basename(START_CTX[0]), tag, "(gen:
#{START_CTX[:generation]})",
+          ]).concat(START_CTX[:argv]).compact.join(' ')
   end

   def redirect_io(io, path)
diff --git a/lib/unicorn/promoted_worker.rb b/lib/unicorn/promoted_worker.rb
new file mode 100644
index 0000000..2595182
--- /dev/null
+++ b/lib/unicorn/promoted_worker.rb
@@ -0,0 +1,40 @@
+# -*- encoding: binary -*-
+
+class Unicorn::PromotedWorker
+  attr_reader :pid, :worker
+
+  def initialize(pid, worker, expected_worker_processes)
+    @pid = pid
+    @worker = worker
+    @worker_processes = 0
+    @expected_worker_processes = expected_worker_processes
+    @ready = false
+  end
+
+  def ready?
+    @ready
+  end
+
+  def ready!
+    @ready = true
+  end
+
+  def promote
+    @worker.soft_kill(:URG)
+  end
+
+  def scale(old_master_worker_processes)
+    diff = @expected_worker_processes -
+      old_master_worker_processes -
+      @worker_processes
+
+    if diff > 0
+      diff.times { kill(:TTIN) }
+      @worker_processes += diff
+    end
+  end
+
+  def kill(sig)
+    Process.kill(sig, @pid)
+  end
+end
-- 
2.35.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Master promotion with SIGURG (CoW optimization)
  2022-07-06  7:40   ` Jean Boussier
@ 2022-07-07 10:23     ` Eric Wong
       [not found]       ` <CANPRWbHTNiEcYq5qhN6Kio8Wg9a+2gXmc2bAcB2oVw4LZv8rcw@mail.gmail.com>
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Wong @ 2022-07-07 10:23 UTC (permalink / raw)
  To: Jean Boussier; +Cc: unicorn-public

Jean Boussier <jean.boussier@shopify.com> wrote:
> > OK, any numbers from Puma users which can be used to project
> > improvements for unicorn (and other servers)?
> 
> There are some user reports here: https://github.com/puma/puma/issues/2258
> but they are mixed in reports for two other new featurs.
> 
> Some reports are up to 20-30% savings, but I'd expect unicorn to benefit
> even more from it, given that typical puma users spawn less processes
> than with unicorn.

OK, those numbers sound very good (but I can't view
graphics/screenshots from w3m)

> > > They also include automatic reforking after a predetermined amount
> > > of requests (1k by default).
> >
> > 1k seems like a lot of requests (more on this below)
> 
> Agreed. Shared pages start being invalidated much faster than that.
> 
> > > - master: on `SIGURG`
> > >   - Forward `SIGURG` to a chosen worker.
> >
> > OK, I guess SIGURG is safe to use since nobody else relies on it (right?).
> 
> That's my understanding, an alternative could be to re-use USR2 and have a
> config flag to define wether it is a rexec or refork.

SIGURG is fine, having SIGUSR2 mean two things seems fragile and error-prone

> > Right.  PID file races and corner cases were painful to deal
> > with in the earliest days of this project, and I don't look
> > forward to having to deal with them ever again...
> >
> > It looked like the world was moving towards socket-activation
> > with dedicated process managers and away from fragile PID files;
> > so being self-daemonization dependent seems like a step back.
> 
> Agreed. We're currently running unicorn as PID 1 inside containers
> and I'm not exactly looking forward to have to minitor a PID file.
> 
> Another avenue I explored was to keep the existing master and
> refork from one of the worker like puma, but re-assign the parent
> to the orignal master using PR_SET_CHILD_SUBREAPER.
> 
> Here's my notes of how it works:
> 
> ### Requirements:
> 
> - Need `PR_SET_CHILD_SUBREAPER`, so Linux 3.4+ only (May 2012)

I'm fine with requiring Linux for this feature, existing
features need to continue working on *BSD...

> - Need `File.mkfifo`, so Ruby 2.3 (Dec 2015), but can be shimed for older
>   rubies.

I don't think it's necessary to use FIFOs at all (see below)

> ### Flow:
> 
> - master: set `PR_SET_CHILD_SUBREAPER` during boot.
> - master: create a temporary directory (`$TMP`).
> - master: spawn initial workers the old way.
> 
> - master: on `SIGCHLD` (a child died):
>   - Fake signal oldest worker (`SIGURG`).
>   - Write the new worker number in the fake signal pipe (at the same time).
> 
> - worker: on `SIGURG` (should spawn a sibling):
>   - Note: worker fake signals are processed after the current request is
>     completed.
>   - Run `before_fork` callback (MAYBE: need a special `before_refork`
>     callback?)
>   - create a pipe.
>   - daemonize (fork twice so that the new process is attributed to the
>     nearest `CHILD_SUBREAPER`, aka the original master).
>   - wait for the first child to exit.
>   - write into the pipe to signal the parent is dead.
>   - Run `after_fork` callback (MAYBE: need a special `after_refork` callback?)

*_fork callbacks already work for your current patch, though, right?;
so hopefully *_refork won't be needed...

> - new sibling after fork:
>   - wait for the parent to exit (though the temporary pipe).
>   - create a named pipe (`File.mkfifo`) at `$TMP/$WORKER_NUM.pipe`.
>   - create a pidfile at `$TMP/WORKER_NUM.pid`.
>   - Open the named pipe in `IO::RDONLY | IO::NONBLOCK` (otherwise it would
>     block until the master open in write mode).
>     - NB: Need to convert it to a `Kgio::Pipe` with
>       `Kgio::Pipe.for_fd(f.to_i)`, and keep `f` need `autoclose = false`.

sidenote: no need for kgio, I've been meaning to get rid of it
(Ruby 2.3+ has everything we'd need, I think...)

>   - Create the `Unicorn::Worker` instance with that pipe and worker number.
>   - Send `SIGURG` to the parent process (should be the master).
>   - Wait for `SIGCONT` in the named pipe.
>   - enter the worker loop.
> 
> - master: on `SIGURG`:
>   - for each outstanding refork order:
>     - Look for `$TMP/$WORKER_NUM.pid` and `$TMP/$WORKER_NUM.pipe`
>     - Read the pidfile
>     - Open the pipe with `IO::RDONLY | IO::NONBLOCK`
>       - NB: Need to convert it to a `Kgio::Pipe` with
>         `Kgio::Pipe.for_fd(f.to_i)`, and keep `f` need `autoclose = false`.
>     - Delete pidfile and named pipe.
>     - Register that new worker normally.
>     - Write `SIGCONT` into the pipe.
> 
> I can share the patch if you are interested, but the extra complexity
> and Linux only features made me prefer the master-promotion approach.

I would prefer this Linux-only approach if it gets rid of PID
files and doesn't introduce new temporary files/FIFOs.

It seems socketpair(..., SOCK_SEQPACKET) can be used for
packetized bidirectional IPC, perhaps with send_io + recv_io iff
necessary.  There shouldn't be any need for new, fragile FS
interactions.


Another idea, have you considered letting master work on some requests?
Signal handling would be delayed, and EINTR handling bugs in
gems could be exposed, but it'd be completely portable...

> > > However it work well enough for demonstration.
> > >
> > > So I figured I'd ask wether the feature is desired upstream
> > > before investing more effort into it.
> >
> > It really depends on:
> >
> > 1) what memory savings and/or speedups can be achieved
> >
> > 2) what support can we expect from you (and other users) for
> >    this feature and potential regressions going forward?
> >
> > I don't have the free time nor mental capacity to deal with
> > fallout from major new features such as this, myself.
> 
> Yeah, this is just a RFC, I wouldn't submit a final patch without
> first deploying it on our infra with good results. I'm just on the
> fence between trying to get this upstream vs maintaining our own
> server that solely work like this, hence somewhat simpler and that
> can make more assumptions.

Of course you also realize unicorn remains culturally-incompatible
with 99.9% of the open source world since I refuse to rely on
graphics drivers to work on a daemon, proprietary software,
terms-of-service, etc...

If anything goes wrong, I'd likely CC you anyways.

> > The hook interface would be yet another documentation+support
> > burden I'm worried about (and also more unicorn-specific stuff
> > to lock people in :<).
> 
> Understandable. I suppose it can be done with a monitoring process.
> Check `smaps_rollup` and send `SIGURG` when deemed necessary.
> 
> More moving pieces but keep unicorn simpler.

*shrug*  I've also been favoring more vertical integration in
other places to keep the overall stack simpler.

It depends on whether the monitoring process/library can work
with other Rack servers, probably; and signal compatibility.

> > A completely different idea which should achieve the same goal,
> > completely independent of unicorn:
> >
> >   App startup/loading can be made to trigger warmup paths which
> >   hit common endpoints.  IOW, app loading could run a small
> >   warmup suite which simulates common requests to heat up caches
> >   and trigger JIT.
> 
> Yeah, I don't really believe in this for a few reasons:
> 
>   - That would slow boot time.

Yes, and startup time is already a nasty thing, anyways...

>   - On large apps there's just too many codepath for this to be realistic.

I was thinking a lazy solution could be to have some middleware
measure coverage and log requests to support auto-replay, somehow.

>   - Many endpoints require database and other network access you probably
>     don't want in the master.

Wouldn't the *_fork hooks handle that, anyways?

>   - Endpoints may have side effects.

Yeah, this would optimize the GET endpoints, at least.
Not sure what percentage of POST needs to be optimized...

> > Ultimately (long-term for ruby-core), it would be better to make
> > JIT (and perhaps even VM cache) results persistent and shareable
> > across different processes, somehow...  That would help all Ruby
> > tools, even short-lived CLI tools.  Portable locking would be a
> > pain, I suppose, but proprietary OSes are always a pain :P
> 
> Based on my limited understanding of both JIT and VM caches, I don't think
> that's really possible.

Ugh, I still think it can be made possible, somehow.  But I no longer
use Ruby for new projects...

> The VM itself could definitely do better CoW wise, and I have some proposals
> on that front (https://bugs.ruby-lang.org/issues/18885) but that will take
> time and will never be perfect.
> 
> That's why I think periodically promoting a new master has potential.
> 
> > There's some line-wrapping issues caused by your MUA;
> 
> Urk. Ok, trying another client now, and I'll resend the patch.

Nope, still wrapped.  According to the git-format-patch(1) manpage,
Thunderbird can work w/o extensions, actually.

IME attachments might work mostly well, nowadays (test sending
to yourself and using "git am" to apply, first, of course).
Most on vger will disagree with me on attachments, though...

"git send-email" and "git imap-send" remain the recommended
options, but mutt works well, too.

> > Perhaps documenting this as experimental and subject to removal
> > at any time would make the addition of a major new feature an
> > easier pill to swallow; but I still think this is better outside
> > of app servers.
> 
> Up to you, if you don't feel like maintaining such a feature I would
> perfectly understand.

I'm fine with "maintaining it" if it just means Cc-ing you on
bugs related to this :>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Master promotion with SIGURG (CoW optimization)
       [not found]         ` <CANPRWbGArasDtbAej4LsCOGeYZSrNz87p5kLjG+x__jHAn-5ng@mail.gmail.com>
@ 2022-07-08  0:13           ` Eric Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Wong @ 2022-07-08  0:13 UTC (permalink / raw)
  To: Jean Boussier; +Cc: unicorn-public

Jean Boussier <jean.boussier@shopify.com> wrote:
> >> It seems socketpair(..., SOCK_SEQPACKET) can be used for
> >> packetized bidirectional IPC, perhaps with send_io + recv_io iff
> >> necessary.  There shouldn't be any need for new, fragile FS
> >> interactions.
> 
> > Hum. I'm not familiar with socketpair, I'll have to read on it.

Please keep unicorn-public@yhbt.net Cc-ed for archival purposes
in case I'm arrested/kidnapped/killed.  I'll quote and reply to
your other message separately, no need to resend.

> Could you clarify what you had in mind? Because I've read on socketpair, and
> from my understanding it's similar to anonymous pipes, as they have to
> be created prior
> to forking.

It's bidirectional, and you can use send_io/recv_io to send
pipes or any other IO objects across.  SOCK_SEQPACKET
(sequenced-packet) also frees you from having to deal with
splitting on message boundaries or interleaving.

Ruby's send_io/recv_io does have a weakness in that it can't
send a useful string buffer along with the IO object, but a little
C can be used with sendmsg/recvmsg to make it happen (but
sendmsg + send_io is probably fine)

Ruby's socket ext just sends a 1 byte dummy padding buffer, but
it can allow whatever the socket buffers can handle (over 200k,
IIRC).

This should allow the a socketpair created from a master to
communicate bidirectionally with any of its kids.

If you don't feel like writing C, it might be easier to have
two socketpairs:

1) for string buffer messages
2) another exclusively for send_io/recv_io

> So if we start forking from a worker, I don't see how the new worker and
> the master can open a line of communication without exchanging on the
> filesystem.

If you're sending one end of a pipe from the worker to master:

   m1, w1 = socketpair ...
   m2, w2 = socketpair ...

   w1.sendmsg("#$$-#{io.stat.ino}")
   w2.send_io(io)

And the master will:

   m1.recvmsg => "#{worker_pid}-#{inode}"
   m2.recv_io => map inode of received IO to worker

One major weakness of the above is the non-atomicity if a worker
dies in between w1.sendmsg and w2.send_io; the master would
be hanging on m2.recv_io.

This is why having a bit of C to bundle the IO object with
sendmsg/recvmsg in a single syscall would be nice (along with
saving 2 FDs).


Fwiw, here's the equivalent send/recv wrappers I wrote for Perl
Inline::C a while back, but it should be easy to translate into
Ruby's C API.

In case you can't infer what the Sv* functions do by name,
pretty much all the Perl C API is documented in the perlapi(1)
manpage.  Perl's API was easy enough for a doof like me to learn
after doing Ruby C for a bit...

/* allow send/recv of up to 10 IO objects at once */
#define SEND_FD_CAPA 10
#define SEND_FD_SPACE (SEND_FD_CAPA * sizeof(int))
union my_cmsg {
	struct cmsghdr hdr;
	char pad[sizeof(struct cmsghdr) + 16 + SEND_FD_SPACE];
};

static int sleep_wait(unsigned *tries, int err)
{
	const struct timespec req = { 0, 100000000 }; /* 100ms */
	switch (err) {
	case ENOBUFS: case ENOMEM: case ETOOMANYREFS:
		if (++*tries < 50) {
			fprintf(stderr, "sleeping on sendmsg: %s (#%u)\n",
				strerror(err), *tries);
			nanosleep(&req, NULL);
			return 1;
		}
	default:
		return 0;
	}
}

SV *send_foo(PerlIO *s, SV *svfds, SV *data, int flags)
{
	struct msghdr msg = { 0 };
	union my_cmsg cmsg = { 0 };
	STRLEN dlen = 0;
	struct iovec iov;
	ssize_t sent;
	AV *fds = (AV *)SvRV(svfds);
	I32 i, nfds = av_len(fds) + 1;
	int *fdp;
	unsigned tries = 0;

	if (SvOK(data)) {
		iov.iov_base = SvPV(data, dlen);
		iov.iov_len = dlen;
	}
	if (!dlen) { /* must be non-zero */
		iov.iov_base = &msg.msg_namelen; /* whatever */
		iov.iov_len = 1;
	}
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	if (nfds) {
		if (nfds > SEND_FD_CAPA) {
			fprintf(stderr, "FIXME: bump SEND_FD_CAPA=%d\n", nfds);
			nfds = SEND_FD_CAPA;
		}
		msg.msg_control = &cmsg.hdr;
		msg.msg_controllen = CMSG_SPACE(nfds * sizeof(int));
		cmsg.hdr.cmsg_level = SOL_SOCKET;
		cmsg.hdr.cmsg_type = SCM_RIGHTS;
		cmsg.hdr.cmsg_len = CMSG_LEN(nfds * sizeof(int));
		fdp = (int *)CMSG_DATA(&cmsg.hdr);
		for (i = 0; i < nfds; i++) {
			SV **fd = av_fetch(fds, i, 0);
			*fdp++ = SvIV(*fd);
		}
	}
	do {
		sent = sendmsg(PerlIO_fileno(s), &msg, flags);
	} while (sent < 0 && sleep_wait(&tries, errno));
	return sent >= 0 ? newSViv(sent) : &PL_sv_undef;
}

void recv_foo(PerlIO *s, SV *buf, STRLEN n)
{
	union my_cmsg cmsg = { 0 };
	struct msghdr msg = { 0 };
	struct iovec iov;
	ssize_t i;
	Inline_Stack_Vars;
	Inline_Stack_Reset;

	if (!SvOK(buf))
		sv_setpvn(buf, "", 0);
	iov.iov_base = SvGROW(buf, n + 1);
	iov.iov_len = n;
	msg.msg_iov = &iov;
	msg.msg_iovlen = 1;
	msg.msg_control = &cmsg.hdr;
	msg.msg_controllen = CMSG_SPACE(SEND_FD_SPACE);

	i = recvmsg(PerlIO_fileno(s), &msg, 0);
	if (i < 0)
		Inline_Stack_Push(&PL_sv_undef);
	else
		SvCUR_set(buf, i);
	if (i > 0 && cmsg.hdr.cmsg_level == SOL_SOCKET &&
			cmsg.hdr.cmsg_type == SCM_RIGHTS) {
		size_t len = cmsg.hdr.cmsg_len;
		int *fdp = (int *)CMSG_DATA(&cmsg.hdr);
		for (i = 0; CMSG_LEN((i + 1) * sizeof(int)) <= len; i++)
			Inline_Stack_Push(sv_2mortal(newSViv(*fdp++)));
	}
	Inline_Stack_Done;
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Master promotion with SIGURG (CoW optimization)
       [not found]       ` <CANPRWbHTNiEcYq5qhN6Kio8Wg9a+2gXmc2bAcB2oVw4LZv8rcw@mail.gmail.com>
       [not found]         ` <CANPRWbGArasDtbAej4LsCOGeYZSrNz87p5kLjG+x__jHAn-5ng@mail.gmail.com>
@ 2022-07-08  0:30         ` Eric Wong
  2022-07-08  6:22           ` Jean Boussier
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Wong @ 2022-07-08  0:30 UTC (permalink / raw)
  To: Jean Boussier; +Cc: unicorn-public

Jean Boussier <jean.boussier@shopify.com> wrote:

(fully quoting original message for archival purposes, original
was sent privately)

> > SIGURG is fine, having SIGUSR2 mean two things seems fragile and error-prone
> 
> Ack.
> 
> 
> > I'm fine with requiring Linux for this feature, existing
> > features need to continue working on *BSD...
> 
> Sounds good.
> 
> 
> > *_fork callbacks already work for your current patch, though, right?;
> > so hopefully *_refork won't be needed...
> 
> Yeah, I don't think it's necessary.
> 
> > sidenote: no need for kgio, I've been meaning to get rid of it
> > (Ruby 2.3+ has everything we'd need, I think...)
> 
> Yes. I just didn't want to make such change in my patch.
> I can send a patch to remove kgio and require Ruby 2.3 if you wish.

I don't think Ruby 2.3+ is even necessary, just yet, nor kgio.
None of these new code paths should be exception-intensive to be
a performance problem.

> > I would prefer this Linux-only approach if it gets rid of PID
> > files and doesn't introduce new temporary files/FIFOs.
> 
> Ack. I'll get back to that one then.
> 
> > It seems socketpair(..., SOCK_SEQPACKET) can be used for
> > packetized bidirectional IPC, perhaps with send_io + recv_io iff
> > necessary.  There shouldn't be any need for new, fragile FS
> > interactions.
> 
> Hum. I'm not familiar with socketpair, I'll have to read on it.

SOCK_SEQPACKET has worked well on Linux for ages; and seems to
work well on FreeBSD 12+, at least; but I haven't tried it on
other OSes...

socketpair itself is ancient and every *nix has it (but only
SOCK_STREAM and SOCK_DGRAM, SOCK_SEQPACKET came a bit later).

> > Another idea, have you considered letting master work on some requests?
> > Signal handling would be delayed, and EINTR handling bugs in
> > gems could be exposed, but it'd be completely portable...
> 
> I haven't considered it no, but I'm not really kin on it, because if you do
> this, requests handled by the master can't be timed out.

OK, fair enough.

> If anything, the reliable timeout is the number one reason why I still
> believe unicorn
> is the superior server out there.

/me hides under a desk :<

Honestly, you have no idea how embarrased I am of that (mis)feature.

> > It depends on whether the monitoring process/library can work
> > with other Rack servers, probably; and signal compatibility.
> 
> That's a good point. Assuming unicorn and puma both do reforking on SIGURG,
> then that monitoring process could work with both. Not a bad idea.
> 
> > >   - Many endpoints require database and other network access you probably
> > >     don't want in the master.
> >
> > Wouldn't the *_fork hooks handle that, anyways?
> 
> They are supposed to yes. But I suspect many apps currently work fine
> because some connection are lazily established on first access and just never
> accessed as part of the boot process.
> 
> Once you enable reforking, you might realized that you are missing some code
> in `before_fork`.

OK, fair enough.  I think it's simpler to tell people to improve
existing hooks than introduce new names for similar concepts
(and likely a chunk of duplicated code in their configs)

> > >   - Endpoints may have side effects.
> >
> > Yeah, this would optimize the GET endpoints, at least.
> 
> Well, `GET` endpoints may have side effects too, e.g. a visit counter.
> 
> > Nope, still wrapped.  According to the git-format-patch(1) manpage,
> > Thunderbird can work w/o extensions, actually.
> 
> Damn it, I sent that one using GMail's plain test mode.
> 
> I'm sorry but my work laptop is kinda locked down, so I'm very limited
> in my choice of mail clients. I'll try to send it from my personal computer.

OK, your initial message was from Thunderbird and probably close;
so I figure git@vger.kernel.org could use the feedback if the
git manpage has fallen out-of-date.

> Also less convenient for you, but I guess you can pull my fork:
> 
> git@github.com:casperisfine/unicorn.git
> 
> The two branches are: master-promote and worker-refork.

That would be: https://github.com/casperisfine/unicorn.git

Anonymous HTTPS is mostly fine (more steps to quote/reply to,
and inaccessible offline)

Registration is what I have major problems with.

> > I'm fine with "maintaining it" if it just means Cc-ing you on
> > bugs related to this :>
> 
> Oh that's fine, that was pretty much implied.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] Master promotion with SIGURG (CoW optimization)
  2022-07-08  0:30         ` Eric Wong
@ 2022-07-08  6:22           ` Jean Boussier
  0 siblings, 0 replies; 7+ messages in thread
From: Jean Boussier @ 2022-07-08  6:22 UTC (permalink / raw)
  To: Eric Wong; +Cc: unicorn-public

> (fully quoting original message for archival purposes, original
> was sent privately)

Sorry for this. Again our company laptops got locked down recently
and now I'm just struggling with shitty mail clients I don't know how to use.


> I don't think Ruby 2.3+ is even necessary, just yet, nor kgio.
> None of these new code paths should be exception-intensive to be
> a performance problem.

True.


> > If anything, the reliable timeout is the number one reason why I still
> > believe unicorn
> > is the superior server out there.
>
> /me hides under a desk :<
>
> Honestly, you have no idea how embarrased I am of that (mis)feature.

It's a bit off topic, but really you shouldn't. A proper timeout mechanism
is absolutely essential to ensure resiliency of the service.

Unless your app is really trivial, it's pretty much impossible to ensure
that all your endpoints will always complete in a reasonable amount
of time, even more so when you have malicious actors trying to DOS
you, or some bugs in database clients and such.

Killing these stuck workers allows to keep the service healthy until the
problem is fixed. The only thing we changed is that we first send SIGTERM
so that we can report the backtrace and other debug data.

> send pipes or any other IO objects across.

Oh, that's what I missed, that makes sense. I'll try the C wrapper
to send both the message and the IO at the same time.

Thanks again, I'll likely get back to you next week with a patch.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-07-08  6:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-05 20:05 [PATCH] Master promotion with SIGURG (CoW optimization) Jean Boussier
2022-07-06  2:33 ` Eric Wong
2022-07-06  7:40   ` Jean Boussier
2022-07-07 10:23     ` Eric Wong
     [not found]       ` <CANPRWbHTNiEcYq5qhN6Kio8Wg9a+2gXmc2bAcB2oVw4LZv8rcw@mail.gmail.com>
     [not found]         ` <CANPRWbGArasDtbAej4LsCOGeYZSrNz87p5kLjG+x__jHAn-5ng@mail.gmail.com>
2022-07-08  0:13           ` Eric Wong
2022-07-08  0:30         ` Eric Wong
2022-07-08  6:22           ` Jean Boussier

Code repositories for project(s) associated with this inbox:

	../../../unicorn.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).