From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <e@80x24.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN:  
X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
	by dcvr.yhbt.net (Postfix) with ESMTP id 2E7451F9F4;
	Mon, 22 Nov 2021 08:42:00 +0000 (UTC)
Date: Mon, 22 Nov 2021 08:42:00 +0000
From: Eric Wong <e@80x24.org>
To: KJ Tsanaktsidis <ktsanaktsidis@zendesk.com>
Cc: raindrops-public@yhbt.net
Subject: Re: [PATCH] Allow Raindrops objects to be backed by a memfd file
Message-ID: <20211122084200.GA10182@dcvr>
References: <20211122010336.43463-1-ktsanaktsidis@zendesk.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20211122010336.43463-1-ktsanaktsidis@zendesk.com>
List-Id: <raindrops-public.yhbt.net>

KJ Tsanaktsidis <ktsanaktsidis@zendesk.com> wrote:
> My use-case for this feature is that I want to collect memory statistics
> of a Unicorn master process in a way that does not itself cause any
> allocations. This is both because that would bias the measurement, but
> more importantly because we very tightly control when the GC runs in our
> Unicorn masters and any garbage created by high-frequency polling of
> GC.stat would potentially live for a long time.
> 
> With this solution, we can simply store values from rb_objspace into a
> Raindrops counter directly, and read the values out-of-process for
> submission to our metrics collection system.

Thanks for the description and justification.  I will accept
this feature, but some tweaks to the patch will be needed.
If this is too much for you, I'll try to take care of
adjustments (but still give you credit) sometime this week.

See comments inline below...

> diff --git a/ext/raindrops/extconf.rb b/ext/raindrops/extconf.rb
> index 792e509..1ddcdf2 100644
> --- a/ext/raindrops/extconf.rb
> +++ b/ext/raindrops/extconf.rb
> @@ -7,6 +7,7 @@
>  
>  $CPPFLAGS += " -D_GNU_SOURCE "
>  have_func('mremap', 'sys/mman.h')
> +have_func('memfd_create', 'sys/mman.h')
>  headers = %w(sys/types.h netdb.h string.h sys/socket.h netinet/in.h)
>  if have_header('linux/tcp.h')
>    headers << 'linux/tcp.h'
> diff --git a/ext/raindrops/raindrops.c b/ext/raindrops/raindrops.c
> index 837084c..c9bf8c8 100644
> --- a/ext/raindrops/raindrops.c
> +++ b/ext/raindrops/raindrops.c
> @@ -34,6 +34,7 @@ struct raindrops {
>  	size_t size;
>  	size_t capa;
>  	pid_t pid;
> +	int fd;
>  	struct raindrop *drops;
>  };

I suggest using "VALUE io" here.  I think it'll make some
code below easier-to-follow and maintain while offering
some flexibility for portability.

The downside is higher memory use (few hundred for the IO object
if the IO is used, 4 bytes if not on 64-bit).  Adding an rd_mark
callback to call rb_gc_mark will be required, too.

> @@ -47,6 +48,11 @@ static void rd_free(void *ptr)
>  		if (rv != 0)
>  			rb_bug("munmap failed in gc: %s", strerror(errno));
>  	}
> +	if (r->fd != -1) {
> +	  int rv = close(r->fd);
> +	  if (rv != 0)
> +	    rb_bug("close failed in gc: %s", strerror(errno));
> +	}

The using an IO here would eliminate this block.

Regardless, whenever possible and reasonable, favor comparisons
against zero instead of -1 since compilers can generate smaller
code.  Pedantically it's not POSIX in some cases, but glibc does
it, at least.

Also, follow existing indentation conventions (this applies to
any project(*)).  The C code here is mostly formatted the same way
as git.git or the Linux kernel, so it uses hard tabs.

(*) I have deal with indentation styles which annoy me for other
    projects, too.

>  	xfree(ptr);
>  }
> @@ -88,34 +94,74 @@ static struct raindrops *get(VALUE self)
>  
>  /*
>   * call-seq:
> - *	Raindrops.new(size)	-> raindrops object
> + *	Raindrops.new(size, name: nil)	-> raindrops object
>   *
>   * Initializes a Raindrops object to hold +size+ counters.  +size+ is
>   * only a hint and the actual number of counters the object has is
>   * dependent on the CPU model, number of cores, and page size of
>   * the machine.  The actual size of the object will always be equal
>   * or greater than the specified +size+.
> + * If +name+ is provided, and the platform is supported, the raindrop
> + * memory region will be backed by a memfd object with the provided
> + * name, so that it can be shared with other, non-child processes.
>   */
> -static VALUE init(VALUE self, VALUE size)
> +static VALUE init(int argc, VALUE *argv, VALUE self)
>  {
>  	struct raindrops *r = DATA_PTR(self);
>  	int tries = 1;
>  	size_t tmp;
> +  VALUE size;
> +  VALUE kwargs_hash;
> +  ID kwargs_ids[1] = { rb_intern_const("name") };

That may be C99-specific, and I think Ruby itself only went to
C99 in 2.7 or 3.x?  In any case, I don't think it's appropriate
to introduce C99 at this time.

(I like using C99 in other projects, too, but I don't think it's
 appropriate for us, just yet).

> +  VALUE kwargs[1];
> +  VALUE name;
> 
>  	if (r->drops != MAP_FAILED)
>  		rb_raise(rb_eRuntimeError, "already initialized");
>  
> +  rb_scan_args(argc, argv, "1:", &size, &kwargs_hash);
> +  rb_get_kwargs(kwargs_hash, kwargs_ids, 0, 1, kwargs);

rb_get_kwargs is 2.2+ only, isn't it?

I suggest using a Ruby wrapper and rename the C method.

Since you mentioned you were using this for unicorn, that's still
Ruby 1.9.3-compatible and will be 2.0+ for the next release.

I'm OK with bumping raindrops to Ruby 2.0+, for now; but the
requirement should be <= the Ruby version used by its users
(e.g. unicorn).

Fwiw, some other popular gems I keep an eye on (e.g. `sequel',
`mail') tend to be conservative when it comes to Ruby version
compatibility, too.

For portability to non-Linux systems, I wonder if:

	Raindrops.new(size, io: Tempfile.new('foo'))

can also be supported.  I can give this a shot on a FreeBSD VM
I sometimes have access to...

And then maybe drop support for `name:' entirely:

	Raindrops.new(size, io: Raindrops::Memfd.new('foo'))

> +  name = kwargs[0];
> +
>  	r->size = NUM2SIZET(size);
>  	if (r->size < 1)
>  		rb_raise(rb_eArgError, "size must be >= 1");
>  
> +  if (name != Qundef && name != Qnil) {

RTEST() reads more easily, and covers Qfalse.

> +#ifdef HAVE_MEMFD_CREATE
> +    r->fd = memfd_create(StringValueCStr(name), MFD_CLOEXEC);
> +    if (r->fd == -1) {
> +      int err = errno;
> +      if (err == ENOSYS) {
> +        rb_raise(rb_eRuntimeError, "system does not support memfd_create");
> +      } else {
> +        rb_raise(rb_eRuntimeError, "error calling memfd_create: %s", strerror(err));
> +      }

I don't think the special case for ENOSYS is necessary.  And
rb_sys_fail("memfd_create") is sufficient (it matches the
rb_sys_fail(funcname) pattern used in Ruby code.

Something like:

	int fd = memfd_create(StringValueCStr(name), MFD_CLOEXEC);
	if (fd < 0)
		rb_sys_fail("memfd_create");
	rd->io = rb_funcall(klass, rb_intern("for_fd"), 1, INT2NUM(fd));

> +    }
> +#else
> +    rb_raise(rb_eRuntimeError, "extension not compiled with memfd_create");
> +#endif
> +  } else {
> +    r->fd = -1;
> +  }
> +
>  	tmp = PAGE_ALIGN(raindrop_size * r->size);
>  	r->capa = tmp / raindrop_size;
>  	assert(PAGE_ALIGN(raindrop_size * r->capa) == tmp && "not aligned");
>  
>  retry:
> -	r->drops = mmap(NULL, tmp,
> -	                PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED, -1, 0);
> +  if (r->fd != -1) {
> +    if (ftruncate(r->fd, tmp) == -1) {
> +      r->drops = MAP_FAILED;
> +    } else {
> +    r->drops = mmap(NULL, tmp,
> +                    PROT_READ|PROT_WRITE, MAP_SHARED, r->fd, 0);
> +    }
> +  } else {
> +  	r->drops = mmap(NULL, tmp,
> +  	                PROT_READ|PROT_WRITE, MAP_ANON|MAP_SHARED, -1, 0);
> +  }
> +
>  	if (r->drops == MAP_FAILED) {
>  		int err = errno;
>  
> @@ -153,6 +199,10 @@ static void resize(struct raindrops *r, size_t new_rd_size)
>  	if (r->pid != getpid())
>  		rb_raise(rb_eRuntimeError, "cannot mremap() from child");
>  
> +	if (r->fd != -1) {
> +	  rb_raise(rb_eRuntimeError, "resize not implemented with named raindrops");
> +	}
> +
>  	rv = mremap(old_address, old_size, new_size, MREMAP_MAYMOVE);
>  	if (rv == MAP_FAILED) {
>  		int err = errno;
> @@ -213,6 +263,18 @@ static VALUE capa(VALUE self)
>  	return SIZET2NUM(get(self)->capa);
>  }
>  
> +/*
> + * call-seq:
> + *   rd.fd    -> Integer
> + *
> + * Returns the file descriptor number associated with this Raindrop, if
> + * it was created with a name.
> + */
> +static VALUE fd(VALUE self)
> +{
> +  return INT2NUM(get(self)->fd);
> +}

If we go with "VALUE io" as I suggested above, I suggest a
"to_io" method.  That would allow compatibility with other APIs
(e.g. IO.select), and access to IO#stat, IO#close_on_exec=,
IO#ioctl, IO#fileno, and any number of potentially useful
methods.

I've never used memfd before, but the mmap-ed regions are usable
after a regular file FD is closed.  So an FD-starved process
could use rd.to_io.close and continue using the mmap-ed section
as normal.

Regardless, naming a method "fd" is unnecessary, the
function/method called "fileno" means the same thing across
C stdio.h, Ruby, Perl, and likely other languages.

>  /*
>   * call-seq:
>   *	rd.dup		-> rd_copy
> @@ -223,8 +285,9 @@ static VALUE init_copy(VALUE dest, VALUE source)
>  {
>  	struct raindrops *dst = DATA_PTR(dest);
>  	struct raindrops *src = get(source);
> +	VALUE init_argv[1] = { SIZET2NUM(src->size) };
>  
> -	init(dest, SIZET2NUM(src->size));
> +	init(1, init_argv, dest);
>  	memcpy(dst->drops, src->drops, raindrop_size * src->size);

Same comments as before wrt. C compiler and Ruby compatibility.

>  	return dest;
> @@ -372,6 +435,11 @@ static VALUE evaporate_bang(VALUE self)
>  	r->drops = MAP_FAILED;
>  	if (munmap(addr, raindrop_size * r->capa) != 0)
>  		rb_sys_fail("munmap");
> +	if (r->fd != -1) {
> +	  if (close(r->fd) != 0)
> +	    rb_sys_fail("close");
> +	  r->fd = -1;
> +	}

This could just use rb_io_close.  No need to comment on the
rest... (or I'm too sleepy atm and missed something :x)

Anyways thanks again for bringing this up.  Let us know if you
want to tackle handling my suggestions or if you want to wait
for me to do it.