From: Eric Wong <normalperson@yhbt.net>
To: mongrel-development@rubyforge.org
Cc: mongrel-unicorn@rubyforge.org
Subject: merging Unicorn HTTP parser back to Mongrel
Date: Sat, 5 Sep 2009 14:50:45 -0700 [thread overview]
Message-ID: <20090905215045.GB28829@dcvr.yhbt.net> (raw)
Hello, (ok, this email got longer than expected, I now consider the
most important parts the first and last paragraphs of the last
footnote).
The Unicorn HTTP parser is feature complete as far as I can tell and
supports things the Mongrel one does not. I would very much like to see
it used in places that Unicorn isn't suited for[1]. In fact, a chunk of
the new features are much better suited for a server with better slow
client handling like Mongrel.
The big roadblock to getting this back into Mongrel is the Java/JRuby
version of the parser Mongrel uses. Simply put, I don't do Java;
somebody else will have to port it. But I'll have to convince you
that these features are worth going into Mongrel, too :)
I could provide a standalone C parser that can be wrapped with FFI, but
I'm not sure if the performance would be acceptable. I'm fairly certain
that a pure-Ruby version with Ragel-generated code would not provide
acceptable performance anywhere; maybe a hand-coded one could, but I'm
not particularly excited about doing that...
The MRI-C parser should just work on Win32.
Unlike the rest of Unicorn, the HTTP parser remains portable to non-UNIX
platforms and thread-safe. There are no system-calls made directly
through it (only memory allocations through the Ruby C API).
New features that aren't in Mongrel are:
* HTTP/0.9 support - blame a network BOFH hell bent for hell on saving
bytes with a health-checker config for this :)
The HttpParser#headers? method has been added to determine if headers
should be sent in the response (HTTP/0.9 didn't have response
headers).
* "Transfer-Encoding: chunked" request decoding support
I've been told mobile devices[2] do uploads like this (since they
may lack the storage capacity to store large files). This will be
useful to Mongrel since Mongrel can handle slow clients better
(mobile devices).
I also have a use case that goes like this:
tar zc $BIG_DIRECTORY | curl -T- http://unicorn/path/to/upload
This designed to be slurp-resistant so clients cannot control memory
usage of the server and DoS it even with huge chunk sizes.
* Trailers support (with Transfer-Encoding: chunked). I haven't
run across applications that use this yet (Amazon S3 maybe?) but one
use case that I can forsee is generating a Content-MD5 trailer with
the above "tar | curl" command.
* Multiline continuation headers - Pound sends them, I don't care for
Pound but I figured I might as well do it just in case somebody else
starts doing it...
* Absolute Request URI parsing - It was done with URI.parse
originally, I figured I might as well do it in Ragel since it's part
of rfc 2616. I think client-side proxies use it so maybe one day
somebody can turn Mongrel or a derived server into a client-side
HTTP proxy...
* Repeated headers handling - they're joined with commas now since Rack
doesn't accept arrays in HTTP_* entries . I posted
a standlone patch for this in <20090810001022.GA17572@dcvr.yhbt.net>
* HttpParser#keepalive? method - the parser can tell you if it's safe
to handle a keepalive request. Not used with Unicorn at the moment.
Chunk extensions is one thing that the parser currently just ignores,
this is because I've yet to see any use of them anywhere and Rack
does not mention them..
Parser Limits:
Request body handling:
Maximum Content-Length is the maximum value of off_t. I don't think
this should be a problem for anyone as Ruby defaults to
_FILE_OFFSET_BITS=64 on 32-bit arches. Mongrel does not have this
limit in the parser, but since it buffers large uploads to a
Tempfile, the limit always existed anyways.
Maximum chunk size is also the maximum value of off_t, which is
usually a 64-bit long (since Ruby defaults to _FILE_OFFSET_BITS=64
on my 32-bit boxes). I don't expect valid clients to send any
values close to this limit, but that's just what it is.
Headers:
Mostly the same as Mongrel, all headers must fit into the same
<=112K string object; which shouldn't be a problem for anything
capable of running Ruby.
Continuation lines can bypass the per-header size limit, but
everything still stays under 112K which is a pretty large limit.
Trailers:
These can fit into another <=112K string, space taken up during
header processing doesn't affect Trailer processing, so you could
end up with 224K of combined metadata.
You can get a full changelog since I branched from fauna/mongrel via:
git log v0.0.0.. -- ext
Finally, the new API is documented via RDoc here:
http://unicorn.bogomips.org/Unicorn/HttpParser.html
I don't consider the API set in stone, but I do consider the header
handling part a bit simpler/less error prone than the old one.
Disclaimer:
Due to the large amounts of changes to the C/Ragel portions, another
security audit/pair-of-eyes would be nice. All use of Unicorn so far
has been on LANs with trusted clients or with nginx in front. While
I'm very comfortable with C and fairly comfortable with Ragel, I'm far
from infallible so close review from a second pair of eyes would be
greatly appreciated.
Future:
I'm also planning on porting this to Rubinius, too. I haven't had a
chance to look at it yet but the Mongrel/C one has already been ported
so it shouldn't be too hard (I only know/can stomach a small amount of
C++, though I suspect I won't even need it ...)
Footnotes:
[1] - Comet/long-polling/reverse HTTP, and sites that rely heavily on
external services (including OpenID) are all badly suited for
Unicorn.
[2] - As a side effect, Unicorn also uses a TeeInput class that allows
the request body to be read in real-time within the Rack
application (while "tee-ing" to a temporary file to provide
rewindability). This also allows Mongrel Upload Progress to
be implemented in the future in a Rack::Lint-compliant manner.
The one weird thing about TeeInput is that:
env["rack.input"].read(NR_BYTES)
Is not guaranteed to return NR_BYTES, only NR_BYTES at most. So
every #read can provide "last block" semantics. Rack does not
enforce this behavior, so it should be fine. This should not be a
problem in practice since most read() and read()-like APIs provide
no such guarantee even if implied when reading from "fast" devices
like the filesystem. CGI apps that get a socket as stdin also
got similar semantics as what apps under Unicorn get.
I imagine this feature to be hugely useful for slow mobile clients
that stream data slowly as it allows the server to start processing
data as it is being uploaded.
--
Eric Wong
reply other threads:[~2009-09-05 21:51 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
replies disabled, historical list
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).