unicorn Ruby/Rack server user+dev discussion/patches/pulls/bugs/help
 help / color / mirror / code / Atom feed
* Some issues in the HTTP parser (and suggestions)
@ 2010-05-07 13:51 Iñaki Baz Castillo
  2010-05-07 18:33 ` Eric Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Iñaki Baz Castillo @ 2010-05-07 13:51 UTC (permalink / raw)
  To: unicorn list

Hi, by inspecting the Ragel grammar of the HTTP parser (coming from
Mongrel) I've realized of some possible issues and bugs:


hostname = (alnum | "-" | "." | "_")+;

- It doesn't allow IPv6. This is important IMHO.
- It allows "_" which is an invalid symbol (not valid for a domain).

I suggest:
  hostname = (alnum | "-" | "." | "[" | "]" | ":")+;

------------------

host_with_port = (hostname (":" digit*)?) >mark %host;

- It allows something ugly as "mydomain.org:"

I suggest:
  host_with_port = (hostname (":" digit{1,5})?) >mark %host;

------------------

message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF;

- It doesn't allow valid spaces before ":" as:
     Host : mydomain.org

- Tabulators are also allowed.

I suggest:
  message_header = ((field_name [ \t]* ":" [ \t]*
field_value)|value_cont) :> CRLF;



Regards.



-- 
Iñaki Baz Castillo
<ibc@aliax.net>
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some issues in the HTTP parser (and suggestions)
  2010-05-07 13:51 Some issues in the HTTP parser (and suggestions) Iñaki Baz Castillo
@ 2010-05-07 18:33 ` Eric Wong
  2010-05-08 13:11   ` Iñaki Baz Castillo
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2010-05-07 18:33 UTC (permalink / raw)
  To: unicorn list

Iñaki Baz Castillo <ibc@aliax.net> wrote:
> Hi, by inspecting the Ragel grammar of the HTTP parser (coming from
> Mongrel) I've realized of some possible issues and bugs:
> 
> 
> hostname = (alnum | "-" | "." | "_")+;
> 
> - It doesn't allow IPv6. This is important IMHO.
> - It allows "_" which is an invalid symbol (not valid for a domain).
> 
> I suggest:
>   hostname = (alnum | "-" | "." | "[" | "]" | ":")+;

Hi Iñaki,

Underscore isn't valid for hostnames, but it is allowed in domain names
and most DNS servers will resolve them.  I've personally seen websites
with underscores in their domain names in the wild[1].

We'll have to test the IPv6 addresses and probably split that out into a
separate regexp since ":" would raise issues with the port number in
existing cases.  This is probably something for post-1.0.

> ------------------
> 
> host_with_port = (hostname (":" digit*)?) >mark %host;
> 
> - It allows something ugly as "mydomain.org:"
> 
> I suggest:
>   host_with_port = (hostname (":" digit{1,5})?) >mark %host;

It's ugly, but section 3.2.2 of RFC 2396 appears to allows it.

> ------------------
> 
> message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF;
> 
> - It doesn't allow valid spaces before ":" as:
>      Host : mydomain.org

Spaces before the ":" aren't allowed in rfc2616, and I have yet to see
evidence of clients sending headers like this in ~4 years of using this
parser.

> - Tabulators are also allowed.
> 
> I suggest:
>   message_header = ((field_name [ \t]* ":" [ \t]*
> field_value)|value_cont) :> CRLF;

I just pushed this out to unicorn.git to allow horizontal tabs:

>From 935912a422cabfd323f9b4ff268ded09a2ebe7a6 Mon Sep 17 00:00:00 2001
From: Eric Wong <normalperson@yhbt.net>
Date: Fri, 7 May 2010 18:20:49 +0000
Subject: [PATCH] http: allow horizontal tab as leading whitespace in header values
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This is allowed by RFC 2616, section 2.2, where spaces and
horizontal tabs are counted as linear white space and linear
white space (not just regular spaces) may prefix field-values
(section 4.2).

This has _not_ been a real issue in ~4 years of using this
parser (starting with Mongrel) with clients in the wild.

Thanks to Iñaki Baz Castillo for pointing this out.
---
 ext/unicorn_http/unicorn_http_common.rl |    2 +-
 test/unit/test_http_parser.rb           |    8 ++++++++
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/ext/unicorn_http/unicorn_http_common.rl b/ext/unicorn_http/unicorn_http_common.rl
index 6fca604..f165e3f 100644
--- a/ext/unicorn_http/unicorn_http_common.rl
+++ b/ext/unicorn_http/unicorn_http_common.rl
@@ -54,7 +54,7 @@
 
   value_cont = lws+ any* >start_value %write_cont_value;
 
-  message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF;
+  message_header = ((field_name ":" lws* field_value)|value_cont) :> CRLF;
   chunk_ext_val = token*;
   chunk_ext_name = token*;
   chunk_extension = ( ";" " "* chunk_ext_name ("=" chunk_ext_val)? )*;
diff --git a/test/unit/test_http_parser.rb b/test/unit/test_http_parser.rb
index 5b0ca9f..b7c8a1c 100644
--- a/test/unit/test_http_parser.rb
+++ b/test/unit/test_http_parser.rb
@@ -51,6 +51,14 @@ class HttpParserTest < Test::Unit::TestCase
     assert parser.keepalive?
   end
 
+  def test_tab_lws
+    parser = HttpParser.new
+    req = {}
+    tmp = "GET / HTTP/1.1\r\nHost:\tfoo.bar\r\n\r\n"
+    assert_equal req.object_id, parser.headers(req, tmp).object_id
+    assert_equal "foo.bar", req['HTTP_HOST']
+  end
+
   def test_connection_close_no_ka
     parser = HttpParser.new
     req = {}
-- 

[1] - and those were wild NSFW sites, of course :>

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Some issues in the HTTP parser (and suggestions)
  2010-05-07 18:33 ` Eric Wong
@ 2010-05-08 13:11   ` Iñaki Baz Castillo
  2010-05-08 16:17     ` Eric Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Iñaki Baz Castillo @ 2010-05-08 13:11 UTC (permalink / raw)
  To: unicorn list

2010/5/7 Eric Wong <normalperson@yhbt.net>:
> Underscore isn't valid for hostnames, but it is allowed in domain names
> and most DNS servers will resolve them.  I've personally seen websites
> with underscores in their domain names in the wild[1].

Hi Eric, could you point me to the spec stating that underscore is
allowed for a domain? In the past I've done a SIP parser [*] with
Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80%
of the grammar of HTTP. I'm pretty sure that "_" is not valid in a
domain (host, hostname or whatever). Anyhow it's better just to allow
it at parsing level :)



> We'll have to test the IPv6 addresses and probably split that out into a
> separate regexp since ":" would raise issues with the port number in
> existing cases.  This is probably something for post-1.0.

There is a IETF draft to improve and *fix* the existing BNF grammar for IPv6.
It also improves the grammar for IPv4 (by dissallowing values greater than 255):

    http://tools.ietf.org/html/draft-ietf-sip-ipv6-abnf-fix


I've already implemented it in Ragel and I can sure that it's 100%
valid and strict (I've done lots of tests):

alphanum  =  ALPHA / DIGIT
domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum );
toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum );
hostname = ( domainlabel "." )* toplabel "."?;
dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2"
0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 );
IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet;
h16 = HEXDIG{1,4};
ls32 = ( h16 ":" h16 ) | IPv4address;
IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) |
( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16
":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32
) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":"
){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( (
( h16 ":" ){,6} h16 )? "::" );
IPv6reference = "[" IPv6address "]";
host = hostname | IPv4address | IPv6reference;
port = DIGIT{1,5};
hostport = host ( ":" port )?;


This is much better than the deprecated and bogus grammar in RFC 2396 ;)





>> ------------------
>>
>> host_with_port = (hostname (":" digit*)?) >mark %host;
>>
>> - It allows something ugly as "mydomain.org:"
>>
>> I suggest:
>>   host_with_port = (hostname (":" digit{1,5})?) >mark %host;
>
> It's ugly, but section 3.2.2 of RFC 2396 appears to allows it.

Sometimes there are bugs in the RFC's related to parsing and BNF
grammars. I know several cases. Unfortunatelly RFC's cannot be fixed,
instead the errors are reported and a new draft or RFC "xxx-fix"
appears some years later.



>> message_header = ((field_name ":" " "* field_value)|value_cont) :> CRLF;
>>
>> - It doesn't allow valid spaces before ":" as:
>>      Host : mydomain.org
>
> Spaces before the ":" aren't allowed in rfc2616, and I have yet to see
> evidence of clients sending headers like this in ~4 years of using this
> parser.

In SIP protocol spaces and tabulators before ":" are allowed, I really
expected that in HTTP the same occurs as SIP grammar is based on HTTP
grammar. But it could be different in some aspects, of course.


>> - Tabulators are also allowed.
>>
>> I suggest:
>>   message_header = ((field_name [ \t]* ":" [ \t]*
>> field_value)|value_cont) :> CRLF;
>
> I just pushed this out to unicorn.git to allow horizontal tabs:

Thanks.


[*] http://dev.sipdoc.net/projects/ragel-sip-parser/wiki/Phase1


-- 
Iñaki Baz Castillo
<ibc@aliax.net>
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some issues in the HTTP parser (and suggestions)
  2010-05-08 13:11   ` Iñaki Baz Castillo
@ 2010-05-08 16:17     ` Eric Wong
  2010-05-08 16:50       ` Iñaki Baz Castillo
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Wong @ 2010-05-08 16:17 UTC (permalink / raw)
  To: unicorn list

Iñaki Baz Castillo <ibc@aliax.net> wrote:
> 2010/5/7 Eric Wong <normalperson@yhbt.net>:
> > Underscore isn't valid for hostnames, but it is allowed in domain names
> > and most DNS servers will resolve them.  I've personally seen websites
> > with underscores in their domain names in the wild[1].
> 
> Hi Eric, could you point me to the spec stating that underscore is
> allowed for a domain? In the past I've done a SIP parser [*] with
> Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80%
> of the grammar of HTTP. I'm pretty sure that "_" is not valid in a
> domain (host, hostname or whatever). Anyhow it's better just to allow
> it at parsing level :)

http://www.ietf.org/rfc/rfc2782.txt

Even if it's not part of the RFC, our parser will match reality and
accommodate broken things we see in the wild, as it has done in the
past:

  http://mid.gmane.org/20080327215027.GA14531@untitled

> > We'll have to test the IPv6 addresses and probably split that out into a
> > separate regexp since ":" would raise issues with the port number in
> > existing cases.  This is probably something for post-1.0.
> 
> There is a IETF draft to improve and *fix* the existing BNF grammar for IPv6.
> It also improves the grammar for IPv4 (by dissallowing values greater than 255):
> 
>     http://tools.ietf.org/html/draft-ietf-sip-ipv6-abnf-fix
> 
> 
> I've already implemented it in Ragel and I can sure that it's 100%
> valid and strict (I've done lots of tests):
> 
> alphanum  =  ALPHA / DIGIT
> domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum );
> toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum );
> hostname = ( domainlabel "." )* toplabel "."?;
> dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2"
> 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 );
> IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet;
> h16 = HEXDIG{1,4};
> ls32 = ( h16 ":" h16 ) | IPv4address;
> IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) |
> ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16
> ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32
> ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":"
> ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( (
> ( h16 ":" ){,6} h16 )? "::" );
> IPv6reference = "[" IPv6address "]";
> host = hostname | IPv4address | IPv6reference;
> port = DIGIT{1,5};
> hostport = host ( ":" port )?;
> 
> 
> This is much better than the deprecated and bogus grammar in RFC 2396 ;)

Thanks, it might be worth simplifying a bit for readability, simplicity
(and possibly performance) at the expense of 100% conformance.

-- 
Eric Wong
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Some issues in the HTTP parser (and suggestions)
  2010-05-08 16:17     ` Eric Wong
@ 2010-05-08 16:50       ` Iñaki Baz Castillo
  0 siblings, 0 replies; 5+ messages in thread
From: Iñaki Baz Castillo @ 2010-05-08 16:50 UTC (permalink / raw)
  To: unicorn list

2010/5/8 Eric Wong <normalperson@yhbt.net>:
> Iñaki Baz Castillo <ibc@aliax.net> wrote:
>> 2010/5/7 Eric Wong <normalperson@yhbt.net>:
>> > Underscore isn't valid for hostnames, but it is allowed in domain names
>> > and most DNS servers will resolve them.  I've personally seen websites
>> > with underscores in their domain names in the wild[1].
>>
>> Hi Eric, could you point me to the spec stating that underscore is
>> allowed for a domain? In the past I've done a SIP parser [*] with
>> Ragel, being 100% strict at BNF grammar, and note that SIP reuses 80%
>> of the grammar of HTTP. I'm pretty sure that "_" is not valid in a
>> domain (host, hostname or whatever). Anyhow it's better just to allow
>> it at parsing level :)
>
> http://www.ietf.org/rfc/rfc2782.txt

Hi Eric. DNS SRV are not domain names, but DNS queries. For example,
in SIP (and also in XMPP) if a phone is configured to use
"myproxy.org" as proxy then it must perform a DNS SRV query for these
entries:

  _sip._udp.myproxy.org
  _sip._tcp.myproxy.org

Then the DNS query recevives some DNS A records for which the client
also retrieves the associated IP's. But after it, when the client
generates a SIP request it uses "myproxy.org" rather than
"_sip._udp.myproxy.org". This is, "_sip._udp.myproxy.org" is not a
domain/hostname, but a format for querying DNS SRV records.


> Even if it's not part of the RFC, our parser will match reality and
> accommodate broken things we see in the wild, as it has done in the
> past:
>
>  http://mid.gmane.org/20080327215027.GA14531@untitled

Yes, I agree.



>> alphanum  =  ALPHA / DIGIT
>> domainlabel = alphanum | ( alphanum ( alphanum | "-" )* alphanum );
>> toplabel = ALPHA | ( ALPHA ( alphanum | "-" )* alphanum );
>> hostname = ( domainlabel "." )* toplabel "."?;
>> dec_octet = DIGIT | ( 0x31..0x39 DIGIT ) | ( "1" DIGIT{2} ) | ( "2"
>> 0x30..0x34 DIGIT ) | ( "25" 0x30..0x35 );
>> IPv4address = dec_octet "." dec_octet "." dec_octet "." dec_octet;
>> h16 = HEXDIG{1,4};
>> ls32 = ( h16 ":" h16 ) | IPv4address;
>> IPv6address = ( ( h16 ":" ){6} ls32 ) | ( "::" ( h16 ":" ){5} ls32 ) |
>> ( h16? "::" ( h16 ":" ){4} ls32 ) | ( ( ( h16 ":" )? h16 )? "::" ( h16
>> ":" ){3} ls32 ) | ( ( ( h16 ":" ){,2} h16 )? "::" ( h16 ":" ){2} ls32
>> ) | ( ( ( h16 ":" ){,3} h16 )? "::" h16 ":" ls32 ) | ( ( ( h16 ":"
>> ){,4} h16 )? "::" ls32 ) | ( ( ( h16 ":" ){,5} h16 )? "::" h16 ) | ( (
>> ( h16 ":" ){,6} h16 )? "::" );
>> IPv6reference = "[" IPv6address "]";
>> host = hostname | IPv4address | IPv6reference;
>> port = DIGIT{1,5};
>> hostport = host ( ":" port )?;
>>
>>
>> This is much better than the deprecated and bogus grammar in RFC 2396 ;)
>
> Thanks, it might be worth simplifying a bit for readability, simplicity
> (and possibly performance) at the expense of 100% conformance.

You can try to simplicity it, but note that some previous IPv6 BNF
grammars failed to cover all the valid cases and they are bogus. For
example, the original IPv6 BNF grammar appearing in RFC 3261 (SIP
protocol) is buggy (even if it seems simpler) and the specification
has been fixed with the draft I linked in my previous mail.


Regards.


-- 
Iñaki Baz Castillo
<ibc@aliax.net>
_______________________________________________
Unicorn mailing list - mongrel-unicorn@rubyforge.org
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-05-08 16:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-07 13:51 Some issues in the HTTP parser (and suggestions) Iñaki Baz Castillo
2010-05-07 18:33 ` Eric Wong
2010-05-08 13:11   ` Iñaki Baz Castillo
2010-05-08 16:17     ` Eric Wong
2010-05-08 16:50       ` Iñaki Baz Castillo

Code repositories for project(s) associated with this public inbox

	https://yhbt.net/unicorn.git/

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).